BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

amcpherson · 2013-08-27T17:13:45Z

Large tsv files with columns that are mixed integers and strings are not consistently converted to the same data type. Some of the integers are interpreted as integers and identical entries are interpreted as strings. Transition seems to happen at power of 2 indices indicating something is being done in chunks but not for all chunks.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})

In [3]: df.to_csv('test', sep='\t', index=False, na_rep='NA')

In [4]: df2 = pd.read_csv('test', sep='\t')

In [5]: print df2['a'].unique()
['1' 'X' 1]

In [6]: for a in df2['a'][262140:262150]:
   ...:         print repr(a)
   ...:
'1'
'1'
'1'
'1'
1
1
1
1
1
1

The text was updated successfully, but these errors were encountered:

jreback · 2013-08-27T17:58:54Z

looks like a dupe of #3866 ?

jreback · 2013-08-27T18:00:11Z

and #4681 ?

hayd · 2013-08-27T18:05:31Z

As pointed out be commenter on SO, you can use converter argument:

df2 = pd.read_csv('test', sep='\t', converters={'a': str})

jreback · 2013-09-28T19:26:14Z

closing as a dupe of #3866

TomAugspurger mentioned this issue Aug 27, 2013

read_csv example tricks parser dtypes #4692

Closed

jreback closed this as completed Sep 28, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

amcpherson commented Aug 27, 2013

jreback commented Aug 27, 2013

jreback commented Aug 27, 2013

hayd commented Aug 27, 2013

jreback commented Sep 28, 2013

BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

Comments

amcpherson commented Aug 27, 2013

jreback commented Aug 27, 2013

jreback commented Aug 27, 2013

hayd commented Aug 27, 2013

jreback commented Sep 28, 2013