Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

Closed
amcpherson opened this issue Aug 27, 2013 · 4 comments
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Milestone

Comments

@amcpherson
Copy link
Contributor

Large tsv files with columns that are mixed integers and strings are not consistently converted to the same data type. Some of the integers are interpreted as integers and identical entries are interpreted as strings. Transition seems to happen at power of 2 indices indicating something is being done in chunks but not for all chunks.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})

In [3]: df.to_csv('test', sep='\t', index=False, na_rep='NA')

In [4]: df2 = pd.read_csv('test', sep='\t')

In [5]: print df2['a'].unique()
['1' 'X' 1]

In [6]: for a in df2['a'][262140:262150]:
   ...:         print repr(a)
   ...:
'1'
'1'
'1'
'1'
1
1
1
1
1
1
@jreback
Copy link
Contributor

jreback commented Aug 27, 2013

looks like a dupe of #3866 ?

@jreback
Copy link
Contributor

jreback commented Aug 27, 2013

and #4681 ?

@hayd
Copy link
Contributor

hayd commented Aug 27, 2013

As pointed out be commenter on SO, you can use converter argument:

df2 = pd.read_csv('test', sep='\t', converters={'a': str})

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

closing as a dupe of #3866

@jreback jreback closed this as completed Sep 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants