You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Large tsv files with columns that are mixed integers and strings are not consistently converted to the same data type. Some of the integers are interpreted as integers and identical entries are interpreted as strings. Transition seems to happen at power of 2 indices indicating something is being done in chunks but not for all chunks.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
In [3]: df.to_csv('test', sep='\t', index=False, na_rep='NA')
In [4]: df2 = pd.read_csv('test', sep='\t')
In [5]: print df2['a'].unique()
['1' 'X' 1]
In [6]: for a in df2['a'][262140:262150]:
...: print repr(a)
...:
'1'
'1'
'1'
'1'
1
1
1
1
1
1
The text was updated successfully, but these errors were encountered:
Large tsv files with columns that are mixed integers and strings are not consistently converted to the same data type. Some of the integers are interpreted as integers and identical entries are interpreted as strings. Transition seems to happen at power of 2 indices indicating something is being done in chunks but not for all chunks.
The text was updated successfully, but these errors were encountered: