Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

Closed
martingoodson opened this issue Jun 12, 2013 · 4 comments

Comments

Projects
None yet
2 participants
@martingoodson
Copy link

commented Jun 12, 2013

read_csv gives unexpected behaviour with large files if a column contains both strings and integers. eg

>>> df=DataFrame({'colA':range(500000-1)+['apple', 'pear']+range(500000-1)})
len(set(df.colA))
500001

>>> df.to_csv('testpandas2.txt')
>>> df2=read_csv('testpandas2.txt')
>>> len(set(df2.colA))
762143

 >>> pandas.__version__
'0.11.0'

It seems some of the integers are parsed as integers and others as strings.

>>> list(set(df2.colA))[-10:]
['282248', '282249', '282240', '282241', '282242', '15679', '282244', '282245', '282246', '282247']
>>> list(set(df2.colA))[:10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2013

In [1]: df2=read_csv('testpandas2.txt',index_col=0)

In [2]: df2
Out[2]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
colA    1000000  non-null values
dtypes: object(1)

In [3]: from collections import Counter

In [4]: Counter(df2.colA.apply(lambda x: type(x)))
Out[4]: Counter({<type 'int'>: 737856, <type 'str'>: 262144})

So the way parsing works (when you don't specify a specifc dtyp) is that on a particular column you
loop over all dtypes, and try to convert to an actual type; if something breaks you go to the next dtype. The
data is left is modified in-place, so the rows before the strings are converted to integers; when it hits the strings the parsing stops and the column is marked object

so the end result is the correct dtype.

you essentially want downcasting back to strings for object dtype; easy enough, specify object as the dtype for this column.

If you want this automatic I think we'd have to provide an option to do it, because that would be inefficient from a parsing speed as you have to copy the column for every dtype you try

can you explain why this actually matters?

@martingoodson

This comment has been minimized.

Copy link
Author

commented Jun 12, 2013

I'm not sure I understand. Why aren't there 500K integers and 500K+2
strings if everything after the string is encountered is parsed as a string?

This matters because if you try and aggregate using the object type column
as a key, the results will be incorrect. You get twice as many keys as you
actually intended. Thus, even trying to find the number of unique keys in a
table, a fairly basic task, will not work as expected.

On Wed, Jun 12, 2013 at 6:14 PM, jreback notifications@github.com wrote:

In [1]: df2=read_csv('testpandas2.txt',index_col=0)

In [2]: df2
Out[2]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
colA 1000000 non-null values
dtypes: object(1)

In [3]: from collections import Counter

In [4]: Counter(df2.colA.apply(lambda x: type(x)))
Out[4]: Counter({<type 'int'>: 737856, <type 'str'>: 262144})

So the way parsing works (when you don't specify a specifc dtyp) is that
on a particular column you
loop over all dtypes, and try to convert to an actual type; if something
breaks you go to the next dtype. The
data is left is modified in-place, so the rows before the strings are
converted to integers; when it hits the strings the parsing stops and the
column is marked object

so the end result is the correct dtype.

you essentially want downcasting back to strings for object dtype; easy
enough, specify object as the dtype for this column.

If you want this automatic I think we'd have to provide an option to do
it, because that would be inefficient from a parsing speed as you have to
copy the column for every dtype you try

can you explain why this actually matters?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3866#issuecomment-19340528
.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2013

@wesm pls take a look

so the int conversion stops at 262144, which is exactly 2**16 * 4...weird must be something odd going on

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2013

I can repro, but fix is eluding me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.