read_csv, integer dtype and empty cells #2631

Closed
janschulz opened this Issue Jan 3, 2013 · 3 comments

Comments

Projects
None yet
2 participants
Contributor

janschulz commented Jan 3, 2013

Reading in a csv file with an integer column which has empty cells will cast that column to float (which in the end will resulted in problems with merging this dataframe on that column with a dataframe where the corresponding column is int).

It would be nice if a warning could be printed when such conversation (maybe only when an explicit dtype={"col":np.int64} setting is passed to read_csv) takes place and optional let me specify that such rows should be droped (isn't there a NA value for int columns...?)

data = """YEAR, DOY, a
2001,106380451,10
2001,,11
2001,106380451,67"""
import numpy as np
f = pandas.read_csv(StringIO(data), sep=",", dtype={'DOY': np.int64})
f.dtypes
YEAR      int64
 DOY    float64
 a        int64
Owner

wesm commented Jan 3, 2013

There is no integer NA values unfortunately. I plan to fix this (a big project-- requires circumventing NumPy probably) one of these days

Contributor

janschulz commented Jan 3, 2013

I don't mind that it is not possible (yet) but that read_csv changed the datatype even as I specified it and didn't say anything (throw exception or print warning).

pandas/src/pasrer.pyx has commented out exception throwing in line 900, which seems to do what I expected...?

Would it be posible to add a param to specify a strategy (drop row, throw exception, cast to float) what should happen with such cases? I tried to understand the code and it seems that it operates on columns, so dropping rows if an int is NA seems not an easy option :-(

wesm was assigned Jan 20, 2013

wesm closed this in 5da8df7 Jan 20, 2013

Owner

wesm commented Jan 20, 2013

Done. Thanks for the suggestion; I agree raising the exception is the right move. in your example note you need to pass skipinitialspace=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment