-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
I have read in an hdf of 4 million+ rows and now I want to convert it to a sample CSV:
df_small = df[:int(1e6)]
df_small.to_csv("X.csv", sep='\t')
len(df_small)
# out: 1,000,000
The dataframe consists of a datetime index and a text column.
When I read the CSV back in, I get more rows than when I saved it:
df2 = pd.read_csv("X.csv",
sep='\t',
engine='python',
parse_dates=['datetime'],
index_col='datetime'
infer_datetime_format=True)
len(df2)
# out: 1,000,002
And looking at my index, the datetime wasn't actually parsed, it's just dtype Object.
I used my own parser and it had an error when it hit a "..." in my datetime index, which wasn't there before.
I opened up the CSV in Excel and found a "..." in my datetime column, and I also noticed that my datetime index and first column were merged together. Not sure if that's relevant or just the way Excel reads it.
When I use read_csv the data comes in fine except for that couple of extra rows with "..." in the index. The row at that index is also just blank.