Join GitHub today
nrows limit fails reading well formed csv files from Australian electricity market data #7626
Reading Australian electricity market data files, read_csv reads past the nrows limit for certain nrows values and consequently fails.
These market data files are 4 csv files combined into a single csv file and so the file has multiple headers and variable field size across the rows.
The first set of data is from rows 1-1442.
Intent was to extract first set of data with nrows = 1442.
Testing several arbitrary CSV files from this data source shows well formed CSV - 120 fields between rows 1 to 1442 (with a 10 field at row 0)
Other python examples of reading the market data using csv module work fine
In the reproducible example below, code works for nrows< 824, but fails on any value above it.
Testing on arbitrary files suggests the 824 limit is variable - sometimes a few more rows, sometimes a few less rows.
I meant to say - the sort of error returned is:
Which is expected - the field size changes past row 1442 - but for these files, the nrows limit reads past the 1442 (or 823+) value.
I also tested nrows on arbitrarily created csv files via numpy arrays but couldn't reproduce the error from the real data I was working with.
(And apologies for poorly formed markdown above - first time posting :-)
thanks - but I'm unclear on your request - that is, I thought I did what you asked already.
I created a reproducible example with the code at the bottom of my post - admittedly with iPython rather than a straight python file.
I'm trying to extract the first section (rows 1-1442 of a 3366 row file) - this is where my problem occurs.
Was my code example unclear?
For reproducibility purposes, the bulk of the code deals with downloading a zip file, but the test is in the five lines from 'with thezipfile.open(fname) as csvFile:' onwards
I'm expecting it to be a subtle bug (or I'm doing something very wrong) - nrows parameter clearly works on the various examples I threw at it that were much larger in row length.
But at the same time, these electricity market files are well formed CSV files (they are part of the market data process in a live electricity market where auctions are run every 5 minutes for the past 15 years) - and pandas is failing to parse the files I used in developing the code.
thanks - you've given me a thought that I can test it by just breaking the relevant part of the CSV file out.
But if it turns out to be related to the file structure itself - not sure how to provide a test without a link to a sample file. Would creating a github repo with some sample csv files and a few lines of code be suitable as the test?
see what you come up with. this is either a bug, which can be reproduced via a generated type of test file (e.g. create a specific structure), a problem with the csv, or incorrect use of the options.
We need a self-contained test in order to narrow down the problem.
lmk what you find.
If the file is only the relevant section (or rows to skip at the front) - no error.
Implies not a problem with use of options too I think.
If the file structure includes the very next line (no. 1443) - the 130 field header for the next section - it fails with any nrows>823.
I also experimented with deleting arbitrary number of rows (but small number) at the end of the section before the next header row - to see if the issue related to that particular line ending. Again fail.
I'm not sure I can create a test file - other than the sample files I've been experimenting with.
I'll go and figure out how to make a github repo and perhaps we can take it from there.
For info - the full error at the fail point is:
/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader.read (pandas/parser.c:7146)()
/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7547)()
/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_rows (pandas/parser.c:7979)()
/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7853)()
/Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:19604)()
CParserError: Error tokenizing data. C error: Expected 120 fields in line 1443, saw 130
how about this for a test.
Create two CSV files ( 1442 rows by 120 cols and 5 rows by 130 cols)
It fails in the same way - though the nrows parameter was much larger before failure occurred relative to the examples above (where the files contained more strings)
In the example below - it fails for nrow>1360. Works fine for lower values.
This is a small update (and to see if any thoughts occur to you).
Before I went to look at parser.py, I tried to generalise the test file above in order to explore row/column variations to see if there was a boundary to the error.
I didn't get far in exploring row parameters before realising the error appears to occur randomly.
In the code below, it loops over the 'test' 3 times, printing out the number of rows in the failed example, as well as the memory size of the dataframe in the failed run.
Different number of errors across different runs (I've seen one where there was no error at all).
The dataframe memory size doesn't appear relevant - when i printed it for all tests, bigger ones passed, smaller ones failed, nothing obvious to look for.
And indicating that it's not a memory thing - a typo that set the number of columns to 12, instead of 120, got that error each and every time read_csv was called.
I'll go look at parser.py to see where I could put some print statements - but, as you say, probably in cython call somewhere - and I'm an economist (not a programer) who last used C sparingly 20 years ago.
def test_RowCount (size_1=(1442,120), rowCount=1361): #original parameters where failure occurred df_1 = pd.DataFrame (np.random.uniform(size=size_1)) df_1.to_csv('test120.csv',index=False) #create file of combined csv file ('testNrows.csv') of different record lengths filenames = ['test120.csv', 'test130.csv'] with open('testNrows.csv', 'w') as outfile: for fname in filenames: with open(fname) as infile: for line in infile: outfile.write(line) try: df = pd.read_csv('testNrows.csv', header=0, nrows=rowCount) except (pd.parser.CParserError) as error: print (error) print ('Rows: ', size_1) print ('Memory (MB): ', df_1.memory_usage(index=True).sum()/1024/1024, '\n') # except: # print ("Unexpected error: ", sys.exc_info()) ### Write out 1 file of different record length for later use in test_RowCount function size_2 = (1,130) df_2 = pd.DataFrame(np.random.uniform(size=size_2)) df_2.to_csv('test130.csv',index=False) ### Loop for testing various row counts and record lengths for j in range(3): print ('Run ', j) for i in range(1442, 1361, -1): #print (i) test_RowCount(size_1=(i,120), rowCount=1360)
@jreback This is a real problem. It's still present in 0.19. It can be worked around by
@jreback I have uploaded a file which reproduces this error: https://gist.githubusercontent.com/jzwinck/838882fbc07f7c3a53992696ef364f66
Simply download that file and run this:
It fails, saying:
Since we told Pandas to read from line 2195 for 100 rows, it should never have seen line 2355.