Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
nrows limit fails reading well formed csv files from Australian electricity market data #7626
Comments
ChristopherShort
commented
Jul 1, 2014
|
I meant to say - the sort of error returned is:
Which is expected - the field size changes past row 1442 - but for these files, the nrows limit reads past the 1442 (or 823+) value. I also tested nrows on arbitrarily created csv files via numpy arrays but couldn't reproduce the error from the real data I was working with. (And apologies for poorly formed markdown above - first time posting :-) |
|
why don't you create a test. Pull the header and 2 rows from each section (then limit the number of fields). Then try this using nrows to skip. If this is a bug, would need to create a reproducible example. |
ChristopherShort
commented
Jul 1, 2014
|
thanks - but I'm unclear on your request - that is, I thought I did what you asked already. I created a reproducible example with the code at the bottom of my post - admittedly with iPython rather than a straight python file. I'm trying to extract the first section (rows 1-1442 of a 3366 row file) - this is where my problem occurs. Was my code example unclear? For reproducibility purposes, the bulk of the code deals with downloading a zip file, but the test is in the five lines from 'with thezipfile.open(fname) as csvFile:' onwards I'm expecting it to be a subtle bug (or I'm doing something very wrong) - nrows parameter clearly works on the various examples I threw at it that were much larger in row length. But at the same time, these electricity market files are well formed CSV files (they are part of the market data process in a live electricity market where auctions are run every 5 minutes for the past 15 years) - and pandas is failing to parse the files I used in developing the code. |
|
no, what I mean is we need an example that you can simply copy and paste (and not use an external URL). |
ChristopherShort
commented
Jul 1, 2014
|
thanks - you've given me a thought that I can test it by just breaking the relevant part of the CSV file out. But if it turns out to be related to the file structure itself - not sure how to provide a test without a link to a sample file. Would creating a github repo with some sample csv files and a few lines of code be suitable as the test? |
|
see what you come up with. this is either a bug, which can be reproduced via a generated type of test file (e.g. create a specific structure), a problem with the csv, or incorrect use of the options. We need a self-contained test in order to narrow down the problem. lmk what you find. |
ChristopherShort
commented
Jul 1, 2014
|
Thanks. If the file is only the relevant section (or rows to skip at the front) - no error. Implies not a problem with use of options too I think. If the file structure includes the very next line (no. 1443) - the 130 field header for the next section - it fails with any nrows>823. I also experimented with deleting arbitrary number of rows (but small number) at the end of the section before the next header row - to see if the issue related to that particular line ending. Again fail. I'm not sure I can create a test file - other than the sample files I've been experimenting with. I'll go and figure out how to make a github repo and perhaps we can take it from there. For info - the full error at the fail point is: /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows) /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader.read (pandas/parser.c:7146)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7547)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._read_rows (pandas/parser.c:7979)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7853)() /Users/ChristopherShort/anaconda/lib/python3.4/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:19604)() CParserError: Error tokenizing data. C error: Expected 120 fields in line 1443, saw 130 |
ChristopherShort
commented
Jul 3, 2014
|
how about this for a test. Create two CSV files ( 1442 rows by 120 cols and 5 rows by 130 cols) It fails in the same way - though the nrows parameter was much larger before failure occurred relative to the examples above (where the files contained more strings) In the example below - it fails for nrow>1360. Works fine for lower values.
|
jreback
added CSV Bug
labels
Jul 3, 2014
jreback
added this to the
0.15.0
milestone
Jul 3, 2014
|
ok great, must be a bug somewhere |
ChristopherShort
commented
Oct 27, 2014
|
I have some time now to try and look at this bug, but not much experience. Do you have any recommendations on things I should know first? |
|
well it's going to be in parser.pyx IMHO not so easy to debug cython I would start by putting print statements to figure out what it is doing on this file |
ChristopherShort
commented
Oct 27, 2014
|
OK - thanks |
ChristopherShort
commented
Oct 29, 2014
|
This is a small update (and to see if any thoughts occur to you). Before I went to look at parser.py, I tried to generalise the test file above in order to explore row/column variations to see if there was a boundary to the error. I didn't get far in exploring row parameters before realising the error appears to occur randomly. In the code below, it loops over the 'test' 3 times, printing out the number of rows in the failed example, as well as the memory size of the dataframe in the failed run. Different number of errors across different runs (I've seen one where there was no error at all). The dataframe memory size doesn't appear relevant - when i printed it for all tests, bigger ones passed, smaller ones failed, nothing obvious to look for. And indicating that it's not a memory thing - a typo that set the number of columns to 12, instead of 120, got that error each and every time read_csv was called. I'll go look at parser.py to see where I could put some print statements - but, as you say, probably in cython call somewhere - and I'm an economist (not a programer) who last used C sparingly 20 years ago. def test_RowCount (size_1=(1442,120), rowCount=1361): #original parameters where failure occurred
df_1 = pd.DataFrame (np.random.uniform(size=size_1))
df_1.to_csv('test120.csv',index=False)
#create file of combined csv file ('testNrows.csv') of different record lengths
filenames = ['test120.csv', 'test130.csv']
with open('testNrows.csv', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
try:
df = pd.read_csv('testNrows.csv', header=0, nrows=rowCount)
except (pd.parser.CParserError) as error:
print (error)
print ('Rows: ', size_1[0])
print ('Memory (MB): ', df_1.memory_usage(index=True).sum()/1024/1024, '\n')
# except:
# print ("Unexpected error: ", sys.exc_info()[0])
### Write out 1 file of different record length for later use in test_RowCount function
size_2 = (1,130)
df_2 = pd.DataFrame(np.random.uniform(size=size_2))
df_2.to_csv('test130.csv',index=False)
### Loop for testing various row counts and record lengths
for j in range(3):
print ('Run ', j)
for i in range(1442, 1361, -1):
#print (i)
test_RowCount(size_1=(i,120), rowCount=1360) |
jreback
modified the milestone: 0.16.0, Next Major Release
Mar 6, 2015
|
@jreback This is a real problem. It's still present in 0.19. It can be worked around by |
|
well if u have a reproducible example pls show it |
|
@jreback OK, the input file is 516 KB. Where would you like me to put it? I tried removing "unnecessary" rows from it but this bug doesn't reproduce if you shrink the file a lot. |
|
best to put this up on a separate repo or gist, and use a URL to access. |
|
@jreback I have uploaded a file which reproduces this error: https://gist.githubusercontent.com/jzwinck/838882fbc07f7c3a53992696ef364f66 Simply download that file and run this:
It fails, saying:
Since we told Pandas to read from line 2195 for 100 rows, it should never have seen line 2355. |
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Nov 25, 2016
|
|
jeffcarey |
0d6027c
|
jeffcarey
referenced
this issue
Nov 25, 2016
Closed
BUG: Corrects stopping logic when nrows argument is supplied (#7626) #14747
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Nov 26, 2016
|
|
jeffcarey |
af0ca98
|
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Nov 26, 2016
|
|
jeffcarey |
b3c7428
|
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Nov 26, 2016
|
|
jeffcarey |
f6e5236
|
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Nov 26, 2016
|
|
jeffcarey |
29a887c
|
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Nov 26, 2016
|
|
jeffcarey |
f4c3c13
|
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Nov 29, 2016
|
|
jeffcarey |
e9c5bee
|
jeffcarey
added a commit
to jeffcarey/pandas
that referenced
this issue
Dec 2, 2016
|
|
jeffcarey |
6f1965a
|
jreback
modified the milestone: 0.19.2, Next Major Release
Dec 5, 2016
jreback
closed this
in 4378f82
Dec 6, 2016
jorisvandenbossche
added a commit
that referenced
this issue
Dec 15, 2016
|
|
jeffcarey + jorisvandenbossche |
96cac41
|
ChristopherShort commentedJul 1, 2014
Reading Australian electricity market data files, read_csv reads past the nrows limit for certain nrows values and consequently fails.
These market data files are 4 csv files combined into a single csv file and so the file has multiple headers and variable field size across the rows.
The first set of data is from rows 1-1442.
Intent was to extract first set of data with nrows = 1442.
Testing several arbitrary CSV files from this data source shows well formed CSV - 120 fields between rows 1 to 1442 (with a 10 field at row 0)
returns
120 1441
10 1
dtype: int64
Other python examples of reading the market data using csv module work fine
In the reproducible example below, code works for nrows< 824, but fails on any value above it.
Testing on arbitrary files suggests the 824 limit is variable - sometimes a few more rows, sometimes a few less rows.