-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading data currently yields ParserError #11
Comments
Thanks for sharing Paige! It's weird that Dask chokes on the file, since clear it is using Pandas under the hood! It actually seems like a Dask bug. I recommend raising a Dask issue. To do this, you will want to simplify your example even further into url = "http://put the full url here"
df = dd.read_csv(url) |
Thanks for your input @rabernat! I will make a dask issue about this now. |
Yep it’s strange as I’ve tried to download the file and open locally and it fails, yet it seems look fine in excel I think it’s something to do with return chars and escape sequences of quotes around that line
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: paigem ***@***.***>
Sent: Thursday, May 20, 2021 10:39:22 AM
To: pangeo-data/pangeo-tutorial-gallery ***@***.***>
Cc: Mortimer, Nick (O&A, IOMRC Crawley) ***@***.***>; Mention ***@***.***>
Subject: Re: [pangeo-data/pangeo-tutorial-gallery] Loading data currently yields ParserError (#11)
Thanks for your input @rabernat<https://github.com/rabernat>! I will make a dask issue about this now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#11 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABBDKH7Q276PRBOACZB7IJ3TORY5VANCNFSM45F4HLGQ>.
|
Thanks for initially flagging this error @NickMortimer! We don't want our beginner friendly tutorials to be broken! Depending on how long it takes to fix this Dask bug, it might be worth making a PR with the pandas library fix for now. @NickMortimer - want to make a PR for your fix? 🙂 |
As Martin mentioned over in the upstream Dask issue (xref dask/dask#7680 (comment)), a quickfix for now is to pass In [1]: import dask.dataframe as dd
...: url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:
...: Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
...: df = dd.read_csv(url, blocksize=None, sample=False)
In [2]: df
Out[2]:
Dask DataFrame Structure:
FID Volcano_Number Volcano_Name Primary_Volcano_Type Last_Eruption_Year Country Geological_Summary Region Subregion Latitude Longitude Elevation Tectonic_Setting Geologic_Epoch Evidence_Category Primary_Photo_Link Primary_Photo_Caption Primary_Photo_Credit Major_Rock_Type GeoLocation
npartitions=1
object int64 object object float64 object object object object float64 float64 int64 object object object object object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: read-csv, 1 tasks |
Thanks @jrbourbeau! Good suggestion. This quick fix is cleaner than importing first through the pandas library. |
I just forked the repo to prepare a pull request and it all works in my environment on local PC so this could be a version issue in the pangeo binder session? dask version=2.17.2,pandas version=1.0.5 on my local machine and all is fine |
Hmm locally I get the same import dask
import pandas as pd
import dask.dataframe as dd
print(f"{dask.__version__ = }")
print(f"{pd.__version__ = }")
url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = dd.read_csv(url, blocksize=None) outputs
For now, my guess it adding |
I've made a pull request for this #14 I'm new to the whole pull request thing so feedback welcome... |
The
dask.ipynb
notebook currently yields aParserError
when loading the volcano data. The line of code that breaks:The error can be found below:
ParserError
---------------------------------------------------------------------------ParserError Traceback (most recent call last)
in
6
7 # blocksize=None means use a single partion
----> 8 df = dd.read_csv(server+query, blocksize=None)
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
578 storage_options=storage_options,
579 include_path_column=include_path_column,
--> 580 **kwargs,
581 )
582
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
444
445 # Use sample to infer dtypes and check for presence of include_path_column
--> 446 head = reader(BytesIO(b_sample), **kwargs)
447 if include_path_column and (include_path_column in head.columns):
448 raise ValueError(
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.name = name
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: EOF inside string starting at row 172
This appears to be due to an abnormal parsing in the datafile itself. The data can be successfully loaded using the pandas library instead, as shown by @NickMortimer during a workshop at the Dask Distributed Summit. 🙂 If the above line of code is replaced with:
then the data loads just fine. So the above three lines of code are an easy fix, unless someone else has an idea how to load the data using
dask.dataframe
directly.The text was updated successfully, but these errors were encountered: