Loading data currently yields ParserError #11

paigem · 2021-05-20T02:28:06Z

The dask.ipynb notebook currently yields a ParserError when loading the volcano data. The line of code that breaks:

df = dd.read_csv(server+query, blocksize=None)

The error can be found below:

ParserError

---------------------------------------------------------------------------

ParserError Traceback (most recent call last)
in
6
7 # blocksize=None means use a single partion
----> 8 df = dd.read_csv(server+query, blocksize=None)

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
578 storage_options=storage_options,
579 include_path_column=include_path_column,
--> 580 **kwargs,
581 )
582

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
444
445 # Use sample to infer dtypes and check for presence of include_path_column
--> 446 head = reader(BytesIO(b_sample), **kwargs)
447 if include_path_column and (include_path_column in head.columns):
448 raise ValueError(

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.name = name

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: EOF inside string starting at row 172

This appears to be due to an abnormal parsing in the datafile itself. The data can be successfully loaded using the pandas library instead, as shown by @NickMortimer during a workshop at the Dask Distributed Summit. 🙂 If the above line of code is replaced with:

import pandas as pd
df = pd.read_csv(server+query)
df = dd.from_pandas(df,npartitions=1)

then the data loads just fine. So the above three lines of code are an easy fix, unless someone else has an idea how to load the data using dask.dataframe directly.

The text was updated successfully, but these errors were encountered:

rabernat · 2021-05-20T02:32:48Z

Thanks for sharing Paige!

It's weird that Dask chokes on the file, since clear it is using Pandas under the hood! It actually seems like a Dask bug. I recommend raising a Dask issue. To do this, you will want to simplify your example even further into

url = "http://put the full url here"
df = dd.read_csv(url)

paigem · 2021-05-20T02:39:07Z

Thanks for your input @rabernat! I will make a dask issue about this now.

NickMortimer · 2021-05-20T02:44:53Z

Yep it’s strange as I’ve tried to download the file and open locally and it fails, yet it seems look fine in excel I think it’s something to do with return chars and escape sequences of quotes around that line Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: paigem ***@***.***> Sent: Thursday, May 20, 2021 10:39:22 AM To: pangeo-data/pangeo-tutorial-gallery ***@***.***> Cc: Mortimer, Nick (O&A, IOMRC Crawley) ***@***.***>; Mention ***@***.***> Subject: Re: [pangeo-data/pangeo-tutorial-gallery] Loading data currently yields ParserError (#11) Thanks for your input @rabernat<https://github.com/rabernat>! I will make a dask issue about this now. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#11 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABBDKH7Q276PRBOACZB7IJ3TORY5VANCNFSM45F4HLGQ>.

paigem · 2021-05-20T02:48:15Z

Thanks for initially flagging this error @NickMortimer! We don't want our beginner friendly tutorials to be broken!

Depending on how long it takes to fix this Dask bug, it might be worth making a PR with the pandas library fix for now. @NickMortimer - want to make a PR for your fix? 🙂

jrbourbeau · 2021-05-20T17:18:41Z

As Martin mentioned over in the upstream Dask issue (xref dask/dask#7680 (comment)), a quickfix for now is to pass sample=False to dask.dataframe.read_csv:

In [1]: import dask.dataframe as dd
   ...: url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:
   ...: Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
   ...: df = dd.read_csv(url, blocksize=None, sample=False)

In [2]: df
Out[2]:
Dask DataFrame Structure:
                  FID Volcano_Number Volcano_Name Primary_Volcano_Type Last_Eruption_Year Country Geological_Summary  Region Subregion Latitude Longitude Elevation Tectonic_Setting Geologic_Epoch Evidence_Category Primary_Photo_Link Primary_Photo_Caption Primary_Photo_Credit Major_Rock_Type GeoLocation
npartitions=1
               object          int64       object               object            float64  object             object  object    object  float64   float64     int64           object         object            object             object                object               object          object      object
                  ...            ...          ...                  ...                ...     ...                ...     ...       ...      ...       ...       ...              ...            ...               ...                ...                   ...                  ...             ...         ...
Dask Name: read-csv, 1 tasks

paigem · 2021-05-20T23:58:55Z

Thanks @jrbourbeau! Good suggestion. This quick fix is cleaner than importing first through the pandas library.

NickMortimer · 2021-05-21T08:08:36Z

I just forked the repo to prepare a pull request and it all works in my environment on local PC so this could be a version issue in the pangeo binder session?

dask version=2.17.2,pandas version=1.0.5 on my local machine and all is fine

jrbourbeau · 2021-05-21T15:19:25Z

Hmm locally I get the same pandas.errors.ParserError when using dask=2.17.2 and pandas=1.0.5. That is

import dask
import pandas as pd
import dask.dataframe as dd

print(f"{dask.__version__ = }")
print(f"{pd.__version__ = }")

url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = dd.read_csv(url, blocksize=None)

outputs

dask.__version__ = '2.17.2'
pd.__version__ = '1.0.5'
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    df = dd.read_csv(url, blocksize=None)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 568, in read
    return read_pandas(
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 446, in read_pandas
    head = reader(BytesIO(b_sample), **kwargs)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2071, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 172

For now, my guess it adding sample=False might be the most robust quickfix

NickMortimer · 2021-05-29T03:16:34Z

I've made a pull request for this #14 I'm new to the whole pull request thing so feedback welcome...

paigem mentioned this issue May 20, 2021

Unexpected ParserError when loading data with dask.dataframe.read_csv() dask/dask#7680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading data currently yields ParserError #11

Loading data currently yields ParserError #11

paigem commented May 20, 2021

rabernat commented May 20, 2021

paigem commented May 20, 2021

NickMortimer commented May 20, 2021 via email

paigem commented May 20, 2021

jrbourbeau commented May 20, 2021

paigem commented May 20, 2021

NickMortimer commented May 21, 2021 •

edited

Loading

jrbourbeau commented May 21, 2021

NickMortimer commented May 29, 2021

Loading data currently yields ParserError #11

Loading data currently yields ParserError #11

Comments

paigem commented May 20, 2021

rabernat commented May 20, 2021

paigem commented May 20, 2021

NickMortimer commented May 20, 2021 via email

paigem commented May 20, 2021

jrbourbeau commented May 20, 2021

paigem commented May 20, 2021

NickMortimer commented May 21, 2021 • edited Loading

jrbourbeau commented May 21, 2021

NickMortimer commented May 29, 2021

NickMortimer commented May 21, 2021 •

edited

Loading