Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading data currently yields ParserError #11

Open
paigem opened this issue May 20, 2021 · 9 comments
Open

Loading data currently yields ParserError #11

paigem opened this issue May 20, 2021 · 9 comments

Comments

@paigem
Copy link

paigem commented May 20, 2021

The dask.ipynb notebook currently yields a ParserError when loading the volcano data. The line of code that breaks:

df = dd.read_csv(server+query, blocksize=None)

The error can be found below:

ParserError ---------------------------------------------------------------------------

ParserError Traceback (most recent call last)
in
6
7 # blocksize=None means use a single partion
----> 8 df = dd.read_csv(server+query, blocksize=None)

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
578 storage_options=storage_options,
579 include_path_column=include_path_column,
--> 580 **kwargs,
581 )
582

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
444
445 # Use sample to infer dtypes and check for presence of include_path_column
--> 446 head = reader(BytesIO(b_sample), **kwargs)
447 if include_path_column and (include_path_column in head.columns):
448 raise ValueError(

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.name = name

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: EOF inside string starting at row 172

This appears to be due to an abnormal parsing in the datafile itself. The data can be successfully loaded using the pandas library instead, as shown by @NickMortimer during a workshop at the Dask Distributed Summit. 🙂 If the above line of code is replaced with:

import pandas as pd
df = pd.read_csv(server+query)
df = dd.from_pandas(df,npartitions=1)

then the data loads just fine. So the above three lines of code are an easy fix, unless someone else has an idea how to load the data using dask.dataframe directly.

@rabernat
Copy link
Member

Thanks for sharing Paige!

It's weird that Dask chokes on the file, since clear it is using Pandas under the hood! It actually seems like a Dask bug. I recommend raising a Dask issue. To do this, you will want to simplify your example even further into

url = "http://put the full url here"
df = dd.read_csv(url)

@paigem
Copy link
Author

paigem commented May 20, 2021

Thanks for your input @rabernat! I will make a dask issue about this now.

@NickMortimer
Copy link

NickMortimer commented May 20, 2021 via email

@paigem
Copy link
Author

paigem commented May 20, 2021

Thanks for initially flagging this error @NickMortimer! We don't want our beginner friendly tutorials to be broken!

Depending on how long it takes to fix this Dask bug, it might be worth making a PR with the pandas library fix for now. @NickMortimer - want to make a PR for your fix? 🙂

@jrbourbeau
Copy link
Member

As Martin mentioned over in the upstream Dask issue (xref dask/dask#7680 (comment)), a quickfix for now is to pass sample=False to dask.dataframe.read_csv:

In [1]: import dask.dataframe as dd
   ...: url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:
   ...: Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
   ...: df = dd.read_csv(url, blocksize=None, sample=False)

In [2]: df
Out[2]:
Dask DataFrame Structure:
                  FID Volcano_Number Volcano_Name Primary_Volcano_Type Last_Eruption_Year Country Geological_Summary  Region Subregion Latitude Longitude Elevation Tectonic_Setting Geologic_Epoch Evidence_Category Primary_Photo_Link Primary_Photo_Caption Primary_Photo_Credit Major_Rock_Type GeoLocation
npartitions=1
               object          int64       object               object            float64  object             object  object    object  float64   float64     int64           object         object            object             object                object               object          object      object
                  ...            ...          ...                  ...                ...     ...                ...     ...       ...      ...       ...       ...              ...            ...               ...                ...                   ...                  ...             ...         ...
Dask Name: read-csv, 1 tasks

@paigem
Copy link
Author

paigem commented May 20, 2021

Thanks @jrbourbeau! Good suggestion. This quick fix is cleaner than importing first through the pandas library.

@NickMortimer
Copy link

NickMortimer commented May 21, 2021

I just forked the repo to prepare a pull request and it all works in my environment on local PC so this could be a version issue in the pangeo binder session?

dask version=2.17.2,pandas version=1.0.5 on my local machine and all is fine

@jrbourbeau
Copy link
Member

Hmm locally I get the same pandas.errors.ParserError when using dask=2.17.2 and pandas=1.0.5. That is

import dask
import pandas as pd
import dask.dataframe as dd

print(f"{dask.__version__ = }")
print(f"{pd.__version__ = }")

url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = dd.read_csv(url, blocksize=None)

outputs

dask.__version__ = '2.17.2'
pd.__version__ = '1.0.5'
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    df = dd.read_csv(url, blocksize=None)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 568, in read
    return read_pandas(
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 446, in read_pandas
    head = reader(BytesIO(b_sample), **kwargs)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/Users/james/miniforge3/envs/test/lib/python3.8/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2071, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 172

For now, my guess it adding sample=False might be the most robust quickfix

@NickMortimer
Copy link

I've made a pull request for this #14 I'm new to the whole pull request thing so feedback welcome...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants