Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in loading datasets #1773

Closed
ghost opened this issue Jan 24, 2021 · 3 comments
Closed

bug in loading datasets #1773

ghost opened this issue Jan 24, 2021 · 3 comments

Comments

@ghost
Copy link

ghost commented Jan 24, 2021

Hi,
I need to load a dataset, I use these commands:

from datasets import load_dataset
dataset = load_dataset('csv', data_files={'train': 'sick/train.csv',
                                          'test':  'sick/test.csv',
                                          'validation': 'sick/validation.csv'})
print(dataset['validation'])

the dataset in sick/train.csv are simple csv files representing the data. I am getting this error, do you have an idea how I can solve this? thank you @lhoestq

Using custom data configuration default
Downloading and preparing dataset csv/default-61468fc71a743ec1 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /julia/cache_home_2/datasets/csv/default-61468fc71a743ec1/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2...
Traceback (most recent call last):
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/datasets-1.2.0-py3.7.egg/datasets/builder.py", line 485, in incomplete_dir
    yield tmp_dir
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/datasets-1.2.0-py3.7.egg/datasets/builder.py", line 527, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/datasets-1.2.0-py3.7.egg/datasets/builder.py", line 604, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/datasets-1.2.0-py3.7.egg/datasets/builder.py", line 959, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/tqdm-4.49.0-py3.7.egg/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/julia/cache_home_2/modules/datasets_modules/datasets/csv/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/csv.py", line 129, in _generate_tables
    for batch_idx, df in enumerate(csv_file_reader):
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/pandas-1.2.0-py3.7-linux-x86_64.egg/pandas/io/parsers.py", line 1029, in __next__
    return self.get_chunk()
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/pandas-1.2.0-py3.7-linux-x86_64.egg/pandas/io/parsers.py", line 1079, in get_chunk
    return self.read(nrows=size)
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/pandas-1.2.0-py3.7-linux-x86_64.egg/pandas/io/parsers.py", line 1052, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/pandas-1.2.0-py3.7-linux-x86_64.egg/pandas/io/parsers.py", line 2056, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 783, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 37, saw 2


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "write_sick.py", line 19, in <module>
    'validation': 'sick/validation.csv'})
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/datasets-1.2.0-py3.7.egg/datasets/load.py", line 612, in load_dataset
    ignore_verifications=ignore_verifications,
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/datasets-1.2.0-py3.7.egg/datasets/builder.py", line 534, in download_and_prepare
    self._save_info()
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/site-packages/datasets-1.2.0-py3.7.egg/datasets/builder.py", line 491, in incomplete_dir
    shutil.rmtree(tmp_dir)
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/shutil.py", line 498, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/julia/libs/anaconda3/envs/success/lib/python3.7/shutil.py", line 496, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/julia/cache_home_2/datasets/csv/default-61468fc71a743ec1/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2.incomplete'
@lhoestq
Copy link
Member

lhoestq commented Jan 24, 2021

Looks like an issue with your csv file. Did you use the right delimiter ?
Apparently at line 37 the CSV reader from pandas reads 2 fields instead of 1.

@lhoestq
Copy link
Member

lhoestq commented Jan 25, 2021

Note that you can pass any argument you would pass to pandas.read_csv as kwargs to load_dataset. For example you can do

from datasets import load_dataset
dataset = load_dataset('csv', data_files=data_files, sep="\t")

for example to use a tab separator.

You can see the full list of arguments here: https://github.com/huggingface/datasets/blob/master/src/datasets/packaged_modules/csv/csv.py

(I've not found the list in the documentation though, we definitely must add them !)

@fghg123
Copy link

fghg123 commented Sep 6, 2021

You can try to convert the file to (CSV UTF-8)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants