Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset "ruwiki_good" does not want to be downloaded #89

Open
Alvant opened this issue Nov 25, 2022 · 0 comments
Open

Dataset "ruwiki_good" does not want to be downloaded #89

Alvant opened this issue Nov 25, 2022 · 0 comments

Comments

@Alvant
Copy link
Collaborator

Alvant commented Nov 25, 2022

Well, the dataset is currently unavailable. It should be fixed — load_dataset('ruwiki_good'). Or... it should at least download and tell which way the .txt file lies (so that it would be possible to do something manually with the file).

If you try this:

>>> d = load_dataset('ruwiki_good')

you get something like this:

Checking if dataset "ruwiki_good" was already downloaded before
Dataset "ruwiki_good" not found on the machine
Downloading the "ruwiki_good" dataset...
100%|█████████████████████████████████████████| 51.2M/51.2M [00:46<00:00, 1.10MiB/s]
Dataset downloaded! Save path is: "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/ruwiki_good.txt"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 132, in load_dataset
    raise exception
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/dataset_manager/api.py", line 126, in load_dataset
    return Dataset(save_path, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 220, in __init__
    self._data = self._read_data(data_path)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/topicnet/cooking_machine/dataset.py", line 355, in _read_data
    data = data_handle.read_csv(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 935, in read_csv
    kwds_defaults = _refine_defaults_read(
  File "/home/alvant/lib/miniconda3/envs/test2/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 2063, in _refine_defaults_read
    raise ValueError(
ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

OS is:

Linux mx 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

Expected Result

The dataset is 1) downloaded and 2) ready to use for topic modeling.

Current "Workaround"

If you set sep='###' in this code:

data = data_handle.read_csv(
    data_path,
    engine='python',
    error_bad_lines=False,
    sep='\n',
    header=None,
    names=[VW_TEXT_COL]
)

then everything seems to work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant