Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

Closed
ERijck opened this issue Mar 16, 2022 · 2 comments

Comments

@ERijck
Copy link

ERijck commented Mar 16, 2022

  • OCTIS version: 1.10.3
  • Python version: 3.9
  • Operating System: Windows

Description

I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBPedia_IT')

Traceback (most recent call last):

  Input In [42] in <module>
    dataset.fetch_dataset('DBPedia_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:77 in download_dataset
    f.write(corpus.text)

  File ~\Anaconda3\envs\SentenceTransformers\lib\encodings\cp1252.py:19 in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined>
@Ravi2712
Copy link

Ravi2712 commented Apr 21, 2022

@ERijck I faced the similar error while using a function preprocessor.preprocess_dataset() that comes with OCTIS. I found out that my dataset has some unicode characters and emojis.

The error you are facing comes from line-number-(77) in downloader.py where OCTIS is trying to create relevant files of dataset. Link here:

if corpus and metadata and vocabulary:

A quickfix would be to read/write corpus that allows unicode characters and looks like following:

with open(corpus_path, 'w', encoding='utf8') as f:
            f.write(corpus.text)

Two quick modifications:

  • You can try to fork the repository and change these lines until OCTIS provides Unicode support for dataset.
  • You can edit the the files present in the environment (Not recommended).

NOTE:
If you endup doing this modification by yourself before OCTIS, you also need to change some appropriate functions which are reading the "Dataset" files before you start training model.

@silviatti
Copy link
Collaborator

Hi,
thanks @ERijck for reporting and @Ravi2712 for your suggestion. It is indeed a problem of encoding. I'll fix this in the next release.

Silvia

silviatti added a commit that referenced this issue May 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants