UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

ERijck · 2022-03-16T10:01:49Z

OCTIS version: 1.10.3
Python version: 3.9
Operating System: Windows

Description

I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBPedia_IT')

Traceback (most recent call last):

  Input In [42] in <module>
    dataset.fetch_dataset('DBPedia_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:77 in download_dataset
    f.write(corpus.text)

  File ~\Anaconda3\envs\SentenceTransformers\lib\encodings\cp1252.py:19 in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

Ravi2712 · 2022-04-21T17:39:24Z

@ERijck I faced the similar error while using a function preprocessor.preprocess_dataset() that comes with OCTIS. I found out that my dataset has some unicode characters and emojis.

The error you are facing comes from line-number-(77) in downloader.py where OCTIS is trying to create relevant files of dataset. Link here:

OCTIS/octis/dataset/downloader.py

Line 75 in 1e8e6be

if corpus and metadata and vocabulary:

A quickfix would be to read/write corpus that allows unicode characters and looks like following:

with open(corpus_path, 'w', encoding='utf8') as f:
            f.write(corpus.text)

Two quick modifications:

You can try to fork the repository and change these lines until OCTIS provides Unicode support for dataset.
You can edit the the files present in the environment (Not recommended).

NOTE:
If you endup doing this modification by yourself before OCTIS, you also need to change some appropriate functions which are reading the "Dataset" files before you start training model.

silviatti · 2022-05-14T16:19:57Z

Hi,
thanks @ERijck for reporting and @Ravi2712 for your suggestion. It is indeed a problem of encoding. I'll fix this in the next release.

Silvia

silviatti added a commit that referenced this issue May 14, 2022

FIX dataset encoding #57

84035df

silviatti closed this as completed May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

ERijck commented Mar 16, 2022

Ravi2712 commented Apr 21, 2022 •

edited

Loading

silviatti commented May 14, 2022

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

Comments

ERijck commented Mar 16, 2022

Description

What I Did

Ravi2712 commented Apr 21, 2022 • edited Loading

silviatti commented May 14, 2022

Ravi2712 commented Apr 21, 2022 •

edited

Loading