Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript string 'null' converted to [None] by load_dataset() #4467

Closed
mbarnig opened this issue Jun 9, 2022 · 3 comments
Closed

Transcript string 'null' converted to [None] by load_dataset() #4467

mbarnig opened this issue Jun 9, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@mbarnig
Copy link

mbarnig commented Jun 9, 2022

Issue

I am training a luxembourgish speech-recognition model in Colab with a custom dataset, including a dictionary of luxembourgish words, for example the speaken numbers 0 to 9. When preparing the dataset with the script

ds_train1 = mydataset.map(prepare_dataset)

the following error was issued:

ValueError                                Traceback (most recent call last)
<ipython-input-69-1e8f2b37f5bc> in <module>()
----> 1 ds_train = mydataset_train.map(prepare_dataset)

11 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2450         if not _is_valid_text_input(text):
   2451             raise ValueError(
-> 2452                 "text input must of type str (single example), List[str] (batch or single pretokenized example) "
   2453                 "or List[List[str]] (batch of pretokenized examples)."
   2454             )

ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

Debugging this problem was not easy, all transcriptions in the dataset are correct strings. Finally I discovered that the transcription string 'null' is interpreted as [None] by the load_dataset() script. By deleting this row in the dataset the training worked fine.

Expected result:

transcription 'null' interpreted as 'str' instead of 'None'.

Reproduction

Here is the code to reproduce the error with a one-row-dataset.

with open("null-test.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

['wav_filename', 'wav_filesize', 'transcript']
['wavs/female/NULL1.wav', '17530', 'null']

dataset = load_dataset('csv', data_files={'train': 'null-test.csv'}) 

Using custom data configuration default-81ac0c0e27af3514
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100%
1/1 [00:00<00:00, 29.55it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.66it/s]
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.
100%
1/1 [00:00<00:00, 25.84it/s]

print(dataset['train']['transcript'])

[None]

Environment info

!pip install datasets==2.2.2
!pip install transformers==4.19.2
@mbarnig mbarnig added the bug Something isn't working label Jun 9, 2022
@albertvillanova albertvillanova self-assigned this Jun 9, 2022
@albertvillanova
Copy link
Member

Hi @mbarnig, thanks for reporting.

Please note that is an expected behavior by pandas (we use the pandas library to parse CSV files): https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

By default the following values are interpreted as NaN: 
‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

(see "null" in the last position in the above list).

In order to prevent pandas from performing that automatic conversion from the string "null" to a NaN value, you should pass the pandas parameter keep_default_na=False:

In [2]: dataset = load_dataset('csv', data_files={'train': 'null-test.csv'}, keep_default_na=False)
In [3]: dataset["train"][0]["transcript"]
Out[3]: 'null'

@mbarnig
Copy link
Author

mbarnig commented Jun 9, 2022

Thanks for the quick answer.

@AbrahamSanders
Copy link

@albertvillanova I also ran into this issue, it had me scratching my head for a while! In my case it was tripped by a literal "NA" comment collected from a user-facing form (e.g., this question does not apply to me). Thankfully this answer was here, but I feel it is such a common trap that it deserves to be noted in the official docs, maybe here?

I'm happy to submit a PR if you agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants