Transcript string 'null' converted to [None] by load_dataset() #4467

mbarnig · 2022-06-09T14:26:00Z

Issue

I am training a luxembourgish speech-recognition model in Colab with a custom dataset, including a dictionary of luxembourgish words, for example the speaken numbers 0 to 9. When preparing the dataset with the script

ds_train1 = mydataset.map(prepare_dataset)

the following error was issued:

ValueError                                Traceback (most recent call last)
<ipython-input-69-1e8f2b37f5bc> in <module>()
----> 1 ds_train = mydataset_train.map(prepare_dataset)

11 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2450         if not _is_valid_text_input(text):
   2451             raise ValueError(
-> 2452                 "text input must of type str (single example), List[str] (batch or single pretokenized example) "
   2453                 "or List[List[str]] (batch of pretokenized examples)."
   2454             )

ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

Debugging this problem was not easy, all transcriptions in the dataset are correct strings. Finally I discovered that the transcription string 'null' is interpreted as [None] by the load_dataset() script. By deleting this row in the dataset the training worked fine.

Expected result:

transcription 'null' interpreted as 'str' instead of 'None'.

Reproduction

Here is the code to reproduce the error with a one-row-dataset.

with open("null-test.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

['wav_filename', 'wav_filesize', 'transcript']
['wavs/female/NULL1.wav', '17530', 'null']

dataset = load_dataset('csv', data_files={'train': 'null-test.csv'})

Using custom data configuration default-81ac0c0e27af3514
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100%
1/1 [00:00<00:00, 29.55it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.66it/s]
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.
100%
1/1 [00:00<00:00, 25.84it/s]

print(dataset['train']['transcript'])

[None]

Environment info

!pip install datasets==2.2.2
!pip install transformers==4.19.2

The text was updated successfully, but these errors were encountered:

albertvillanova · 2022-06-09T16:29:02Z

Hi @mbarnig, thanks for reporting.

Please note that is an expected behavior by pandas (we use the pandas library to parse CSV files): https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

By default the following values are interpreted as NaN: 
‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

(see "null" in the last position in the above list).

In order to prevent pandas from performing that automatic conversion from the string "null" to a NaN value, you should pass the pandas parameter keep_default_na=False:

In [2]: dataset = load_dataset('csv', data_files={'train': 'null-test.csv'}, keep_default_na=False)
In [3]: dataset["train"][0]["transcript"]
Out[3]: 'null'

mbarnig · 2022-06-09T17:55:37Z

Thanks for the quick answer.

AbrahamSanders · 2023-07-04T02:18:39Z

@albertvillanova I also ran into this issue, it had me scratching my head for a while! In my case it was tripped by a literal "NA" comment collected from a user-facing form (e.g., this question does not apply to me). Thankfully this answer was here, but I feel it is such a common trap that it deserves to be noted in the official docs, maybe here?

I'm happy to submit a PR if you agree!

mbarnig added the bug Something isn't working label Jun 9, 2022

albertvillanova self-assigned this Jun 9, 2022

albertvillanova closed this as completed Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript string 'null' converted to [None] by load_dataset() #4467

Transcript string 'null' converted to [None] by load_dataset() #4467

mbarnig commented Jun 9, 2022

albertvillanova commented Jun 9, 2022

mbarnig commented Jun 9, 2022

AbrahamSanders commented Jul 4, 2023

Transcript string 'null' converted to [None] by load_dataset() #4467

Transcript string 'null' converted to [None] by load_dataset() #4467

Comments

mbarnig commented Jun 9, 2022

Issue

Expected result:

Reproduction

Environment info

albertvillanova commented Jun 9, 2022

mbarnig commented Jun 9, 2022

AbrahamSanders commented Jul 4, 2023