You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training a luxembourgish speech-recognition model in Colab with a custom dataset, including a dictionary of luxembourgish words, for example the speaken numbers 0 to 9. When preparing the dataset with the script
ds_train1 = mydataset.map(prepare_dataset)
the following error was issued:
ValueError Traceback (most recent call last)
<ipython-input-69-1e8f2b37f5bc> in <module>()
----> 1 ds_train = mydataset_train.map(prepare_dataset)
11 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2450 if not _is_valid_text_input(text):
2451 raise ValueError(
-> 2452 "text input must of type str (single example), List[str] (batch or single pretokenized example) "
2453 "or List[List[str]] (batch of pretokenized examples)."
2454 )
ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).
Debugging this problem was not easy, all transcriptions in the dataset are correct strings. Finally I discovered that the transcription string 'null' is interpreted as [None] by the load_dataset() script. By deleting this row in the dataset the training worked fine.
Expected result:
transcription 'null' interpreted as 'str' instead of 'None'.
Reproduction
Here is the code to reproduce the error with a one-row-dataset.
with open("null-test.csv") as f:
reader = csv.reader(f)
for row in reader:
print(row)
Using custom data configuration default-81ac0c0e27af3514
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100%
1/1 [00:00<00:00, 29.55it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.66it/s]
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.
100%
1/1 [00:00<00:00, 25.84it/s]
By default the following values are interpreted as NaN:
‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
(see "null" in the last position in the above list).
In order to prevent pandas from performing that automatic conversion from the string "null" to a NaN value, you should pass the pandas parameter keep_default_na=False:
In [2]: dataset=load_dataset('csv', data_files={'train': 'null-test.csv'}, keep_default_na=False)
In [3]: dataset["train"][0]["transcript"]
Out[3]: 'null'
@albertvillanova I also ran into this issue, it had me scratching my head for a while! In my case it was tripped by a literal "NA" comment collected from a user-facing form (e.g., this question does not apply to me). Thankfully this answer was here, but I feel it is such a common trap that it deserves to be noted in the official docs, maybe here?
Issue
I am training a luxembourgish speech-recognition model in Colab with a custom dataset, including a dictionary of luxembourgish words, for example the speaken numbers 0 to 9. When preparing the dataset with the script
ds_train1 = mydataset.map(prepare_dataset)
the following error was issued:
Debugging this problem was not easy, all transcriptions in the dataset are correct strings. Finally I discovered that the transcription string 'null' is interpreted as [None] by the
load_dataset()
script. By deleting this row in the dataset the training worked fine.Expected result:
transcription 'null' interpreted as 'str' instead of 'None'.
Reproduction
Here is the code to reproduce the error with a one-row-dataset.
['wav_filename', 'wav_filesize', 'transcript']
['wavs/female/NULL1.wav', '17530', 'null']
Using custom data configuration default-81ac0c0e27af3514
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100%
1/1 [00:00<00:00, 29.55it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.66it/s]
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-81ac0c0e27af3514/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.
100%
1/1 [00:00<00:00, 25.84it/s]
[None]
Environment info
The text was updated successfully, but these errors were encountered: