Error when using tokenizer #993

joaocp98662 · 2022-04-29T20:20:42Z

Hello,

I have a dataset with various documents and this documents are in a jsonl file, and each document is an entry with an id and contents fields. The contents field contains all the information about the document. This information is distributed in 5 fields separated by '\n'. For instance:

{"id": "NCT03538132", "contents": "Patients' Perception on Bone Grafts\nPatients' Perception on Bone Biomaterials Used in Dentistry : a Multicentric Study\nThe goal of this study is to collect the patients' opinion about the different types of bone graft, to assess which are the most rejected by the patients and if the demographic variables (such as the gender or the age) and the level of education influence their decision.\nNowadays, many procedures may need regenerative techniques. Some studies have already assessed the patients' opinion regarding soft tissue grafts, some investigators have centered their studies on the techniques' efficiency without assessing the patient's perception.\nInclusion criteria: - Adult (18 years old or more) - Able to read and write - Not under the influence of alcohol or drugs - Had not previously undergone any surgery involving bone graft or bone augmentation. Exclusion criteria: - Any patient who doesn't fullfill the inclusion criterias."}

For each document I would like to know how many tokens are generated for each field and plot a distribution of the number of tokens with the respect to the field. I only can get the tokens for 66% of my dataset, after that I get this error:

line 145, in
inputs = tokenizer(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2413, in call
return self.batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2598, in batch_encode_plus
return self._batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 439, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

I saved the documents that are breaking in a log file and tried to run again with a few documents and it breaks it some documents. I tried to change the other of the documents, the ones breaking I put them in the beginning and it worked for these but then it breaks in other documents. Can anyone help me?

This is my code:

collection_iterator = JsonlCollectionIterator(f'{ct_utils.paths["input_dir"]}/documents.jsonl', fields=["brief_title", "official_title", "brief_summary", "detailed_description", "criteria"])

fields_tokens = {
                "brief_title": [], 
                "official_title": [],
                "brief_summary": [],
                "detailed_description": [],
                "criteria": []
            }

log_file = open('documents_log.txt', 'w')

for index, batch_info in enumerate(collection_iterator(1, 0, 1)):
    
    for field in collection_iterator.fields:

        # try:
        inputs = tokenizer(
            batch_info[field],
            padding='longest',
            truncation=False,
            add_special_tokens=True,
            return_tensors='pt'
        )
        # except:
        #     log_file.write(batch_info["id"][0] + "\n")

        fields_tokens[field].append({'index': index + 1, 'document_id': batch_info['id'][0], 'tokens_count': inputs["input_ids"].shape[1]})

#log_file.close()

with open("fields_tokens.json", "w") as file:
    json.dump(fields_tokens, file, indent=4)

I didn't pass the max length and set the truncation to false in the tokenizer because I want it to generate the tokens for the complete text.

The text was updated successfully, but these errors were encountered:

Narsil · 2022-05-02T08:19:32Z

Hi @joaocp98662 ,

Looking at the error it seems to be coming from your data, you might have an empty field somewhere.
You need to try...except as suggests your code, and if you could provide the context of batch_info[field] that would help us solve the issue. We could have a very simple reproducing issue

The code should definitely not crash, but it still is probably just a data issue.
If you could provide the exact tokenizer you're using that would help too.

@SaulLu for reference.

Cheers,

joaocp98662 · 2022-05-02T17:13:02Z

Hi @Narsil

Thank you very much for your reply. Yes, you are right, it was a data issue. I was looping through a fixed list of five fields and many of the documents didn't have all these five fields. It is a big dataset and I was convinced that these five fields existed in all the documents. When I passed the batch_info[field] (containing the text of that field) to the tokenizer it was passing nothing for the fields that didn't exist in the given document. I changed my approach and already tested it and the tokenizer worked just fine.

I think you can closed the issue. If you need further information please let me know.

Cheers,

Narsil · 2022-05-03T08:00:17Z

Ok I will close, but it still seems odd that the error occurs in tokenizer code, if you could provide info so we could fix (error should never be in our code, but in yours if possible so that's it's easier for you to debug).

joaocp98662 · 2022-05-03T12:15:04Z

except:
log_file.write(f'id: {batch_info["id"][0]}, field: {field}, info: {batch_info[field]}')

All the documents caught in the except don't have the specified field, so in the log file the batch_info[field] is []. Testing it with the tokenizer it returns the error reported in the first post.

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")

inputs = tokenizer(
[],
padding='longest',
truncation=False,
add_special_tokens=True,
return_tensors='pt'
)

Result:
in
inputs = tokenizer(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2413, in call
return self.batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2598, in batch_encode_plus
return self._batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 439, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

Narsil closed this as completed May 3, 2022

ZhangzihanGit mentioned this issue Aug 31, 2022

CETopic and BERTopic training fails. hyintell/topicx#3

Closed

Danysan1 mentioned this issue Jul 16, 2023

Tokenizer error "list index out of range" during mapping extension KRR-Oxford/DeepOnto#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using tokenizer #993

Error when using tokenizer #993

joaocp98662 commented Apr 29, 2022 •

edited

Loading

Narsil commented May 2, 2022

joaocp98662 commented May 2, 2022

Narsil commented May 3, 2022

joaocp98662 commented May 3, 2022 •

edited

Loading

Error when using tokenizer #993

Error when using tokenizer #993

Comments

joaocp98662 commented Apr 29, 2022 • edited Loading

Narsil commented May 2, 2022

joaocp98662 commented May 2, 2022

Narsil commented May 3, 2022

joaocp98662 commented May 3, 2022 • edited Loading

joaocp98662 commented Apr 29, 2022 •

edited

Loading

joaocp98662 commented May 3, 2022 •

edited

Loading