Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using tokenizer #993

Closed
joaocp98662 opened this issue Apr 29, 2022 · 4 comments
Closed

Error when using tokenizer #993

joaocp98662 opened this issue Apr 29, 2022 · 4 comments

Comments

@joaocp98662
Copy link

joaocp98662 commented Apr 29, 2022

Hello,

I have a dataset with various documents and this documents are in a jsonl file, and each document is an entry with an id and contents fields. The contents field contains all the information about the document. This information is distributed in 5 fields separated by '\n'. For instance:

{"id": "NCT03538132", "contents": "Patients' Perception on Bone Grafts\nPatients' Perception on Bone Biomaterials Used in Dentistry : a Multicentric Study\nThe goal of this study is to collect the patients' opinion about the different types of bone graft, to assess which are the most rejected by the patients and if the demographic variables (such as the gender or the age) and the level of education influence their decision.\nNowadays, many procedures may need regenerative techniques. Some studies have already assessed the patients' opinion regarding soft tissue grafts, some investigators have centered their studies on the techniques' efficiency without assessing the patient's perception.\nInclusion criteria: - Adult (18 years old or more) - Able to read and write - Not under the influence of alcohol or drugs - Had not previously undergone any surgery involving bone graft or bone augmentation. Exclusion criteria: - Any patient who doesn't fullfill the inclusion criterias."}

For each document I would like to know how many tokens are generated for each field and plot a distribution of the number of tokens with the respect to the field. I only can get the tokens for 66% of my dataset, after that I get this error:

line 145, in
inputs = tokenizer(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2413, in call
return self.batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2598, in batch_encode_plus
return self._batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 439, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

I saved the documents that are breaking in a log file and tried to run again with a few documents and it breaks it some documents. I tried to change the other of the documents, the ones breaking I put them in the beginning and it worked for these but then it breaks in other documents. Can anyone help me?

This is my code:

collection_iterator = JsonlCollectionIterator(f'{ct_utils.paths["input_dir"]}/documents.jsonl', fields=["brief_title", "official_title", "brief_summary", "detailed_description", "criteria"])

fields_tokens = {
                "brief_title": [], 
                "official_title": [],
                "brief_summary": [],
                "detailed_description": [],
                "criteria": []
            }

log_file = open('documents_log.txt', 'w')

for index, batch_info in enumerate(collection_iterator(1, 0, 1)):
    
    for field in collection_iterator.fields:

        # try:
        inputs = tokenizer(
            batch_info[field],
            padding='longest',
            truncation=False,
            add_special_tokens=True,
            return_tensors='pt'
        )
        # except:
        #     log_file.write(batch_info["id"][0] + "\n")

        fields_tokens[field].append({'index': index + 1, 'document_id': batch_info['id'][0], 'tokens_count': inputs["input_ids"].shape[1]})

#log_file.close()

with open("fields_tokens.json", "w") as file:
    json.dump(fields_tokens, file, indent=4)

I didn't pass the max length and set the truncation to false in the tokenizer because I want it to generate the tokens for the complete text.

@Narsil
Copy link
Collaborator

Narsil commented May 2, 2022

Hi @joaocp98662 ,

Looking at the error it seems to be coming from your data, you might have an empty field somewhere.
You need to try...except as suggests your code, and if you could provide the context of batch_info[field] that would help us solve the issue. We could have a very simple reproducing issue

The code should definitely not crash, but it still is probably just a data issue.
If you could provide the exact tokenizer you're using that would help too.

@SaulLu for reference.

Cheers,

@joaocp98662
Copy link
Author

Hi @Narsil

Thank you very much for your reply. Yes, you are right, it was a data issue. I was looping through a fixed list of five fields and many of the documents didn't have all these five fields. It is a big dataset and I was convinced that these five fields existed in all the documents. When I passed the batch_info[field] (containing the text of that field) to the tokenizer it was passing nothing for the fields that didn't exist in the given document. I changed my approach and already tested it and the tokenizer worked just fine.

I think you can closed the issue. If you need further information please let me know.

Cheers,

@Narsil
Copy link
Collaborator

Narsil commented May 3, 2022

Ok I will close, but it still seems odd that the error occurs in tokenizer code, if you could provide info so we could fix (error should never be in our code, but in yours if possible so that's it's easier for you to debug).

@Narsil Narsil closed this as completed May 3, 2022
@joaocp98662
Copy link
Author

joaocp98662 commented May 3, 2022

except:
log_file.write(f'id: {batch_info["id"][0]}, field: {field}, info: {batch_info[field]}')

All the documents caught in the except don't have the specified field, so in the log file the batch_info[field] is []. Testing it with the tokenizer it returns the error reported in the first post.

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")

inputs = tokenizer(
[],
padding='longest',
truncation=False,
add_special_tokens=True,
return_tensors='pt'
)

Result:
in
inputs = tokenizer(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2413, in call
return self.batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2598, in batch_encode_plus
return self._batch_encode_plus(
File "/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 439, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants