Mentions positioning with GPT2-tokenizer #10

Herice31-53 · 2023-03-30T18:55:17Z

Hello,

This is really great work and the latency is truly better than other models.

However I do have some questions, I was hoping you could give me some insights.

Using GPT2-tokenizer, the positions of each mentions usually corresponds to the position_of_the_mentions+1 in the AIDA data you are using (except for the start of a mention that is the first word in the text).
Example from the aida_test_dataset:

You can notice that the JAPAN position within the list should be [4,6] instead of [5,7], and that the Rugby Union mention should be [0,5] instead of [0,6]. So I was simply wondering if that is normal ?

Also if it's not too much, could you explain what would happen with entities.json if the datasets for training don't have "candidates" key ?

Finally, about the mentions.json, there are a ton of '!', should I replicate that in my own mentions.json ?

Thank you for your help and for this repo !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mentions positioning with GPT2-tokenizer #10

Mentions positioning with GPT2-tokenizer #10

Herice31-53 commented Mar 30, 2023

Mentions positioning with GPT2-tokenizer #10

Mentions positioning with GPT2-tokenizer #10

Comments

Herice31-53 commented Mar 30, 2023