Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mentions positioning with GPT2-tokenizer #10

Open
Herice31-53 opened this issue Mar 30, 2023 · 0 comments
Open

Mentions positioning with GPT2-tokenizer #10

Herice31-53 opened this issue Mar 30, 2023 · 0 comments

Comments

@Herice31-53
Copy link

Hello,

This is really great work and the latency is truly better than other models.

However I do have some questions, I was hoping you could give me some insights.

Using GPT2-tokenizer, the positions of each mentions usually corresponds to the position_of_the_mentions+1 in the AIDA data you are using (except for the start of a mention that is the first word in the text).
Example from the aida_test_dataset:
example_tokenizer
You can notice that the JAPAN position within the list should be [4,6] instead of [5,7], and that the Rugby Union mention should be [0,5] instead of [0,6]. So I was simply wondering if that is normal ?

Also if it's not too much, could you explain what would happen with entities.json if the datasets for training don't have "candidates" key ?

Finally, about the mentions.json, there are a ton of '!', should I replicate that in my own mentions.json ?

Thank you for your help and for this repo !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant