Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to understand the index_answer funtion #30

Closed
kushalj001 opened this issue May 6, 2020 · 6 comments
Closed

Trying to understand the index_answer funtion #30

kushalj001 opened this issue May 6, 2020 · 6 comments

Comments

@kushalj001
Copy link

image
The last condition in this function, wherein you return (None, None). Does this condition arise or is it just for avoiding a crash.
I am trying to implement the same paper and when I try to get the final labels for my context-question pair, there are many answers that result in ValueError. Is this some flaw in dataset?
Thank you.

@hitvoice
Copy link
Owner

hitvoice commented May 12, 2020

It's probably due to the tokenization inconsistency between the annotated answer span and spacy tokenization. It's likely to happen where the corpus has unusual punctuations.
If the annotated answer_start or answer_end lies in the middle of a token produced by SpaCy tokenization, it'll raise ValueError.

@kushalj001
Copy link
Author

So you're not considering those examples for training, right?

@hitvoice
Copy link
Owner

Yes. It's hard to automatically fix the tokenization errors.

@kushalj001
Copy link
Author

I banged my head for some days in trying to debug and fix them. It's largely due to the absence of a space character (' ') just before or just after the answer span in the answer. I reduced the errors to 10-15 erroneous examples and dropped them finally.
Also, a follow-up question, removal of punctuation is not necessary from the contexts and questions, right? Your script has only fixed the spaces before building a vocab. Even lowercasing the text is not necessary, right before building the vocab?
In the case of glove 840B, keeping the data as it is does not affect the vocab a lot. But in the case of GloVe 6B lowercasing the data reduces the Out of Vocabulary words to a fair extent.

Thank you for your help!

@hitvoice
Copy link
Owner

Also, a follow-up question, removal of punctuation is not necessary from the contexts and questions, right?

No, they are not necessary.

Your script has only fixed the spaces before building a vocab. Even lowercasing the text is not necessary, right before building the vocab?

If you use the lower-cased GloVe, you should lowercase the text before building the vocab. Otherwise, the vocab and the embedding tokens may not match.

In the case of glove 840B, keeping the data as it does not affect the vocab a lot. But in the case of GloVe 6B lowercasing the data reduces the Out of Vocabulary words to a fair extent.

Yes, you lowercase the data when using the GloVe 6B lowercased version.

@kushalj001
Copy link
Author

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants