-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QA task scores #52
Comments
Hi. Could you post a complete training log for QA somewhere and link it here. I need to have a look at it before I can say anything> |
For QA there's a train set and a dev set. Evaluation is done on the dev set. The dev set seems to be provided with correct labels(and not just placeholders). So the f1 score on this set printed locally should've been the same as that rendered through the PR. Following is the log when run for default params and uncommenting parts of code in run_squad.py that prints the results after evaluation. bash train.sh bert-base-multilingual-cased bert QA_EN_HI Fine-tuning bert-base-multilingual-cased on QA_EN_HI
The score through PR was given as: QA_EN_HI: 19.444444444444443 while the value locally printed is around 66. |
Could you share the file Since all the questions in the original dataset do not have contexts, DrQA is used to retrieve contexts for these question from Wikipedia. It looks like when you ran DrQA, it generated contexts for more examples, as the |
PFA the dev file.
…On Fri, Jun 25, 2021 at 9:23 PM Anirudh Srinivasan ***@***.***> wrote:
Could you share the file Data/Processed_Data/QA_EN_HI/dev-v2.0.json?
Since all the questions in the original dataset do not have contexts, DrQA
is used to retrieve contexts for these question from Wikipedia. It looks
like when you ran DrQA, it generated contexts for more examples, as the
predictions.json that you have uploaded has more entries in it?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGYQJMRRYKONB3M2YM56WFTTUSQ55ANCNFSM44CEBHNA>
.
|
Could you upload it some filesharing site or some site like pastebin? |
The file that I have on my end does not have questions with ID 235 onwards in the dev set. Could you make a backup of the dev file, delete the keys with ID 235 onwards and try running training again? It looks like the higher score you are getting is due to these extra questions being considered as part of the dev set |
Here's the dev set I had been using: Data/Processed_Data/QA_EN_HI/dev-v2.0.json. Even after removing the titles with ids 235 onwards, the inconsistency persists. Local dev set F1 score is ~67 while the PR gives around 24 ! |
I will check and get back. When you updated the dev file, did you delete the cache file in the same directory? If not, please delete that file and try re-running. |
Also tried after deleting the cached dev file, that didn't help either. |
It seems that the QA dataset processing scripts are returning more data points in your case than what is actually expected. I would suggest that you please try to rerun the QA preprocessing alone in a new python:3.6 docker container and check the dataset that you obtain. Please check if the train set has 259 entries and the test has 54 entries. |
I've been getting 313 entries in train and 70 in dev set. |
I seem to have figured out what the issue is. Would you be able to try running the code again with a few changes? When you're running the docker container, use this exact image - index 1fdaf95..3f57130 100644
--- a/Data/Preprocess_Scripts/preprocess_qa.sh
+++ b/Data/Preprocess_Scripts/preprocess_qa.sh
@@ -15,9 +15,9 @@ python $PREPROCESS_DIR/preprocess_drqa.py --data_dir $ORIGINAL_DATA_DIR
git clone https://github.com/facebookresearch/DrQA.git
cd DrQA
git checkout 96f343c
-pip install -r requirements.txt
+pip install elasticsearch==7.8.0 nltk==3.5 scipy==1.5.0 prettytable==0.7.2 tqdm==4.46.1 regex==2020.6.8 termcolor==1.1.0 scikit-learn==0.23.1 numpy==1.18.5 torch==1.4.0
python setup.py develop
-pip install spacy
+pip install spacy==2.3.0
python -m spacy download xx_ent_wiki_sm
python -c "import nltk;nltk.download(['punkt', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words'])"
./download.sh The If DrQA runs properly, the penultimate line of the |
Yes, that solved it. Thanks a lot ! |
Sorry about the issues. We rely on DrQA running in a "deterministic" manner. Due to updates to either the python version or some of the packages, this wasn't happening. I will update the scripts and the readme with these additional instructions. Were you able to submit and run evaluation properly? |
Yes, the score on submission is consistent now. Thanks again. |
There's no test set for QA, so the scores shown after the git PR would be on the same dev set, I believe. Since the dev set does have the labels we should have been able to use the f1 scores printed locally (which look okay ~72 for lr=5e-6, bs=2, epochs=16, max_seq=512, seed=32 ). I fail to understand why the scores retrieved via the pull request differ, being extremely poor ( ~ 25.3 )? Please let me know if there's anything I could be missing here or why this inconsistency?
PS: model is bert-base-multilingual-cased bert
The text was updated successfully, but these errors were encountered: