Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA task scores #52

Closed
BarahFazili opened this issue May 4, 2021 · 15 comments
Closed

QA task scores #52

BarahFazili opened this issue May 4, 2021 · 15 comments

Comments

@BarahFazili
Copy link

There's no test set for QA, so the scores shown after the git PR would be on the same dev set, I believe. Since the dev set does have the labels we should have been able to use the f1 scores printed locally (which look okay ~72 for lr=5e-6, bs=2, epochs=16, max_seq=512, seed=32 ). I fail to understand why the scores retrieved via the pull request differ, being extremely poor ( ~ 25.3 )? Please let me know if there's anything I could be missing here or why this inconsistency?

PS: model is bert-base-multilingual-cased bert

@Genius1237
Copy link
Collaborator

Hi. Could you post a complete training log for QA somewhere and link it here. I need to have a look at it before I can say anything>

@BarahFazili
Copy link
Author

For QA there's a train set and a dev set. Evaluation is done on the dev set. The dev set seems to be provided with correct labels(and not just placeholders). So the f1 score on this set printed locally should've been the same as that rendered through the PR. Following is the log when run for default params and uncommenting parts of code in run_squad.py that prints the results after evaluation.

bash train.sh bert-base-multilingual-cased bert QA_EN_HI

Fine-tuning bert-base-multilingual-cased on QA_EN_HI
06/25/2021 18:23:58 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

  • This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    06/25/2021 18:24:14 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/home/barah/exp/grey/GLUECoS/Data/Processed_Data/QA_EN_HI', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_train=True, doc_stride=128, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, lang_id=0, learning_rate=5e-05, local_rank=-1, logging_steps=500, max_answer_length=30, max_grad_norm=1.0, max_query_length=64, max_seq_length=512, max_steps=-1, model_name_or_path='bert-base-multilingual-cased', model_type='bert', n_best_size=20, n_gpu=1, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=5.0, output_dir='/home/barah/exp/grey/GLUECoS/Results/QA_EN_HI', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=4, predict_file=None, save_steps=500, seed=42, threads=1, tokenizer_name='', train_file=None, verbose_logging=False, version_2_with_negative=True, warmup_steps=0, weight_decay=0.0)
    06/25/2021 18:24:14 - INFO - main - Loading features from cached file /home/barah/exp/grey/GLUECoS/Data/Processed_Data/QA_EN_HI/cached_train_bert-base-multilingual-cased_512
    06/25/2021 18:24:14 - INFO - main - ***** Running training *****
    06/25/2021 18:24:14 - INFO - main - Num examples = 438
    06/25/2021 18:24:14 - INFO - main - Num Epochs = 5
    06/25/2021 18:24:14 - INFO - main - Instantaneous batch size per GPU = 4
    06/25/2021 18:24:14 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
    06/25/2021 18:24:14 - INFO - main - Gradient Accumulation steps = 1
    06/25/2021 18:24:14 - INFO - main - Total optimization steps = 550
    Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:28<00:00, 3.84it/s]
    06/25/2021 18:24:43 - INFO - main - training loss QA= 2.6527178460901433███| 110/110 [00:28<00:00, 4.25it/s]
    Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.78it/s]
    06/25/2021 18:25:12 - INFO - main - training loss QA= 1.2291807915676725███| 110/110 [00:29<00:00, 4.20it/s]
    Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.72it/s]
    06/25/2021 18:25:41 - INFO - main - training loss QA= 0.717832342916253████| 110/110 [00:29<00:00, 4.17it/s]
    Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.75it/s]
    06/25/2021 18:26:11 - INFO - main - training loss QA= 0.33675127934812654██| 110/110 [00:29<00:00, 4.18it/s]
    Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.75it/s]
    06/25/2021 18:26:40 - INFO - main - training loss QA= 0.22754330959604968██| 110/110 [00:29<00:00, 4.14it/s]
    Epoch: 100%|█████████████████████████████████████████████████████████████████████████| 5/5 [02:25<00:00, 29.19s/it]
    06/25/2021 18:26:40 - INFO - main - global_step = 551, average loss = 1.0309306944591778
    06/25/2021 18:26:40 - INFO - main - Loading features from cached file /home/barah/exp/grey/GLUECoS/Data/Processed_Data/QA_EN_HI/cached_dev_bert-base-multilingual-cased_512
    06/25/2021 18:26:40 - INFO - main - ***** Running evaluation 551 *****
    06/25/2021 18:26:40 - INFO - main - Num examples = 123
    06/25/2021 18:26:40 - INFO - main - Batch size = 8
    Evaluating: 100%|██████████████████████████████████████████████████████████████████| 16/16 [00:02<00:00, 6.87it/s]
    06/25/2021 18:26:42 - INFO - main - Evaluation done in total 2.328559 secs (0.018931 sec per example)
    06/25/2021 18:26:43 - INFO - main - eval f1 at 66.42857142857143
    OrderedDict([('exact', 65.71428571428571), ('f1', 66.42857142857143), ('total', 70), ('HasAns_exact', 65.71428571428571), ('HasAns_f1', 66.42857142857143), ('HasAns_total', 70), ('best_exact', 65.71428571428571), ('best_exact_thresh', 0.0), ('best_f1', 66.42857142857143), ('best_f1_thresh', 0.0)])
    06/25/2021 18:26:43 - INFO - main - Results: {'exact': 65.71428571428571, 'f1': 66.42857142857143, 'total': 70, 'HasAns_exact': 65.71428571428571, 'HasAns_f1': 66.42857142857143, 'HasAns_total': 70, 'best_exact': 65.71428571428571, 'best_exact_thresh': 0.0, 'best_f1': 66.42857142857143, 'best_f1_thresh': 0.0}

The score through PR was given as: QA_EN_HI: 19.444444444444443 while the value locally printed is around 66.

@Genius1237
Copy link
Collaborator

Genius1237 commented Jun 25, 2021

Could you share the file Data/Processed_Data/QA_EN_HI/dev-v2.0.json?

Since all the questions in the original dataset do not have contexts, DrQA is used to retrieve contexts for these question from Wikipedia. It looks like when you ran DrQA, it generated contexts for more examples, as the predictions.json that you have uploaded has more entries in it.

@BarahFazili
Copy link
Author

BarahFazili commented Jun 25, 2021 via email

@Genius1237
Copy link
Collaborator

Could you upload it some filesharing site or some site like pastebin?

@Genius1237
Copy link
Collaborator

The file that I have on my end does not have questions with ID 235 onwards in the dev set. Could you make a backup of the dev file, delete the keys with ID 235 onwards and try running training again? It looks like the higher score you are getting is due to these extra questions being considered as part of the dev set

@BarahFazili
Copy link
Author

Here's the dev set I had been using: Data/Processed_Data/QA_EN_HI/dev-v2.0.json. Even after removing the titles with ids 235 onwards, the inconsistency persists. Local dev set F1 score is ~67 while the PR gives around 24 !

@Genius1237
Copy link
Collaborator

I will check and get back. When you updated the dev file, did you delete the cache file in the same directory? If not, please delete that file and try re-running.

@BarahFazili
Copy link
Author

Also tried after deleting the cached dev file, that didn't help either.

@Genius1237
Copy link
Collaborator

It seems that the QA dataset processing scripts are returning more data points in your case than what is actually expected.

I would suggest that you please try to rerun the QA preprocessing alone in a new python:3.6 docker container and check the dataset that you obtain. Please check if the train set has 259 entries and the test has 54 entries.

@BarahFazili
Copy link
Author

I've been getting 313 entries in train and 70 in dev set.

@Genius1237
Copy link
Collaborator

I seem to have figured out what the issue is. Would you be able to try running the code again with a few changes?

When you're running the docker container, use this exact image - python:3.6.10. Also, apply this patch to the GLUECoS repo. It changes 2 lines in the Data/Preprocess_scripts/preprocess_qa.sh file.

index 1fdaf95..3f57130 100644
--- a/Data/Preprocess_Scripts/preprocess_qa.sh
+++ b/Data/Preprocess_Scripts/preprocess_qa.sh
@@ -15,9 +15,9 @@ python $PREPROCESS_DIR/preprocess_drqa.py --data_dir $ORIGINAL_DATA_DIR
 git clone https://github.com/facebookresearch/DrQA.git
 cd DrQA
 git checkout 96f343c
-pip install -r requirements.txt
+pip install elasticsearch==7.8.0 nltk==3.5 scipy==1.5.0 prettytable==0.7.2 tqdm==4.46.1 regex==2020.6.8 termcolor==1.1.0 scikit-learn==0.23.1 numpy==1.18.5 torch==1.4.0
 python setup.py develop
-pip install spacy
+pip install spacy==2.3.0
 python -m spacy download xx_ent_wiki_sm
 python -c "import nltk;nltk.download(['punkt', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words'])"
 ./download.sh

The preprocess_qa.sh script makes few modification to the DrQA repo. These are done in lines 24-32. Could you also please manually verify that these changes take effect (check after running it)?

If DrQA runs properly, the penultimate line of the preprocess_qa.sh script should be Finished. Total = 215 .

@BarahFazili
Copy link
Author

Yes, that solved it. Thanks a lot !

@Genius1237
Copy link
Collaborator

Sorry about the issues. We rely on DrQA running in a "deterministic" manner. Due to updates to either the python version or some of the packages, this wasn't happening.

I will update the scripts and the readme with these additional instructions. Were you able to submit and run evaluation properly?

@BarahFazili
Copy link
Author

Yes, the score on submission is consistent now. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants