# SRL Patient recognition experiments

In this notebook I carry out experiments to test whether two Semantic Role Labelling (SRL) systems can correctly identify patients in sentences with varying structures. This code was based on code provided by Pia Sommerauer.

In this code I load two models, namely the AllenNLP SRL model and the AllenNLP SRL BERT model. I create a variety of tets cases, for wich I evaluate the performance of the two models. All the test sentences are stored in a json file specified through the `test_sents_path` variable. The SRL predictions are stored in the json file specified through `srl_pred_path`, and similarly the SRL BERT predictions are stored at the path `bert_pred_path`.

### Patient recognition
I carry out two sets of tests: in the first test I only use names as patients, in the second test I add titles to the agent and patient, namely 'Doctor' and 'nurse'. In both sets I test 3 sentence structures and names from 3 cultures: English, Iranian and Dutch. The sentence structures are as follows:
* Active: '`name1` kissed `name2` yesterday'
* Passive : '`name1` is the one that was kissed by `name2` yesterday'
* 'It was .. who' + passive : 'It was `name1` that who was kissed by `name2` yesterday'

For the doctor/nurse titles, I test them both in a stereotypical context, where the Doctor has a male name and the nurse a female name, and in a non-stereotypical context, where the gender of the Doctor and the nurse are reversed. This is done to test whether the model has some gender bias: if this is the case we expect better results in the stereotypical context.


### Import libraries

In [1]:
from allennlp_models.pretrained import load_predictor

In [2]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect

In [3]:
from checklist.pred_wrapper import PredictorWrapper

In [4]:
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

In [5]:
from utils_functions import *

### Load the models 

In [6]:
# load the regular SRL model
srl_predictor = load_predictor('structured-prediction-srl')
# load the SRL BERT model
srlbert_predictor = load_predictor('structured-prediction-srl-bert')

2022-04-01 16:10:04,021 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-04-01 16:10:04,198 - INFO - allennlp.common.plugins - Plugin allennlp_semparse available
2022-04-01 16:10:04,405 - INFO - allennlp.common.plugins - Plugin allennlp_server available
2022-04-01 16:10:04,465 - INFO - allennlp.common.params - id = pair-classification-esim
2022-04-01 16:10:04,466 - INFO - allennlp.common.params - registered_model_name = esim
2022-04-01 16:10:04,467 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:04,468 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:04,469 - INFO - allennlp.common.params - display_name = Enhanced LSTM for Natural Language Inference
2022-04-01 16:10:04,471 - INFO - allennlp.common.params - task_id = textual_entailment
2022-04-01 16:10:04,472 - INFO - allennlp.common.params - model_usage.archive_file = esim-elmo-2020.11.11.tar.gz
2022-04-01 16:10:04,473 - INFO - allennlp.common.params - mod

2022-04-01 16:10:04,646 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:04,647 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10:04,649 - INFO - allennlp.common.params - evaluation_data.dataset = None
2022-04-01 16:10:04,650 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:04,652 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:04,659 - INFO - allennlp.common.params - training_data.dataset.name = DROP
2022-04-01 16:10:04,661 - INFO - allennlp.common.params - training_data.dataset.url = https://allennlp.org/drop
2022-04-01 16:10:04,663 - INFO - allennlp.common.params - training_data.motivation = None
2022-04-01 16:10:04,664 - INFO - allennlp.common.params - training_data.preprocessing = None
2022-04-01 16:10:04,667 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = None
2022-04-01 16:10:04,668 - INFO - allennlp.commo

2022-04-01 16:10:04,899 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:04,901 - INFO - allennlp.common.params - model_details.model_type = RoBERTa large
2022-04-01 16:10:04,902 - INFO - allennlp.common.params - model_details.paper.citation = 
@article{Liu2019RoBERTaAR,
title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author={Y. Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and M. Lewis and Luke Zettlemoyer and Veselin Stoyanov},
journal={ArXiv},
year={2019},
volume={abs/1907.11692}}

2022-04-01 16:10:04,904 - INFO - allennlp.common.params - model_details.paper.title = RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al)
2022-04-01 16:10:04,906 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:198953378
2022-04-01 16:10:04,906 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:04,907 - INFO - allennlp

2022-04-01 16:10:05,153 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:05,154 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:05,155 - INFO - allennlp.common.params - display_name = Coreference Resolution
2022-04-01 16:10:05,156 - INFO - allennlp.common.params - task_id = coref
2022-04-01 16:10:05,158 - INFO - allennlp.common.params - model_usage.archive_file = coref-spanbert-large-2021.03.10.tar.gz
2022-04-01 16:10:05,159 - INFO - allennlp.common.params - model_usage.training_config = coref/coref_spanbert_large.jsonnet
2022-04-01 16:10:05,160 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==2.1.0 allennlp-models==2.1.0
2022-04-01 16:10:05,161 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:05,163 - INFO - allennlp.common.params - model_details.description = The basic outline of this model is to get an embedded representation of each span in the document.

2022-04-01 16:10:05,296 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:199453025
2022-04-01 16:10:05,297 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:05,299 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:05,300 - INFO - allennlp.common.params - intended_use.primary_uses = This model is developed for the AllenNLP demo.
2022-04-01 16:10:05,301 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:05,302 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:05,303 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:05,304 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:05,305 - INFO - allennlp.common.params - metrics.model_performance_measures = Accuracy and F1-score
2022-04-01 16:10:05,307 - INFO - allennlp.

2022-04-01 16:10:05,427 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = Achieves 99% accuracy and 96% F1 on the CoNLL-2003 validation set.
2022-04-01 16:10:05,429 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:05,431 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:05,432 - INFO - allennlp.common.params - model_caveats_and_recommendations.caveats_and_recommendations = This model is based on ELMo. ELMo is not deterministic, meaning that you will see slight differences every time you run it. Also, ELMo likes to be warmed up, so we recommend processing dummy input before processing real workloads with it.
2022-04-01 16:10:05,485 - INFO - allennlp.common.params - id = semparse-text-to-sql
2022-04-01 16:10:05,486 - INFO - allennlp.common.params - registered_model_name = None
2022-04-01 16:10:05,487 - INFO - allennlp.common.params - model_class = None

2022-04-01 16:10:05,615 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:05,615 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:05,617 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04-01 16:10:05,619 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:05,619 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:05,621 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:05,622 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:05,623 - INFO - allennlp.common.params - metrics.model_performance_measures = Accuracy and Span-based F1 metric
2022-04-01 16:10:05,624 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:05,625 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10:05,

2022-04-01 16:10:05,808 - INFO - allennlp.common.params - registered_predictor_name = textual_entailment
2022-04-01 16:10:05,810 - INFO - allennlp.common.params - display_name = RoBERTa SNLI
2022-04-01 16:10:05,811 - INFO - allennlp.common.params - task_id = textual_entailment
2022-04-01 16:10:05,813 - INFO - allennlp.common.params - model_usage.archive_file = snli-roberta.2021-03-11.tar.gz
2022-04-01 16:10:05,815 - INFO - allennlp.common.params - model_usage.training_config = pair_classification/snli_roberta.jsonnet
2022-04-01 16:10:05,816 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==2.1.0 allennlp-models==2.1.0
2022-04-01 16:10:05,817 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:05,820 - INFO - allennlp.common.params - model_details.description = This `Model` implements a basic text classifier. The text is embedded into a text field using a RoBERTa-large model. The resulting sequence is pooled using a cl

2022-04-01 16:10:05,936 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:05,937 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10:05,938 - INFO - allennlp.common.params - evaluation_data.dataset.name = WikiTableQuestions
2022-04-01 16:10:05,939 - INFO - allennlp.common.params - evaluation_data.dataset.notes = Please download the data from the url provided.
2022-04-01 16:10:05,939 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://ppasupat.github.io/WikiTableQuestions/
2022-04-01 16:10:05,940 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:05,941 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:05,942 - INFO - allennlp.common.params - training_data.dataset.name = WikiTableQuestions
2022-04-01 16:10:05,943 - INFO - allennlp.common.params - training_data.dataset.notes = Please download the data from the url provided.
202

2022-04-01 16:10:06,123 - INFO - allennlp.common.params - model_details.developed_by = Stanovsky et al
2022-04-01 16:10:06,125 - INFO - allennlp.common.params - model_details.contributed_by = None
2022-04-01 16:10:06,126 - INFO - allennlp.common.params - model_details.date = 2020-03-26
2022-04-01 16:10:06,127 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:06,129 - INFO - allennlp.common.params - model_details.model_type = BiLSTM
2022-04-01 16:10:06,130 - INFO - allennlp.common.params - model_details.paper.citation = 
@inproceedings{Stanovsky2018SupervisedOI,
title={Supervised Open Information Extraction},
author={Gabriel Stanovsky and Julian Michael and Luke Zettlemoyer and I. Dagan},
booktitle={NAACL-HLT},
year={2018}}

2022-04-01 16:10:06,131 - INFO - allennlp.common.params - model_details.paper.title = Supervised Open Information Extraction
2022-04-01 16:10:06,132 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.or

2022-04-01 16:10:06,258 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = None
2022-04-01 16:10:06,258 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:06,259 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:06,261 - INFO - allennlp.common.params - model_caveats_and_recommendations.caveats_and_recommendations = This model is trained on the original SNLI-VE dataset. [Subsequent work](https://api.semanticscholar.org/CorpusID:215415945) has found that an estimated 31% of `neutral` labels in the dataset are incorrect. The `e-SNLI-VE-2.0` dataset contains the re-annotated validation and test sets.
2022-04-01 16:10:06,317 - INFO - allennlp.common.params - id = vgqa-vilbert
2022-04-01 16:10:06,318 - INFO - allennlp.common.params - registered_model_name = vqa_vilbert_from_huggingface
2022-04-01 16:10:06,319 - INFO - allennlp.common.params - model_class = No

2022-04-01 16:10:06,429 - INFO - allennlp.common.params - model_details.short_description = RoBERTa finetuned on SNLI with binary gender bias mitigation.
2022-04-01 16:10:06,431 - INFO - allennlp.common.params - model_details.developed_by = Dev at al
2022-04-01 16:10:06,431 - INFO - allennlp.common.params - model_details.contributed_by = Arjun Subramonian
2022-04-01 16:10:06,432 - INFO - allennlp.common.params - model_details.date = 2021-05-20
2022-04-01 16:10:06,433 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:06,434 - INFO - allennlp.common.params - model_details.model_type = RoBERTa
2022-04-01 16:10:06,435 - INFO - allennlp.common.params - model_details.paper.citation = 
@article{Dev2020OnMA,
title={On Measuring and Mitigating Biased Inferences of Word Embeddings},
author={Sunipa Dev and Tao Li and J. M. Phillips and Vivek Srikumar},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2020},
volume={34},
number={05},
pages={

2022-04-01 16:10:06,547 - INFO - allennlp.common.params - evaluation_data.dataset.processed_url = balanced_real_val
2022-04-01 16:10:06,549 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://visualqa.org/
2022-04-01 16:10:06,551 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:06,552 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:06,553 - INFO - allennlp.common.params - training_data.dataset.name = VQA dataset
2022-04-01 16:10:06,554 - INFO - allennlp.common.params - training_data.dataset.notes = Training requires a large amount of images to be accessible locally, so we cannot provide a command you can easily copy and paste. The first time you run it, you will get an error message that tells you how to get the rest of the data.
2022-04-01 16:10:06,554 - INFO - allennlp.common.params - training_data.dataset.processed_url = balanced_real_train
2022-04-01 16:10:06,555 - INFO - allennlp.common

2022-04-01 16:10:06,730 - INFO - allennlp.common.params - model_usage.training_config = structured-prediction/constituency_parser_elmo.jsonnet
2022-04-01 16:10:06,731 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==2.1.0 allennlp-models==2.1.0
2022-04-01 16:10:06,732 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:06,734 - INFO - allennlp.common.params - model_details.description = This is an implementation of a minimal neural model for constituency parsing based on an independent scoring of labels and spans. This `SpanConstituencyParser` simply encodes a sequence of text with a stacked `Seq2SeqEncoder`, extracts span representations using a `SpanExtractor`, and then predicts a label for each span in the sequence. These labels are non-terminal nodes in a constituency parse tree, which we then greedily reconstruct. The model uses ELMo embeddings, which are completely character-based and improves single model perf

2022-04-01 16:10:06,859 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:06,861 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:06,861 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:06,862 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:06,863 - INFO - allennlp.common.params - metrics.model_performance_measures = Accuracy
2022-04-01 16:10:06,865 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:06,865 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10:06,867 - INFO - allennlp.common.params - evaluation_data.dataset.name = SuperGLUE Recognizing Textual Entailment validation set
2022-04-01 16:10:06,868 - INFO - allennlp.common.params - evaluation_data.dataset.processed_url = https://dl.fbaipublicfiles.com/glue/superglue/data/v2/RTE.zip!RTE/val.jsonl
2022-04-01 16:1

2022-04-01 16:10:07,084 - INFO - allennlp.common.params - task_id = rc
2022-04-01 16:10:07,086 - INFO - allennlp.common.params - model_usage.archive_file = transformer-qa.2021-02-11.tar.gz
2022-04-01 16:10:07,087 - INFO - allennlp.common.params - model_usage.training_config = rc/transformer_qa.jsonnet
2022-04-01 16:10:07,088 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==2.1.0 allennlp-models==2.1.0
2022-04-01 16:10:07,089 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:07,091 - INFO - allennlp.common.params - model_details.description = The model implements a reading comprehension model patterned after the proposed model in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al, 2018)](https://api.semanticscholar.org/CorpusID:52967399), with improvements borrowed from the SQuAD model in the transformers project. It predicts start tokens and end tokens with a linear laye

2022-04-01 16:10:07,226 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:198953378
2022-04-01 16:10:07,227 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:07,228 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:07,230 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04-01 16:10:07,231 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:07,233 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:07,234 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:07,235 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:07,238 - INFO - allennlp.common.params - metrics.model_performance_measures = The chosen metric is accuracy, since it is a multiple choice model.
2022-04-01 16:10:07,239 - INFO - allen

2022-04-01 16:10:07,442 - INFO - allennlp.common.params - registered_model_name = bidaf
2022-04-01 16:10:07,443 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:07,444 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:07,445 - INFO - allennlp.common.params - display_name = ELMo-BiDAF
2022-04-01 16:10:07,447 - INFO - allennlp.common.params - task_id = rc
2022-04-01 16:10:07,448 - INFO - allennlp.common.params - model_usage.archive_file = bidaf-elmo.2021-02-11.tar.gz
2022-04-01 16:10:07,449 - INFO - allennlp.common.params - model_usage.training_config = rc/bidaf_elmo.jsonnet
2022-04-01 16:10:07,450 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==2.1.0 allennlp-models==2.1.0
2022-04-01 16:10:07,451 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:07,452 - INFO - allennlp.common.params - model_details.description = This is an implementation of the BiDAF model wit

2022-04-01 16:10:07,579 - INFO - allennlp.common.params - model_details.model_type = RoBERTa
2022-04-01 16:10:07,581 - INFO - allennlp.common.params - model_details.paper.citation = 
@article{Zhang2018MitigatingUB,
title={Mitigating Unwanted Biases with Adversarial Learning},
author={B. H. Zhang and B. Lemoine and Margaret Mitchell},
journal={Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society},
year={2018}
}
2022-04-01 16:10:07,581 - INFO - allennlp.common.params - model_details.paper.title = Mitigating Unwanted Biases with Adversarial Learning
2022-04-01 16:10:07,582 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:9424845
2022-04-01 16:10:07,583 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:07,584 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:07,586 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04

2022-04-01 16:10:07,712 - INFO - allennlp.common.params - evaluation_data.dataset.processed_url = /path/to/dataset
2022-04-01 16:10:07,713 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://catalog.ldc.upenn.edu/LDC99T42
2022-04-01 16:10:07,714 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:07,716 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:07,717 - INFO - allennlp.common.params - training_data.dataset.name = PTB 3.0
2022-04-01 16:10:07,718 - INFO - allennlp.common.params - training_data.dataset.notes = The dependency parser was evaluated on the Penn Tree Bank dataset. Unfortunately we cannot release this data due to licensing restrictions by the LDC. You can download the PTB data from the LDC website.
2022-04-01 16:10:07,720 - INFO - allennlp.common.params - training_data.dataset.processed_url = /path/to/dataset
2022-04-01 16:10:07,721 - INFO - allennlp.common.params - training_data.

2022-04-01 16:10:07,899 - INFO - allennlp.common.params - model_details.developed_by = Liu et al
2022-04-01 16:10:07,900 - INFO - allennlp.common.params - model_details.contributed_by = Dirk Groeneveld
2022-04-01 16:10:07,901 - INFO - allennlp.common.params - model_details.date = 2020-07-29
2022-04-01 16:10:07,902 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:07,903 - INFO - allennlp.common.params - model_details.model_type = RoBERTa
2022-04-01 16:10:07,904 - INFO - allennlp.common.params - model_details.paper.citation = 
@article{Liu2019RoBERTaAR,
title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author={Y. Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and M. Lewis and Luke Zettlemoyer and Veselin Stoyanov},
journal={ArXiv},
year={2019},
volume={abs/1907.11692}}

2022-04-01 16:10:07,905 - INFO - allennlp.common.params - model_details.paper.title = RoBERTa: A Robustly Optimized BERT Pretrainin

2022-04-01 16:10:08,027 - INFO - allennlp.common.params - evaluation_data.dataset.processed_url = https://allennlp.s3.amazonaws.com/datasets/snli/snli_1.0_test.jsonl
2022-04-01 16:10:08,027 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://nlp.stanford.edu/projects/snli/
2022-04-01 16:10:08,028 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:08,028 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:08,032 - INFO - allennlp.common.params - training_data.dataset.name = Stanford Natural Language Inference (SNLI) train set
2022-04-01 16:10:08,033 - INFO - allennlp.common.params - training_data.dataset.processed_url = https://allennlp.s3.amazonaws.com/datasets/snli/snli_1.0_train.jsonl
2022-04-01 16:10:08,034 - INFO - allennlp.common.params - training_data.dataset.url = https://nlp.stanford.edu/projects/snli/
2022-04-01 16:10:08,035 - INFO - allennlp.common.params - training_data.motivation = Non

2022-04-01 16:10:08,209 - INFO - allennlp.common.params - model_details.short_description = RoBERTa-based multiple choice model for PIQA.
2022-04-01 16:10:08,210 - INFO - allennlp.common.params - model_details.developed_by = Devlin et al
2022-04-01 16:10:08,210 - INFO - allennlp.common.params - model_details.contributed_by = Dirk Groeneveld
2022-04-01 16:10:08,211 - INFO - allennlp.common.params - model_details.date = 2020-07-08
2022-04-01 16:10:08,211 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:08,212 - INFO - allennlp.common.params - model_details.model_type = RoBERTa large
2022-04-01 16:10:08,213 - INFO - allennlp.common.params - model_details.paper.citation = 
@article{Liu2019RoBERTaAR,
title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author={Y. Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and M. Lewis and Luke Zettlemoyer and Veselin Stoyanov},
journal={ArXiv},
year={2019},
volume={ab

2022-04-01 16:10:08,330 - INFO - allennlp.common.params - training_data.preprocessing = Dragnet and [Newspaper](https://github.com/codelucas/newspaper) content extractors are used. Wikipedia articles are removed.
2022-04-01 16:10:08,332 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = None
2022-04-01 16:10:08,333 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:08,334 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:08,335 - INFO - allennlp.common.params - model_caveats_and_recommendations.caveats_and_recommendations = None
2022-04-01 16:10:08,393 - INFO - allennlp.common.params - id = tagging-fine-grained-crf-tagger
2022-04-01 16:10:08,394 - INFO - allennlp.common.params - registered_model_name = crf_tagger
2022-04-01 16:10:08,394 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:08,395 - INFO - allennlp.common.params - registe

2022-04-01 16:10:08,508 - INFO - allennlp.common.params - model_details.model_type = BART
2022-04-01 16:10:08,508 - INFO - allennlp.common.params - model_details.paper.citation = 
@inproceedings{Lewis2020BARTDS,
title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension},
author={M. Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and A. Mohamed and Omer Levy and Ves Stoyanov and L. Zettlemoyer},
booktitle={ACL},
year={2020}}

2022-04-01 16:10:08,509 - INFO - allennlp.common.params - model_details.paper.title = BART: Denosing Sequence-to-Sequence Pre-training for Natural Language Generation,Translation, and Comprehension
2022-04-01 16:10:08,510 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:204960716
2022-04-01 16:10:08,511 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:08,511 - INFO - allennlp.common.params - model_details.co

2022-04-01 16:10:10,035 - INFO - allennlp.common.params - model.encoder.num_layers = 8
2022-04-01 16:10:10,036 - INFO - allennlp.common.params - model.encoder.recurrent_dropout_probability = 0.1
2022-04-01 16:10:10,037 - INFO - allennlp.common.params - model.encoder.use_highway = True
2022-04-01 16:10:10,039 - INFO - allennlp.common.params - model.encoder.use_input_projection_bias = True
2022-04-01 16:10:10,040 - INFO - allennlp.common.params - model.encoder.stateful = False
2022-04-01 16:10:10,500 - INFO - allennlp.common.params - model.binary_feature_dim = 100
2022-04-01 16:10:10,501 - INFO - allennlp.common.params - model.embedding_dropout = 0.0
2022-04-01 16:10:10,502 - INFO - allennlp.common.params - model.initializer = <allennlp.nn.initializers.InitializerApplicator object at 0x7f3be08f7b10>
2022-04-01 16:10:10,503 - INFO - allennlp.common.params - model.label_smoothing = None
2022-04-01 16:10:10,503 - INFO - allennlp.common.params - model.ignore_span_metric = False
2022-04-01 16

2022-04-01 16:10:11,775 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:34032948
2022-04-01 16:10:11,776 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:11,778 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:11,779 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04-01 16:10:11,780 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:11,781 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:11,783 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:11,784 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:11,786 - INFO - allennlp.common.params - metrics.model_performance_measures = Accuracy
2022-04-01 16:10:11,787 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-

2022-04-01 16:10:11,966 - INFO - allennlp.common.params - model_details.short_description = RoBERTa-based binary classifier for Stanford Sentiment Treebank
2022-04-01 16:10:11,967 - INFO - allennlp.common.params - model_details.developed_by = Devlin et al
2022-04-01 16:10:11,968 - INFO - allennlp.common.params - model_details.contributed_by = Zhaofeng Wu
2022-04-01 16:10:11,968 - INFO - allennlp.common.params - model_details.date = 2020-06-08
2022-04-01 16:10:11,969 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:11,970 - INFO - allennlp.common.params - model_details.model_type = RoBERTa large
2022-04-01 16:10:11,971 - INFO - allennlp.common.params - model_details.paper.citation = 
@article{Liu2019RoBERTaAR,
title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author={Y. Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and M. Lewis and Luke Zettlemoyer and Veselin Stoyanov},
journal={ArXiv},
year={201

2022-04-01 16:10:12,086 - INFO - allennlp.common.params - training_data.dataset.notes = Please download the data from the url provided.
2022-04-01 16:10:12,087 - INFO - allennlp.common.params - training_data.dataset.url = https://github.com/jonathanherzig/commonsenseqa
2022-04-01 16:10:12,087 - INFO - allennlp.common.params - training_data.motivation = None
2022-04-01 16:10:12,088 - INFO - allennlp.common.params - training_data.preprocessing = None
2022-04-01 16:10:12,090 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = None
2022-04-01 16:10:12,096 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:12,098 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:12,099 - INFO - allennlp.common.params - model_caveats_and_recommendations.caveats_and_recommendations = None
2022-04-01 16:10:12,148 - INFO - allennlp.common.params - id = glove-sst
2022-04-01 16:10:

2022-04-01 16:10:12,256 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:12,259 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:12,261 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04-01 16:10:12,263 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:12,264 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:12,265 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:12,267 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:12,268 - INFO - allennlp.common.params - metrics.model_performance_measures = CoNLL coref scores and Mention Recall
2022-04-01 16:10:12,269 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:12,270 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10

2022-04-01 16:10:12,396 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:12,397 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:12,399 - INFO - allennlp.common.params - model_caveats_and_recommendations = None
2022-04-01 16:10:12,446 - INFO - allennlp.common.params - id = tagging-elmo-crf-tagger
2022-04-01 16:10:12,447 - INFO - allennlp.common.params - registered_model_name = crf_tagger
2022-04-01 16:10:12,448 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:12,449 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:12,449 - INFO - allennlp.common.params - display_name = ELMo-based Named Entity Recognition
2022-04-01 16:10:12,451 - INFO - allennlp.common.params - task_id = ner
2022-04-01 16:10:12,452 - INFO - allennlp.common.params - model_usage.archive_file = ner-elmo.2021-02-12.tar.gz
2022-04-01 16:10:12,452 - INFO - 

2022-04-01 16:10:12,563 - INFO - allennlp.common.params - model_details.date = 2020-02-10
2022-04-01 16:10:12,563 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:12,564 - INFO - allennlp.common.params - model_details.model_type = None
2022-04-01 16:10:12,565 - INFO - allennlp.common.params - model_details.paper.citation = 
@inproceedings{Dasigi2019IterativeSF,
title={Iterative Search for Weakly Supervised Semantic Parsing},
author={Pradeep Dasigi and Matt Gardner and Shikhar Murty and Luke Zettlemoyer and E. Hovy},
booktitle={NAACL-HLT},
year={2019}}

2022-04-01 16:10:12,566 - INFO - allennlp.common.params - model_details.paper.title = Iterative Search for Weakly Supervised Semantic Parsing
2022-04-01 16:10:12,567 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:174799945
2022-04-01 16:10:12,568 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:12,568 - INFO - allennlp.common.

2022-04-01 16:10:12,724 - INFO - allennlp.common.params - id = nlvr2-vilbert
2022-04-01 16:10:12,725 - INFO - allennlp.common.params - registered_model_name = nlvr2
2022-04-01 16:10:12,726 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:12,727 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:12,728 - INFO - allennlp.common.params - display_name = Visual Entailment - NLVR2
2022-04-01 16:10:12,728 - INFO - allennlp.common.params - task_id = nlvr2
2022-04-01 16:10:12,729 - INFO - allennlp.common.params - model_usage.archive_file = vilbert-nlvr2-2021.06.01.tar.gz
2022-04-01 16:10:12,730 - INFO - allennlp.common.params - model_usage.training_config = vilbert_nlvr2_pretrained.jsonnet
2022-04-01 16:10:12,731 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp>=2.5.1 allennlp-models>=2.5.1
2022-04-01 16:10:12,732 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:12,733 -

2022-04-01 16:10:12,827 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:12,827 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:12,828 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:12,828 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:12,828 - INFO - allennlp.common.params - metrics.model_performance_measures = Accuracy
2022-04-01 16:10:12,829 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:12,829 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10:12,829 - INFO - allennlp.common.params - evaluation_data.dataset.name = Stanford Natural Language Inference (SNLI) dev set
2022-04-01 16:10:12,830 - INFO - allennlp.common.params - evaluation_data.dataset.processed_url = https://allennlp.s3.amazonaws.com/datasets/snli/snli_1.0_test.jsonl
2022-04-01 16:10:12,830 - IN

2022-04-01 16:10:12,974 - INFO - allennlp.common.params - model_usage.training_config = structured_prediction/bert_base_srl.jsonnet
2022-04-01 16:10:12,975 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==2.1.0 allennlp-models==2.1.0
2022-04-01 16:10:12,976 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:12,977 - INFO - allennlp.common.params - model_details.description = An implementation of a BERT based model (Shi et al, 2019) with some modifications (no additional parameters apart from a linear classification layer), which is currently the state of the art single model for English PropBank SRL (Newswire sentences). It achieves 86.49 test F1 on the Ontonotes 5.0 dataset.
2022-04-01 16:10:12,978 - INFO - allennlp.common.params - model_details.short_description = A BERT based model (Shi et al, 2019) with some modifications (no additional parameters apart from a linear classification layer)
2022-04-01 16:10:12,980

2022-04-01 16:10:13,097 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://github.com/gabrielStanovsky/oie-benchmark
2022-04-01 16:10:13,098 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:13,099 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:13,101 - INFO - allennlp.common.params - training_data.dataset.name = All Words Open IE
2022-04-01 16:10:13,102 - INFO - allennlp.common.params - training_data.dataset.url = https://github.com/gabrielStanovsky/supervised-oie/tree/master/data
2022-04-01 16:10:13,102 - INFO - allennlp.common.params - training_data.motivation = None
2022-04-01 16:10:13,103 - INFO - allennlp.common.params - training_data.preprocessing = None
2022-04-01 16:10:13,104 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = None
2022-04-01 16:10:13,105 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:13,

2022-04-01 16:10:13,250 - INFO - allennlp.common.params - model_details.contributed_by = Jacob Morrison
2022-04-01 16:10:13,251 - INFO - allennlp.common.params - model_details.date = 2021-05-07
2022-04-01 16:10:13,251 - INFO - allennlp.common.params - model_details.version = 2
2022-04-01 16:10:13,251 - INFO - allennlp.common.params - model_details.model_type = ViLBERT based on BERT large
2022-04-01 16:10:13,251 - INFO - allennlp.common.params - model_details.paper.citation = 
@inproceedings{Lu2019ViLBERTPT,
title={ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks},
author={Jiasen Lu and Dhruv Batra and D. Parikh and Stefan Lee},
booktitle={NeurIPS},
year={2019}
}
2022-04-01 16:10:13,251 - INFO - allennlp.common.params - model_details.paper.title = ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
2022-04-01 16:10:13,252 - INFO - allennlp.common.params - model_details.paper.url = https://api.se

2022-04-01 16:10:13,340 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:13,340 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10:13,341 - INFO - allennlp.common.params - evaluation_data.dataset.name = On Measuring and Mitigating Biased Gender-Occupation Inferences SNLI Dataset
2022-04-01 16:10:13,343 - INFO - allennlp.common.params - evaluation_data.dataset.processed_url = https://storage.googleapis.com/allennlp-public-models/binary-gender-bias-mitigated-snli-dataset.jsonl
2022-04-01 16:10:13,343 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://github.com/sunipa/On-Measuring-and-Mitigating-Biased-Inferences-of-Word-Embeddings
2022-04-01 16:10:13,344 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:13,345 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:13,346 - INFO - allennlp.common.params - training_data.dataset.n

2022-04-01 16:10:13,482 - INFO - allennlp.common.params - id = rc-naqanet
2022-04-01 16:10:13,482 - INFO - allennlp.common.params - registered_model_name = naqanet
2022-04-01 16:10:13,483 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:13,484 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:13,485 - INFO - allennlp.common.params - display_name = Numerically Augmented QA Net
2022-04-01 16:10:13,485 - INFO - allennlp.common.params - task_id = rc
2022-04-01 16:10:13,487 - INFO - allennlp.common.params - model_usage.archive_file = naqanet-2021.02.26.tar.gz
2022-04-01 16:10:13,488 - INFO - allennlp.common.params - model_usage.training_config = rc/naqanet.jsonnet
2022-04-01 16:10:13,489 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==2.1.0 allennlp-models==2.1.0
2022-04-01 16:10:13,489 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:13,490 - INFO - allennlp.comm

2022-04-01 16:10:13,587 - INFO - allennlp.common.params - model_details.paper.title = Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples
2022-04-01 16:10:13,588 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:21712653
2022-04-01 16:10:13,589 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:13,590 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:13,591 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04-01 16:10:13,593 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:13,594 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:13,595 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:13,596 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:13,597 - INFO 

2022-04-01 16:10:13,696 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:13,697 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:13,699 - INFO - allennlp.common.params - model_caveats_and_recommendations.caveats_and_recommendations = None
2022-04-01 16:10:13,744 - INFO - allennlp.common.params - id = lm-masked-language-model
2022-04-01 16:10:13,745 - INFO - allennlp.common.params - registered_model_name = masked_language_model
2022-04-01 16:10:13,746 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:13,746 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:13,747 - INFO - allennlp.common.params - display_name = BERT-based Masked Language Model
2022-04-01 16:10:13,748 - INFO - allennlp.common.params - task_id = masked-language-modeling
2022-04-01 16:10:13,749 - INFO - allennlp.common.params - model_usage.archive_file = be

2022-04-01 16:10:13,847 - INFO - allennlp.common.params - model_details.paper.title = RoBERTa: A Robustly Optimized BERT Pretraining Approach
2022-04-01 16:10:13,847 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:198953378
2022-04-01 16:10:13,852 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:13,853 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:13,854 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04-01 16:10:13,855 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:13,857 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:13,858 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:13,859 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:13,860 - INFO - allennlp.common.params - m

2022-04-01 16:10:13,968 - INFO - allennlp.common.params - training_data.preprocessing = None
2022-04-01 16:10:13,969 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = None
2022-04-01 16:10:13,970 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:13,972 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:13,974 - INFO - allennlp.common.params - model_caveats_and_recommendations.caveats_and_recommendations = None
2022-04-01 16:10:14,014 - INFO - allennlp.common.params - id = rc-bidaf
2022-04-01 16:10:14,016 - INFO - allennlp.common.params - registered_model_name = bidaf
2022-04-01 16:10:14,018 - INFO - allennlp.common.params - model_class = None
2022-04-01 16:10:14,019 - INFO - allennlp.common.params - registered_predictor_name = None
2022-04-01 16:10:14,020 - INFO - allennlp.common.params - display_name = BiDAF
2022-04-01 16:10:14,021 - INFO - allennlp.c

2022-04-01 16:10:14,129 - INFO - allennlp.common.params - model_details.paper.title = Bidirectional Attention Flow for Machine Comprehension
2022-04-01 16:10:14,129 - INFO - allennlp.common.params - model_details.paper.url = https://api.semanticscholar.org/CorpusID:8535316
2022-04-01 16:10:14,130 - INFO - allennlp.common.params - model_details.license = None
2022-04-01 16:10:14,131 - INFO - allennlp.common.params - model_details.contact = allennlp-contact@allenai.org
2022-04-01 16:10:14,132 - INFO - allennlp.common.params - intended_use.primary_uses = None
2022-04-01 16:10:14,133 - INFO - allennlp.common.params - intended_use.primary_users = None
2022-04-01 16:10:14,134 - INFO - allennlp.common.params - intended_use.out_of_scope_use_cases = None
2022-04-01 16:10:14,135 - INFO - allennlp.common.params - factors.relevant_factors = None
2022-04-01 16:10:14,137 - INFO - allennlp.common.params - factors.evaluation_factors = None
2022-04-01 16:10:14,139 - INFO - allennlp.common.params - metr

2022-04-01 16:10:14,230 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:14,231 - INFO - allennlp.common.params - training_data.dataset.name = Stanford Natural Language Inference (SNLI) train set
2022-04-01 16:10:14,232 - INFO - allennlp.common.params - training_data.dataset.processed_url = https://allennlp.s3.amazonaws.com/datasets/snli/snli_1.0_train.jsonl
2022-04-01 16:10:14,233 - INFO - allennlp.common.params - training_data.dataset.url = https://nlp.stanford.edu/projects/snli/
2022-04-01 16:10:14,234 - INFO - allennlp.common.params - training_data.motivation = None
2022-04-01 16:10:14,235 - INFO - allennlp.common.params - training_data.preprocessing = None
2022-04-01 16:10:14,236 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = Net Neutral: 0.613096454815352, Fraction Neutral: 0.6704967487937075, Threshold:0.5: 0.6637061892722586, Threshold:0.7: 0.49490217463150243
2022-04-01 16:10:14,237 - INFO - allennlp.common.para

2022-04-01 16:10:14,368 - INFO - allennlp.common.params - task_id = semparse-nlvr
2022-04-01 16:10:14,369 - INFO - allennlp.common.params - model_usage.archive_file = https://allennlp.s3.amazonaws.com/models/nlvr-erm-model-2020.02.10-rule-vocabulary-updated.tar.gz
2022-04-01 16:10:14,370 - INFO - allennlp.common.params - model_usage.training_config = None
2022-04-01 16:10:14,371 - INFO - allennlp.common.params - model_usage.install_instructions = pip install allennlp==1.0.0 allennlp-models==1.0.0
2022-04-01 16:10:14,371 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:14,372 - INFO - allennlp.common.params - model_details.description = The model is a semantic parser trained on Cornell NLVR.
2022-04-01 16:10:14,373 - INFO - allennlp.common.params - model_details.short_description = The model is a semantic parser trained on Cornell NLVR.
2022-04-01 16:10:14,374 - INFO - allennlp.common.params - model_details.developed_by = Dasigi et al
2022-04-01 16:10:14,3

2022-04-01 16:10:14,467 - INFO - allennlp.common.params - evaluation_data.dataset.processed_url = https://allennlp.s3.amazonaws.com/datasets/multinli/multinli_1.0_dev_mismatched.jsonl
2022-04-01 16:10:14,468 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://cims.nyu.edu/~sbowman/multinli/
2022-04-01 16:10:14,469 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:14,469 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:14,470 - INFO - allennlp.common.params - training_data.dataset.name = Multi-genre Natural Language Inference (MultiNLI) train set
2022-04-01 16:10:14,471 - INFO - allennlp.common.params - training_data.dataset.processed_url = https://allennlp.s3.amazonaws.com/datasets/multinli/multinli_1.0_train.jsonl
2022-04-01 16:10:14,471 - INFO - allennlp.common.params - training_data.dataset.url = https://cims.nyu.edu/~sbowman/multinli/
2022-04-01 16:10:14,476 - INFO - allennlp.common.params

2022-04-01 16:10:14,621 - INFO - allennlp.common.params - model_usage.training_config = None
2022-04-01 16:10:14,622 - INFO - allennlp.common.params - model_usage.install_instructions = The model is available at https://github.com/anthonywchen/MOCHA.
2022-04-01 16:10:14,623 - INFO - allennlp.common.params - model_usage.overrides = None
2022-04-01 16:10:14,625 - INFO - allennlp.common.params - model_details.description = LERC is a BERT model that is trained to mimic human judgement scores on candidate answers in the MOCHA dataset. LERC outputs scores that range from 1 to 5, however, to stay consistent with metrics such as BLEU and ROUGE, we normalize the output of LERC to be between 0 and 1 in this demo.
2022-04-01 16:10:14,626 - INFO - allennlp.common.params - model_details.short_description = A BERT model that scores candidate answers from 0 to 1.
2022-04-01 16:10:14,627 - INFO - allennlp.common.params - model_details.developed_by = Chen et al
2022-04-01 16:10:14,628 - INFO - allennlp

2022-04-01 16:10:14,742 - INFO - allennlp.common.params - metrics.decision_thresholds = None
2022-04-01 16:10:14,742 - INFO - allennlp.common.params - metrics.variation_approaches = None
2022-04-01 16:10:14,743 - INFO - allennlp.common.params - evaluation_data.dataset.name = PIQA (validation set)
2022-04-01 16:10:14,744 - INFO - allennlp.common.params - evaluation_data.dataset.notes = Please download the data from the url provided.
2022-04-01 16:10:14,744 - INFO - allennlp.common.params - evaluation_data.dataset.url = https://yonatanbisk.com/piqa/
2022-04-01 16:10:14,745 - INFO - allennlp.common.params - evaluation_data.motivation = None
2022-04-01 16:10:14,746 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:14,747 - INFO - allennlp.common.params - training_data.dataset.name = PIQA (train set)
2022-04-01 16:10:14,752 - INFO - allennlp.common.params - training_data.dataset.notes = Please download the data from the url provided.
2022-04-01 16:10:14

2022-04-01 16:10:14,896 - INFO - allennlp.common.params - model_details.developed_by = Lample et al
2022-04-01 16:10:14,897 - INFO - allennlp.common.params - model_details.contributed_by = None
2022-04-01 16:10:14,897 - INFO - allennlp.common.params - model_details.date = 2020-06-24
2022-04-01 16:10:14,898 - INFO - allennlp.common.params - model_details.version = 1
2022-04-01 16:10:14,901 - INFO - allennlp.common.params - model_details.model_type = BiLSTM
2022-04-01 16:10:14,902 - INFO - allennlp.common.params - model_details.paper.citation = 
@article{Lample2016NeuralAF,
title={Neural Architectures for Named Entity Recognition},
author={Guillaume Lample and Miguel Ballesteros and Sandeep Subramanian and K. Kawakami and Chris Dyer},
journal={ArXiv},
year={2016},
volume={abs/1603.01360}}

2022-04-01 16:10:14,903 - INFO - allennlp.common.params - model_details.paper.title = Neural Architectures for Named Entity Recognition
2022-04-01 16:10:14,903 - INFO - allennlp.common.params - model_d

2022-04-01 16:10:15,009 - INFO - allennlp.common.params - evaluation_data.preprocessing = None
2022-04-01 16:10:15,010 - INFO - allennlp.common.params - training_data.dataset.name = CNN/DailyMail
2022-04-01 16:10:15,011 - INFO - allennlp.common.params - training_data.dataset.notes = Please download the data from the url provided.
2022-04-01 16:10:15,012 - INFO - allennlp.common.params - training_data.dataset.url = https://github.com/abisee/cnn-dailymail
2022-04-01 16:10:15,012 - INFO - allennlp.common.params - training_data.motivation = None
2022-04-01 16:10:15,013 - INFO - allennlp.common.params - training_data.preprocessing = None
2022-04-01 16:10:15,014 - INFO - allennlp.common.params - quantitative_analyses.unitary_results = None
2022-04-01 16:10:15,015 - INFO - allennlp.common.params - quantitative_analyses.intersectional_results = None
2022-04-01 16:10:15,015 - INFO - allennlp.common.params - model_ethical_considerations.ethical_considerations = None
2022-04-01 16:10:15,017 - INF

2022-04-01 16:10:35,900 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.output.LayerNorm.bias
2022-04-01 16:10:35,900 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.output.LayerNorm.weight
2022-04-01 16:10:35,901 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.output.dense.bias
2022-04-01 16:10:35,902 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.output.dense.weight
2022-04-01 16:10:35,903 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.self.key.bias
2022-04-01 16:10:35,904 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.self.key.weight
2022-04-01 16:10:35,906 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.self.query.bias
2022-04-01 16:10:35,908 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.1.attention.self.query.weight
2022-04-01 16:10:35,908 - INFO - allennlp.nn.initial

2022-04-01 16:10:35,962 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.attention.self.query.weight
2022-04-01 16:10:35,962 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.attention.self.value.bias
2022-04-01 16:10:35,963 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.attention.self.value.weight
2022-04-01 16:10:35,963 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.intermediate.dense.bias
2022-04-01 16:10:35,964 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.intermediate.dense.weight
2022-04-01 16:10:35,964 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.output.LayerNorm.bias
2022-04-01 16:10:35,965 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.output.LayerNorm.weight
2022-04-01 16:10:35,965 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.3.output.dense.bias
2022-04-01 16:10:35,966 - INFO - allennlp.nn.initializers -    bert_model.encoder.la

2022-04-01 16:10:36,029 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.7.output.dense.bias
2022-04-01 16:10:36,029 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.7.output.dense.weight
2022-04-01 16:10:36,031 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.8.attention.output.LayerNorm.bias
2022-04-01 16:10:36,032 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.8.attention.output.LayerNorm.weight
2022-04-01 16:10:36,033 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.8.attention.output.dense.bias
2022-04-01 16:10:36,034 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.8.attention.output.dense.weight
2022-04-01 16:10:36,035 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.8.attention.self.key.bias
2022-04-01 16:10:36,035 - INFO - allennlp.nn.initializers -    bert_model.encoder.layer.8.attention.self.key.weight
2022-04-01 16:10:36,037 - INFO - allennlp.nn.initializers -    bert_

In [7]:
#functions to create model predictions for a list containing sentences
### added by pia, edited by Goya ###

def predict_srl(data):
    pred = []
    for d in data:
        pred.append(srl_predictor.predict(d))
    return pred


def predict_srlbert(data):
    pred = []
    for d in data:
        pred.append(srlbert_predictor.predict(d))
    return pred

predict_srl = PredictorWrapper.wrap_predict(predict_srl)
predict_srlbert = PredictorWrapper.wrap_predict(predict_srlbert)

### Define output file paths

In [8]:
#create lists to store test sentences and model predictions in 
test_data = []
SRLBERT_predictions = []
SRL_predictions = []

In [9]:
#define paths to output files
test_sents_path = './JSON_test_and_predict_files/test_data_patient.json'
bert_pred_path = './JSON_test_and_predict_files/BERT_predictions_patient.json'
srl_pred_path = './JSON_test_and_predict_files/SRL_predictions_patient.json'

#set name of current capability
capability = 'patient_recognition'

### Load Checklist tests (Load functions defined in utils)
Load functions to test whether names are recognized as patients

In [10]:
#load functions that check whether the argument of interest is correctly predicted
expect_arg1_verb0 = Expect.single(found_arg1_people_verb0)
expect_arg1_verb1 = Expect.single(found_arg1_people_verb1)
expect_arg1_verb2 = Expect.single(found_arg1_people_verb2)

Load functions to recognize the title 'doctor' + a name as patient

In [11]:
#load functions that check whether the argument of interest is correctly predicted
expect_arg1_doctor_verb0 = Expect.single(found_arg1_doctor_verb0)
expect_arg1_doctor_verb1 = Expect.single(found_arg1_doctor_verb1)
expect_arg1_doctor_verb2 = Expect.single(found_arg1_doctor_verb2)

### Load wordlists to use in sample sentences

In [12]:
# initialize editor object
editor = Editor()
#import alphabet detector to ensure we only use latin characters
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()

#get lists of names from the different countries
english_firstname = editor.lexicons.female_from.United_Kingdom + editor.lexicons.male_from.United_Kingdom
english_male = editor.lexicons.male_from.United_Kingdom 
english_female = editor.lexicons.female_from.United_Kingdom

#get iranian names, only those in latin characters
iran_lastnames = [name for name in editor.lexicons.last_from.Iran if ad.only_alphabet_chars(name, "LATIN")]
iran_female = [name for name in editor.lexicons.female_from.Iran if ad.only_alphabet_chars(name, "LATIN")]
iran_male = [name for name in editor.lexicons.male_from.Iran if ad.only_alphabet_chars(name, "LATIN")]
iran_names = iran_female + iran_male

#get Dutch names
dutch_male = editor.lexicons.male_from.the_Netherlands
dutch_female = editor.lexicons.female_from.the_Netherlands
dutch_names = dutch_female +  dutch_male
dutch_lastnames = editor.lexicons.last_from.the_Netherlands

# a list of verbs to use in the test cases
passive_verbs = ['kissed', 'killed', 'hurt', 'touched', 'ignored', 'silenced', 'hit', 'greeted']

## Tests

###  Names only : English names
Tests in the name only setting, for English names

In [13]:
#create samples
testcase_name = 'English_names_active'
t = editor.template("{first_name} {last_name} {verb} {first} {last} yesterday.", first=english_firstname, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
[ARG1: Eric] Reed [V: hurt] [ARGM-LOC: Ernest Miller] [ARGM-TMP: yesterday] .
----
[ARG1: Adam] [ARGM-ADV: Walker] [V: hurt] [ARG2: Sara Miller] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [14]:
#create samples
testcase_name = 'English_names_passive'
t = editor.template("{first} {last} was {verb} by {first_name} {last_name} yesterday", first=english_firstname, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
[ARGM-MNR: David] [ARG1: Griffiths] was [V: greeted] [ARG0: by Carl Ford] [ARGM-TMP: yesterday]
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [15]:
#create samples
testcase_name = 'English_names_itwas_passive'
t = editor.template("It was {first} {last} who was {verb} by {first_name} {last_name} yesterday'", first=english_firstname, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
It was Amanda Griffiths [R-ARG1: who] was [V: killed] [ARG0: by Emma Cooper] [ARGM-TMP: yesterday] '
----
It was [ARG1: Nicola] Richards [R-ARG1: who] was [V: silenced] [ARG0: by Emily Allen yesterday] '
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Names only : Iranian names
Tests in the name only setting, for Iranian names

In [16]:
#create samples
testcase_name = 'Iranian_names_active'
t = editor.template("{first_name} {last_name} {verb} {first} {last} yesterday.", first=iran_names, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)


test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    11 (11.0%)

Example fails:
[ARG0: William Brown] [V: touched] [ARG1: Nassim] [ARG2: Moradi] [ARGM-TMP: yesterday] .
----
[ARG0: Andrew Watson] [V: hurt] [ARG1: Rahman] [ARGM-MNR: Ghazi] [ARGM-TMP: yesterday] .
----
[ARG0: Alexander Johnson] [V: hit] [ARG1: Amir] [ARG2: Jamali] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [17]:
#create samples
testcase_name = 'Iranian_names_passive'
t = editor.template("{first} {last} was {verb} by {first_name} {last_name} yesterday", first=iran_names, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    5 (5.0%)

Example fails:
Niki [ARG1: Rahimi] was [V: hit] [ARG0: by Sally Miller] [ARGM-TMP: yesterday]
----
[ARGM-TMP: Cleopatra] [ARG1: Shariati] was [V: killed] [ARG0: by Jay Allen] [ARGM-TMP: yesterday]
----
Helen [ARG1: Rezaei] was [V: kissed] [ARG0: by Judith Nelson] [ARGM-TMP: yesterday]
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [18]:
#create samples
testcase_name = 'Iranian_names_itwas_passive'
t = editor.template("It was {first} {last} who was {verb} by {first_name} {last_name} yesterday'", first=iran_names, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    43 (43.0%)

Example fails:
It was Camelia Jabbari [R-ARG1: who] was [V: greeted] [ARG0: by Ben Brooks yesterday] '
----
It was Carina [ARG1: Mansourian] [R-ARG1: who] was [V: hurt] [ARG0: by Gary Stewart] [ARGM-TMP: yesterday] '
----
It was Mani Behbahani [R-ARG1: who] was [V: ignored] [ARG0: by Carolyn Hart] yesterday '
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Names only: Dutch names
Tests in the name only setting, for Dutch names

In [19]:
#create samples
testcase_name = 'Dutch_names_active'
t = editor.template("{first_name} {last_name} {verb} {first} {last} yesterday.", first=dutch_names, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    4 (4.0%)

Example fails:
[ARG1: Sue Nelson] [V: greeted] [ARG2: Wim Staal] [ARGM-TMP: yesterday] .
----
[ARG1: Pamela Mason] [V: touched] [ARG2: Maria Vos] [ARGM-TMP: yesterday] .
----
[ARG1: Betty Green] [V: greeted] [ARG2: Johannes Boersma] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [20]:
#create samples
testcase_name = 'Dutch_names_passive'
t = editor.template("{first} {last} was {verb} by {first_name} {last_name} yesterday", first=dutch_names, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    8 (8.0%)

Example fails:
[ARGM-ADV: Cor] [ARG1: Pronk] was [V: kissed] [ARG0: by David Carter] [ARGM-TMP: yesterday]
----
Dirk [ARG1: Roos] was [V: hurt] [ARG0: by Alexander Walker] [ARGM-TMP: yesterday]
----
[ARGM-DIS: David] [ARG1: Roos] was [V: greeted] [ARG0: by Alexandra Robinson] [ARGM-TMP: yesterday]
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
Henriëtte [ARG1: Vos] was [V: greeted] [ARG0: by Rose Cooper] [ARGM-TMP: yesterday]
----


In [21]:
#create samples
testcase_name = 'English_names_itwas_passive'
t = editor.template("It was {first} {last} who was {verb} by {first_name} {last_name} yesterday'", first=dutch_names, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    36 (36.0%)

Example fails:
It was Marlies [ARG1: Muller] [R-ARG1: who] was [V: killed] [ARG0: by Matt Butler yesterday] '
----
It was Liesbeth Polak [R-ARG1: who] was [V: killed] [ARG0: by Don James] [ARGM-TMP: yesterday] '
----
It was [ARG1: Henk] Rutten [R-ARG1: who] was [V: kissed] [ARG0: by Lawrence Thompson] [ARGM-TMP: yesterday] '
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Titles + names: English names
In the stereotypical version: 'Doctor' + male name ; 'Nurse' + female name

In [22]:
#create samples
testcase_name = 'English_title_stereotype_active'
t = editor.template("Nurse {female} {last_name} {verb} Doctor {first} {last} yesterday.", first=english_male, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [23]:
#create samples
testcase_name = 'English_title_stereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {female} {last_name} yesterday.", first=english_male, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [24]:
#create samples
testcase_name = 'English_title_stereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {female} {last_name} yesterday.", first=english_male, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In the non-stereotypical version: 'Doctor' + female name ; 'Nurse' + male name

In [25]:
#create samples
testcase_name = 'English_title_nonstereotype_active'
t = editor.template("Nurse {male} {last_name} {verb} Doctor {first} {last} yesterday.", first=english_female, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [26]:
#create samples
testcase_name = 'English_title_nonstereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {male} {last_name} yesterday.", first=english_female, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
[ARGM-DIS: Doctor Ethel] [ARG1: Clark] was [V: hit] [ARG0: by nurse Kevin Sullivan] [ARGM-TMP: yesterday] .
----
[ARGM-DIS: Doctor Ethel] [ARG1: Davies] was [V: hit] [ARG0: by nurse Chris Martin] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [27]:
#create samples
testcase_name = 'English_title_nonstereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {male} {last_name} yesterday.", first=english_female, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Titles + names: Iranian names
In the stereotypical version: 'Doctor' + male name ; 'Nurse' + female name

In [28]:
#create samples
testcase_name = 'Iranian_title_stereotype_active'
t = editor.template("Nurse {female} {last_name} {verb} Doctor {first} {last} yesterday.", first=iran_male, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    4 (4.0%)

Example fails:
[ARGM-TMP: Nurse] [ARG0: Deborah Robertson] [V: touched] [ARG1: Doctor Saeed] [ARGM-PRD: Razi] [ARGM-TMP: yesterday] .
----
Nurse [ARG0: Judith Sullivan] [V: kissed] [ARG1: Doctor Robert Khan yesterday] .
----
Nurse [ARG0: Louise Hamilton] [V: kissed] [ARG1: Doctor Daniel] [ARG2: Panahi] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [29]:
#create samples
testcase_name = 'Iranian_title_stereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {female} {last_name} yesterday.", first=iran_male, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    15 (15.0%)

Example fails:
[ARG2: Doctor Navid Zandi] was [V: touched] [ARG0: by nurse Laura Hamilton] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Jalil Fatemi] was [V: touched] [ARG0: by nurse Barbara Clark] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Mani] [ARG1: Peyrovani] was [V: kissed] [ARG0: by nurse Rebecca Coleman] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [30]:
#create samples
testcase_name = 'Iranian_title_stereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {female} {last_name} yesterday.", first=iran_male, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In the non-stereotypical version: 'Doctor' + female name ; 'Nurse' + male name

In [31]:
#create samples
testcase_name = 'Iranian_title_nonstereotype_active'
t = editor.template("Nurse {male} {last_name} {verb} Doctor {first} {last} yesterday.", first=iran_female, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
[ARGM-CAU: Nurse] [ARG0: Donald Cohen] [V: touched] [ARG1: Doctor Niki] [ARG2: Hassani] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [32]:
#create samples
testcase_name = 'Iranian_title_nonstereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {male} {last_name} yesterday.", first=iran_female, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    16 (16.0%)

Example fails:
[ARG2: Doctor Fatemeh Afshar] was [V: touched] [ARG0: by nurse Steve Gordon] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Negar Tabatabaei] was [V: touched] [ARG0: by nurse Richard White] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Fatemeh Rajabi] was [V: kissed] [ARG0: by nurse Ralph Bell] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [33]:
#create samples
testcase_name = 'Iranian_title_nonstereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {male} {last_name} yesterday.", first=iran_female, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Titles + names: Dutch names
In the stereotypical version: 'Doctor' + male name ; 'Nurse' + female name

In [34]:
#create samples
testcase_name = 'Dutch_title_stereotype_active'
t = editor.template("Nurse {female} {last_name} {verb} Doctor {first} {last} yesterday.", first=dutch_male, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    6 (6.0%)

Example fails:
Nurse [ARG0: Rose Wright] [V: touched] [ARG1: Doctor Bas] [ARG2: Wiersma] [ARGM-TMP: yesterday] .
----
[ARGM-MOD: Nurse] [ARG0: Diane Price] [V: hurt] [ARG1: Doctor Eduard] [ARG2: Smulders] [ARGM-TMP: yesterday] .
----
Nurse [ARG0: Catherine Cohen] [V: hurt] [ARG1: Doctor Rob] [ARG2: Simons] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [35]:
#create samples
testcase_name = 'Dutch_title_stereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {female} {last_name} yesterday.", first=dutch_male, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    6 (6.0%)

Example fails:
[ARG1: Doctor Jacques] Kuiper was [V: hit] [ARG0: by nurse Kathleen King] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Geert Molenaar] was [V: touched] [ARG0: by nurse Charlotte Foster] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Joop Dijkstra] was [V: kissed] [ARG0: by nurse Anne Kennedy] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [36]:
#create samples
testcase_name = 'Dutch_title_stereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {female} {last_name} yesterday.", first=dutch_male, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
It was Doctor Erik [ARG1: Martens] [R-ARG1: who] was [V: greeted] [ARG0: by nurse Melissa Evans] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In the non-stereotypical version: 'Doctor' + female name ; 'Nurse' + male name

In [37]:
#create samples
testcase_name = 'Dutch_title_nonstereotype_active'
t = editor.template("Nurse {male} {last_name} {verb} Doctor {first} {last} yesterday.", first=dutch_female, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
Nurse [ARG0: Ken Alexander] [V: hurt] [ARG1: Doctor] Helena Vonk yesterday .
----
Nurse [ARG0: Tom Gordon] [V: greeted] [ARG1: Doctor Nina] [ARG2: Wagenaar] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [38]:
#create samples
testcase_name = 'Dutch_title_nonstereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {male} {last_name} yesterday.", first=dutch_female, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    7 (7.0%)

Example fails:
[ARG2: Doctor Ilse] [ARG1: Dijkstra] was [V: kissed] [ARG0: by nurse Colin Brooks] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Ineke] [ARG1: Rutten] was [V: kissed] [ARG0: by nurse Dick Perry] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Gerda] [ARG1: Vonk] was [V: kissed] [ARG0: by nurse Matthew Rose] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [39]:
#create samples
testcase_name = 'Dutch_title_nonstereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {male} {last_name} yesterday.", first=dutch_female, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Store all data to JSON

In [40]:
#store the test sentences
store_data(test_sents_path, test_data, new_file=True)
#store the model predictions
store_data(bert_pred_path, SRLBERT_predictions, new_file=True)
store_data(srl_pred_path, SRL_predictions, new_file=True)