# SRL Patient recognition experiments

In this notebook I carry out experiments to test whether two Semantic Role Labelling (SRL) systems can correctly identify patients in sentences with varying structures. This code was based on code provided by Pia Sommerauer.

In this code I load two models, namely the AllenNLP SRL model and the AllenNLP SRL BERT model. I create a variety of tets cases, for wich I evaluate the performance of the two models. All the test sentences are stored in a json file specified through the `test_sents_path` variable. The SRL predictions are stored in the json file specified through `srl_pred_path`, and similarly the SRL BERT predictions are stored at the path `bert_pred_path`.

### Patient recognition
I carry out two sets of tests: in the first test I only use names as patients, in the second test I add titles to the agent and patient, namely 'Doctor' and 'nurse'. In both sets I test 3 sentence structures and names from 3 cultures: English, Iranian and Dutch. The sentence structures are as follows:
* Active: '`name1` kissed `name2` yesterday'
* Passive : '`name1` is the one that was kissed by `name2` yesterday'
* 'It was .. who' + passive : 'It was `name1` that who was kissed by `name2` yesterday'

For the doctor/nurse titles, I test them both in a stereotypical context, where the Doctor has a male name and the nurse a female name, and in a non-stereotypical context, where the gender of the Doctor and the nurse are reversed. This is done to test whether the model has some gender bias: if this is the case we expect better results in the stereotypical context.


### Import libraries

In [1]:
from allennlp_models.pretrained import load_predictor

In [2]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect

In [3]:
from checklist.pred_wrapper import PredictorWrapper

In [4]:
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

In [5]:
from utils_functions import *

### Load the models 

In [None]:
# load the regular SRL model
srl_predictor = load_predictor('structured-prediction-srl')
# load the SRL BERT model
srlbert_predictor = load_predictor('structured-prediction-srl-bert')

In [7]:
#functions to create model predictions for a list containing sentences
### added by pia, edited by Goya ###

def predict_srl(data):
    pred = []
    for d in data:
        pred.append(srl_predictor.predict(d))
    return pred


def predict_srlbert(data):
    pred = []
    for d in data:
        pred.append(srlbert_predictor.predict(d))
    return pred

predict_srl = PredictorWrapper.wrap_predict(predict_srl)
predict_srlbert = PredictorWrapper.wrap_predict(predict_srlbert)

### Define output file paths

In [8]:
#create lists to store test sentences and model predictions in 
test_data = []
SRLBERT_predictions = []
SRL_predictions = []

In [9]:
#define paths to output files
test_sents_path = './JSON_test_and_predict_files/test_data_patient.json'
bert_pred_path = './JSON_test_and_predict_files/BERT_predictions_patient.json'
srl_pred_path = './JSON_test_and_predict_files/SRL_predictions_patient.json'

#set name of current capability
capability = 'patient_recognition'

### Load Checklist tests (Load functions defined in utils)
Load functions to test whether names are recognized as patients

In [10]:
#load functions that check whether the argument of interest is correctly predicted
expect_arg1_verb0 = Expect.single(found_arg1_people_verb0)
expect_arg1_verb1 = Expect.single(found_arg1_people_verb1)
expect_arg1_verb2 = Expect.single(found_arg1_people_verb2)

Load functions to recognize the title 'doctor' + a name as patient

In [11]:
#load functions that check whether the argument of interest is correctly predicted
expect_arg1_doctor_verb0 = Expect.single(found_arg1_doctor_verb0)
expect_arg1_doctor_verb1 = Expect.single(found_arg1_doctor_verb1)
expect_arg1_doctor_verb2 = Expect.single(found_arg1_doctor_verb2)

### Load wordlists to use in sample sentences

In [12]:
# initialize editor object
editor = Editor()
#import alphabet detector to ensure we only use latin characters
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()

#get lists of names from the different countries
english_firstname = editor.lexicons.female_from.United_Kingdom + editor.lexicons.male_from.United_Kingdom
english_male = editor.lexicons.male_from.United_Kingdom 
english_female = editor.lexicons.female_from.United_Kingdom

#get iranian names, only those in latin characters
iran_lastnames = [name for name in editor.lexicons.last_from.Iran if ad.only_alphabet_chars(name, "LATIN")]
iran_female = [name for name in editor.lexicons.female_from.Iran if ad.only_alphabet_chars(name, "LATIN")]
iran_male = [name for name in editor.lexicons.male_from.Iran if ad.only_alphabet_chars(name, "LATIN")]
iran_names = iran_female + iran_male

#get Dutch names
dutch_male = editor.lexicons.male_from.the_Netherlands
dutch_female = editor.lexicons.female_from.the_Netherlands
dutch_names = dutch_female +  dutch_male
dutch_lastnames = editor.lexicons.last_from.the_Netherlands

# a list of verbs to use in the test cases
passive_verbs = ['kissed', 'killed', 'hurt', 'touched', 'ignored', 'silenced', 'hit', 'greeted']

## Tests

###  Names only : English names
Tests in the name only setting, for English names

In [13]:
#create samples
testcase_name = 'English_names_active'
t = editor.template("{first_name} {last_name} {verb} {first} {last} yesterday.", first=english_firstname, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
[ARG1: Eric] Reed [V: hurt] [ARGM-LOC: Ernest Miller] [ARGM-TMP: yesterday] .
----
[ARG1: Adam] [ARGM-ADV: Walker] [V: hurt] [ARG2: Sara Miller] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [14]:
#create samples
testcase_name = 'English_names_passive'
t = editor.template("{first} {last} was {verb} by {first_name} {last_name} yesterday", first=english_firstname, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
[ARGM-MNR: David] [ARG1: Griffiths] was [V: greeted] [ARG0: by Carl Ford] [ARGM-TMP: yesterday]
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [15]:
#create samples
testcase_name = 'English_names_itwas_passive'
t = editor.template("It was {first} {last} who was {verb} by {first_name} {last_name} yesterday'", first=english_firstname, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
It was Amanda Griffiths [R-ARG1: who] was [V: killed] [ARG0: by Emma Cooper] [ARGM-TMP: yesterday] '
----
It was [ARG1: Nicola] Richards [R-ARG1: who] was [V: silenced] [ARG0: by Emily Allen yesterday] '
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Names only : Iranian names
Tests in the name only setting, for Iranian names

In [16]:
#create samples
testcase_name = 'Iranian_names_active'
t = editor.template("{first_name} {last_name} {verb} {first} {last} yesterday.", first=iran_names, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)


test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    11 (11.0%)

Example fails:
[ARG0: William Brown] [V: touched] [ARG1: Nassim] [ARG2: Moradi] [ARGM-TMP: yesterday] .
----
[ARG0: Andrew Watson] [V: hurt] [ARG1: Rahman] [ARGM-MNR: Ghazi] [ARGM-TMP: yesterday] .
----
[ARG0: Alexander Johnson] [V: hit] [ARG1: Amir] [ARG2: Jamali] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [17]:
#create samples
testcase_name = 'Iranian_names_passive'
t = editor.template("{first} {last} was {verb} by {first_name} {last_name} yesterday", first=iran_names, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    5 (5.0%)

Example fails:
Niki [ARG1: Rahimi] was [V: hit] [ARG0: by Sally Miller] [ARGM-TMP: yesterday]
----
[ARGM-TMP: Cleopatra] [ARG1: Shariati] was [V: killed] [ARG0: by Jay Allen] [ARGM-TMP: yesterday]
----
Helen [ARG1: Rezaei] was [V: kissed] [ARG0: by Judith Nelson] [ARGM-TMP: yesterday]
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [18]:
#create samples
testcase_name = 'Iranian_names_itwas_passive'
t = editor.template("It was {first} {last} who was {verb} by {first_name} {last_name} yesterday'", first=iran_names, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    43 (43.0%)

Example fails:
It was Camelia Jabbari [R-ARG1: who] was [V: greeted] [ARG0: by Ben Brooks yesterday] '
----
It was Carina [ARG1: Mansourian] [R-ARG1: who] was [V: hurt] [ARG0: by Gary Stewart] [ARGM-TMP: yesterday] '
----
It was Mani Behbahani [R-ARG1: who] was [V: ignored] [ARG0: by Carolyn Hart] yesterday '
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Names only: Dutch names
Tests in the name only setting, for Dutch names

In [19]:
#create samples
testcase_name = 'Dutch_names_active'
t = editor.template("{first_name} {last_name} {verb} {first} {last} yesterday.", first=dutch_names, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    4 (4.0%)

Example fails:
[ARG1: Sue Nelson] [V: greeted] [ARG2: Wim Staal] [ARGM-TMP: yesterday] .
----
[ARG1: Pamela Mason] [V: touched] [ARG2: Maria Vos] [ARGM-TMP: yesterday] .
----
[ARG1: Betty Green] [V: greeted] [ARG2: Johannes Boersma] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [20]:
#create samples
testcase_name = 'Dutch_names_passive'
t = editor.template("{first} {last} was {verb} by {first_name} {last_name} yesterday", first=dutch_names, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    8 (8.0%)

Example fails:
[ARGM-ADV: Cor] [ARG1: Pronk] was [V: kissed] [ARG0: by David Carter] [ARGM-TMP: yesterday]
----
Dirk [ARG1: Roos] was [V: hurt] [ARG0: by Alexander Walker] [ARGM-TMP: yesterday]
----
[ARGM-DIS: David] [ARG1: Roos] was [V: greeted] [ARG0: by Alexandra Robinson] [ARGM-TMP: yesterday]
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
Henriëtte [ARG1: Vos] was [V: greeted] [ARG0: by Rose Cooper] [ARGM-TMP: yesterday]
----


In [21]:
#create samples
testcase_name = 'English_names_itwas_passive'
t = editor.template("It was {first} {last} who was {verb} by {first_name} {last_name} yesterday'", first=dutch_names, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    36 (36.0%)

Example fails:
It was Marlies [ARG1: Muller] [R-ARG1: who] was [V: killed] [ARG0: by Matt Butler yesterday] '
----
It was Liesbeth Polak [R-ARG1: who] was [V: killed] [ARG0: by Don James] [ARGM-TMP: yesterday] '
----
It was [ARG1: Henk] Rutten [R-ARG1: who] was [V: kissed] [ARG0: by Lawrence Thompson] [ARGM-TMP: yesterday] '
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Titles + names: English names
In the stereotypical version: 'Doctor' + male name ; 'Nurse' + female name

In [22]:
#create samples
testcase_name = 'English_title_stereotype_active'
t = editor.template("Nurse {female} {last_name} {verb} Doctor {first} {last} yesterday.", first=english_male, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [23]:
#create samples
testcase_name = 'English_title_stereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {female} {last_name} yesterday.", first=english_male, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [24]:
#create samples
testcase_name = 'English_title_stereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {female} {last_name} yesterday.", first=english_male, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In the non-stereotypical version: 'Doctor' + female name ; 'Nurse' + male name

In [25]:
#create samples
testcase_name = 'English_title_nonstereotype_active'
t = editor.template("Nurse {male} {last_name} {verb} Doctor {first} {last} yesterday.", first=english_female, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [26]:
#create samples
testcase_name = 'English_title_nonstereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {male} {last_name} yesterday.", first=english_female, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
[ARGM-DIS: Doctor Ethel] [ARG1: Clark] was [V: hit] [ARG0: by nurse Kevin Sullivan] [ARGM-TMP: yesterday] .
----
[ARGM-DIS: Doctor Ethel] [ARG1: Davies] was [V: hit] [ARG0: by nurse Chris Martin] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [27]:
#create samples
testcase_name = 'English_title_nonstereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {male} {last_name} yesterday.", first=english_female, last=editor.lexicons.last_from.United_Kingdom, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Titles + names: Iranian names
In the stereotypical version: 'Doctor' + male name ; 'Nurse' + female name

In [28]:
#create samples
testcase_name = 'Iranian_title_stereotype_active'
t = editor.template("Nurse {female} {last_name} {verb} Doctor {first} {last} yesterday.", first=iran_male, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    4 (4.0%)

Example fails:
[ARGM-TMP: Nurse] [ARG0: Deborah Robertson] [V: touched] [ARG1: Doctor Saeed] [ARGM-PRD: Razi] [ARGM-TMP: yesterday] .
----
Nurse [ARG0: Judith Sullivan] [V: kissed] [ARG1: Doctor Robert Khan yesterday] .
----
Nurse [ARG0: Louise Hamilton] [V: kissed] [ARG1: Doctor Daniel] [ARG2: Panahi] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [29]:
#create samples
testcase_name = 'Iranian_title_stereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {female} {last_name} yesterday.", first=iran_male, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    15 (15.0%)

Example fails:
[ARG2: Doctor Navid Zandi] was [V: touched] [ARG0: by nurse Laura Hamilton] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Jalil Fatemi] was [V: touched] [ARG0: by nurse Barbara Clark] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Mani] [ARG1: Peyrovani] was [V: kissed] [ARG0: by nurse Rebecca Coleman] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [30]:
#create samples
testcase_name = 'Iranian_title_stereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {female} {last_name} yesterday.", first=iran_male, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In the non-stereotypical version: 'Doctor' + female name ; 'Nurse' + male name

In [31]:
#create samples
testcase_name = 'Iranian_title_nonstereotype_active'
t = editor.template("Nurse {male} {last_name} {verb} Doctor {first} {last} yesterday.", first=iran_female, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
[ARGM-CAU: Nurse] [ARG0: Donald Cohen] [V: touched] [ARG1: Doctor Niki] [ARG2: Hassani] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [32]:
#create samples
testcase_name = 'Iranian_title_nonstereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {male} {last_name} yesterday.", first=iran_female, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    16 (16.0%)

Example fails:
[ARG2: Doctor Fatemeh Afshar] was [V: touched] [ARG0: by nurse Steve Gordon] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Negar Tabatabaei] was [V: touched] [ARG0: by nurse Richard White] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Fatemeh Rajabi] was [V: kissed] [ARG0: by nurse Ralph Bell] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [33]:
#create samples
testcase_name = 'Iranian_title_nonstereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {male} {last_name} yesterday.", first=iran_female, last=iran_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Titles + names: Dutch names
In the stereotypical version: 'Doctor' + male name ; 'Nurse' + female name

In [34]:
#create samples
testcase_name = 'Dutch_title_stereotype_active'
t = editor.template("Nurse {female} {last_name} {verb} Doctor {first} {last} yesterday.", first=dutch_male, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    6 (6.0%)

Example fails:
Nurse [ARG0: Rose Wright] [V: touched] [ARG1: Doctor Bas] [ARG2: Wiersma] [ARGM-TMP: yesterday] .
----
[ARGM-MOD: Nurse] [ARG0: Diane Price] [V: hurt] [ARG1: Doctor Eduard] [ARG2: Smulders] [ARGM-TMP: yesterday] .
----
Nurse [ARG0: Catherine Cohen] [V: hurt] [ARG1: Doctor Rob] [ARG2: Simons] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [35]:
#create samples
testcase_name = 'Dutch_title_stereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {female} {last_name} yesterday.", first=dutch_male, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    6 (6.0%)

Example fails:
[ARG1: Doctor Jacques] Kuiper was [V: hit] [ARG0: by nurse Kathleen King] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Geert Molenaar] was [V: touched] [ARG0: by nurse Charlotte Foster] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Joop Dijkstra] was [V: kissed] [ARG0: by nurse Anne Kennedy] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [36]:
#create samples
testcase_name = 'Dutch_title_stereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {female} {last_name} yesterday.", first=dutch_male, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    1 (1.0%)

Example fails:
It was Doctor Erik [ARG1: Martens] [R-ARG1: who] was [V: greeted] [ARG0: by nurse Melissa Evans] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In the non-stereotypical version: 'Doctor' + female name ; 'Nurse' + male name

In [37]:
#create samples
testcase_name = 'Dutch_title_nonstereotype_active'
t = editor.template("Nurse {male} {last_name} {verb} Doctor {first} {last} yesterday.", first=dutch_female, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb0, format_srl_verb0, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    2 (2.0%)

Example fails:
Nurse [ARG0: Ken Alexander] [V: hurt] [ARG1: Doctor] Helena Vonk yesterday .
----
Nurse [ARG0: Tom Gordon] [V: greeted] [ARG1: Doctor Nina] [ARG2: Wagenaar] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [38]:
#create samples
testcase_name = 'Dutch_title_nonstereotype_passive'
t = editor.template("Doctor {first} {last} was {verb} by nurse {male} {last_name} yesterday.", first=dutch_female, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb1, format_srl_verb1, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    7 (7.0%)

Example fails:
[ARG2: Doctor Ilse] [ARG1: Dijkstra] was [V: kissed] [ARG0: by nurse Colin Brooks] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Ineke] [ARG1: Rutten] was [V: kissed] [ARG0: by nurse Dick Perry] [ARGM-TMP: yesterday] .
----
[ARG2: Doctor Gerda] [ARG1: Vonk] was [V: kissed] [ARG0: by nurse Matthew Rose] [ARGM-TMP: yesterday] .
----
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


In [39]:
#create samples
testcase_name = 'Dutch_title_nonstereotype_itwas_passive'
t = editor.template("It was Doctor {first} {last} who was {verb} by nurse {male} {last_name} yesterday.", first=dutch_female, last=dutch_lastnames, verb=passive_verbs, meta=True, nsamples=100)

#make and store predictions for the two models
test_data, SRL_predictions, SRLBERT_predictions = predict_and_store(t, capability, testcase_name, \
                                                                    expect_arg1_doctor_verb2, format_srl_verb2, \
                                                                    predict_srl, predict_srlbert, test_data, \
                                                                    SRL_predictions, SRLBERT_predictions)

SRL
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)
SRL BERT
Predicting 100 examples
Test cases:      100
Fails (rate):    0 (0.0%)


### Store all data to JSON

In [40]:
#store the test sentences
store_data(test_sents_path, test_data, new_file=True)
#store the model predictions
store_data(bert_pred_path, SRLBERT_predictions, new_file=True)
store_data(srl_pred_path, SRL_predictions, new_file=True)