
# Probing for Dutch Relative Pronoun Choice

## Introduction 

Delobelle et al. 2020 test RobBERT, a Dutch language model using the Roberta architecture and training objective, with a task where the model has to predict *die* or *dat*. This is a so-called masked prediction task, where one word in a sentence is replaced with the special token [MASK] and the model has to predict which of two possible words (*die* or *dat*) is more likely in the position of the mask. Neural language models take the full context into account to make the prediction. Delobelle collect data from the Dutch section of the Europarl corpus, where they use all sentences containing die or dat as test cases (288K sentences). The task is inspired by a similar experiment from Allein et al. 2020. Allein et al use data from Europarl and Sonar to train a neural classifier for the task, using an LSTM that is initialized with word embeddings obtained from a word2vec model. They obtain accuracies of 83.2% (Europarl) and 84.5% (Sonar). (Majority baseline is 66% for Europarl as reported by Delobelle). 

Delobelle report very high levels of accuracy for RobBERT and for Bertje as well as mBERT(98.2-99.2% for all models). It should be noted though that in many cases, there is no real ambiguity, as (1) *dat* is used as subordinator not as pronoun, (2) *die* or *dat* is used but there are no preceding nouns with the opposite gender. Also, in cases where *die* or *dat* is used as deictic pronoun introducing an NP (*die jongen*), the noun with which the pronoun agrees is usually adjacent or very close to the pronoun (with no intervening distractors), making the task superficially easy. 

We turn the task into a proper probing task, and make the task harder, by zooming in on cases where 

1. the masked position is a relative pronoun, 
2. the relative pronoun ia part of  a complex NP of the form *...noun1 ... van ... noun2 ...RelClause* where the two nouns are of opposite gender and the relative clause is attached to either to the higher noun1 (the less likely configuration in Dutch) or the lower noun2

Both nouns have to be singular, as plural nouns always take *die* as relative pronoun. In these cases, predicting the correct relative pronoun amounts to recognizing that the mask introduces a relative clause that has been attached to either the higher or lower noun. In this way, we can use pronoun choice to probe the ability of the model to be sensitive to relative clause attachment, eventhough the language model itself only predicts the likelihood of strings, not of structures. It should be noted though, that in contrast with subject-verb agreement, right context potentially plays a role in the prediction, as some relative clauses might be more likely to be introduced by a die/dat relative pronoun (which often has the role of subject in the relative clause) than others. 

Dutch has a gender system, where singular nouns are either neuter or nonneuter. The distinction is relevant for choice of the definite determiner (het or de), inflection of prenominal adjectives in indefinite NPs, and relative pronoun choice (die or dat). Here we are interested in the latter phenomenon.

### Hypothesis

1. If the model prefers die over dat, it has a preference for the most frequent pronoun, which overrides context
2. If the model does better on low-attachment cases over high-attachment cases, it shows that it takes local context into account, but does not fully take into account the wider context.
3. If the model does better on low-attachment cases over high-attachment cases, it shows that it takes preceding context into account, and not only following context. 

## Relative Clause Attachment 

Relative clause attachment has been studied extensively from a psycholinguistic perspective (see Desmet for a study on Dutch). The canonical example is 

> Someone shot the servant of the actress who was on the balcony.
     
Here, the relative clause could be attached either to the higher NP (the servant) or the lower NP (the actress). It has been claimed that languages differ in their preference for high or low attachment. Desmet et al write: 

> Probably the most interesting finding about the syntactic ambiguity in (1) is that the preferred interpretation differs across languages, with English preferring low attachment, and many other languages (Dutch, French, German, Spanish) preferring high attachment (for an overview, see Mitchell & Brysbaert, 1998). In line with the tuning hypothesis, evidence has been obtained that in English text corpora low attachment is more prevalent than high attachment, whereas in Spanish and French the reverse pattern was found (Baltazart & Kister, 1995; Corley, 1996; Cuetos et al., 1996; Mitchell & Brysbaert, 1998; Mitchell et al., 1995).

Desmet et al point out that the earlier observed preference for high-attachment in fact does not exist if one takes into account the fact that Dutch has a gender system, disambiguating many of the potentially ambiguous cases. [CHECK]. In our newspaper data we also find that, in ambiguous as well as non-ambiguous contexts, low-attachment is far more frequent than high attachment. In this experiment, we use this ambiguity in combination with the fact that Dutch relative pronouns agree with the gender of the antecedent, to design a probing task for neural language models. 

### Data collection

We search the newspaper section of the SONAR/LAssyLarge corpus for relevant examples, using the automatically parsed data of the LassyLarge corpus (Van Noord et al, 20xx). We use the following query:

    '//node[@cat="np" and node[@pos="noun" and @gen="het" and @num="sg"] 
                      and node[@rel="mod" and @cat="pp" and node[@rel="hd" and @root="van"]]/
                            node[@cat="np" and node[@pos="noun" and @gen="de" and @num="sg"]]]/
          node[@rel="mod" and @cat="rel"]/node[@rel="rhd" and @word="dat"]'

This query searches for an NP containing a het-noun as head, a PP containing an NP with a de-noun as head, and finally a relative introduced by 'dat'. Note that the fact that the relative clause is an immediate daughter of the highest node with category NP means that the relative clause is attached to the higher noun. A similar query is used for the *die*-cases (with oppositve gender specifications). 

An example of our data is: 

>  Melk is het enige product van de koe [MASK] aan de paniek is ontsnapt .

Here, the correct choice is *dat*, as the relative clause modifies the neuter noun *product* and not the nonneuter noun *koe*. 

Note also that is some cases, the masked sentence actually is ambiguous, in the sense that both the choice for dat and the choice for die result in a sentence with a plausible interpretation:

> Een dagboek van de nieuwe NCMV-topman [MASK] een tip oplicht van de geheimen van de onderhandeling .

Here, the choice for *dat* amounts to choosing *dagboek* as antecedent (in line with the source) whereas die amounts to choosing *NCMW-topman* as antecedent, a possibility that would result in a semantically plausible sentence as well. 

False hits in the corpus:

> Het gevaar van een hernieuwde oorlog komt uit Rusland zelf , [MASK] in het verleden nooit lang bij de zijlijn is blijven staan in de Kaukasus .

Here, the correct antecedent is *Rusland* and not *gevaar* as the automatic parse assumes. 

Ambiguous cases were kept, false hits (wrong attachments, dat as subordinating conjuction) were removed. Duplicate sentences were also removed. 

### Attachment statistics

Desmet et al cite the psycholinguistic literature as claiming that Dutch generally has a preference for high attachment. The raw frequencies in our corpus contradict this view. (see discussion with Kisjes). SO we assume that high attachment cases are actually more challenging to the model, as in those cases the antecedent is separated from the pronoun by a distractor with opposite gender, and the corpus frequencies do not favor high attachment per se. 
 
NB is lassylarge used in training Bertje? guess so....
  

* Allein, Liesbeth, Artuur Leeuwenberg, and Marie-Francine Moens. "Binary and multitask classification model for Dutch anaphora resolution: Die/Dat prediction." arXiv preprint arXiv:2001.02943 (2020)
* Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020, July). Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440-8451).
* Delobelle, Pieter, Thomas Winters, and Bettina Berendt. "RobBERT: a dutch RoBERTa-based language model." arXiv preprint arXiv:2001.06286 (2020).
* Timothy Desmet, Constantijn De Baecke, Denis Drieghe, Marc Brysbaert & Wietske Vonk (2006) Relative clause attachment in Dutch: On-line comprehension corresponds to corpus frequencies when lexical variables are taken into account, Language and Cognitive Processes, 21:4, 453-485, DOI: 10.1080/01690960400023485 
* de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., & Nissim, M. (2019). Bertje: A dutch bert model. arXiv preprint arXiv:1912.09582.


## Bertje experiments

Here we use the Bertje model as described in de Vries et al (2020) and create masked language modeling task

In [1]:
from transformers import pipeline
bertje = pipeline('fill-mask', model='GroNLP/bert-base-dutch-cased')


Some weights of BertModel were not initialized from the model checkpoint at GroNLP/bert-base-dutch-cased and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
def predict(Model,Sentence,Targets,Correct) :
    res = Model(Sentence,targets=Targets)[0] 
    if res['token_str'].strip('▁') == Correct :
        return 1
    else :
        return 0

def test(Model,File,Targets,Correct) :
    score = 0
    count = 0 
    with open(File) as sentences :
        for line in sentences :
            prediction = Model(line,targets=Targets)[0]
            predicted_token = prediction['token_str'].strip('▁')
            if predicted_token == Correct :
                score += 1
            count += 1
    return(score,count)

def diedat(Model,Datfile,Diefile) :
    (datscore,datcount) = test(Model,Datfile,['die','dat'],'dat')
    (diescore,diecount) = test(Model,Diefile,['die','dat'],'die')
    
    print('{}, accuracy: {:6.4f} ({}/{})'.format('dat',datscore/datcount,datscore,datcount))
    print('{}, accuracy: {:6.4f} ({}/{})'.format('die',diescore/diecount,diescore,diecount))
    print('overall accuracy: {:6.4f}'.format((datscore+diescore)/(datcount+diecount)))

        

In [3]:
diedat(bertje,'clef.np2.1000.dat.mask','clef.np2.1000.die.mask')

dat, accuracy: 0.9310 (931/1000)
die, accuracy: 0.9370 (937/1000)
overall accuracy: 0.9340


## Comparison with RobBERT


We also perform experiments using RobBERT for comparison





In [4]:
from transformers import RobertaTokenizer
rtokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")

robbert = pipeline('fill-mask', model='pdelobelle/robbert-v2-dutch-base', tokenizer=rtokenizer)


Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


In [5]:
diedat(robbert,'clef.np2.1000.dat.rmask','clef.np2.1000.die.rmask')
        

dat, accuracy: 0.7820 (782/1000)
die, accuracy: 0.8360 (836/1000)
overall accuracy: 0.8090


XML-Roberta is a multilingual version of the Roberta LM (Conneau, 2020)

In [6]:
from transformers import AutoTokenizer
  
xlmtokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

xlmroberta = pipeline('fill-mask', model='xlm-roberta-base', tokenizer=xlmtokenizer)

Some weights of XLMRobertaForMaskedLM were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
diedat(xlmroberta,'clef.np2.1000.dat.rmask','clef.np2.1000.die.rmask')

dat, accuracy: 0.8520 (852/1000)
die, accuracy: 0.9570 (957/1000)
overall accuracy: 0.9045


In [8]:
from transformers import AutoTokenizer

mberttokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

mbert = pipeline('fill-mask', model='bert-base-multilingual-cased', tokenizer=mberttokenizer)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
# diedat(mbert,'clef.np1.edited.dat.mask','clef.np1.edited.die.mask')
diedat(mbert,'clef.np2.1000.dat.mask','clef.np2.1000.die.mask')

dat, accuracy: 0.7690 (769/1000)
die, accuracy: 0.8850 (885/1000)
overall accuracy: 0.8270


The comparison between Bertje and RobBERT leads to a suprising result. While good news for Bertje, what explains this? Also test on easier tasks (NP2 attachment, other?) 

All models complain about weights not being set, but on the other hand they should be OK out of the box for MLM, right?

 |             | Bertje | Robbert | mBERT  | XLM-Roberta | 
 | :----       | :----: | :----:  | :----: | :----:      |
 | dat         | 0.779  | 0.587   | 0.498  | 0.586       |
 | die         | 0.754  | 0.654   | 0.686  | 0.762       | 
 | **overall** | 0.769  | 0.615   | 0.577  | 0.660       | 
 
 

## CLEF

The raw text of the LassyLarge corpus (SONAR) was included in the training corpus for Bertje. We test on data from the CLEF corpus (4 years of newspaper text not included in SONAR), to see whether inclusion in the original corpus affects results.

 |             | Bertje | Robbert | mBERT  | XLM-Roberta | 
 | :----       | :----: | :----:  | :----: | :----:      |
 | dat         | 0.792  | 0.558   | 0.537  | 0.607       |
 | die         | 0.789  | 0.677   | 0.743  | 0.821       | 
 | **overall** | 0.790  | 0.631   | 0.662  | 0.738       | 
 
 ## Filtered data
 
 The data contains some attachment and other parsing errors, resulting in false hits (i.e. the data does not show an example of a relative clause being attached to a high noun in a complex NP. We manually filtered those sentences. In addition, we automatically filtered cases where the second noun in the die-cases was an amount/unit as these are often preceded by a number but can nevertheless be modified by 'die' (jaar, gulden, miljard, procent) Now 978->888 dat sentences, 1528->1163 die sentences. 
 
 |             | Bertje | Robbert | mBERT  | XLM-Roberta | 
 | :----       | :----: | :----:  | :----: | :----:      |
 | dat         | 0.789  | 0.537   | 0.510 | 0.590      |
 | die         | 0.740  | 0.650   | 0.711  | 0.783       | 
 | **overall** | 0.761  | 0.601   | 0.624  | 0.700       | 
 
 ### raw scores NP1 edited 
 
 
 |             | Bertje       | Robbert   | mBERT      | XLM-Roberta | 
 | :----       | ----:      | ----:     | ----:     | ----:      |
 | dat  (888)  | 701 0.7894  | 477 0.5372 | 453 0.5101 | 524 0.5901  |
 | die (1163)  | 860 0.7395  | 756 0.6500 | 827 0.7111 | 911 0.7833  | 
 | **overall** | 0.7611      | 0.6012     | 0.6241     | 0.6997      | 
 
 ### raw scores NP2 1000 
 
 
 |             | Bertje  | Robbert | mBERT  | XLM-Roberta | 
 | :----       | :----:  | :----:  | :----: | :----:      |
 | dat (1000)  | 0.9310  | 0.7820  | 0.7690 | 0.8520 |
 | die (1000)  | 0.9370  | 0.8360  | 0.8850 | 0.9570 | 
 | **overall** | 0.9340  | 0.8090  | 0.8270 | 0.9045 | 
 

## Data Analysis

The LM cannot use formal, grammatical, information, for deciding whether a high or low attachment is preferred (and thus, whether to choose *die* or *dat*). However, in almost all cases, the relative pronoun functions as subject in the relative clause. Thus, we can study to what extent there is a correlation between e.g. main verb and likelihood of die or dat as rel pronoun (PMI score over full corpus). (Even weaker: correlation between de/het noun as subject and verb.)

Also check the features used in the Allein models (trained classifiers). They might have tried similar things. 

### frequency of high vs low attachment

Low attachment is far more frequent. This is one of the reasons all models do better on the low attachment cases. (Apart from the fact that the antecedent is closer to the pronoun.)

For instance, before filtering, we obtained less than 2600 in total for high attachment, vs 4.400 for low attachment in half of the corpus (94 only). So comparable number would be 8.800 approx. 

### frequency of dat vs die in relative clauses 

In high attachment, after filtering 788 dat vs 1163 die. In low attachment, 2002 vs 2423. This may explain why models generally do better for die. 

### how often does a noun occur with a relative clause?

Some nouns are more likely to be followed by a rel than others. Requires xquery script to obtain the data. (Need statistics over all noun occurrences as well as noun-relcl occurrences. A full analysis requires some sort of logistic regression model where these probabilities are used as feature (or PMI score). 


### how often does a main verb in rel occur with die/dat?

Some verbs occur more often with die than with dat. Requires xquery script that counts die/dat and main verb lemma in the sentence. No need to collect all verb stats as it is MI over verb-die vs verb-datin the context of relcl. 

Here we compute correlation between die and dat frequency for all verbs occurring at least 2x with die and dat, for NRC 1994 data. And removing verbs that occur with one of the two more than 500 times (outliers that make the plot hard to read). 

TODO: plot the dots, add a trendline 



In [30]:
import pandas as pd

verbs = pd.read_csv('verbs_no_outliers1.tsv',delimiter=' ',names=['verb','die','dat'], usecols=['die','dat'])
verbs_label = pd.read_csv('verbs_no_outliers1.tsv',delimiter=' ',names=['verb','die','dat'])
print(verbs_label['verb'][1])
for i,row in verbs_label.iterrows() :
    print(row['verb'])

aanbieden
aanbevolen
aanbieden
aanbiedt
aanblijven
aanboden
aanbood
aanbrengt
aandeed
aandient
aandoen
aandoet
aandringen
aandringt
aanduiden
aanduidt
aangaan
aangaf
aangeboden
aangebracht
aangebroken
aangedaan
aangedragen
aangedrongen
aangeduid
aangeeft
aangegaan
aangegeven
aangehouden
aangekleed
aangekocht
aangekondigd
aangelegd
aangemerkt
aangemeten
aangemoedigd
aangenomen
aangepakt
aangeprezen
aangericht
aangeschaft
aangeslagen
aangesloten
aangespannen
aangestoken
aangetast
aangetreden
aangetroffen
aangetrokken
aangevallen
aangeven
aangevoerd
aangevraagd
aangevuld
aangewend
aangewezen
aangezet
aangezien
aanging
aanhield
aanhoudt
aankan
aankijkt
aankleeft
aankomt
aankon
aankondigde
aankondigt
aanleggen
aanlegt
aanloopt
aanmeten
aanmoedigen
aannam
aanneemt
aannemen
aanpakken
aanpast
aanprijst
aanprijzen
aanrichten
aanrichtte
aanschaffen
aansluit
aansluiten
aansprak
aanspreekt
aanspreken
aanstaat
aantast
aantasten
aantonen
aantoont
aantrad
aantreden
aantreedt
aantreffen
aantreft
aantr

kàn
keert
ken
kende
kenden
kenmerkend
kenmerkt
kenmerkte
kennen
kent
keren
kiest
kiezen
kijken
kijkt
klaagt
kleeft
klinken
klinkt
klonk
kloppen
klopt
kocht
koesteren
koestert
koken
komen
komt
kon
konden
koopt
koos
kopen
koppelen
koppelt
kost
kosten
kostte
kozen
kreeg
kregen
krijg
krijgen
krijgt
kritiseerde
kruipt
kunt
kwalificeerde
kwalificeert
kwam
kwamen
kweken
kwelt
kwijtgeraakt
kwijtraakte
laat
lachen
lacht
lag
lanceerde
lanceerden
landde
landen
las
laten
leed
leefde
leeft
leek
leende
leent
leert
lees
leest
legde
legden
leggen
legt
leidde
leidden
leiden
leidt
lenen
leren
leunt
leven
levende
leverde
leveren
levert
lezen
liep
liet
lieten
liggen
liggende
ligt
lijden
lijdende
lijdt
lijken
lijkt
loopt
lopen
lopende
losgebarsten
losgelaten
losgemaakt
losmaakt
losmaken
lossen
luidde
luidt
luistert
lukt
maakte
maakten
mag
maken
markeert
markeren
meebracht
meebrengen
meebrengt
meedeed
meedeelde
meedoet
meedraagt
meegaat
meegedeeld
meegekregen
meegemaakt
meegenomen
meegevoerd
meekan
meemaakt

In [17]:
print(verbs.corr()) # pearson correlation 
print(verbs.corr(method="spearman")) #spearman rank correlation 

          die       dat
die  1.000000  0.904389
dat  0.904389  1.000000
          die       dat
die  1.000000  0.726825
dat  0.726825  1.000000


In [25]:
import numpy as np
%matplotlib
import matplotlib.pyplot as plt

Using matplotlib backend: TkAgg


In [34]:
#verbs.plot(x="die",y="dat",kind="scatter",color="red")
#for i,row in verbs_label.iterrows() :
#    verbs.plot.annotate(row['verb'],row['die'],row['dat'])

plt.scatter(verbs_label['die'],verbs_label['dat'])
for i,row in verbs_label.iterrows() :
    print(row['verb'])
    plt.text(row['die'],row['dat'],row['verb'])
    
    

aanbevolen
aanbieden
aanbiedt
aanblijven
aanboden
aanbood
aanbrengt
aandeed
aandient
aandoen
aandoet
aandringen
aandringt
aanduiden
aanduidt
aangaan
aangaf
aangeboden
aangebracht
aangebroken
aangedaan
aangedragen
aangedrongen
aangeduid
aangeeft
aangegaan
aangegeven
aangehouden
aangekleed
aangekocht
aangekondigd
aangelegd
aangemerkt
aangemeten
aangemoedigd
aangenomen
aangepakt
aangeprezen
aangericht
aangeschaft
aangeslagen
aangesloten
aangespannen
aangestoken
aangetast
aangetreden
aangetroffen
aangetrokken
aangevallen
aangeven
aangevoerd
aangevraagd
aangevuld
aangewend
aangewezen
aangezet
aangezien
aanging
aanhield
aanhoudt
aankan
aankijkt
aankleeft
aankomt
aankon
aankondigde
aankondigt
aanleggen
aanlegt
aanloopt
aanmeten
aanmoedigen
aannam
aanneemt
aannemen
aanpakken
aanpast
aanprijst
aanprijzen
aanrichten
aanrichtte
aanschaffen
aansluit
aansluiten
aansprak
aanspreekt
aanspreken
aanstaat
aantast
aantasten
aantonen
aantoont
aantrad
aantreden
aantreedt
aantreffen
aantreft
aantrekken
aant

identificeert
illustreerde
illustreert
impliceert
importeerde
importeert
inbrengen
incasseren
indiende
indienen
indient
indruist
ingaan
ingaat
ingebracht
ingediend
ingegaan
ingegeven
ingeleverd
ingelijfd
ingeluid
ingenomen
ingericht
ingeschakeld
ingeschreven
ingeslagen
ingespannen
ingesteld
ingestelde
ingetrokken
ingevoerd
ingevuld
ingezet
inging
ingrijpen
ingrijpt
inhield
inhoudt
inleveren
inluidde
inluiden
innam
inneemt
innemen
insloeg
instelde
instellen
instelt
interesseert
interpreteren
introduceerde
investeerde
investeert
investeren
invoeren
invoert
invullen
inwoont
inzet
inzette
inzetten
inzien
inziet
jaagt
jagen
kampen
kampt
kàn
keert
ken
kende
kenden
kenmerkend
kenmerkt
kenmerkte
kennen
kent
keren
kiest
kiezen
kijken
kijkt
klaagt
kleeft
klinken
klinkt
klonk
kloppen
klopt
kocht
koesteren
koestert
koken
komen
komt
kon
konden
koopt
koos
kopen
koppelen
koppelt
kost
kosten
kostte
kozen
kreeg
kregen
krijg
krijgen
krijgt
kritiseerde
kruipt
kunt
kwalificeerde
kwalificeert
kwam
kwamen
k

uitschakelde
uitsluit
uitsprak
uitspreekt
uitspreken
uitsteekt
uitstijgt
uitstraalde
uitstraalt
uitstralen
uitstrekt
uitstrekte
uitte
uittrekken
uittrok
uitvoerde
uitvoeren
uitvoert
uitzag
uitzendt
uitziet
uitzond
vaart
vallen
valt
vangen
varen
varieert
vastgebonden
vastgelegd
vastgesteld
vasthouden
vasthoudt
vastlegde
vastlegt
vaststellen
vaststelt
vatten
vecht
vechten
veranderd
veranderde
veranderen
verandert
verankerd
verbaasde
verbeeld
verbeelden
verbeterd
verbeteren
verbieden
verbiedt
verbinden
verbindt
verblijft
verboden
verbond
verbonden
verborg
verborgen
verbroken
verbruikt
verdacht
verdachte
verdachten
verdedigd
verdedigde
verdedigen
verdedigt
verdeeld
verdeelt
verdelen
verdiend
verdiende
verdienen
verdient
verdragen
verdreven
verdrijven
verdrongen
verdubbelde
verdubbelen
verduistert
verdween
verdwenen
verdwijnen
verdwijnt
vereist
vereiste
verenigde
verenigen
verenigt
verfilmd
vergaard
vergaderde
vergadert
vergde
vergeet
vergeleken
vergelijken
vergelijkt
vergemakkelijken
verge

In [12]:
import matplotlib.pyplot as plt

fit = np.polyfit(x=verbs['die'],y=verbs['dat'],deg=1)
line_fit = np.poly1d(fit)
plt.plot(verbs['die'],line_fit(verbs['die']))

[<matplotlib.lines.Line2D at 0x7f08164f99d0>]

## Regression testing

Does the model make predictions based on N1 and N2 only, or does it also take into account the right context (relative clause), and assesses whether that fits better with N1 or N2? (Note that in most test sentences, the relative pronoun functions as subject of the relative clauses, so the model has to assess whether N1 or N2 is the more suitable subject for the relative clause).

### Method

Combine a high attachment RC with a random LC (relative), taken from either the low attachment test data, with either a matching pronoun or a non-matching pronoun. Prediction is that this should degrade performance, but especially so of the relative was taken from data with the opposite pronoun (so append a dat-relative clause to an die-N1 high attachment case, should lead to poor results for predicting die, slightly less so if the relative clause was originally also a die clause but from a different sentence).

Below: hardest case (random RC from mismatching pronoun), medium hard case (random RC from matching pronoun) 

Next: same story but now for N2 (low attachment) cases with random RC from either opposite of matching pronoun relative clauses (from N1 data)

In [5]:
diedat(bertje,'Regression/dat.np1.die.rc.mask','Regression/die.np1.dat.rc.mask')

dat, accuracy: 0.5507 (489/888)
die, accuracy: 0.5597 (497/888)
overall accuracy: 0.5552


In [4]:
diedat(bertje,'Regression/dat.np1.dat.rc.mask','Regression/die.np1.die.rc.mask')

dat, accuracy: 0.5946 (528/888)
die, accuracy: 0.6216 (552/888)
overall accuracy: 0.6081


In [6]:
diedat(bertje,'Regression/dat.np2.die.rc.mask','Regression/die.np2.dat.rc.mask')

dat, accuracy: 0.7917 (703/888)
die, accuracy: 0.8311 (738/888)
overall accuracy: 0.8114


In [7]:
diedat(bertje,'Regression/dat.np2.dat.rc.mask','Regression/die.np2.die.rc.mask')

dat, accuracy: 0.8018 (712/888)
die, accuracy: 0.8514 (756/888)
overall accuracy: 0.8266
