# Working with Universal Dependencies

The Universal Dependencies framework distributes annotated corpora for many languages, all using the same dependency format. In this notebook, I'll demonstrate how to access such a corpus from within Python. 

We will work with English GUM corpus. You can find it listed under "English" on https://universaldependencies.org/#language- The repository with the actual corpus is at https://github.com/UniversalDependencies/UD_English-GUM/tree/master

From the github repository, please download the training portion of the corpus, and put it in the same directory as this notebook. 

First, we will take a look at the format in which Universal Dependencies corpora are stored.

In [28]:
# if you don't have conllu yet, uncomment the following
# !python -m pip install --upgrade conllu

In [29]:
import conllu # reading Universal Dependency files in the CONLLu format

We open the GUM corpus as a text file, and look at its first few lines. After the initial metadata, the first sentence starts with the line
      "# text = Aesthetic Appreciation and Spanish Art:"

In [30]:
with open("en_gum-ud-train.conllu", encoding="utf8") as f:
    data = f.read()

In [31]:
print(data[:4000])

# newdoc id = GUM_academic_art
# global.Entity = GRP-etype-infstat-salience-centering-minspan-link-identity
# meta::author = Claire Bailey-Ross, Andrew Beresford, Daniel Smith, Claire Warwick
# meta::dateCollected = 2017-09-13
# meta::dateCreated = 2017-08-08
# meta::dateModified = 2017-09-13
# meta::genre = academic
# meta::salientEntities = 4 (5*), 5 (5*), 44 (5*), 45 (5*), 46 (5*), 47 (5*), 27 (4*), 147 (4*), 2 (3*), 43 (3), 20 (2*), 23 (2), 63 (2), 72 (2), 73 (2), 3 (1), 19 (1), 24 (1), 26 (1), 48 (1), 49 (1), 50 (1), 62 (1), 68 (1), 69 (1), 74 (1), 76 (1), 77 (1), 78 (1), 79 (1), 158 (1)
# meta::sourceURL = https://dh2017.adho.org/abstracts/333/333.pdf
# meta::speakerCount = 0
# meta::summary1 = (human) This paper presents an eye tracking study to explore how viewers experience art, focusing on a 17th Century collection of Spanish paintings by Zurbarán.
# meta::summary2 = (claude-3-5-sonnet-20241022) This pilot study uses eye-tracking techniques to examine how viewers visually pro

As you can see, this is a tabular format. There is a line for each word in the sentence, and the information for that word is given in tab-delimited cells, for example:

```2	Appreciation	appreciation	NOUN	NN	Number=Sing	0	root	0:root	Entity=1)|MSeg=Appreciat-ion```

This is the CoNLL-U format, which originated with a shared task at the Conference on Natural Language Learning (CoNLL). 

There is a Python package, conllu, that is made for reading CoNLL-U data. Once we have read the GUM corpus into a string, we can parse it with conllu:

In [32]:
sentences = conllu.parse(data)

The content of `sentences` is a sequence of TokenList objects. Here is the one for the 10th sentence of the corpus:

In [33]:
print(sentences[10])

TokenList<Thus, ,, the, time, it, takes, and, the, ways, of, visually, exploring, an, artwork, can, inform, about, its, relevance, ,, interestingness, ,, and, even, its, aesthetic, appeal, ., metadata={sent_id: "GUM_academic_art-11", s_type: "sub", s_prominence: "3", transition: "establishment", text: "Thus, the time it takes and the ways of visually exploring an artwork can inform about its relevance, interestingness, and even its aesthetic appeal."}>


We can access the entries on the TokenList through a for-loop, or using an index. Here is the first token of sentence 10. As you can see, it is a Python dictionary.

In [34]:
sentence10 = sentences[10]
firstword = sentence10[0]
firstword

{'id': 1,
 'form': 'Thus',
 'lemma': 'thus',
 'upos': 'ADV',
 'xpos': 'RB',
 'feats': None,
 'head': 16,
 'deprel': 'advmod',
 'deps': [('advmod', 16)],
 'misc': {'Discourse': 'context-background:12->23:7:_',
  'PDTB': 'Explicit:Contingency.Cause.Result:thus:107:79-106:108-134',
  'SpaceAfter': 'No'}}

You can access the entries in that dictionary by their keys:

In [35]:
firstword["lemma"]

'thus'

To better understand this big dictionary, it helps to view it as an attribute-value matrix. Here is the first word of the 10th sentence of the UD_English-GUM corpus:

In [36]:
firstword = {'id': 1,
  'form': 'Thus',
  'lemma': 'thus',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 16,
  'deprel': 'advmod',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}}

This is the following attribute-value matrix (AVM):

$$
\left[\begin{array}{ll}
\text{id:} & 1\\
\text{form:} & 'Thus'\\
\text{lemma:} & 'thus'\\
\text{upos:} &  'ADV'\\
\text{xpos:} & 'RB'\\
\text{feats:} &  None\\
\text{head:} & 16\\
\text{deprel:}  & advmod\\
\text{deps:}  & None\\
\text{misc:} & \left[\begin{array}{ll}
\text{SpaceAfter:} & 'No'
\end{array}\right]
\end{array}\right]
$$

As you saw above, you can access an entry in this attribute-value matrix through its dictionary key:

In [37]:
firstword["lemma"]

'thus'

One of the values in the AVM is itself an AVM. To access the value that tells you whether there is a space after the word, you need to specify the whole path of keys. `firstword["misc"]` accesses a dictionary, namely `{'SpaceAfter': 'No'}`, which again has keys, in particular `SpaceAfter`: 

In [38]:
firstword["misc"]["SpaceAfter"]

'No'

The Universal Dependencies representation of a whole sentence is a list of tokens, that is, a list of dictionaries (=AVMs):

In [39]:
sentence10 = [{'id': 1,
  'form': 'Thus',
  'lemma': 'thus',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 16,
  'deprel': 'advmod',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 2,
  'form': ',',
  'lemma': ',',
  'upos': 'PUNCT',
  'xpos': ',',
  'feats': None,
  'head': 1,
  'deprel': 'punct',
  'deps': None,
  'misc': None},
 {'id': 3,
  'form': 'the',
  'lemma': 'the',
  'upos': 'DET',
  'xpos': 'DT',
  'feats': {'Definite': 'Def', 'PronType': 'Art'},
  'head': 4,
  'deprel': 'det',
  'deps': None,
  'misc': None},
 {'id': 4,
  'form': 'time',
  'lemma': 'time',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 16,
  'deprel': 'nsubj',
  'deps': None,
  'misc': None},
 {'id': 5,
  'form': 'it',
  'lemma': 'it',
  'upos': 'PRON',
  'xpos': 'PRP',
  'feats': {'Case': 'Nom',
   'Gender': 'Neut',
   'Number': 'Sing',
   'Person': '3',
   'PronType': 'Prs'},
  'head': 6,
  'deprel': 'nsubj',
  'deps': None,
  'misc': None},
 {'id': 6,
  'form': 'takes',
  'lemma': 'take',
  'upos': 'VERB',
  'xpos': 'VBZ',
  'feats': {'Mood': 'Ind',
   'Number': 'Sing',
   'Person': '3',
   'Tense': 'Pres',
   'VerbForm': 'Fin'},
  'head': 4,
  'deprel': 'acl:relcl',
  'deps': None,
  'misc': None},
 {'id': 7,
  'form': 'and',
  'lemma': 'and',
  'upos': 'CCONJ',
  'xpos': 'CC',
  'feats': None,
  'head': 9,
  'deprel': 'cc',
  'deps': None,
  'misc': None},
 {'id': 8,
  'form': 'the',
  'lemma': 'the',
  'upos': 'DET',
  'xpos': 'DT',
  'feats': {'Definite': 'Def', 'PronType': 'Art'},
  'head': 9,
  'deprel': 'det',
  'deps': None,
  'misc': None},
 {'id': 9,
  'form': 'ways',
  'lemma': 'way',
  'upos': 'NOUN',
  'xpos': 'NNS',
  'feats': {'Number': 'Plur'},
  'head': 4,
  'deprel': 'conj',
  'deps': None,
  'misc': None},
 {'id': 10,
  'form': 'of',
  'lemma': 'of',
  'upos': 'SCONJ',
  'xpos': 'IN',
  'feats': None,
  'head': 12,
  'deprel': 'mark',
  'deps': None,
  'misc': None},
 {'id': 11,
  'form': 'visually',
  'lemma': 'visually',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 12,
  'deprel': 'advmod',
  'deps': None,
  'misc': None},
 {'id': 12,
  'form': 'exploring',
  'lemma': 'explore',
  'upos': 'VERB',
  'xpos': 'VBG',
  'feats': {'VerbForm': 'Ger'},
  'head': 9,
  'deprel': 'acl',
  'deps': None,
  'misc': None},
 {'id': 13,
  'form': 'an',
  'lemma': 'a',
  'upos': 'DET',
  'xpos': 'DT',
  'feats': {'Definite': 'Ind', 'PronType': 'Art'},
  'head': 14,
  'deprel': 'det',
  'deps': None,
  'misc': None},
 {'id': 14,
  'form': 'artwork',
  'lemma': 'artwork',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 12,
  'deprel': 'obj',
  'deps': None,
  'misc': None},
 {'id': 15,
  'form': 'can',
  'lemma': 'can',
  'upos': 'AUX',
  'xpos': 'MD',
  'feats': {'VerbForm': 'Fin'},
  'head': 16,
  'deprel': 'aux',
  'deps': None,
  'misc': None},
 {'id': 16,
  'form': 'inform',
  'lemma': 'inform',
  'upos': 'VERB',
  'xpos': 'VB',
  'feats': {'VerbForm': 'Inf'},
  'head': 0,
  'deprel': 'root',
  'deps': None,
  'misc': None},
 {'id': 17,
  'form': 'about',
  'lemma': 'about',
  'upos': 'ADP',
  'xpos': 'IN',
  'feats': None,
  'head': 19,
  'deprel': 'case',
  'deps': None,
  'misc': None},
 {'id': 18,
  'form': 'its',
  'lemma': 'its',
  'upos': 'PRON',
  'xpos': 'PRP$',
  'feats': {'Gender': 'Neut',
   'Number': 'Sing',
   'Person': '3',
   'Poss': 'Yes',
   'PronType': 'Prs'},
  'head': 19,
  'deprel': 'nmod:poss',
  'deps': None,
  'misc': None},
 {'id': 19,
  'form': 'relevance',
  'lemma': 'relevance',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 16,
  'deprel': 'obl',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 20,
  'form': ',',
  'lemma': ',',
  'upos': 'PUNCT',
  'xpos': ',',
  'feats': None,
  'head': 21,
  'deprel': 'punct',
  'deps': None,
  'misc': None},
 {'id': 21,
  'form': 'interestingness',
  'lemma': 'interestingness',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 19,
  'deprel': 'conj',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 22,
  'form': ',',
  'lemma': ',',
  'upos': 'PUNCT',
  'xpos': ',',
  'feats': None,
  'head': 27,
  'deprel': 'punct',
  'deps': None,
  'misc': None},
 {'id': 23,
  'form': 'and',
  'lemma': 'and',
  'upos': 'CCONJ',
  'xpos': 'CC',
  'feats': None,
  'head': 27,
  'deprel': 'cc',
  'deps': None,
  'misc': None},
 {'id': 24,
  'form': 'even',
  'lemma': 'even',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 27,
  'deprel': 'advmod',
  'deps': None,
  'misc': None},
 {'id': 25,
  'form': 'its',
  'lemma': 'its',
  'upos': 'PRON',
  'xpos': 'PRP$',
  'feats': {'Gender': 'Neut',
   'Number': 'Sing',
   'Person': '3',
   'Poss': 'Yes',
   'PronType': 'Prs'},
  'head': 27,
  'deprel': 'nmod:poss',
  'deps': None,
  'misc': None},
 {'id': 26,
  'form': 'aesthetic',
  'lemma': 'aesthetic',
  'upos': 'ADJ',
  'xpos': 'JJ',
  'feats': {'Degree': 'Pos'},
  'head': 27,
  'deprel': 'amod',
  'deps': None,
  'misc': None},
 {'id': 27,
  'form': 'appeal',
  'lemma': 'appeal',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 19,
  'deprel': 'conj',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 28,
  'form': '.',
  'lemma': '.',
  'upos': 'PUNCT',
  'xpos': '.',
  'feats': None,
  'head': 16,
  'deprel': 'punct',
  'deps': None,
  'misc': None}]

In [40]:
# now we can iterate through the AVMs for this sentence, and 
# print informati0n for each one
for token in sentence10:
    print(token["id"], token["form"], token["upos"], 
          token["head"], token["deprel"])

1 Thus ADV 16 advmod
2 , PUNCT 1 punct
3 the DET 4 det
4 time NOUN 16 nsubj
5 it PRON 6 nsubj
6 takes VERB 4 acl:relcl
7 and CCONJ 9 cc
8 the DET 9 det
9 ways NOUN 4 conj
10 of SCONJ 12 mark
11 visually ADV 12 advmod
12 exploring VERB 9 acl
13 an DET 14 det
14 artwork NOUN 12 obj
15 can AUX 16 aux
16 inform VERB 0 root
17 about ADP 19 case
18 its PRON 19 nmod:poss
19 relevance NOUN 16 obl
20 , PUNCT 21 punct
21 interestingness NOUN 19 conj
22 , PUNCT 27 punct
23 and CCONJ 27 cc
24 even ADV 27 advmod
25 its PRON 27 nmod:poss
26 aesthetic ADJ 27 amod
27 appeal NOUN 19 conj
28 . PUNCT 16 punct


Now say we want to determine how often we have subject-verb-object (SVO) versus SOV versus VSO etc. in a Universal Dependencies corpus. To do that, we would like to have an AVM for a word that includes all its dependents. For the verb "inform" in the sentence above, we would like the AVM to list that "time" (word 4) is the nsubj of "inform", and "relevance" (word 19) is its obl:


$$
\left[\begin{array}{ll}
\text{form:} & inform\\
\text{id:} & 16\\
\text{upos:} & VERB\\
\text{dep:} & \[ \left[\begin{array}{ll}
\text{id:} & 4\\
\text{deprel:} & nsubj\end{array}\right], 
\left[\begin{array}{ll}
\text{id:} & 19\\
\text{deprel:} & obl\end{array}\right]\]
\end{array}\right]
$$


As a Python data structure, this AVM is rather complex: It is a dictionary, but under the key "dep" the value is a list of dictionaries. 


In [41]:
inform_avm_with_deps = { "form" : "inform",
                        "id" : 16,
                        "upos" : "VERB",
                        "dep" : [ {"id" : 4, "deprel" : "nsubj"}, 
                                  {"id" : 19, "deprel" : "obl"}]
                       }

Here is how we make a version of sentence 10 that has such an AVM for each word.

In [42]:
def reformat_sentence(sentence):
    # first we initialize each AVM to have an empty dependencies list
    sentence_reformatted = [ ]
    for token in sentence:
        sentence_reformatted.append( { "form" : token["form"], 
                                    "id" : token["id"],
                                    "upos" : token["upos"],
                                    "dep" : [ ]
                                  } )

    # now we add dependencies
    for token in sentence:
        # looking up the head of this token. index is that head minus one.
        myhead_ix = token["head"] - 1
        # print(token["form"], token["id"], token["head"], sentence10_reformat[myhead_ix]["form"])
        # adding this token to the head's dependencies
        sentence_reformatted[ myhead_ix ]["dep"].append({ "id" : token["id"],
                                                       "deprel" : token["deprel"]})

    return sentence_reformatted
    
reformat_sentence(sentence10)

[{'form': 'Thus',
  'id': 1,
  'upos': 'ADV',
  'dep': [{'id': 2, 'deprel': 'punct'}]},
 {'form': ',', 'id': 2, 'upos': 'PUNCT', 'dep': []},
 {'form': 'the', 'id': 3, 'upos': 'DET', 'dep': []},
 {'form': 'time',
  'id': 4,
  'upos': 'NOUN',
  'dep': [{'id': 3, 'deprel': 'det'},
   {'id': 6, 'deprel': 'acl:relcl'},
   {'id': 9, 'deprel': 'conj'}]},
 {'form': 'it', 'id': 5, 'upos': 'PRON', 'dep': []},
 {'form': 'takes',
  'id': 6,
  'upos': 'VERB',
  'dep': [{'id': 5, 'deprel': 'nsubj'}]},
 {'form': 'and', 'id': 7, 'upos': 'CCONJ', 'dep': []},
 {'form': 'the', 'id': 8, 'upos': 'DET', 'dep': []},
 {'form': 'ways',
  'id': 9,
  'upos': 'NOUN',
  'dep': [{'id': 7, 'deprel': 'cc'},
   {'id': 8, 'deprel': 'det'},
   {'id': 12, 'deprel': 'acl'}]},
 {'form': 'of', 'id': 10, 'upos': 'SCONJ', 'dep': []},
 {'form': 'visually', 'id': 11, 'upos': 'ADV', 'dep': []},
 {'form': 'exploring',
  'id': 12,
  'upos': 'VERB',
  'dep': [{'id': 10, 'deprel': 'mark'},
   {'id': 11, 'deprel': 'advmod'},
   {'id'

Based on this data structure, we can determine whether the subject is before the verb: If so, its ID is lower than that of the verb. We can also determine whether the subject is before the object: If so, its ID is lower than that of the the object.

We can also see how far away from the verb the subject is, by computing the difference between the IDs of the verb and its subject. In the same way, we can determine how far away from the verb the direct object is. 

In [43]:
import random
random.seed(123)

In [None]:
# print(sentences[10])
# print(sentences[10].metadata)
def finderskeepers(sentences, sample_size):
    fit_generalization = []
    possible_exceptions = []
    current = {}
    for sentence in sentences:
        for word in sentence:
            if word["upos"] == "PRON" and word["deprel"] =="obj":
                if word["head"] != None and sentence[word["head"]-1]["upos"] == "VERB":
                    current["sentence"] = sentence.metadata['text']
                    current["PRON"] = word
                    current["VERB"] = sentence[word["head"]-1]
                    if word["id"] != 0 and word["id"]-1 == word["head"]:
                        if current not in fit_generalization:
                            fit_generalization.append(current.copy())
                    elif word["id"]-1 != word["head"]: # != gets all possible exceptions
                        if current not in possible_exceptions:
                            possible_exceptions.append(current.copy())
    
    print(f"Sentences that fit the generalization: {len(fit_generalization)}\n")
    
    fitted_samples = random.sample(fit_generalization, sample_size)
    
    for entry in fitted_samples:
        print(f'PRON: {entry["PRON"]}, VERB: {entry["VERB"]}\n Sentence: {entry["sentence"]}\n')
    
    print(f"\nSentences that may (or may not) be exceptions: {len(possible_exceptions)}\n")
    
    fitted_samples = random.sample(possible_exceptions, sample_size)
    
    for entry in fitted_samples:
        print(f'PRON: {entry["PRON"]}, VERB: {entry["VERB"]}\n Sentence: {entry["sentence"]}\n')
    
    #print(count)
    print(f"\nTotal sentences with Pronouns linked to Verbs: {len(fit_generalization)+len(possible_exceptions)}")

In [53]:
finderskeepers(sentences, 5)

Sentences that fit the generalization: 995

PRON: this, VERB: love
 Sentence: I also love this.

PRON: em, VERB: get
 Sentence: You know then, they have to, like, keep em, away from anything, you know, get em really in the soft ground, and, no hard pebbles, or hard clods of dirt or anything?

PRON: what, VERB: see
 Sentence: Position a large mirror so you can check your positioning and see what you're doing.

PRON: that, VERB: read
 Sentence: Now, when I read that, the first thing that jumps into my mind is hundreds of millions.

PRON: her, VERB: wake
 Sentence: Once Bet was asleep it was impossible to wake her up.


Sentences that may (or may not) be exceptions: 156

PRON: which, VERB: hatched
 Sentence: So Erasmus laid the the — laid the egg, which Luther hatched.

PRON: anything, VERB: publish
 Sentence: More recently, institutions of academic expertise have been subject to a large and growing outbreak of so-called predatory journals — journals that will publish almost anything for 

In [46]:
# AND NOW WE BEGIN TO EXAMINE THE SWEDISH DATA
with open("sv_lines-ud-train.conllu", encoding="utf8") as f:
    data = f.read()
sentences_swedish = conllu.parse(data)
print(data[:1000])

# newdoc id = 1
# sent_id = sv_lines-ud-train-doc1-1
# text = Visa alla
# text_en = Show All
1	Visa	visa	VERB	IMP-ACT	Mood=Imp|VerbForm=Fin|Voice=Act	0	root	_	_
2	alla	all	PRON	TOT-PL-NOM	Definite=Ind|Number=Plur|PronType=Tot	1	obj	_	_

# sent_id = sv_lines-ud-train-doc1-2
# text = Om ANSI SQL-frågeläge
# text_en = About ANSI SQL query mode
1	Om	om	ADP	_	_	3	case	_	_
2	ANSI	ANSI	PROPN	SG-NOM	Case=Nom	3	nmod	_	_
3	SQL-frågeläge	SQLfrågeläge	NOUN	IND-NOM	Case=Nom|Definite=Ind|Gender=Neut|Number=Sing	0	root	_	_

# sent_id = sv_lines-ud-train-doc1-3
# text = En del av innehållet i det här avsnittet kanske inte gäller för vissa språk.
# text_en = Some of the content in this topic may not be applicable to some languages.
1	En	en	DET	SG-IND	Definite=Ind|Gender=Com|Number=Sing|PronType=Art	2	det	_	_
2	del	del	NOUN	SG-IND-NOM	Case=Nom|Definite=Ind|Gender=Com|Number=Sing	11	nsubj	_	_
3	av	av	ADP	_	_	4	case	_	_
4	innehållet	innehåll	NOUN	SG-DEF-NOM	Case=Nom|Definite=Def|Gender=Neut|Number=Sing	2	

In [47]:
def swedishchecker(sentences, sample_size):
    fit_generalization = []
    possible_exceptions = []
    current = {}
    for sentence in sentences:
        for word in sentence:
            if word["xpos"] == "NEG":
                if word["head"] != None and sentence[word["head"]-1]["upos"] == "VERB":
                    current["sentence"] = sentence.metadata['text']
                    try:
                        current["sentence-E"] = sentence.metadata['text_en']
                    except:
                        try:
                            current["sentence-E"] = sentence.metadata['Text_en']
                        except:
                            print(f"Error for sentence: {sentence.metadata}")
                            current["sentence-E"] = None
                    current["NEG"] = word
                    current["VERB"] = sentence[word["head"]-1]
                    if word["id"] != 0 and word["id"]-1 == word["head"]:
                        if current not in fit_generalization:
                            fit_generalization.append(current.copy())
                    elif word["id"]-1 != word["head"]:
                        if current not in possible_exceptions:
                            possible_exceptions.append(current.copy())
    
    print(f"Sentences that fit the generalization: {len(fit_generalization)}\n")
    
    fitted_samples = random.sample(fit_generalization, min(sample_size,len(fit_generalization)))
    
    for entry in fitted_samples:
        print(f'NEG: {entry["NEG"]}, VERB: {entry["VERB"]}\n Sentence: {entry["sentence"]}\n English Translation: {entry["sentence-E"]}\n')
    
    print(f"\nSentences that may (or may not) be exceptions: {len(possible_exceptions)}\n")
    
    fitted_samples = random.sample(possible_exceptions, min(sample_size,len(possible_exceptions)))
    
    for entry in fitted_samples:
        print(f'NEG: {entry["NEG"]}, VERB: {entry["VERB"]}\n Sentence: {entry["sentence"]}\n English Translation: {entry["sentence-E"]}\n')
    
    print(f"\nTotal sentences with Negations linked to Verbs: {len(fit_generalization)+len(possible_exceptions)}")

In [48]:
# Unknown Bug in the Code prevents me from storing text_en
swedishchecker(sentences_swedish, 6)

Sentences that fit the generalization: 99

NEG: inte, VERB: betyder
 Sentence: Detta betyder inte att alla andra lever ett tryggt och behagligt liv under en anständig regim.
 English Translation: This is not to say that everyone else is living pleasantly and well under a decent regime.

NEG: inte, VERB: finns
 Sentence: Och det finns inte heller någon möjlighet för honom att bli invigd i sådana mysterier.
 English Translation: There's no initiation either into such mysteries.

NEG: inte, VERB: flyttas
 Sentence: Om du flyttar en linje från rutnätets centrum, flyttas inte den närmaste parallella linjen, och avståndet mellan den linje som flyttas och den närmaste parallella linjen ökar.
 English Translation: If you are moving the line away from the center of the grid, the nearest parallel line doesn't move, and the distance between the line you are moving and the nearest parallel line grows larger.

NEG: aldrig, VERB: såg
 Sentence: Det såg aldrig ut att vara annat än trasiga, kasserade 

In [None]:
# TASK 2 EXTRACTING SEMANTIC GENERALIZATIONS ABOUT VERBS
# extract all the verbs in corpus and sort them by frequency
def verb_frequencies(sentences):
    verb_freq = {}
    # verbs = []
    for sentence in sentences:
        for word in sentence:
            if word["upos"] == "VERB":
                verb = word["lemma"]
                verb_freq[verb] = verb_freq.get(verb,0) + 1
    verbs = list(verb_freq.keys())
    print(f"There are {len(verbs)} verbs used in the corpus.")
    #print(f"Those verbs are: {verbs}")
    sorted_by_frequency_desc = sorted(verb_freq.items(), key=lambda item: item[1], reverse=True)
    first_five = sorted_by_frequency_desc[:5] # optionally random sample the top 20%
    # it would be better if we can print this in a nicer format, and include the english translation
    print(f"The highest frequency verbs are: {first_five}")
    # then find middle five using some multiplication for the sorted list?\
    middle_five = sorted_by_frequency_desc[len(sorted_by_frequency_desc)//5:len(sorted_by_frequency_desc)//5+5]
    print(f"Some of the mid-frequency verbs are: {middle_five}")
    return first_five + middle_five
    

    

In [None]:
verbs = verb_frequencies(sentences_swedish)

There are 1308 verbs used in the corpus.
The highest frequency verbs are: [('säga', 294), ('ha', 225), ('komma', 188), ('se', 185), ('gå', 180)]
Some of the mid-frequency verbs are: [('misstänka', 5), ('erinra', 5), ('stoppa', 5), ('lukta', 5), ('vila', 5)]


The following code is a method to generate each of the following sets from any verb in the corpus:

1. its subject set – the set of head nouns of its subjects
2. Its object set - the set of head nouns of its objects
3. Its modifier set - the set of heads of its modifiers
4. Its before/after set - the word that precedes the verb and the word that follows it

In [62]:
def gen_sets(sentences, verb):
    sets = {"verb": verb, "subjects": set(), "objects": set(), "modifiers": set(), "before": set(), "after": set()}
    for sentence in sentences:
        words = [x['lemma'] for x in sentence]
        if (verb in words):
            word_id = words.index(verb)+1
            sets["before"].add(words[word_id-2])
            sets["after"].add(words[word_id])
            for word in sentence:
                if(word["deprel"] in ["obj", "nsubj", "iobj", "advmod"] and word["head"] == word_id):
                    match word["deprel"]:
                        case "obj" | "iobj":
                            sets["objects"].add(word["lemma"])
                        case "nsubj":
                            sets["subjects"].add(word["lemma"])
                        case "advmod":
                            sets["modifiers"].add(word["lemma"])
                    
    return sets



In [None]:
#example
print(gen_sets(sentences_swedish, "säga"))

{'verb': 'säga', 'subjects': {'barn', 'pojke', 'Shahar', 'Ryan', 'fru', 'Marlow', 'Ron', 'Hagrid', 'Malfoy', 'far', 'de', 'besättning', 'Dursley', 'vis', 'John', 'Margot', 'hon', 'Hjalmar', 'Borgin', 'Harry', 'George', 'du', 'legend', 'Neil', 'han', 'Dudley', 'vi', 'svensk', 'Fred', 'Roland', 'oljeställ', 'Kedourie', 'Bray', 'Compson', 'man', 'företrädare', 'ni', 'Quinn', 'någon', 'som', 'farmor', 'jag', 'Weasley', 'ingen', 'maskinist', 'Le', 'kvinna', 'stolsgranne', 'Paniotis', 'Vernon', 'Stella', 'Harold', 'mor', 'Wentz'}, 'objects': {'jomän', 'fri', 'vad', 'de', 'tio', 'stund', 'ja', 'ingenting', 'någonting', 'någon', 'följande', 'den', 'farmor', 'sig', 'månad', 'sak', 'denna', 'tvekan', 'se', 'pöbelvälde', 'tack', 'ord', 'mor', 'morgon'}, 'modifiers': {'skrattande', 'då', 'oskuldsfullt', 'förnumstigt', 'barskt', 'inte', 'varpå', 'så', 'nu', 'avsiktligt', 'nyss', 'frågande', 'dessutom', 'hastigt', 'kort', 'minst', 'uppriktigt', 'ursäktande', 'plötsligt', 'överhuvudtaget', 'hur', 'st

In [72]:
verb_sets = list(map(lambda y: gen_sets(sentences_swedish, y), [x[0] for x in verbs]))
print(verb_sets)

[{'verb': 'säga', 'subjects': {'barn', 'pojke', 'Shahar', 'Ryan', 'fru', 'Marlow', 'Ron', 'Hagrid', 'Malfoy', 'far', 'de', 'besättning', 'Dursley', 'vis', 'John', 'Margot', 'hon', 'Hjalmar', 'Borgin', 'Harry', 'George', 'du', 'legend', 'Neil', 'han', 'Dudley', 'vi', 'svensk', 'Fred', 'Roland', 'oljeställ', 'Kedourie', 'Bray', 'Compson', 'man', 'företrädare', 'ni', 'Quinn', 'någon', 'som', 'farmor', 'jag', 'Weasley', 'ingen', 'maskinist', 'Le', 'kvinna', 'stolsgranne', 'Paniotis', 'Vernon', 'Stella', 'Harold', 'mor', 'Wentz'}, 'objects': {'jomän', 'fri', 'vad', 'de', 'tio', 'stund', 'ja', 'ingenting', 'någonting', 'någon', 'följande', 'den', 'farmor', 'sig', 'månad', 'sak', 'denna', 'tvekan', 'se', 'pöbelvälde', 'tack', 'ord', 'mor', 'morgon'}, 'modifiers': {'skrattande', 'då', 'oskuldsfullt', 'förnumstigt', 'barskt', 'inte', 'varpå', 'så', 'nu', 'avsiktligt', 'nyss', 'frågande', 'dessutom', 'hastigt', 'kort', 'minst', 'uppriktigt', 'ursäktande', 'plötsligt', 'överhuvudtaget', 'hur', 's