# English-Swedish Generalizations & Clustering
**Authors:** Riley Clark, Ishani Saha, Drew Marceau, Antoinette Reid

In [17]:
# Load Necessary Libraries
import conllu
import random
random.seed(123)
import math
import gensim.downloader as gensim_api
import numpy as np
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from gensim.models import Word2Vec
from scipy import stats
from functools import reduce
import json

# Task 1.2: English Syntax Generalization Practice
We used the English GUM corpus with 233k entries maintained by Georgetown University Linguistics students. The text comes from multiple media formats. Our goal was to examine a generalization on verbs and their direct object pronouns within the corpus. The generalization accurately described the relevant sentences in the corpus.

## Our Generalization
**Expected Generalization:** Pronoun direct objects immediately follow their linked verbs.

An example that illustrates this generalization is provided below.

“First, our experimental subjects lived in a large enclosure under conditions that **allowed them** to exercise all day long.”

In [5]:
def englishGen(sentences, sample_size):
    fit_generalization = []
    possible_exceptions = []
    current = {}
    for sentence in sentences:
        for word in sentence:
            if word["upos"] == "PRON" and word["deprel"] == "obj": # and word["lemma"] != "that"
                if word["head"] != None and sentence[word["head"]-1]["upos"] == "VERB":
                    current["sentence"] = sentence.metadata['text']
                    current["PRON"] = word
                    current["VERB"] = sentence[word["head"]-1]
                    if word["id"] != 0 and word["id"]-1 == word["head"]:
                        if current not in fit_generalization:
                            fit_generalization.append(current.copy())
                    elif word["id"]-1 != word["head"]: # != gets all possible exceptions
                        if current not in possible_exceptions:
                            possible_exceptions.append(current.copy())
    
    print(f"Sentences that fit the generalization: {len(fit_generalization)}\n")
    
    fitted_samples = random.sample(fit_generalization, sample_size)
    
    for entry in fitted_samples:
        print(f'PRON: {entry["PRON"]}, VERB: {entry["VERB"]}\n Sentence: {entry["sentence"]}\n')
    
    print(f"\nSentences that may (or may not) be exceptions: {len(possible_exceptions)}\n")
    
    fitted_samples = random.sample(possible_exceptions, sample_size)
    
    for entry in fitted_samples:
        print(f'PRON: {entry["PRON"]}, VERB: {entry["VERB"]}\n Sentence: {entry["sentence"]}\n')
    
    #print(count)
    print(f"\nTotal sentences with Pronouns linked to Verbs: {len(fit_generalization)+len(possible_exceptions)}")

## Results
We used a function called 'englishGen' to search the corpus for sentences that contain pronoun (direct) objects linked to verbs. If those pronouns immediately followed the verb, then they fit our generalization. Otherwise, they were flagged as exceptions. Note that the deprel "obj" specifies only direct objects. If we wanted indirect objects we could use "iobj".

About 86% of the corpus fit our generalization; direct object pronouns indeed follow their linked verbs. The exceptions to this are when adverbs modify the verb, adjectives modify the pronoun, wh-questions, a passivized verb requires argument movement to bring the direct object pronoun *forward* in the sentence, among others. Examples are provided below.

## Examples of Exceptions
### Wh-Question
"If your professor comes into an early morning class holding a mug of liquid, **what** do you assume she is **drinking**?"

### Stammer
“You know, **documenting**, uh, uh, **whatever**.”

### Modifying Preposition
"And I said ‘Ben, **pick** me out **something**, you've got fifty bucks to spend.'"

### Dialogue
"Whitmore told LA Weekly that on October 6 after traveling back from Florida, Montalvo ‘walked into lobby of the East L.A. station and turned himself in’, and **told** the police, ‘**everything** he did’.”

### Wh-Movement
“Malaysia and Indonesia have maintained a policy of turning away boats of migrants **which**, according to AFP, the United Nations and United States have both **criticised**.”

### Movement
"**This** I would have **asked** him had he not been so far away, but he was very far, and could not be seen at all when he drew nigh that gigantic reef."

In [7]:
with open("en_gum-ud-train.conllu", encoding="utf8") as f:
    data = f.read()
sentences_english = conllu.parse(data)
englishGen(sentences_english, 3)

Sentences that fit the generalization: 995

PRON: it, VERB: like
 Sentence: And I really like it.

PRON: her, VERB: distract
 Sentence: Back out in the mall, Cara is wailing, which could start an asthma attack, so to distract her I say, “You want a cookie?”

PRON: em, VERB: get
 Sentence: You know then, they have to, like, keep em, away from anything, you know, get em really in the soft ground, and, no hard pebbles, or hard clods of dirt or anything?


Sentences that may (or may not) be exceptions: 156

PRON: which, VERB: live
 Sentence: Yet the turning - point is past, and history begins anew for us, the history which we shall live and act and others will write about.

PRON: that, VERB: use
 Sentence: One of them is the channels that NBC as the broadcasting rights owner for the United States will use to air the Paralympic Games on.

PRON: what, VERB: figure
 Sentence: No, I was just trying to figure out what we spent already.


Total sentences with Pronouns linked to Verbs: 1151


# Tasks 1.1 & 1.3: Swedish Verb Negation Generalization
We used the Swedish LinES corpus (from the Parallel Treebank of the same name) that includes just over 100k Swedish translations from English text. Our goal was to examine a generalization on Swedish verb negation. Our results conflicted with our expectations, and so we performed further examination to search for a different possible generalization.

## Our Generalization
**Expectated Generalization:** Negation words immediately follow the verb they negate.

The below example illustrates this generalization. 

“Hon **svarade inte**.”
(She didn't answer.)
('inte' = 'not', linked words are bolded)

In [8]:
def swedishGen(sentences, sample_size):
    fit_generalization = []
    possible_exceptions = []
    current = {}
    exceptions = []
    for sentence in sentences:
        for word in sentence:
            if word["xpos"] == "NEG" and word["head"] != None and sentence[word["head"]-1]["upos"] == "VERB":
            #if word["feats"] != None and "Polarity" in word["feats"].keys() and word["feats"]["Polarity"] == "Neg" and and word["head"] != None and sentence[word["head"]-1]["upos"] == "VERB": # xpos "NEG", upos --
                current["sentence"] = sentence.metadata['text']
                try:
                    current["sentence-E"] = sentence.metadata['text_en']
                except:
                    current["sentence-E"] = None
                current["NEG"] = word
                current["VERB"] = sentence[word["head"]-1]
                if word["id"] != 0 and word["id"]-1 == word["head"]:
                    if current not in fit_generalization:
                        fit_generalization.append(current.copy())
                elif word["id"]-1 != word["head"]:
                    if current not in possible_exceptions:
                        possible_exceptions.append(current.copy())
                        exceptions.append(sentence)
                            
    print(f"Sentences that fit the generalization: {len(fit_generalization)}\n")
    fitted_samples = random.sample(fit_generalization, min(sample_size, len(fit_generalization)))
    for entry in fitted_samples:
        print(f'NEG: {entry["NEG"]}, VERB: {entry["VERB"]}\n'
              f'Sentence: {entry["sentence"]}\n'
              f'English Translation: {entry["sentence-E"]}\n')

    print(f"\nSentences that may (or may not) be exceptions: {len(possible_exceptions)}\n")
    fitted_samples = random.sample(possible_exceptions, min(sample_size, len(possible_exceptions)))
    for entry in fitted_samples:
        print(f'NEG: {entry["NEG"]}, VERB: {entry["VERB"]}\n'
              f'Sentence: {entry["sentence"]}\n'
              f'English Translation: {entry["sentence-E"]}\n')

    return exceptions

## Results & Discussion
We used a function called 'swedishGen' to search the corpus for negation words linked to verbs. Then we filtered instances where the negation came immediately after the verb; those examples fit our generalization. The other sentences were cached as exceptions.

Only about 21% of our corpus fit the expected generalization. There were a lot of exceptions ranging from embedded sentences, questions, auxiliaries, verb-object switches, and the list goes on. Some examples of these are provided below:

In [9]:
with open("sv_lines-ud-train.conllu", encoding="utf8") as f:
    data = f.read()
sentences_swedish = conllu.parse(data)
exceptions = swedishGen(sentences_swedish, 3)

Sentences that fit the generalization: 99

NEG: inte, VERB: visas
Sentence: När du exporterar till Excel blir detaljfälten tillgängliga i pivottabellens verktygsfält, men fälten visas inte i rapporten.
English Translation: When you export to Excel, detail fields will be available on the PivotTable toolbar in Excel, but the fields won't be displayed in the report.

NEG: inte, VERB: insåg
Sentence: Jag insåg inte genast vad det där skeppsbrottet verkligen innebar.
English Translation: I did not see the real significance of that wreck at once.

NEG: inte, VERB: tycker
Sentence: "Vi måste göda dig medan vi har chansen jag tycker inte den där skolmaten låter särskilt bra"
English Translation: We must feed you up while we've got the chance. I don't like the sound of that school food


Sentences that may (or may not) be exceptions: 368

NEG: inte, VERB: grälade
Sentence: Mina föräldrar grälade och när de inte grälade snäste de åt varandra och när de inte snäste planerade de en fest, ordnade e

## Secondary Corpus Test
When tested with a slightly smaller corpus (96k entries) called Talbanken from Lund University. The sentences were taken from various text genres like textbooks, brochures, and newspaper articles. We found similar results to the above, where 22% of the sentences with negated verbs actually fit our generalization. This leads us to believe that the translation bias in our first corpus may not be the reason that our generalization fits so poorly.

# Generalization Exceptions
Our generalization seems to hold on simple sentences with little to no nuance.

 Examples include:

“Hon svarade inte.” -“She didn't answer.”

"Hon talar inte jiddisch?" - “She doesn't speak Yiddish?”

These sentences relay straightforward information and do not contain many flourishes in speech. If we were to only consider such sentencs, our generalization holds with 21% accuracy on verb negations.

However, if we take into account more detailed sentences, we see a diffrerent result. In examining our initial results, there were two main exceptions we identified that change the location of negation. If we include this nuance, accuracy increases to 44%

- **Auxiliary verbs:** if auxilary verbs are present, the negation is placed between the auxiliary and the main (head) verb

“Hans sekreterare hade inte ringt det samtal hon hade fått instruktioner om.”

“His secretary had not made the instructed call.”
	
- **Embedded clauses:** if there is an embedded clause, the negation follows the subject of the clause and preceedes the main (head) verb. This seems to happen because of rules regarding VPs in Swedish

“Jag har suttit här tålmodigt och jag finner det anmärkningsvärt att ni inte ropar upp mig.”

“I have sat here patiently and I find it quite extraordinary that you are not calling me.”

While 44% accuracy might seem low, this result can be explained by recognizing a feature of Swedish that lets words be reordered to put emphasis on certain aspects of the sentence. For example, Object Shift allows for the object of a verb to swap places with the negation, while still producing a gramatical sentence

“Jag förstår det inte alls.”

“I do not understand it at all.”


# Task 2
## Verb Frequency

In [None]:
def verb_frequencies(sentences):
    verb_freq = {}
    # verbs = []
    for sentence in sentences:
        for word in sentence:
            if word["upos"] == "VERB":
                verb = word["lemma"]
                verb_freq[verb] = verb_freq.get(verb,0) + 1
    verbs = list(verb_freq.keys())
    print(f"There are {len(verbs)} verbs used in the corpus.")
    #print(f"Those verbs are: {verbs}")
    sorted_by_frequency_desc = sorted(verb_freq.items(), key=lambda item: item[1], reverse=True)
    first_five = sorted_by_frequency_desc[:5] # optionally random sample the top 20%
    # it would be better if we can print this in a nicer format, and include the english translation
    print(f"The highest frequency verbs are: {first_five}")
    # then find middle five using some multiplication for the sorted list?\
    middle_five = sorted_by_frequency_desc[len(sorted_by_frequency_desc)//5:len(sorted_by_frequency_desc)//5+5]
    print(f"Some of the mid-frequency verbs are: {middle_five}")

    #choosing a randomized set from top 20%
    arbitrary_top20 = [sorted_by_frequency_desc[x] for x in random.sample(range(len(verbs)//5), 5)]
    arbitrary_between20and40 = [sorted_by_frequency_desc[x] for x in random.sample(range(len(verbs)//5,2*len(verbs)//5), 5)]
    return arbitrary_top20 + arbitrary_between20and40

In [None]:
verb_frequencies(sentences_swedish)

There are 1308 verbs used in the corpus.
The highest frequency verbs are: [('säga', 294), ('ha', 225), ('komma', 188), ('se', 185), ('gå', 180)]
Some of the mid-frequency verbs are: [('misstänka', 5), ('erinra', 5), ('stoppa', 5), ('lukta', 5), ('vila', 5)]


[('lägga', 47),
 ('räcka', 17),
 ('hänga', 19),
 ('hälsa', 8),
 ('tränga', 8),
 ('dränka', 3),
 ('besvara', 4),
 ('misstänka', 5),
 ('testa', 2),
 ('uppmana', 4)]

The verbs chosen are randomly sampled from the top 20% of verbs when sorted by frequency and the next 20% of most frequently used verbs -- five from each category. Frequency was determined by tallying the lemmas of each verb.

## Verb Sets

In [37]:
def gen_sets(sentences, verb):
    sets = {"verb": verb, "subjects": set(), "objects": set(), "modifiers": set(), "before": set(), "after": set(), "clausalcomps": set()}
    for sentence in sentences:
        words = [x['lemma'] for x in sentence]
        if (verb in words):
            word_id = words.index(verb)+1
            sets["before"].add(words[word_id-2])
            sets["after"].add(words[word_id])
            for word in sentence:
                if(word["deprel"] in ["obj", "nsubj", "iobj", "advmod", "ccomp"] and word["head"] == word_id):
                    match word["deprel"]:
                        case "obj" | "iobj":
                            if word["upos"]!= "PROPN": sets["objects"].add(word["lemma"])
                           # sets["objects"].add(word["lemma"])
                        case "nsubj":
                            if word["upos"]!= "PROPN": sets["subjects"].add(word["lemma"]) 
                            #sets["subjects"].add(word["lemma"])
                        case "advmod": #includes negation
                            sets["modifiers"].add(word["lemma"])
                        case "ccomp":
                            sets["clausalcomps"].add(word["lemma"])
                    
    return sets

For any verb's *lemma* in the set of Swedish sentences, this method generates a dictionary containing sets of each subjects, objects, modifiers, preceding words, and following words corresponding to the given verb. The set of modifiers for the verb only contains adverbs, (or those with the dependency relationship "advmod" to the verb), but not other modifiers like negation, prepositions, or auxiliaries. Including would have likely skewed our results by adding more noise, since semantically, adverbs might be more significant. However, it may have been beneficial to have include other modifiers as well.

## Word 2 Vector Model

After we have the sets of words that we need, we need to make the Word2Vec model. This is what the following function does, returning it in the form of a space so that we can use gensim library functions on it:

In [38]:
# load the word2vec model
def make_W2V(conllu_corpus):
    sentences = []
    for tokList in conllu_corpus:
        sent = []
        for token in tokList:
            if token != "metadata":
                sent.append(token["lemma"])
        sentences.append(sent)
        
    space = Word2Vec(sentences, epochs=10, min_count=0, vector_size=300, sg = 1)
    return space.wv

Now that we have the Word2Vec, we can compute the k nearest words semantically from the Word2Vec vectors. We can also find the centroid of each set by summing all of the word vectors in the set and finding the most similar vector to the sum. Note that the similarity is done via cosine similarity, so we do not need to divide the sum by the number of words (which would give the average vector).

In [39]:
def k_nearest(k, space, vector):
    if vector == "empty set":
        return "empty set"
    return space.most_similar(vector)[:k]

def find_centroid(set: set, space):
    total = []
    if len(set) == 0:
        return ["empty set"]
    for token in set:
        total.append(space[token])
    
    sum = reduce(lambda x, y: x + y, total)
    return space.similar_by_vector(sum)[0]

In [44]:
def outputTopK(label, verb, sets, space, svToEn, translate=True, k=5):
    topK = k_nearest(k, space, find_centroid(sets[label], space)[0])
    arr = []
    if topK == "empty set":
        print(f'The {label} set is empty for verb {verb}')
        return
    for pair in topK:
        if translate: arr.append(svToEn[pair[0]])
        else: arr.append(pair[0])
    print(f'The top {k} words most similar to \'{verb}\' (\'{svToEn[verb]}\') in the {label} set are: {arr}')

def outputAllSets(verb, k=5):
    sets = gen_sets(sentences_swedish, verb)
    space = make_W2V(sentences_swedish)
    with open('dictionary.json', 'r') as f:
        svToEn = json.load(f)
    outputTopK("subjects", verb, sets, space, svToEn)
    outputTopK("objects", verb, sets, space, svToEn)
    outputTopK("modifiers", verb, sets, space, svToEn)
    outputTopK("before", verb, sets, space, svToEn)
    outputTopK("after", verb, sets, space, svToEn)


outputAllSets("ha")

The top 5 words most similar to 'ha' ('have') in the subjects set are: ['railway', 'literature', 'ty', 'roll', 'presuppose']
The top 5 words most similar to 'ha' ('have') in the objects set are: ['conviction', 'Clelia', 'greatness', 'Society', 'rinse']
The top 5 words most similar to 'ha' ('have') in the modifiers set are: ['lust', 'before', 'bad', 'away', 'suspect']
The top 5 words most similar to 'ha' ('have') in the before set are: ['wonder', 'pay', 'intention', 'marriage', 'Jerusalem']
The top 5 words most similar to 'ha' ('have') in the after set are: ['Wonderful', 'home', 'immediately', 'Margot', 'everything']


## A. Which out of the subject/object/modifier/before-after sets gives the best sense of the meaning of the verb?
'ha' means 'to have' or 'to possess'. It seems that the object set gives the best sense of the meaning of the verb, with phrases like 'to have hope', 'to possess horse', 'to have hold', and 'to possess greatness' (these exact sentences may not be shown because the Word2Vec model is nondeterministic). Keep in mind, these are the lemma versions of the words, so they may mean different things when detached from the context. The after set also seems to line up well with the meaning of the verb, however it would be more difficult to understand it without being given the verb.

## B. Does frequency of the verb have an effect on the kinds of clusters you end up with?
Yes and no. The function we use for vector similarity implements cosine similarity which does not depend on the length of the vector, so in that sense the frequency does not matter. However, it is likely that if a verb appears more often in the corpus, then its meaning will become more specific to what it actually means because there is a higher chance to see all uses of it. This would give the verb a better embedding in the vector space, which may then affect the clusters that appear.

# C. One additional experiment: Clausal Component Set


In [75]:
space = make_W2V(sentences_swedish)
with open('dictionary.json', 'r') as f:
    svToEn = json.load(f)

verbs = ["ha", "komma", "se", "tala"]

for verb in verbs:
    sets = gen_sets(sentences_swedish, verb)
    outputTopK("clausalcomps", verb, sets, space, svToEn, k=8)

The top 8 words most similar to 'ha' ('have') in the clausalcomps set are: ['fly', 'send', 'go', 'think', 'go', 'stop', 'set', 'jab']
The top 8 words most similar to 'komma' ('come') in the clausalcomps set are: ['exist', 'express', 'miss', 'opportunity', 'opinion', 'event', 'roll', 'beginning']
The top 8 words most similar to 'se' ('see') in the clausalcomps set are: ['clock', 'parent', 'set', 'Boy', 'fetch', 'quiet', 'then', 'laugh']
The top 8 words most similar to 'tala' ('say') in the clausalcomps set are: ['meet', 'of course', 'right', 'imagine', 'hope', 'already', 'forget', 'yet']


What we can infer from these lists:  
&emsp;1. 'ha' seems to take a variety of verbs, suggesting that it is a very general verb in Swedish. This tracks with it being among the most frequent words in the corpus. More specifically, though, we can say that it is likely to take stative predicates.  

&emsp;2. 'komma' takes mostly nouns/adjectives, which suggests that rather than the action meaning of 'come', 'komma' is used more in evaluative predications ('came to be', 'came to know'). Although, it is strange that there are little to no verbs, so perhaps the data is just noisy.  

&emsp;3. 'se' seems to be a perception verb, taking words like 'leave', 'each other', 'fetch', etc. It is used to set up concrete actions.  

&emsp;4. 'tala' is the cleanest cluster, relating most to cognition verbs ('hope', 'wonder', 'mean'). This shows us that it is likely to mean something about describing thought or attitude.


To conclude, the clausal complement set is able to tell us a little bit about the meaning of the given verb, more so for some than others. For example, 'komma' is used in a much more mixed context than 'tala', so it is a bit easier to tell what 'tala' means than to tell what 'komma' means. Overall though, it is not very obvious what verb was used just by looking at the sets. It helps to narrow down the search to broad meanings like perception verbs, general-use verbs, cognition verbs, etc., but not to the specific meaning of the verb.