Name: Max Boholm (no group)

# A2: Vector Semantics

Nikolai Ilinykh, Mehdi Ghanimifard, Wafia Adouane and Simon Dobnik

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read [the following instructions](https://github.com/sdobnik/computational-semantics/blob/master/README.md) on how to work on group assignments.

Write all your answers and the code in the appropriate boxes below.

In this lab we will look at how to build distributional semantic models from corpora and use semantic similarity captured by these models to do semantic tasks. We are also going to examine how different vector composition functions for phrases affect both the model and the learned information about similarities.  

Note that this lab uses a code from `dist_erk.py`, which contains functions that highly resemble those shown during the lecture. In the end, you can use either of the functions (from the lecture / from the file) to solve the tasks.

In [1]:
# the following command simply imports all the methods from that code.
from dist_erk import *

## 1. Loading a corpus

To train a distributional model, we first need a sufficiently large collection of texts which contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus which you can download from [here](https://linux.dobnik.net/cloud/index.php/s/isMBj49jt5renYt?path=%2Fresources%2Fa2-distributional-representations) (wikipedia.txt.zip). (This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/)).  
When unpacked, the file is 151mb, hence if you are using the MLT servers you should store it in a temporary folder outside your home and adjust the `corpus_dir` path below.  
<!-- <It may already exist in `/opt/mlt/courses/cl2015/a5`.> -->


In [2]:
corpus_dir = "/home/max/Documents/resources/wikipedia/"

## 2. Building a model

Now you are ready to build the model.  
Using the methods from the code imported above build three word matrices with 1000 dimensions as follows:  

(i) with raw counts (saved to a variable `space_1k`);  
(ii) with PPMI (`ppmispace_1k`);  
(iii) with reduced dimensions SVD (`svdspace_1k`).  
For the latter use `svddim=5`. **[5 marks]**

Your task is to replace `...` with function calls. Functions are imported from `dist_erk.py` earlier, and they largely resemble functions shown during the lecture.

In [3]:
numdims = 1000
svddim = 5

# which words to use as targets and context words?
# we need to count the words and keep only the N most frequent ones
# which function would you use here with which variable?
ktw = do_word_count(corpus_dir, numdims)

mwi= make_word_index(ktw)
#M.B. why we need these?
wi = make_word_index(ktw).keys() # word index
words_in_order = sorted(make_word_index(ktw).values()) # sorted words

# create different spaces (the original matrix space, the ppmi space, the svd space)
# which functions with which arguments would you use here?
print('create count matrices')
space_1k = make_space(corpus_dir, mwi, numdims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, mwi)
print('svd transform')
svdspace_1k = svd_transform(space_1k, numdims, svddim)
print('done.')

reading file wikipedia.txt
create count matrices
reading file wikipedia.txt
ppmi transform
svd transform
done.


In [4]:
# now, to test the space, you can print vector representation for some words
print('house:', space_1k['house'])

house: [2554 3774 3105  567  962  631  443  185  311  189  131   28   93  169
   81  125  151  408  194   90   79   29  217  184   62   15   31   70
   10    1   41   21    1   31   37    1   30    5   25    7    3   20
   11    1   32   36    2    5   66    4    0   46    8   18   28    0
   20    7    8   16   10   40    0  175   10    2    7   19    1  174
   11    3    1    6    0    0    0   10    9   11    7   24    4    4
   14   23   58    7    0   10    2    3   10    6   18    6   13    3
   22    0    3    5    3    7   14    3   40   20   19   15    6    8
   24    4    5    1   19    0    3    1    0   14    0   14   53    7
    7   11    6    5    5    4   12    6   53    1    1  433    4    0
    5    7    7   12    1    1    3    4   17    8   16    1    2   31
    1   12   14    1   44    6   14    9   38    7    2    6    8    1
   10    6   10    1    9    7    9    4    3   10    0   11    3    2
    0    2   11   37    2    0    2    1    5    9   10   16   88    6

Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. It took 40 minutes on a laptop. We saved all three matrices [here](https://linux.dobnik.net/cloud/index.php/s/isMBj49jt5renYt?path=%2Fresources%2Fa2-distributional-representations) (pretrained.zip). Download them and unpack them to a `pretrained` folder which should be a subfolder of the folder with this notebook:

In [5]:
import numpy as np

numdims = 10000
svddim = 50

print('Please wait...')
ktw_10k       = np.load('./pretrained/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load('./pretrained/raw_wikipediaktw.npy', allow_pickle=True).all()
ppmispace_10k = np.load('./pretrained/ppmi_wikipediaktw.npy', allow_pickle=True).all()
svdspace_10k  = np.load('./pretrained/svd50_wikipedia10k.npy', allow_pickle=True).all()
print('Done.')


Please wait...
Done.


In [6]:
# testing semantic space
print('house:', space_10k['house'])

house: [2554 3774 3105 ...    0    0    0]


## 3. Testing semantic similarity

The file `similarity_judgements.txt` (a copy is included with this notebook) contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected through crowd-sourcing using Mechanical Turk as described in [1]. The score range from 1 (highly dissimilar) to 5 (highly similar). Note: this is a different dataset from the phrase similarity dataset we discussed during the lecture (the one from [2]). For more information, please read the papers.

The following code will transform similarity scores into a Python-friendly format:

In [7]:
word_pairs = [] # test suit word pairs
semantic_similarity = [] 
visual_similarity = []
test_vocab = set()

for index, line in enumerate(open('similarity_judgements.txt')):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # it will check if both words from each pair exist in the word matrix.
        if w1 in ktw_10k and w2 in ktw_10k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))
        
print('number of available words to test:', len(test_vocab-(test_vocab-set(ktw))))
print('number of available word pairs to test:', len(word_pairs))
#list(zip(word_pairs, visual_similarity, semantic_similarity))

number of available words to test: 12
number of available word pairs to test: 774


Now we are going to test how the cosine similarity between vectors of each of the three spaces (normal space, ppmi, svd) compares with the human similarity judgements for the words in the similarity dataset. Which of the three spaces best approximates human judgements?

For comparison of several scores, we can use [Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better the similarity scores align. The p values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate Pearson's correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [8]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Visual Similarity vs. Semantic Similarity:
rho     = 0.7122
p-value = 0.0000


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[6 marks]**

In [9]:
raw_similarities  = [cosine(w1, w2, space_10k) for w1, w2 in word_pairs]
ppmi_similarities = [cosine(w1, w2, ppmispace_10k) for w1, w2 in word_pairs]
svd_similarities  = [cosine(w1, w2, svdspace_10k) for w1, w2 in word_pairs]

Now, calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlates them? Is this expected? **[6 marks]**

In [10]:
from scipy.stats import spearmanr as correlation

experiment={}
experiment["raw"] = correlation(semantic_similarity,raw_similarities)
experiment["pmi"] = correlation(semantic_similarity,ppmi_similarities)
experiment["svd"] = correlation(semantic_similarity,svd_similarities)

print("\trho\tp")
for condition in ["raw", "pmi", "svd"]:
    print("{}\t{}\t{}".format(condition, round(experiment[condition][0], 3), round(experiment[condition][1], 3)))

	rho	p
raw	0.152	0.0
pmi	0.455	0.0
svd	0.423	0.0


**Your answer should go here:**
The experiment tells us:

1. There is a positive correlation between the judgments and all three distributional models. 
2. All correlations are signifiant at 0.001 level.
3. The raw representation performs worst.
4. The PMI and SVD representations performs close to equally well; PMI model is 0.032 better.

It is expected that the normalised models perform better than the raw count model. 

We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[7 marks]**

In [11]:
from scipy.stats import spearmanr as correlation

experiment_2={}
experiment_2["raw"] = correlation(visual_similarity,raw_similarities)
experiment_2["pmi"] = correlation(visual_similarity,ppmi_similarities)
experiment_2["svd"] = correlation(visual_similarity,svd_similarities)

print("\trho\tp")
for condition in ["raw", "pmi", "svd"]:
    print("{}\t{}\t{}".format(condition, round(experiment_2[condition][0], 3), round(experiment_2[condition][1], 3)))

	rho	p
raw	0.121	0.001
pmi	0.384	0.0
svd	0.31	0.0


**Your answer should go here:**

Here are my observations:

1. The range of the scores for the three models are lower for Experiment 2 than Experiment 1, i.e. there is less difference between them.
2. All three models perform worse in Experiment 2 than Experiment 1.
3. The difference between PMI and SVD is larger for Experiment 2 than Experiment 1.

A tentative "structuralist" conclusion can be drawn from this. That is, the meaning of words are defined by their relation to other words within the system (i.e. language), while external factors, such as, how similiar referents are, are of less importance for meaning. According to this hypothesis, we expect higher correlations between distributional language models and semantic similiarity judgments (as both are system internal), than with visual judgements (which is external to language). 

I have no clear idea why PMI performs better than SVD in Experiment 2, than Experiment 1.

## 4. Operations on similarities

We can perform mathematical operations on vectors to derive meaning predictions. For example, we can subtract the normalised vectors for `king` minus `queen` and add the resulting vector to `man` and we hope to get the vector for `woman`. Why? **[3 marks]**

**Your answer should go here:**

First, some definitions:
*    we have two pairs: *a1*--*a2* and *b1*--*b2*; for example, *king* (*a1*)--*queen* (*a2*) and *man* (*b1*)--*woman* (*b2*)
*    the assumed mathematical relation defined above is: *b2* = *b1* - (*a1* - *a2*)

Now, the question is *why* we should make this assumption. 
*Attempted explanation:*
The semantic difference between *king* and *queen* is the dimension *+MAN*. Consider a classical semantic component analysis of the pair:

|Word |HUMAN|...|ROYAL|MAN|
|-----|-----|---|-----|---|
|King |  +  |...|  +  | + |
|Queen|  +  |...|  +  | - |

Assume that vectors ultimately represent conceptual content like this. Calculating the diffenrence between *king* and *queen* will result in the vector corresponding to the meaning *+MAN*. (We use substraction to capture semantic difference.) Now, we use *that* vector (representing *+MAN*) and substract from the vector of *man*, which would leave us the vector for *woman*. Again, the idea, perhaps is clarified by considering components:

|Word |HUMAN|...|MAN|
|-----|-----|---|---|
|Man  |  +  |...| + |
|Woman|  +  |...| - |

Here is some helpful code that allows us to calculate such comparisons.

In [12]:
from scipy.spatial import distance

def normalize(vec):
    return vec / veclen(vec)

def find_similar_to(vec1, space):
    # vector similarity funciton
    #sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.correlation(a, b)
    #sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    #sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

Here is how you apply this code. Comment on the results you get. **[3 marks]**

In [13]:
short = normalize(svdspace_10k['short'])
light = normalize(svdspace_10k['light'])
long = normalize(svdspace_10k['long'])
heavy = normalize(svdspace_10k['heavy'])

find_similar_to(light - (heavy - long), svdspace_10k)[:10]

[('long', 0.8733111261346901),
 ('above', 0.8259671977311955),
 ('around', 0.8030776291120685),
 ('sun', 0.7692439111243973),
 ('just', 0.7678481974778111),
 ('wide', 0.767257431992253),
 ('each', 0.7665960260861158),
 ('circle', 0.7647746702909336),
 ('length', 0.7601066921319761),
 ('almost', 0.7542351860536628)]

**Your answer should go here:**

Here are two observations:

1. The ambiguity of *light* seems to cause confusion for the prediction.
2. For present purposes, the prediction "over-estimates" the contribution of the second element of the first pair. That is, we do not get *b2* (*short*), but *a2* (*long*) from our function *b1* - (*a1* - *a2*), discussed above. 

Find 5 similar pairs of pairs of words and test them. Hint: Google for `word analogies examples`. You can also construct analogies that are less lexical but more grammatical, e.g. `see, saw, leave, ?` or analogies that are based on world knowledge as in the [Google analogy dataset](http://download.tensorflow.org/data/questions-words.txt) from [3]. Does the resulting vector similarity confirm your expectations? But remember you can only do this if the words are contained in our vector space with 10,000 dimensions. **[10 marks]**

In [14]:
def in_model(quadruple, model):
    """Tests if the words of a quadruple are in a model."""
    not_in_model=[]
    for w in quadruple:
        try:
            model[w]
        except KeyError:
            not_in_model.append(w)
    
    if len(not_in_model) == 1:
        print(f"This word is not in model: {not_in_model[0]}")
    elif len(not_in_model) > 1:
        print("These words are not in model:")
        for w in not_in_model:
            print(w)

def predictor(quadruple, model, k=1):
    """Takes a quadruple of words and a model, and returns the "accuracy" of a prediction from that model. 
    Arguments: 
    - quadruple of words a1, a2, b1, b2, such that "a1 is to a2 as b1 is to b2" 
    (e.g., 'king' is to 'queen' as 'man' is to 'woman'
    - vector model of word meanings;
    - k, which defines which predictions in a list of weighted predictions 
    to consider as the prediction(s) of the algorithm. By default k=1, i.e. the first prediction
    of the list. However, to allow for more forgiving evaluations, k can be set to include the top k 
    predicitions of the list.
    """
    
    vectors=[normalize(model[w]) for w in quadruple]
    new_vector=vectors[3]-(vectors[1]-vectors[2])
    # b2 = b1 - (a1 - a2)
    my_prediction = [w for w,n in find_similar_to(new_vector, model)[:k]]
    
    if quadruple[-1] in my_prediction:
        print("Correct prediction ({})!".format(",".join(my_prediction)))
    else:
        print("Incorrect prediction ({} instead of {})".format(",".join(my_prediction), quadruple[-1]))
    
pairs=[
    ("man","woman","boy","girl"),
    ("night", "day", "dark", "light"),
    ("oslo", "norway", "cairo", "egypt"), #from http://download.tensorflow.org/data/questions-words.txt
    ("husband", "wife", "brother", "sister"),
    ("bull", "cow", "husband", "wife")
]

for pair in pairs:
    in_model(pair, svdspace_10k)

print("Predictions given algorithm k1:")
for pair in pairs:
    predictor(pair, svdspace_10k)

print("\nPredictions given algorithm k2:")
for pair in pairs:
    predictor(pair, svdspace_10k, k=2)

print("\nPredictions given algorithm k3:")
for pair in pairs:
    predictor(pair, svdspace_10k, k=3)  


Predictions given algorithm k1:
Incorrect prediction (boy instead of girl)
Incorrect prediction (dark instead of light)
Incorrect prediction (cairo instead of egypt)
Correct prediction (sister)!
Incorrect prediction (husband instead of wife)

Predictions given algorithm k2:
Correct prediction (boy,girl)!
Incorrect prediction (dark,thin instead of light)
Incorrect prediction (cairo,jerusalem instead of egypt)
Correct prediction (sister,brother)!
Incorrect prediction (husband,married instead of wife)

Predictions given algorithm k3:
Correct prediction (boy,girl,cat)!
Correct prediction (dark,thin,light)!
Incorrect prediction (cairo,jerusalem,carthage instead of egypt)
Correct prediction (sister,brother,uncle)!
Correct prediction (husband,married,wife)!


## 5. Semantic composition and phrase similarity **[20 marks]**

In this task, we are going to look at how different semantic composition models, introduced in [2] correlate with human judgements. The file with the dataset is `mitchell_lapata_acl08.txt` included with this notebook. Your task is to do the following:  

(i) process the dataset, extract pairs of `reference - landmark high` and `reference - landmark low`; you can use the code from the lecture as something to start with. Note that there are 2 landmarks for each reference: one landmark exhibits high similarity with the reference, while another one has low similarity with the reference. A single human participant could have evaluated both of these pairs. For more details, we refer you to the paper.  

(ii) build models of semantic phrase composition: in the lecture we introduced simple additive, simple multiplicative and combined models (details are in [2]). Your task is to take a single pair (a reference or a high similarity landmark or a low similarity landmark) and compute the composition of its vectors using each of these functions. Thus, you will have three compositional models that take a `noun - verb` pair and output a single vector, representing the meaning of this pair. As your semantic space, you can use pretrained spaces (standard space, ppmi or svd) introduced above. It is up to you which space you use, but for someone who runs your code, it should be pretty straightforward to switch between them.

(iii) calculate Spearman correlation between each model's predictions and human judgements; you should have something similar to the scores that are shown in the paper [2]:  

![title](./res.png)

The paper states that they calculated correlations between each individual participant's judgeements and each model's predictions.  

Let's say we have 3 models: simple additive (A), simple multiplicative (M), combined (C).  
From our task dataset, we also know that we have 20 participants.  
Now, for each participant in 20 participants we get all `verb - noun` pairs that these participated evaluated.  
For example:

In [15]:
participant_judgemenets_example = [
 'participant50 chatter child gabble 6 high',
 'participant50 chatter tooth click 2 high',
 'participant50 reel head whirl 5 high',
 'participant50 reel mind stagger 4 low',
 'participant50 reel industry stagger 5 high',
 'participant50 reel man whirl 3 low',
 'participant50 glow fire beam 7 low',
 'participant50 glow face burn 3 low',
 'participant50 glow cigar burn 5 high',
 'participant50 glow skin beam 7 high'
    
]

In [16]:
participant_judgemenets_example

['participant50 chatter child gabble 6 high',
 'participant50 chatter tooth click 2 high',
 'participant50 reel head whirl 5 high',
 'participant50 reel mind stagger 4 low',
 'participant50 reel industry stagger 5 high',
 'participant50 reel man whirl 3 low',
 'participant50 glow fire beam 7 low',
 'participant50 glow face burn 3 low',
 'participant50 glow cigar burn 5 high',
 'participant50 glow skin beam 7 high']

Let's look at the first pair that participant50 evaluated: reference `child chatter` and high-level similarity landmark (as the last word in the row indicates) `child gabble`. The human gave the similarity score of 6 (very similar). Thus, human similarity judgment = [6].  

Our A model's output:  
cosine(p1, p2) = 0.88, where p1 is the result of addition of word vectors in the reference phrase `child gabble`, and p2 is the result of addition of word vectors in the high-level similarity phrase `child chatter`.  

Therefore, we have human rating vector [6] and model A output [0.88]. Next is to compute correlation between these two vectors.

To get an overall score, simply average your correlation scores over all participants, since you are calculating correlation scores per participant.

Of course, your human rating vectors will be longer (e.g., [6, 7, 3, 4, 5]) where each element is a participant's judgement of a specific pair. Each of your models (A, B, C) will produce a single vector of cosine similarity between these same pairs (e.g., [0.89, 0.98, 0.23, 0.65, 0.55]). The goal is to compare each model's cosine similarity vectors with human rating vectors and identify the model which outputs the best result in terms of being the closest to the way human rate similarity between the phrases.

The minimum to do in this task: compute correlations for 3 models mentioned above and human rating for AT LEAST one participant. Elaborate on how different the resulting correlation scores are depending on the model's composition function (additive, multiplicative, combined). For examples on how to interpret the results, look at Section 5 Results of the original paper.

In [17]:
#Only functions in this cell. The evaluation is carried out in the next cell. 
from scipy.stats import spearmanr as correlation
import math
import numpy as np

def file_catcher(file_name):
    """Takes a filename for a file with the psychological data and returns a dataset 
    organised as a list of dictionaries."""
    
    data=[]
    participants=[]
    lexicon=[]
    with open(file_name, mode="r") as f:
        for row in [x.split(" ") for x in f.read().split("\n")[1:]]:
            unit={}
            participant=row[0]
            unit["participant"]=participant
            if participant not in participants:
                participants.append(participant)
            unit["verb"]=row[1]
            unit["noun"]=row[2]
            unit["lm"]=row[3]
            unit["input"]=int(row[4])
            unit["hilo"]=row[5]
            for w in row[1:3]:
                if w not in lexicon:
                    lexicon.append(w)
            data.append(unit)
    return data, participants, lexicon

def restrict_data(data_set, space_model):
    """Takes the dataset of psychological data and tests if the words of that dataset is present in a model. 
    Returns a restricted dataset where every pair of judgement, which contains a word not in the model, 
    is removed."""
    
    new_data=[]
    for unit in data_set:
        add_this=True
        for w in [unit["noun"], unit["verb"], unit["lm"]]:
            try:
                space_model[w]
            except KeyError:
                add_this=False
        if add_this == True:
            new_data.append(unit)
    
    reduction=len(data_set)-len(new_data)
    print(f"NOTE: The dataset has been reduced by {reduction} instances of [noun reference] -- [noun landmark] pairs of judgement, since these pairs contain at leats one word, which is not contained in the space model.")
            
    return new_data

def simple_add(noun, verb, model):
    return model[noun]+model[verb]

def simple_multiply(noun, verb, model):
    return model[noun]*model[verb]

def combined(noun, verb, model):
    return model[noun]+model[verb]+(model[noun]*model[verb])

# I used the following function to see how the data looked like. Please ignored, if considered irrelevant.
def know_your_data(data, ignore_landmark=False):
    """Counts instances for judgement in the dataset. Returns a frequency list."""
    
    type_freq={}    
    
    if ignore_landmark == False:
        for unit in data:
            typ=unit["noun"]+" "+unit["verb"]+" (landmark being: {})".format(unit["lm"])
            if typ in type_freq:
                type_freq[typ]+=1
            else:
                type_freq[typ]=1
    else:
        for unit in data:
            typ=unit["noun"]+" "+unit["verb"]
            if typ in type_freq:
                type_freq[typ]+=1
            else:
                type_freq[typ]=1
    output=list(type_freq.items())
    for line in output:
        print(line)

def my_cosine(vec1, vec2):
    """Calculates the consine similartiy of two vectors. Uses code from Erk."""
    
    len_vec1=math.sqrt(np.sum(np.square(vec1)))
    len_vec2=math.sqrt(np.sum(np.square(vec2)))
    
    if len_vec1 == 0.0 or len_vec2 == 0.0: # as assumed by Erk: "if one of the vectors is empty. make the cosine zero."
        cosine=0.0
    else:
        dotproduct = numpy.sum(vec1 * vec2)
        cosine=dotproduct / (len_vec1 * len_vec2)
    
    return cosine  

def create_eval(psy_data, space_model):
    """Takes a dataset of psychological data and returns the values required for the evaluation."""
    
    evaluation_data={p:{"add":[], "mult":[], "comb":[], "psy_sim":[], "hilo":[]} for p in participants}
    
    for unit in psy_data:
        partic=unit["participant"]
        noun=unit["noun"]
        reference=unit["verb"]
        lm=unit["lm"]
        
        evaluation_data[partic]["add"].append(my_cosine(simple_add(noun,reference, space_model),simple_add(noun,lm,space_model)))
        evaluation_data[partic]["mult"].append(my_cosine(simple_multiply(noun,reference, space_model),simple_multiply(noun,lm,space_model)))
        evaluation_data[partic]["comb"].append(my_cosine(combined(noun,reference, space_model),combined(noun,lm,space_model)))
        evaluation_data[partic]["psy_sim"].append(unit["input"])
        evaluation_data[partic]["hilo"].append(unit["hilo"])

    return evaluation_data

def mean(num_lst):
    """Calculates the mean of a list of values."""
    if len(num_lst)>0:
        m=sum(num_lst)/len(num_lst)
    else:
        m=0 
    return m

def to_star(p_val):
    """Takes a numerical p-value an returns the symbol for that value following the standard "star convention"."""

    p_star=""
    if p_val <= 0.05:
        if p_val<= 0.01:
            p_star="**"
        else:
            p_star="*"
    return p_star
    

def evaluate(data_for_evaluation, overview=True, average=True, overall_correlation=True):
    """Takes data for evaluation and prints the results of the evaluation.
    The output is determined by three Boolean arguments:
    (i)   overview: when True, the function prints a table of scores for each participant.
    (ii)  average: when True, a column of average correlation between judgement and distance
    in model is added as a column to the table with mean cosine similarities for high and low landmarks.
    "Average" here stand for the mean of correlations calculated for each participant.
    (iii) overall_correlation: when True, a table for correlation of judgment and similarity in the
    model is calculated for all judgments.
    """
    
    #Two dictionaries for score keeping; one for the means of distance scores and one for correlations
    evaluation_m1={p:{"add":{"H":0, "L":0}, "mult":{"H":0, "L":0}, "comb":{"H":0, "L":0}} for p in participants} #evaluation method 1; here we add evaluation data for high vs low
    correlations={p:{"add":0, "mult":0, "comb":0} for p in participants}
   
    for pa in participants: #this list is defined outside the function
        variable_x=data_for_evaluation[pa] #in order to simplify the code somewhat ...
                        
        #for correlations
        for operation in ["add", "mult", "comb"]:
            rho, p = correlation(variable_x[operation], variable_x["psy_sim"])
            rho=round(rho, 3)
            correlations[pa][operation]=(rho, to_star(p))
       
        #for method 1 type evaluation in [2], i.e. mean distance for high vs. low landmarks
        for operation in ["add", "mult", "comb"]:
            high_lst=[]
            low_lst=[]
            
            #for every high and low classification, we append its associated distance measure to separate lists     
            for hilo, model_score in zip(variable_x["hilo"], variable_x[operation]):
                if hilo == "high":
                    high_lst.append(model_score)
                if hilo == "low":
                    low_lst.append(model_score)
            #now we take the mean of those lists of distance scores
            high_score=mean(high_lst)
            low_score=mean(low_lst)
            #and adds to the dictionary used for score keeping
            evaluation_m1[pa][operation]["H"]=high_score
            evaluation_m1[pa][operation]["L"]=low_score
    
    #Table for overview (all participants)
    table_no=1
    if overview == True:
        print(f"TABLE {table_no}: Overview of scores per participant.\n")
        print("Participant\tAdd\t\t\tMultiply\t\tCombined")
        print("\t\tCor.\tHigh\tLow\tCor.\tHigh\tLow\tCor.\tHigh\tLow")
        
        for p in participants:
            who=p
            scores=[]
            for operation in ["add", "mult", "comb"]:
                scores.append(str(round(correlations[p][operation][0], 3))+correlations[p][operation][1]) #rho + its significance
                scores.append(str(round(evaluation_m1[p][operation]["H"], 3)))
                scores.append(str(round(evaluation_m1[p][operation]["L"], 3)))
            print("{}\t{}".format(who, "\t".join(scores)))
        print("* Sig at 0.05 p-level; ** Sig. at 0.01 p-level.")
        table_no+=1
    
    #Table for means of cosine similarities for high and low landmarks
    models={m:{"hi":0, "lo":0, "cor":0} for m in ["add", "mult", "comb"]}
        
    for operation in ["add", "mult", "comb"]: 
        high_lst=[]
        low_lst=[]
        cor_lst=[]
                        
        for pa in participants:
            high_lst.append(evaluation_m1[pa][operation]["H"])
            low_lst.append(evaluation_m1[pa][operation]["L"])
            cor_lst.append(correlations[pa][operation][0]) #only rho-value; I am not sure how to use significance values in this average approach
     
        models[operation]["hi"]=round(mean(high_lst),3)
        models[operation]["lo"]=round(mean(low_lst),3)
        models[operation]["cor"]=round(mean(cor_lst),3)
    
    #If the variable average is set to True, a column for average correlaions are added to the table
    if average == True:
        print(f"\nTABLE {table_no}: Evaluations of models Add, Multiply and Combined.\n")
        print("Model\tHigh\tLow\tCorrelation*")
        for model in ["add", "mult", "comb"]:
            print("{0}\t{1}\t{2}\t{3}".format(model, models[model]["hi"], models[model]["lo"], models[model]["cor"]))
        print(f"Note: Correlation here refers to the average of correlations calculated for all participants (N={len(participants)}).")
    #If not, the table only contain means of cosine similarities for high and low landmarks
    else:
        print(f"\nTABLE {table_no}: Evaluations of models Add, Multiply and Combined.\n")
        print("Model\tHigh\tLow")
        for model in ["add", "mult", "comb"]:
            print("{0}\t{1}\t{2}".format(model, models[model]["hi"], models[model]["lo"]))
    
    table_no+=1
    
    #The table for correlation of judgment and similarity in the model is calculated for all judgments
    if overall_correlation == True:
        judgements=[]
        add_lst=[]
        mult_lst=[]
        combined_lst=[]
        
        #Populates the lists with every judgement, similarity for Add model, etc.
        for pa in participants:
            judgements.extend(data_for_evaluation[pa]["psy_sim"])
            add_lst.extend(data_for_evaluation[pa]["add"])
            mult_lst.extend(data_for_evaluation[pa]["mult"])
            combined_lst.extend(data_for_evaluation[pa]["comb"])
        
        #Calculates the correlation
        cor_add, p_add=correlation(judgements, add_lst)
        cor_mult, p_mult=correlation(judgements, mult_lst)
        cor_comb, p_comb=correlation(judgements, combined_lst)
        
        #Prints a table
        print(f"\nTABLE {table_no}: Correlations for Add, Multiply and Combined with all judgments (N={len(judgements)}).\n")
        print("Model    Rho")
        print("Add      {}{}".format(round(cor_add, 3), to_star(p_add)))
        print("Multiply {}{}".format(round(cor_mult, 3), to_star(p_mult)))
        print("Combined {}{}".format(round(cor_comb, 3), to_star(p_comb)))
        print("* Sig at 0.05 p-level; ** Sig. at 0.01 p-level.")


In [18]:
#Now, lets run this...
space_model=svdspace_10k #Set your preferred space model here if you want

data_set, participants, words = file_catcher("mitchell_lapata_acl08.txt")
#know_your_data(data_set, ignore_landmark=True)
#in_model(words, svdspace_10k)
data_set=restrict_data(data_set, space_model)

data_for_evaluation=create_eval(data_set, space_model)
evaluate(data_for_evaluation, overview=False)

NOTE: The dataset has been reduced by 3360 instances of [noun reference] -- [noun landmark] pairs of judgement, since these pairs contain at leats one word, which is not contained in the space model.

TABLE 1: Evaluations of models Add, Multiply and Combined.

Model	High	Low	Correlation*
add	0.927	0.874	0.522
mult	0.937	0.931	0.1
comb	0.923	0.89	0.335
Note: Correlation here refers to the average of correlations calculated for all participants (N=60).

TABLE 2: Correlations for Add, Multiply and Combined with all judgments (N=240).

Model    Rho
Add      0.473**
Multiply 0.012
Combined 0.317**
* Sig at 0.05 p-level; ** Sig. at 0.01 p-level.


**Any comments/thoughts should go here:**

In contrast to Mitchell's and Lapata's study [2], the model Add is better than Multiply and Combined in this evaluation. I have no clear idea why this is the case. However, the reduced dataset used here (due to several words in their experiment being absent from the space model used), might be one explanation of why the results are so different. 

# Literature

  - [1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.  

  - [2] Mitchell, J., & Lapata, M. (2008). Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT (pp. 236–244). Association for Computational Linguistics.
  
  - [3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

## Marks

This assignment has a total of 60 marks.