# Assignment 2

## Guidelines

> Remember to add an explanation of what you do using markdown, and to comment your code. Please *be brief*.
>
> If you re-use a substantial portion of code you find online, e.g on Stackoverflow, you need to add a link to it and make the borrowing explicit. The same applies of you take it and modify it, even substantially. There is nothing bad in doing that, providing you are acknowledging it and make it clear you know what you're doing.
>
> Make sure your notebooks have been run when you sumit, as I won't run them myself. Submit both the `.ipynb` file along with an `.html` export of the same. Submit all necessary auxilliary files as well. Please compress your submission into a `.zip` archive. Only `.zip` files can be submitted.

## Grading policy
> As follows:
>
> * 70 points for correctly completing the assignment.
>
> * 20 points for appropriately writing and organizing your code in terms of structure, readibility (also by humans), comments and minimal documentation. It is important to be concise but also to explain what you did and why, when not obvious.
> 
> * 10 points for doing something extra, e.g., if you go beyond expectations (overall or on something specific). Some ideas for extras might be mentioned in the exercises, or you can come up with your own. You don't need to do them all to get the bonus. The sum of points is 90, doing (some of) the extras can bring you to 100, so the extras are not necessary to get an A.
> 

**The AUC code of conduct applies to this assignment: please only submit your own work.**

---

# Introduction

In this assignment, you will build and compare vector models for measuring **semantic similarity**.

First, you are going to use different count-based methods to create these models. Secondly you are going to created dense, lower-dimensionality models from them. Thirdly, you are going to use prediction-based models as well.

Eventually, you are asked to assess the performance of these models against a human gold standard.

---

# Corpus preparation (10 points)

## Question 1 (10 points)

Create one distributional space by **counting and filtering** the surface co-occurrences in a symmetric ±5 word collocations span from the following corpus:

* A lemmatized version of the Reuters corpus (the choice of the lemmatizer is up to you). For this step, you might need a PoS-tagger: you are welcome to choose one yourself. In case you can't do PoS tagging on your own, you can use the following command to load the provided corpus in `data/reuters.pos` (uploaded as a `.zip` file, so first unzip it):

```python
with open("data/reuters.pos", "rb") as corpus_file:
    reuter_PoSTagged = pickle.load(corpus_file)
```

Remember to make motivated choices for the different strategies in building word vectors as described in class. Be explicit about:

1. what lemmas you want to describe (i.e., what will be your target vectors?);
2. how you want to describe them (i.e., what will be your contexts?);
3. what filtering strategy you are going to choose (i.e., what do you exclude?).

In [6]:
# your code here

import pickle

with open("data/reuters.pos", "rb") as corpus_file:
    reuter_PoSTagged = pickle.load(corpus_file)
    
reuter_PoSTagged

[[('ASIAN', 'NNP'),
  ('EXPORTERS', 'NNP'),
  ('FEAR', 'NN'),
  ('DAMAGE', 'NN'),
  ('FROM', 'IN'),
  ('U', 'NNP'),
  ('.', '.'),
  ('S', 'NNP'),
  ('.-', 'CD'),
  ('JAPAN', 'NNP'),
  ('RIFT', 'NNP'),
  ('Mounting', 'VBG'),
  ('trade', 'NN'),
  ('friction', 'NN'),
  ('between', 'IN'),
  ('the', 'DT'),
  ('U', 'NNP'),
  ('.', '.'),
  ('S', 'NN'),
  ('.', '.'),
  ('And', 'CC'),
  ('Japan', 'NNP'),
  ('has', 'VBZ'),
  ('raised', 'VBN'),
  ('fears', 'NNS'),
  ('among', 'IN'),
  ('many', 'JJ'),
  ('of', 'IN'),
  ('Asia', 'NNP'),
  ("'", 'POS'),
  ('s', 'NNS'),
  ('exporting', 'VBG'),
  ('nations', 'NNS'),
  ('that', 'IN'),
  ('the', 'DT'),
  ('row', 'NN'),
  ('could', 'MD'),
  ('inflict', 'VB'),
  ('far', 'RB'),
  ('-', ':'),
  ('reaching', 'VBG'),
  ('economic', 'JJ'),
  ('damage', 'NN'),
  (',', ','),
  ('businessmen', 'NNS'),
  ('and', 'CC'),
  ('officials', 'NNS'),
  ('said', 'VBD'),
  ('.', '.')],
 [('They', 'PRP'),
  ('told', 'VBD'),
  ('Reuter', 'NNP'),
  ('correspondents', 'NNS'),
 

---

# Vector representations (60 points)

## Question 2 (20 points)

Weight the counts in the space you created for the previous question by using the following association measures on both spaces:

1. One **measure of your choice** among those available in the [nltk.BigramAssocMeasures](http://www.nltk.org/howto/metrics.html#association-measures) module.
2. The **Positive Local Mutual Information** measure (as shown in class/lab).

**Possible extra**

3. Also use the **smoothed ppmi measure** proposed by [Levy et al. (2015)](http://www.aclweb.org/anthology/Q15-1016). Recall that the authors proposed to smooth the ppmi by raising the context counts to the power of $\alpha$ (where $\alpha= 0.75$ is reported to work well). That is, if $V_c$ is the vocabulary of all the contexts in a given space and $f(c)$ is the context frequency, they proposed the following association measure:

$$PPMI_\alpha (w,c) = max \left(0, \ log_2 \left(\frac{p(w,c)}{p(w) \cdot p_\alpha(c)}\right)  \right) $$

$$where: \ \ p_\alpha(c) = \frac{f(c)^\alpha}{\sum_{c' \in V_c} f(c')^\alpha}$$

In [None]:
# your code here

## Question 3 (20 points)

Up to this point, you should have created 2 different distributional spaces (3 if you did the extra).

Use **Singular Value Decomposition** to reduce their dimensionality retaining only the first 100 dimensions. For this question, you can either re-use the SVD code from the lab, or import the SVD functions from external libraries such as [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) or [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html).

**Possible extra**

Find the 'optimal' number of dimensions to retain using the approach shown in the lab. Use a model with this dimensionality instead of 100.

In [None]:
# your code here

## Question 4 (20 points)

Train a Word2Vec model the same corpus, for example using [gensim](https://radimrehurek.com/gensim). Make sure to motivate the choice of your hyperparameters.

**Possible extra**

*Fine-tuning* is the process of starting from a pre-trained embedding model and training it some more using new data. Try to use a pre-trained model from gensim and to fine-tune it on the Reuters corpus.

In [None]:
# your code here

---

# Evaluating on semantic similarity (20 points)

## Question 5 (20 points)

Evaluate the performance of your models on a **semantic similarity task**. Using `SimLex-999` as gold standard. Evaluate all of your models on the dataset in `data/SimLex-999.txt`, and determine the best performing model. Note: There should be 5 to 8 model evaluations in total. 5 if you did not do any extra (2 from 4.1 + 2 from 4.2 + 1 from 4.3), and 8 if you did them all (3 from 4.1 + 3 from 4.2 + 2 from 4.3).

1. Your evaluation should follow the approach shown in lab 4 (Section 1.6: "Evaluating your Model"), using a **correlation measure** on model predictions and the (human) gold standard. 
2. Remember to **visualize** your results (e.g., as bar plots).
3. Take note (and report) the overlap between your models and the SimLex-999 dataset, i.e., how many pairs are shared by your model and the evaluation dataset.
4. Make sure to discuss your results and provide your reasoning on them.

### Remarks

- The 'SimLex-999' dataset is described in `data/SimLex-999.README.txt`, and [the author's github page](https://fh295.github.io/simlex.html). Hint: the relevant judgements are those in the `SimLex999` column.
- To directly compare the models against the gold standard, you will have to find the *overlap* between them, i.e. the pairs that occur in your model *and* the evaluation dataset.

In [1]:
with open("data/SimLex-999.txt") as f:
    for n, line in enumerate(f.read().split("\n")):
        items = line.split("\t")
        print(items)
        if n>10:
            break

['word1', 'word2', 'POS', 'SimLex999', 'conc(w1)', 'conc(w2)', 'concQ', 'Assoc(USF)', 'SimAssoc333', 'SD(SimLex)']
['old', 'new', 'A', '1.58', '2.72', '2.81', '2', '7.25', '1', '0.41']
['smart', 'intelligent', 'A', '9.2', '1.75', '2.46', '1', '7.11', '1', '0.67']
['hard', 'difficult', 'A', '8.77', '3.76', '2.21', '2', '5.94', '1', '1.19']
['happy', 'cheerful', 'A', '9.55', '2.56', '2.34', '1', '5.85', '1', '2.18']
['hard', 'easy', 'A', '0.95', '3.76', '2.07', '2', '5.82', '1', '0.93']
['fast', 'rapid', 'A', '8.75', '3.32', '3.07', '2', '5.66', '1', '1.68']
['happy', 'glad', 'A', '9.17', '2.56', '2.36', '1', '5.49', '1', '1.59']
['short', 'long', 'A', '1.23', '3.61', '3.18', '2', '5.36', '1', '1.58']
['stupid', 'dumb', 'A', '9.58', '1.75', '2.36', '1', '5.26', '1', '1.48']
['weird', 'strange', 'A', '8.93', '1.59', '1.86', '1', '4.26', '1', '1.3']
['wide', 'narrow', 'A', '1.03', '3.06', '3.04', '2', '4.06', '1', '0.58']


In [1]:
# your code here

---