## Language Processing assignment 3: Word embeddings and society

In this assignment you will have to load vectorial representations of words and calculate their cosine similarity, a common distance metric to evaluate semantic similarity.

On grading: There are five exercises in this assignment. You must have at least three correct exercises (and among the incorrect ones, there should be some proper attempt to solve the missing exercises). What we mean is that if you do three perfect exercises but the remaining two exercises are blank, the assignment will not be considered passed.

In [1]:
import numpy as np
import math

 #### Exercise 1:
 
In order to play with word embeddings, we need a way of storing them in our program. We need a data structure to represent all the word embeddings.

The goal of this exercise is to open the file where the embeddings are saved and to put them in a variable that you can use afterwards.

You can represent the data in the way that you think it fits best. The result can go from a really simple approach until a complex but useful class.

Given a word, such as `"house"`, this data structure should return the embeddings related to that word.

In the saved file, we will have a set of words, and each word will be represented as a sequence of floating point numbers, such as:

`house --> 0.001 0.002 0.005 0.001 0.0012312 0.004 ...`

`cow --> 0.2 0.01 0.00031 0.01 0.9 0.00031 0.0015 0.002 ...`

The number of floating point numbers will always be the same.

---
 
If we want to play with word embeddings, we have to get them from somewhere. Pick English embeddings from [Absalon](https://absalon.ku.dk/files/7371057/download?download_frd=1) or the embeddings that you want from this [website](https://fasttext.cc/docs/en/crawl-vectors.html). I recommend you downloading the file from Absalon, as it contains only 50,000 words and it is easier to load (well, faster).

If you get the embeddings from the Github page, you should download the embeddings in **text** format. These embeddings have been trained with raw text from Wikipedia. This may take some space in your computer, depending on the language you choose.

Once you downloaded the embeddings, it's time to start programming! The files follow a specific format.

##### FILE FORMAT:

The first line in the file contains two numbers separated by a single space. The first number indicates the number of words in the file (`N_WORDS`) and the second number specifies the number of dimensions (`N_DIMENSIONS`) that are used to represent each of those words.

After the first line, each line will contain one word at the beginning. Following the word, and separated by spaces, there will be `N_DIMENSIONS` numbers, which represent each word in the space.

The words are sorted by their frequency in the wikipedia corpus, then the first words in the file will be the most frequent ones. Here you can see how the English embeddings file starts:

`9999 300`

`, 0.1250 -0.1079 0.0245 -0.2529 0.1057 -0.0184 0.1177 ...`

`the -0.0517 0.0740 -0.0131 0.0447 -0.0343 0.0212 0.0069 ...`

`. 0.0342 -0.0801 0.1162 -0.3968 -0.0147 -0.0533 0.0606 ...`

`and 0.0082 -0.0899 0.0265 -0.0086 -0.0609 0.0068 0.0652 ...`

`...`

##### What you have to do:

Write a program to read the file and store the words and their embeddings in the data structure that you think it is the best. It might be very simple, or it might be a more complex one.

##### Important note: You are not allowed to use a package like gensim to open the file

In [20]:
#1.- Define object to save words and their embeddings
#2.- Write code for reading the file and save it in the defined object

#1. I decided the best object to save this type of data was a dictionary i.e.: word_embeddings
#2. Here it is the code to do this, and a short print statement to show the use of it

with open('wiki.en.vec.short50K', 'r', encoding='utf-8') as f:
    word_embeddings = {}
    for line in f:
        elements = line.split()
        word = elements[0]
        embedding = np.array([float(x) for x in elements[1:]])
        word_embeddings[word] = embedding

print(word_embeddings['she'])



[ 8.8637e-02 -4.1191e-03 -2.5390e-01  3.2839e-01  1.3474e-01  2.1099e-01
  1.5887e-01  1.3855e-01  8.4253e-03  2.1534e-01  3.0048e-01 -6.4144e-02
 -5.2371e-02 -1.1812e-01 -1.2325e-01 -4.0250e-01  3.0414e-01 -2.9539e-01
 -9.1729e-02  2.4486e-01 -4.2637e-02 -1.7132e-02 -2.8279e-01 -3.7561e-01
 -1.1360e-01  3.1833e-01  2.7959e-01 -9.7287e-02  9.9161e-03  7.5482e-02
 -9.2898e-02  3.4215e-01  1.1880e-02  7.0913e-02 -7.9244e-02 -4.6072e-01
 -1.6607e-01 -3.2089e-01 -6.7372e-02 -1.0502e-01 -9.7204e-02  1.0914e-01
 -1.5526e-01 -4.9404e-02  9.2036e-02 -2.6817e-01  1.4362e-01  2.6702e-01
 -4.4490e-02 -1.0933e-01  1.8751e-01 -1.2888e-01 -4.2224e-02  2.1953e-01
 -1.4486e-02 -1.3708e-01  6.1556e-03  2.7584e-01  2.7403e-01  2.3650e-01
 -5.9322e-02 -2.6784e-02  4.3424e-02 -2.8990e-01 -9.1051e-03 -1.1771e-01
 -2.2181e-01  2.6172e-01 -1.9857e-01 -1.1452e-03 -1.7981e-01 -2.9999e-01
 -3.1459e-02  1.4812e-01 -1.4990e-01 -8.7872e-02  2.8076e-01  3.0934e-01
 -2.3802e-01 -1.6786e-01  1.4296e-01 -3.5512e-02  2

#### Exercise 2:

A common distance metric used to measure the similarity between two words is the cosine similarity, which measures the cosine of the angle between the two vectors that represent each of the words.

This similarity value is calculated by using this formula:

$$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\|_2 \|\mathbf{B}\|_2} $$

<!--= \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }-->

Don't be scared. The first part of the formula, $\mathbf{A} \cdot \mathbf{B}$ is the dot product between vectors $\mathbf{A}$ and $\mathbf{B}$. And you know how to do that in Python.

$\mathbf{A} \cdot \mathbf{B} = \sum\limits_{i=1}^{n}{A_i  B_i}$

In the lower part, $\|\mathbf{A}\|_2 \|\mathbf{B}\|_2$, you have to calculate the Euclidean norm of each vector ($\mathbf{A}$ and $\mathbf{B}$) and multiply their results. The Euclidean norm is calculated using this formula:

$\|\mathbf{A}\|_2 = \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}$

The formula inside the square root is the same as the one from the dot product. Then it can be rewritten like this:

$\|\mathbf{A}\|_2 = \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} = \sqrt{\mathbf{A} \cdot \mathbf{A}} $

You should program the cosine similarity function by using numpy. You cannot use previously programmed cosine similarity functions, you must write your own function. This program must get two numpy arrays and it should return a number.

The resulting number of this formula should be interpreted as a number that specifies the similarity between two words. The higher the number, the similarity between those two words will be higher.

In [26]:
def similarity(A, B):
    #YOUR CODE HERE
    cosine_similarity = (np.dot(A, B)) / ((np.sqrt(np.dot(A, A))) * (np.sqrt(np.dot(B, B))))
    return cosine_similarity

A = word_embeddings['house']
B = word_embeddings['he']
C = word_embeddings['the']

print("Cosine similarity between 'house' and 'he':", similarity(A, B))
print("Cosine similarity between 'the' and 'he':", similarity(C, B))
#It makes sense that 'she' and 'he' are more 'similar' than 'the' and 'he'.

Cosine similarity between 'house' and 'he': 0.2716942456042166
Cosine similarity between 'the' and 'he': 0.4304674538970477


#### Exercise 3:

In the third exercise you have to squeeze your brain a bit more. Now, you have loaded the whole embedding file, and you also have a distance metric to measure the similarity between words. Let's do more complex things, then.

Given a word, you have to find the 30 most similar words. Then, given one word you should get the distance to all the words in the embeddings file, and pick the nearest ones.

In order to make this task easier, I attach a simple implementation of an ordered list.

In [16]:
#This function should return the embeddings of a word according to your class
def get (LIST, index):
    return LIST[index]

def get_value(el):
    return el[1]



class OrderedListTuple:
    
    def __init__(self, max_size):
        self.content = []
        self.max_size = max_size
        
    def find_pos (self, element):
        index = 0
        while (index <= len(self.content)-1) and get_value(get(self.content, index)) > get_value(element):
            index += 1
        return index

    def insert_element (self, element):
        pos = self.find_pos (element)
        self.content.insert (pos, element)
        if len(self.content) > self.max_size:
            self.content.pop()

This implementation is very simple. When we initialize the list, we set the number of elements that it will have at most. Then, when we add elements to the list, it will add the element in the correct position. But, if the number of elements is higher than the ones that we can keep, the object will remove the last element. Let's see how it works with an example:

In [17]:
L = OrderedListTuple(4)
print (L.content)

L.insert_element(("house", 14))
print (L.content)
L.insert_element(("home", 6))
print (L.content)
L.insert_element(("brown", 3))
print (L.content)
L.insert_element(("elbow", 4))
print (L.content)
L.insert_element(("high", 1))
print (L.content)
L.insert_element(("the", 9))
print (L.content)
L.insert_element(("and", 43))
print (L.content)
L.insert_element(("kitty", 44))
print (L.content)

[]
[('house', 14)]
[('house', 14), ('home', 6)]
[('house', 14), ('home', 6), ('brown', 3)]
[('house', 14), ('home', 6), ('elbow', 4), ('brown', 3)]
[('house', 14), ('home', 6), ('elbow', 4), ('brown', 3)]
[('house', 14), ('the', 9), ('home', 6), ('elbow', 4)]
[('and', 43), ('house', 14), ('the', 9), ('home', 6)]
[('kitty', 44), ('and', 43), ('house', 14), ('the', 9)]


##### Hint: Why don't you create a similarity function that gets two words, and it returns a tuple? For each word in the dictionary, you can calculate the similarity to an input word, and save this in a tuple. Then, using the previous data structure, you can save only the N-best words.

With this data structure, you should be able to get the most similar words to one word.

In [41]:
#YOUR CODE HERE
given_word = 'house'

target_vector = word_embeddings[given_word]

ordered_list = OrderedListTuple(max_size=30)

for word, vector in word_embeddings.items():
    if word != given_word:
        if target_vector.shape == vector.shape:
            #print(target_vector, word)
            similarity_points = similarity(target_vector, vector)

            new_word = (word, similarity_points)

            ordered_list.insert_element(new_word)
        else:
            continue

similar_words = [word for word, _ in ordered_list.content]
print(f'30 most similar words to: {given_word}\n', similar_words)

30 most similar words to: house
 ['houses', 'mansion', 'farmhouse', 'barn', 'representatives', 'townhouse', 'cottage', 'upstairs', 'outbuildings', 'residence', 'room', 'schoolhouse', 'passedbody', 'bedroom', 'bungalow', 'mansions', 'parsonage', 'downstairs', 'barns', 'tavern', 'cottages', 'apartment', 'manor', 'bedrooms', 'kitchen', 'townhouses', 'hallway', 'rooms', 'lodgings', 'rectory']


#### Exercise 4:

The last exercise is really cool. One of the properties that researchers found in word embeddings was that we could perform algebraic operations over the vectors in order to get specific words.

For example, if we perform an operation like this one:

$$DICTIONARY['berlin'] - DICTIONARY['germany'] + DICTIONARY['france']$$

This results in a vector. If we find the 20 closest words to that vector, we should be able to see that the word `Germany` will be near. Another nice operation was:

$$DICTIONARY['queen'] - DICTIONARY['woman'] + DICTIONARY['man']$$

Perform this operations with the words you want, and check if it works.

In [None]:
#YOUR CODE HERE
#berlin - germany + france

In [None]:
#YOUR CODE HERE
#queen - woman + man

#### Exercise 5:

In recent years, many researchers have shown that word embeddings obtained from large corpora reproduce biases that happen in society.

In this exercise, we would like to ask you to try to show some examples in the loaded word embedding file that show some sort of bias. This bias can be either of these, or any other bias you are interested:

 * Gender
 * Origin
 * Sexual preference
 * Socioeconomic class
 * Academic background
 
These examples could be based on distances between words, but any other creative methodology that you could think of will be well considered as well.

For instance, what is the distance between "maid" and "man", and "maid" and "woman"?

You should provide examples but also your interpretation of these results.

If you want to get some inspiration, you may want to check some recent articles about the topic:

  * Bender, Emily M., and Batya Friedman. "Data statements for natural language processing: Toward mitigating system bias and enabling better science." Transactions of the Association for Computational Linguistics 6 (2018): 587-604. https://aclanthology.org/Q18-1041/
  * Hovy, Dirk, and Shrimai Prabhumoye. "Five sources of bias in natural language processing." Language and Linguistics Compass 15.8 (2021): e12432. https://compass.onlinelibrary.wiley.com/doi/10.1111/lnc3.12432

In [None]:
#YOUR CODE HERE

YOUR INTERPRETATION HERE (press ENTER to write)

#### Exercise 6

In this exercise we are going to see how to calculate word counts normalized by document frequency (TF-IDF).

To this end, we will calculate the word frequencies and the IDF normalized counts using a Python package called TFIDFvectorizer from Scikit-Learn.

We will work with the Gutenberg corpus from NLTK.

In [None]:
from nltk.corpus import gutenberg
import nltk

import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
fileids = gutenberg.fileids()
fileids

In [None]:
corpus = []
for fileid in fileids:
    corpus.append(nltk.corpus.gutenberg.raw(fileid))

In [None]:
#Step 1:
vec_tfidf = TfidfVectorizer()
result_tfidf = vec_tfidf.fit_transform(corpus).toarray()

#Step 2:
id2word = {vec.vocabulary_[key]:key for key in vec.vocabulary_.keys()}

#Step 3:
sorted_ids_blake = np.argsort(result_tfidf[4]).reshape(-1)

#Step 4:
for id in sorted_ids_blake[-10:]:
    print (id2word[id],result_tfidf[4][id])

##### Exercise 6.1:
Explain using your own words and in one single sentence (per step) what happens in each step.

###### Step 1: YOUR SENTENCE HERE

###### Step 2: YOUR SENTENCE HERE

###### Step 3: YOUR SENTENCE HERE

###### Step 4: YOUR SENTENCE HERE

#### Exercise 6.2:
Can you check what are the top-10 most relevant words based on their inverse document frequency? I am asking for the 10 words with the highest inverse document frequency.

If you do not know how to get the IDFs of the words, you may want to take a look at the documentation of the TFIDFvectorizer.

In [None]:
#YOUR CODE HERE

#### Exercise 6.3:

How do the inverse document frequencies look like in this corpus? Do they seem relevant? Please state that in 1-2 sentences

#### YOUR RESPONSE HERE