<a href="https://colab.research.google.com/github/mbargane93/Lab1_week1_NLP/blob/main/Lab2intro_to_wordvectors_AboubacrySene.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1 style="font-family:verdana;font-size:300%;text-align:center;background-color:#f2f2f2;color:#0d0d0d">AMMI NLP - Review sessions</h1>

<h1 style="font-family:verdana;font-size:180%;text-align:Center;color:#993333"> Lab 2: Introduction to wordvectors </h1>

**Big thanks to Amr Khalifa who improved this lab and made it to a Jupyter Notebook!**

In [112]:
import io, sys
import numpy as np

In [34]:
def load_vectors(filename):
    fin = io.open(filename, 'r', encoding='utf-8', newline='\n')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = np.asarray([float(x) for x in tokens[1:]])
    return data

In [35]:
# Loading word vectors
print('')
print(' ** Word vectors ** ')
print('')

'''
word_vectors is a dictionary that maps words to their numerical word vector
[word (string)] = [np-array] 
'''
word_vectors = load_vectors('wiki.en.vec')

tree_vector = word_vectors['tree']
print(type(tree_vector), len(tree_vector))


 ** Word vectors ** 

<class 'numpy.ndarray'> 300


In [36]:
## This function computes the cosine similarity between vectors u and v

def cosine(u, v):
    '''
    Parameters:
    u : 1-D numpy array
    v : 1-D numpy array 
    Returns:
    cos (float) : value of the cosine similairy between vectors u, v 
    '''
    ## FILL CODE
    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    cos   = dot_product / (norm_u * norm_v)
    return cos 


In [8]:
# compute similarity between words

print('similarity(apple, apples) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['apples']))
print('similarity(apple, banana) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['banana']))
print('similarity(apple, tiger) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['tiger']))

similarity(apple, apples) = 0.637
similarity(apple, banana) = 0.431
similarity(apple, tiger) = 0.212


In [37]:
# Functions for nearest neighbor
# This function returns the word corresponding to 
# nearest neighbor vector of x
# The list exclude_words can be used to exclude some
# words from the nearest neighbors search 

def nearest_neighbor(x, word_vectors, exclude_words=[]):
    '''
    Parameters:
    x (string): word to find its nearest neighbour 
    word_vectors (Python dict): {word (string): np-array of word vector}
    exclude_words (list of strings): words to be excluded from the search
    
    Returns:
    best_word (string) : the word whose word vector is the nearest neighbour 
    to the word vector of x
    '''
    best_score = -1.0
    best_word = None
    ## FILL CODE
    for words in word_vectors.keys():
      imp = cosine(word_vectors[words],x)
      if imp>best_score and not words in exclude_words:
        best_score = imp
        best_word = words     
    return best_word

In [38]:
print('')
print('The nearest neighbor of cat is: ' +
      nearest_neighbor(word_vectors['cat'], word_vectors, exclude_words = ['cat', 'cats']))


The nearest neighbor of cat is: dog


#### Hint (using python priorty queues with the heapq datastructure): 
if you don't want to store all the words and scores you can use the priortiy queue and only store the best K element so far. 

In [64]:

## This function return the words corresponding to the
## K nearest neighbors of vector x.
## You can use the functions heappush and heappop.
def knn(x, vectors, k):
    '''
    Parameters:
    x (string): word to find its nearest neighbour 
    word_vectors (Python dict): {word (string): np-array of word vector}
    k (int): number of nearest neighbours to be found
    
    Returns: 
    k_nearest_neighbors (list of tuples): [(score, word), (score, word), ....]
    '''

    k_nearest_neighbors = None
    ## FILL CODE
    word_scores  =[]
    
    for word, v in vectors.items():
        score = cosine(x, v)
        word_scores.append((score, word))
        
    word_scores.sort(key = lambda x: x[0], reverse = True)
    k_nearest_neighbors = word_scores[1:k+1]
        
    return k_nearest_neighbors

In [65]:
knn_cat = knn(word_vectors['cat'], word_vectors, 5)
print()
print('')
print('cat')
print('--------------')
for score, word in knn(word_vectors['cat'], word_vectors, 5):
    print (word + '\t%.3f' % score)




cat
--------------
cats	0.732
dog	0.638
pet	0.573
rabbit	0.549
dogs	0.538


#### Hint: 
To find the analogies, we find the nearest neighbour associated with the wordvector d
$$ d = \frac{c}{\Vert {c} \Vert} + \frac{b}{\Vert {b} \Vert} - \frac{a}{\Vert {a} \Vert}$$


In [66]:
import numpy as np
import gensim
from gensim.models import word2vec,KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [80]:
## This function return the words d, such that a:b and c:d
## verifies the same relation
def analogy(a, b, c, word_vectors):
    '''
    Parameters:
    a (string): word a
    b (string): word b
    c (string): word c
    word_vectors (Python dict): {word (string): np-array of word vector}
    
    Returnrs: 
    the word d (string) associated with c such that c:d is similar to a:b 
    
    '''
    ## FILL CODE
    word_a, word_b, word_c = a.lower(), b.lower(), c.lower()
  
    e_a, e_b, e_c = word_vectors[word_a], word_vectors[word_b], word_vectors[word_c]
    words = word_vectors.keys()
    max_cosine_sim = -100              
    d = None                   
    for w in words:        
        if w in [word_a, word_b, word_c] :
            continue
        cosine_sim = cosine((e_b - e_a), (word_vectors[w] - e_c))
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            d = w 
    return d

In [81]:
# Word analogies
print('')
print('france - paris + rome = ' + analogy('paris', 'france', 'rome', word_vectors))


france - paris + rome = italy


## A word about biases in word vectors

In [82]:
## A word about biases in word vectors:

print('')
print('similarity(genius, man) = %.3f' %
      cosine(word_vectors['man'], word_vectors['genius']))
print('similarity(genius, woman) = %.3f' %
      cosine(word_vectors['woman'], word_vectors['genius']))


similarity(genius, man) = 0.445
similarity(genius, woman) = 0.325


In [90]:
# Compute the association strength between:
#   - a word w
#   - two sets of attributes A and B

def association_strength(w, A, B, vectors):
    '''
    Parameters:
    w (string): word w
    A (list of strings): The words belonging to set A
    B (list of strings): The words belonging to set B
    vectors (Python dict): {word (string): np-array of word vector}
    
    Returnrs: 
    strength (float): the value of the association strength 
    '''
    strength = 0.0
    part_a = 0.0
    part_b = 0.0 
    ## FILL CODE
    for a in A:
      part_a += 1/len(A)*cosine(vectors[w],vectors[a])
    for b in B:
      part_b += 1/len(B)*cosine(vectors[w],vectors[b])
    strength = part_a - part_b
    return strength

In [91]:
## Perform the word embedding association test between:
##   - two sets of words X and Y
##   - two sets of attributes A and B

def weat(X, Y, A, B, vectors):
    '''
    Parameters:
    X (list of strings): The words belonging to set X
    Y (list of strings): The words belonging to set Y
    A (list of strings): The words belonging to set A
    B (list of strings): The words belonging to set B
    vectors (Python dict): {word (string): np-array of word vector}
    
    Returns: 
    score (float): the value of the group association strength  
    '''
    score_1 = 0.0
    score_2 = 0.0
    ## FILL CODE
    for x in X:
      score_1+= association_strength(x, A, B, vectors)
    for y in Y:
      score_2+= association_strength(y, A, B, vectors)
    score = score_1 -score_2
    return score

In [92]:
## Replicate one of the experiments from:
##
## Semantics derived automatically from language corpora contain human-like biases
## Caliskan, Bryson, Narayanan (2017)

career = ['executive', 'management', 'professional', 'corporation', 
          'salary', 'office', 'business', 'career']
family = ['home', 'parents', 'children', 'family',
          'cousins', 'marriage', 'wedding', 'relatives']
male = ['john', 'paul', 'mike', 'kevin', 'steve', 'greg', 'jeff', 'bill']
female = ['amy', 'joan', 'lisa', 'sarah', 'diana', 'kate', 'ann', 'donna']

print('')
print('Word embedding association test: %.3f' %
      weat(career, family, male, female, word_vectors))


Word embedding association test: 0.847


## Word translation using word vectors

In the following, we will use word vectors in English and French to translate words from English to French. The idea is to learn a linear function that maps English word vectors to their correponding French word vectors. To learn this linear mapping, we will use a small bilingual lexicon, that contains pairs of words in English and French that are translations of each other.

The following function will load the small English-French bilingual lexicon:

In [93]:
def load_lexicon(filename):
    '''
    Parameters:
    filename(string): the path of the lexicon
    
    Returns:
    data(list of pairs of string): the bilingual lexicon
    '''
    fin = io.open(filename, 'r', encoding='utf-8', newline='\n')
    data = []
    for line in fin:
        a, b = line.rstrip().split(' ')
        data.append((a, b))
    return data

In [95]:
word_vectors_en = load_vectors('wiki.en.vec')
word_vectors_fr = load_vectors('wiki.fr.vec')
lexicon = load_lexicon("lexicon-en-fr.txt")
print(lexicon[:5])

[('the', 'le'), ('the', 'les'), ('the', 'la'), ('and', 'et'), ('was', 'fut')]


In [96]:
# We split the lexicon into a train and validation set
train = lexicon[:5000]
valid = lexicon[5000:5100]

The following function will learn the mapping from English to French. The idea is to build two matrices $X_{\text{en}}$ and $X_{\text{fr}}$, and to find a mapping $M$ that minimizes $||X_{\text{en}} W - X_{\text{fr}} ||_2$. In numpy, this mapping can be obtained by using the `numpy.linalg.lstsq` function.

In [109]:
def align(word_vectors_en, word_vectors_fr, lexicon):
    '''
    Parameters:
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    lexicon(list of pairs of string): bilingual training lexicon
    
    Returns
    mapping(np.array): the mapping from English to French vectors
    '''
    x_en, x_fr = [], []
    ## FILL CODE
    for w_e,w_f in lexicon:
      x_en.append(word_vectors_en[w_e])
      x_fr.append(word_vectors_fr[w_f])
      #print(w_e,word_vectors_en[w_e])
      #break
    return np.linalg.lstsq(x_en, x_fr,rcond=None)

In [110]:
mapping = align(word_vectors_en, word_vectors_fr, lexicon)

In [111]:
mapping

(array([[-0.06183285, -0.01071552,  0.00175985, ..., -0.01107046,
          0.01629405, -0.01644996],
        [-0.01655313, -0.02930488,  0.09810107, ..., -0.01744702,
         -0.02848298,  0.02070179],
        [-0.01970861, -0.0147154 ,  0.01231819, ...,  0.03036093,
         -0.00209909, -0.00944313],
        ...,
        [ 0.0669847 ,  0.02351181,  0.02041902, ...,  0.00886501,
          0.08635366,  0.00595836],
        [ 0.01936122,  0.00552446,  0.01234669, ..., -0.00623332,
         -0.05116348,  0.05634361],
        [ 0.00530333, -0.03424679, -0.03369923, ..., -0.01344391,
         -0.00051053, -0.00491391]]),
 array([240.83364785, 229.76001971, 240.87553006, 242.85647259,
        243.71812188, 226.23758728, 253.75036774, 232.27915867,
        240.37369797, 242.5454353 , 234.87123906, 242.73885799,
        240.05196358, 225.23258537, 231.97861248, 275.50632002,
        254.2120771 , 212.69817596, 231.11341292, 236.77024946,
        228.47970302, 230.40315553, 231.74205474, 254

Given a mapping, a set of word English word vector and French word vectors, the next function will translate the English word to French. To do so, we apply the mapping on the English word, and retrieve the nearest neighbor of the obtained vector in the set of French word vectors. The translation is then the corresponding French word.

In [135]:
def translate(word, word_vectors_en, word_vectors_fr, mapping):
    '''
    Parameters:
    word(string): an English word
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    mapping(np.array): the mapping from English to French vectors
    
    Returns
    A string containing the translation of the English word
    '''
    
    ## FILL CODE
    a = word_vectors_en[word]
    _a = a@mapping[0]
    _trans = knn(_a, word_vectors_fr, 1)
    return _trans[0][1]

In [136]:
print(translate("world", word_vectors_en, word_vectors_fr, mapping))
print(translate("machine", word_vectors_en, word_vectors_fr, mapping))
print(translate("learning", word_vectors_en, word_vectors_fr, mapping))

mondiale
machines
apprendre


Finally, let's implement a function to evaluate this method on the validation lexicon:

In [137]:
def evaluate(valid, word_vectors_en, word_vectors_fr, mapping):
    '''
    Parameters:
    valid(a list of pairs of string): the validation lexicon
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    mapping(np.array): the mapping from English to French vectors
    
    Returns
    Accuracy(float): the accuracy on the validation lexicon
    '''
    acc, n = 0.0, 0
    out = 0.0
    ## FILL CODE
    for i in valid:
      out = translate(i[0], word_vectors_en, word_vectors_fr, mapping)

      if out == i[1]:
        n +=1 
    acc = n / len(valid)
    return acc

In [138]:
evaluate(valid, word_vectors_en, word_vectors_fr, mapping)

0.18