
<h1 style="font-family:verdana;font-size:300%;text-align:center;background-color:#f2f2f2;color:#0d0d0d">MMI_2024_NLP - Week 1</h1>

<h1 style="font-family:verdana;font-size:180%;text-align:Center;color:#993333"> Lab 2: Introduction to wordvectors </h1>


Before we start, please change the name of the notebook to the following format : **Firstname_LASTNAME_Lab2_intro_to_wordvectors.ipynb**


In some cells and files you will see code blocks that look like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
pass
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

You should replace the `pass` statement with your own code and leave the blocks intact, like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
y = m * x + b
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

In [2]:
import io, sys
import numpy as np

In [3]:

def load_vectors(filename):
    fin = io.open(filename, 'r', encoding='utf-8', newline='\n')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = np.asarray([float(x) for x in tokens[1:]])
    return data

In [4]:
# Loading word vectors

print('')
print(' ** Word vectors ** ')
print('')

'''
word_vectors is a dictionary that maps words to their numerical word vector
[word (string)] = [np-array]
'''
word_vectors = load_vectors('/content/drive/MyDrive/AMMI-23/nlp/Lab2/wiki.en.vec')

tree_vector = word_vectors['tree']
print(type(tree_vector), len(tree_vector))


 ** Word vectors ** 

<class 'numpy.ndarray'> 300


In [5]:
## This function computes the cosine similarity between vectors u and v

def cosine(u, v):
    '''
    Parameters:
    u : 1-D numpy array
    v : 1-D numpy array

    Returns:
    cos (float) : value of the cosine similairy between vectors u, v
    '''
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    dot_prod = np.dot(u, v)
    u_norm = np.linalg.norm(u)
    v_norm = np.linalg.norm(v)
    cos = dot_prod / (u_norm * v_norm)
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return cos

In [6]:
# compute similarity between words
print (f"test similarity {cosine(np.array([1,0,0]),np.array([1,0,0]))}", )
print('similarity(apple, apples) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['apples']))
print('similarity(apple, banana) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['banana']))
print('similarity(apple, tiger) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['tiger']))

test similarity 1.0
similarity(apple, apples) = 0.637
similarity(apple, banana) = 0.431
similarity(apple, tiger) = 0.212


In [7]:
## Functions for nearest neighbor
## This function returns the word corresponding to
## nearest neighbor vector of x
## The list exclude_words can be used to exclude some
## words from the nearest neighbors search

def nearest_neighbor(x, word_vectors, exclude_words=[]):
    '''
    Parameters:
    x (string): word to find its nearest neighbour
    word_vectors (Python dict): {word (string): np-array of word vector}
    exclude_words (list of strings): words to be excluded from the search

    Returns:
    best_word (string) : the word whose word vector is the nearest neighbour
    to the word vector of x
    '''
    best_score = -1.0
    best_word = None
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    for word, vector in word_vectors.items():
      if word in exclude_words:
        continue
      score = cosine(x, vector)
      if score > best_score:
        best_score = score
        best_word = word
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return best_word

In [8]:
print('')
print('The nearest neighbor of cat is: ' +
      nearest_neighbor(word_vectors['cat'], word_vectors, exclude_words = ['cat', 'cats']))


The nearest neighbor of cat is: dog


In [9]:
print('')
print('The nearest neighbor of cat is: ' +
      nearest_neighbor(word_vectors['cat'], word_vectors, exclude_words = ['cat', 'cats']))


The nearest neighbor of cat is: dog


#### Hint (using python priorty queues with the heapq datastructure):
if you don't want to store all the words and scores you can use the priortiy queue and only store the best K element so far.

In [10]:
## This function return the words corresponding to the
## K nearest neighbors of vector x.
## You can use the functions heappush and heappop.
import heapq
def knn(x, vectors, k):
    '''
    Parameters:
    x (string): word to find its nearest neighbour
    word_vectors (Python dict): {word (string): np-array of word vector}
    k (int): number of nearest neighbours to be found

    Returns:
    k_nearest_neighbors (list of tuples): [(score, word), (score, word), ....]
    '''

    k_nearest_neighbors = None
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    heap = []

    for word, vector in vectors.items():
      score = cosine(x, vector)
      heapq.heappush(heap, (score, word))
    if len(heap) > k:
      heapq.heappop(heap)

    knn = heapq.nlargest(k, heap)
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################
    return knn

In [11]:
knn_cat = knn(word_vectors['cat'], word_vectors, 5)
print('')
print('cat')
print('--------------')
for score, word in knn(word_vectors['cat'], word_vectors, 5):
    print (word + '\t%.3f' % score)


cat
--------------
cat	1.000
cats	0.732
dog	0.638
pet	0.573
rabbit	0.549


#### Hint:
To find the analogies, we find the nearest neighbour associated with the wordvector d
$$ d = \frac{c}{\Vert {c} \Vert} + \frac{b}{\Vert {b} \Vert} - \frac{a}{\Vert {a} \Vert}$$


In [12]:
## This function return the words d, such that a:b and c:d
## verifies the same relation

def analogy(a, b, c, word_vectors):
    '''
    Parameters:
    a (string): word a
    b (string): word b
    c (string): word c
    word_vectors (Python dict): {word (string): np-array of word vector}

    Returnrs:
    the word d (string) associated with c such that c:d is similar to a:b

    '''
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    # Normalize the word vectors
    a_vector = word_vectors[a] / np.linalg.norm(word_vectors[a])
    b_vector = word_vectors[b] / np.linalg.norm(word_vectors[b])
    c_vector = word_vectors[c] / np.linalg.norm(word_vectors[c])

    # Compute the new vector d
    d_vector = c_vector + b_vector - a_vector

    # Find the nearest neighbor to d_vector
    best_score = -1.0
    best_word = None

    for word, vector in word_vectors.items():
        if word in [a, b, c]:
            continue
        score = cosine(d_vector, vector)
        if score > best_score:
            best_score = score
            best_word = word
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return best_word

In [13]:
# Word analogies
print('')
print('france - paris + rome = ' + analogy('paris', 'france', 'rome', word_vectors))


france - paris + rome = italy


## A word about biases in word vectors

In [14]:
## A word about biases in word vectors:
print('')
print('similarity(genius, man) = %.3f' %
      cosine(word_vectors['man'], word_vectors['genius']))
print('similarity(genius, woman) = %.3f' %
      cosine(word_vectors['woman'], word_vectors['genius']))


similarity(genius, man) = 0.445
similarity(genius, woman) = 0.325


In [15]:
## Compute the association strength between:
##   - a word w
##   - two sets of attributes A and B

def association_strength(w, A, B, vectors):
    '''
    Parameters:
    w (string): word w
    A (list of strings): The words belonging to set A
    B (list of strings): The words belonging to set B
    vectors (Python dict): {word (string): np-array of word vector}

    Returnrs:
    strength (float): the value of the association strength
    '''
    strength = 0.0
    part_a = 0.0
    part_b = 0.0
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    part_a = np.mean([cosine(vectors[w], vectors[a]) for a in A])
    part_b = np.mean([cosine(vectors[w], vectors[b]) for b in B])
    strength = part_a - part_b
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return strength

In [16]:
## Perform the word embedding association test between:
##   - two sets of words X and Y
##   - two sets of attributes A and B

def weat(X, Y, A, B, vectors):
    '''
    Parameters:
    X (list of strings): The words belonging to set X
    Y (list of strings): The words belonging to set Y
    A (list of strings): The words belonging to set A
    B (list of strings): The words belonging to set B
    vectors (Python dict): {word (string): np-array of word vector}

    Returns:
    score (float): the value of the group association strength
    '''

    score = 0.0
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    def association_strength(w, A, B, vectors):
        part_a = np.mean([cosine(vectors[w], vectors[a]) for a in A])
        part_b = np.mean([cosine(vectors[w], vectors[b]) for b in B])
        strength = part_a - part_b
        return strength

    score_X = sum(association_strength(x, A, B, vectors) for x in X)
    score_Y = sum(association_strength(y, A, B, vectors) for y in Y)

    score = score_X - score_Y
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return score

In [17]:
## Replicate one of the experiments from:
##
## Semantics derived automatically from language corpora contain human-like biases
## Caliskan, Bryson, Narayanan (2017)

career = ['executive', 'management', 'professional', 'corporation',
          'salary', 'office', 'business', 'career']
family = ['home', 'parents', 'children', 'family',
          'cousins', 'marriage', 'wedding', 'relatives']
male = ['john', 'paul', 'mike', 'kevin', 'steve', 'greg', 'jeff', 'bill']
female = ['amy', 'joan', 'lisa', 'sarah', 'diana', 'kate', 'ann', 'donna']

print('')
print('Word embedding association test: %.3f' %
      weat(career, family, male, female, word_vectors))


Word embedding association test: 0.847


## Word translation using word vectors

In the following, we will use word vectors in English and French to translate words from English to French. The idea is to learn a linear function that maps English word vectors to their correponding French word vectors. To learn this linear mapping, we will use a small bilingual lexicon, that contains pairs of words in English and French that are translations of each other.

The following function will load the small English-French bilingual lexicon:

In [18]:
def load_lexicon(filename):
    '''
    Parameters:
    filename(string): the path of the lexicon

    Returns:
    data(list of pairs of string): the bilingual lexicon
    '''
    fin = io.open(filename, 'r', encoding='utf-8', newline='\n')
    data = []
    for line in fin:
        a, b = line.rstrip().split(' ')
        data.append((a, b))
    return data

In [19]:
word_vectors_en = load_vectors('/content/drive/MyDrive/AMMI-23/nlp/Lab2/wiki.en.vec')
word_vectors_fr = load_vectors('/content/drive/MyDrive/AMMI-23/nlp/Lab2/wiki.fr.vec')
lexicon = load_lexicon("/content/drive/MyDrive/AMMI-23/nlp/Lab2/lexicon-en-fr.txt")
print(lexicon[:5])

[('the', 'le'), ('the', 'les'), ('the', 'la'), ('and', 'et'), ('was', 'fut')]


In [20]:
# We split the lexicon into a train and validation set
train = lexicon[:5000]
valid = lexicon[5000:5100]

The following function will learn the mapping from English to French. The idea is to build two matrices $X_{\text{en}}$ and $X_{\text{fr}}$, and to find a mapping $M$ that minimizes $||X_{\text{en}} W - X_{\text{fr}} ||_2$. In numpy, this mapping can be obtained by using the `numpy.linalg.lstsq` function.

In [21]:
def align(word_vectors_en, word_vectors_fr, lexicon):
    '''
    Parameters:
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    lexicon(list of pairs of string): bilingual training lexicon

    Returns
    mapping(np.array): the mapping from English to French vectors
    '''
    x_en, x_fr = [], []
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    x_en, x_fr = [], []

    for en_word, fr_word in lexicon:
      if en_word in word_vectors_en and fr_word in word_vectors_fr:
        x_en.append(word_vectors_en[en_word])
        x_fr.append(word_vectors_fr[fr_word])

    x_en = np.array(x_en)
    x_fr = np.array(x_fr)

    mapping, _, _, _ = np.linalg.lstsq(x_en, x_fr, rcond=None)
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return mapping

In [22]:
mapping = align(word_vectors_en, word_vectors_fr, lexicon)
mapping

array([[-0.06183285, -0.01071552,  0.00175985, ..., -0.01107046,
         0.01629405, -0.01644996],
       [-0.01655313, -0.02930488,  0.09810107, ..., -0.01744702,
        -0.02848298,  0.02070179],
       [-0.01970861, -0.0147154 ,  0.01231819, ...,  0.03036093,
        -0.00209909, -0.00944313],
       ...,
       [ 0.0669847 ,  0.02351181,  0.02041902, ...,  0.00886501,
         0.08635366,  0.00595836],
       [ 0.01936122,  0.00552446,  0.01234669, ..., -0.00623332,
        -0.05116348,  0.05634361],
       [ 0.00530333, -0.03424679, -0.03369923, ..., -0.01344391,
        -0.00051053, -0.00491391]])

Given a mapping, a set of word English word vector and French word vectors, the next function will translate the English word to French. To do so, we apply the mapping on the English word, and retrieve the nearest neighbor of the obtained vector in the set of French word vectors. The translation is then the corresponding French word.

In [23]:
def translate(word, word_vectors_en, word_vectors_fr, mapping):
    '''
    Parameters:
    word(string): an English word
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    mapping(np.array): the mapping from English to French vectors

    Returns
    A string containing the translation of the English word
    '''
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    en_vector = word_vectors_en[word]
    mapped_vector = np.dot(en_vector, mapping)
    best_score = -1.0
    best_word = None
    for fr_word, fr_vector in word_vectors_fr.items():
      score = cosine(mapped_vector, fr_vector)
      if score > best_score:
        best_score = score
        best_word = fr_word
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return best_word

In [24]:
print(translate("man", word_vectors_en, word_vectors_fr, mapping))
print(translate("machine", word_vectors_en, word_vectors_fr, mapping))
print(translate("learning", word_vectors_en, word_vectors_fr, mapping))

homme
machine
apprentissage


In [25]:
print(translate("man", word_vectors_en, word_vectors_fr, mapping))
print(translate("machine", word_vectors_en, word_vectors_fr, mapping))
print(translate("learning", word_vectors_en, word_vectors_fr, mapping))

homme
machine
apprentissage


Finally, let's implement a function to evaluate this method on the validation lexicon:

In [26]:
def evaluate(valid, word_vectors_en, word_vectors_fr, mapping):
    '''
    Parameters:
    valid(a list of pairs of string): the validation lexicon
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    mapping(np.array): the mapping from English to French vectors

    Returns
    Accuracy(float): the accuracy on the validation lexicon
    '''
    acc, n = 0.0, 0
    ##########################################################################
    #                      TODO: Implement this function                     #
    ##########################################################################
    # Replace "pass" statement with your code
    for en_word, fr_word in valid:
      if en_word in word_vectors_en and fr_word in word_vectors_fr:
        en_vector = word_vectors_en[en_word]
        mapped_vector = np.dot(en_vector, mapping)
        best_score = -1.0
        best_word = None
        for word, vector in word_vectors_fr.items():
          score = cosine(mapped_vector, vector)
          if score > best_score:
            best_score = score
            best_word = word
        if best_word == fr_word:
          acc += 1

        n += 1

    accuracy = acc / n if n > 0 else 0.0
    ##########################################################################
    #                            END OF YOUR CODE                            #
    ##########################################################################

    return accuracy

In [27]:
accuracy = evaluate(valid, word_vectors_en, word_vectors_fr, mapping)
accuracy

0.64