# Naive Machine Translation

Writing a program that translates English to French.

## 1. Initialization

### 1.1. Importing packages and downloading data

#### Importing packages

In [1]:
from gensim.models import KeyedVectors
import pandas as pd
import pickle
import numpy as np

#### The word embeddings data for English and French words

English embeddings from Google code archive word2vec https://code.google.com/archive/p/word2vec/  
French embeddings from cross_lingual_text_classification https://github.com/vjstark/crosslingual_text_classification -> https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec

In [2]:
en_embeddings = KeyedVectors.load_word2vec_format('./src/GoogleNews-vectors-negative300.bin', binary = True)
fr_embeddings = KeyedVectors.load_word2vec_format('./src/wiki.multi.fr.vec')

#### Loading the english to french dictionaries

This function returns the english to french dictionary given a file where each column corresponds to a word.

In [3]:
def get_dict(file_name):
    
    my_file = pd.read_csv(file_name, delimiter=' ')
    etof = {}  # the english to french dictionary to be returned
    for i in range(len(my_file)):
        en = my_file.loc[i][0]
        fr = my_file.loc[i][1]
        etof[en] = fr

    return etof

In [4]:
en_fr_train = get_dict('./data/en-fr.train.txt')
print('The length of the english to french training dictionary is', len(en_fr_train))
en_fr_test = get_dict('./data/en-fr.test.txt')
print('The length of the english to french test dictionary is', len(en_fr_train))

The length of the english to french training dictionary is 5000
The length of the english to french test dictionary is 5000


#### Subseting the data

In [5]:
english_set = set(en_embeddings.vocab)
french_set = set(fr_embeddings.vocab)
en_embeddings_subset = {}
fr_embeddings_subset = {}
french_words = set(en_fr_train.values())

for en_word in en_fr_train.keys():
    fr_word = en_fr_train[en_word]
    if fr_word in french_set and en_word in english_set:
        en_embeddings_subset[en_word] = en_embeddings[en_word]
        fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]


for en_word in en_fr_test.keys():
    fr_word = en_fr_test[en_word]
    if fr_word in french_set and en_word in english_set:
        en_embeddings_subset[en_word] = en_embeddings[en_word]
        fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]


pickle.dump( en_embeddings_subset, open( "./data/en_embeddings.p", "wb" ) )
pickle.dump( fr_embeddings_subset, open( "./data/fr_embeddings.p", "wb" ) )

Loading the subset of data

In [6]:
en_embeddings_subset = pickle.load(open("./data/en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("./data/fr_embeddings.p", "rb"))

### 1.2. Generate embedding

**Defining get_matrices function**

**Inputs** :  
*en_fr*: English to French dictionary  
*french_vecs*: French words to their corresponding word embeddings.  
*english_vecs*: English words to their corresponding word embeddings.

**Outputs** :  
*X*: a matrix where the columns are the English embeddings.  
*Y*: a matrix where the columns correspong to the French embeddings.  

In [7]:
def get_matrices(en_fr, french_vecs, english_vecs):

    X_l = list()
    Y_l = list()

    english_set = set(english_vecs.keys())
    french_set = set(french_vecs.keys())

    french_words = set(en_fr.values())

    for en_word, fr_word in en_fr.items():

        if fr_word in french_set and en_word in english_set:
            en_vec = english_vecs[en_word]
            fr_vec = french_vecs[fr_word]
            X_l.append(en_vec)
            Y_l.append(fr_vec)

    X = np.vstack(X_l)
    Y = np.vstack(Y_l)

    return X, Y

Getting the training set

In [8]:
X_train, Y_train = get_matrices(en_fr_train, fr_embeddings_subset, en_embeddings_subset)

## 2. Translations

### 2.1. Translation as linear transformation of embeddings

#### Computing the loss

The loss function will be squared Frobenoius norm of the difference between matrix and its approximation, divided by the number of training examples 𝑚 .

**Defining compute_loss function**

**Inputs** :  
*X*: a matrix of dimension (m,n) where the columns are the English embeddings.  
*Y*: a matrix of dimension (m,n) where the columns correspong to the French embeddings.  
*R*: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.

**Outputs** :  
*L*: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R.

In [9]:
def compute_loss(X, Y, R):

    m = np.shape(X)[0]
    
    diff = np.dot(X,R)-Y
    diff_squared = np.square(diff)
    sum_diff_squared = np.sum(diff_squared)

    loss = sum_diff_squared/m

    return loss

#### Computing the gradient of loss in respect to transform matrix R

**Defining compute_gradient function**

**Inputs** :  
*X*: a matrix of dimension (m,n) where the columns are the English embeddings.  
*Y*: a matrix of dimension (m,n) where the columns correspong to the French embeddings.  
*R*: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.

**Outputs** :  
*gradient*: a matrix of dimension (n,n) - gradient of the loss function L for given X, Y and R.

In [10]:
def compute_gradient(X, Y, R):

    m = np.shape(X)[0]

    gradient = np.dot(X.T,(np.dot(X,R)-Y)) * (2/m)

    return gradient

#### Finding the optimal R with gradient descent algorithm

**Defining align_embeddings function**

**Inputs** :  
*X*: a matrix of dimension (m,n) where the columns are the English embeddings.  
*Y*: a matrix of dimension (m,n) where the columns correspong to the French embeddings.  
*iterations*: positive int - describes how many steps will gradient descent algorithm do.  
*learning_rate*: positive float - describes how big steps will  gradient descent algorithm do.

**Outputs** :  
R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||XR-Y||^2

In [11]:
def align_embeddings(X, Y, iterations=100, learning_rate=0.0003):

    R = np.random.rand(X.shape[1], X.shape[1])

    for i in range(iterations):
        if i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")

        gradient = compute_gradient(X, Y, R)

        R -= learning_rate*gradient

    return R

#### Calculating transformation matrix R

In [18]:
R_train = align_embeddings(X_train, Y_train, iterations=500, learning_rate=0.8)

loss at iteration 0 is: 954.8366
loss at iteration 25 is: 97.7061
loss at iteration 50 is: 26.8728
loss at iteration 75 is: 9.8347
loss at iteration 100 is: 4.4102
loss at iteration 125 is: 2.3495
loss at iteration 150 is: 1.4620
loss at iteration 175 is: 1.0429
loss at iteration 200 is: 0.8312
loss at iteration 225 is: 0.7186
loss at iteration 250 is: 0.6562
loss at iteration 275 is: 0.6205
loss at iteration 300 is: 0.5994
loss at iteration 325 is: 0.5867
loss at iteration 350 is: 0.5789
loss at iteration 375 is: 0.5740
loss at iteration 400 is: 0.5709
loss at iteration 425 is: 0.5688
loss at iteration 450 is: 0.5675
loss at iteration 475 is: 0.5666


### 2.2. Testing the translation

#### Fining k-Nearest neighbors

**Defining cosine_similarity function**

**Inputs** :  
*A*: a numpy array which corresponds to a word vector  
*B*: A numpy array which corresponds to a word vector

**Outputs** :  
*cos*: numerical number representing the cosine similarity between A and B.

In [13]:
def cosine_similarity(A, B):

    cos = -10
    dot = np.dot(A, B)
    norma = np.linalg.norm(A)
    normb = np.linalg.norm(B)
    cos = dot / (norma * normb)

    return cos

**Defining nearest_neighbor function**

**Inputs** :  
*v*: the vector we are going find the nearest neighbor for
*candidates*: a set of vectors where we will find the neighbors  
*k*: top k nearest neighbors to find

**Outputs** :  
*k_idx*: the indices of the top k closest vectors in sorted form

In [14]:
def nearest_neighbor(v, candidates, k=1):

    similarity_l = []

    for row in candidates:
        cos_similarity = cosine_similarity(v,row)
        similarity_l.append(cos_similarity)
        
    sorted_ids = np.argsort(similarity_l)

    k_idx = sorted_ids[-k:]
    
    return k_idx

#### Testing the translation and computing its accuracy

**Defining test_vocabulary function**

**Inputs** :  
*X*: a matrix where the columns are the English embeddings.  
*Y*: a matrix where the columns correspong to the French embeddings.  
*R*: the transform matrix which translates word embeddings from English to French word vector space.

**Outputs** :  
*accuracy*: for the English to French translations

In [15]:
def test_vocabulary(X, Y, R):

    pred = np.dot(X,R)

    num_correct = 0

    for i in range(len(pred)):
        pred_idx = nearest_neighbor(pred[i], Y)

        if pred_idx == i:
            num_correct += 1

    accuracy = num_correct/np.shape(pred)[0]

    return accuracy

Translation mechanism working on the unseen data:

In [16]:
X_val, Y_val = get_matrices(en_fr_test, fr_embeddings_subset, en_embeddings_subset)

In [19]:
acc = test_vocabulary(X_val, Y_val, R_train)
print(f"accuracy on test set is {acc:.3f}")

accuracy on test set is 0.552


We managed to translate words from one language to another language without ever seing them with 55% accuracy by using some basic linear algebra and learning a mapping of words from one language to another!