<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/nlp/c1_w4_assignment_naive_machine_translation_lsh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 4 - Naive Machine Translation and LSH

You will now implement your first machine translation system and then you
will see how locality sensitive hashing works. Let's get started by importing
the required functions!

If you are running this notebook in your local computer, don't forget to
download the twitter samples and stopwords from nltk.

```
nltk.download('stopwords')
nltk.download('twitter_samples')
```

### This assignment covers the folowing topics:

- [1. The word embeddings data for English and French words](#1)
  - [1.1 Generate embedding and transform matrices](#1-1)
      - [Exercise 1](#ex-01)
- [2. Translations](#2)
  - [2.1 Translation as linear transformation of embeddings](#2-1)
      - [Exercise 2](#ex-02)  
      - [Exercise 3](#ex-03)  
      - [Exercise 4](#ex-04)        
  - [2.2 Testing the translation](#2-2)
      - [Exercise 5](#ex-05)
      - [Exercise 6](#ex-06)      
- [3. LSH and document search](#3)
  - [3.1 Getting the document embeddings](#3-1)
      - [Exercise 7](#ex-07)
      - [Exercise 8](#ex-08)      
  - [3.2 Looking up the tweets](#3-2)
  - [3.3 Finding the most similar tweets with LSH](#3-3)
  - [3.4 Getting the hash number for a vector](#3-4)
      - [Exercise 9](#ex-09)  
  - [3.5 Creating a hash table](#3-5)
      - [Exercise 10](#ex-10)  
  - [3.6 Creating all hash tables](#3-6)
      - [Exercise 11](#ex-11) 

In [61]:
import pdb
import pickle
import string
import re
import time
import nltk
import gensim
import scipy
import sklearn

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from gensim.models import KeyedVectors
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [62]:
nltk.download('stopwords')
nltk.download('twitter_samples')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [63]:
def process_tweet(tweet):
    '''
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    '''
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
            word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean


def get_dict(file_name):
    """
    This function returns the english to french dictionary given a file where the each column corresponds to a word.
    Check out the files this function takes in your workspace.
    """
    my_file = pd.read_csv(file_name, delimiter=' ')
    etof = {}  # the english to french dictionary to be returned
    for i in range(len(my_file)):
        # indexing into the rows.
        en = my_file.loc[i][0]
        fr = my_file.loc[i][1]
        etof[en] = fr

    return etof


def cosine_similarity(A, B):
    '''
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        cos: numerical number representing the cosine similarity between A and B.
    '''
    # you have to set this variable to the true label.
    cos = -10
    dot = np.dot(A, B)
    norma = np.linalg.norm(A)
    normb = np.linalg.norm(B)
    cos = dot / (norma * normb)

    return cos

<a name="1"></a>

# 1. The word embeddings data for English and French words

Write a program that translates English to French.

## The data

The full dataset for English embeddings is about 3.64 gigabytes, and the French
embeddings are about 629 megabytes. To prevent the Coursera workspace from
crashing, we've extracted a subset of the embeddings for the words that you'll
use in this assignment.

If you want to run this on your local computer and use the full dataset,
you can download the
* English embeddings from Google code archive word2vec
[look for GoogleNews-vectors-negative300.bin.gz](https://code.google.com/archive/p/word2vec/)
    * You'll need to unzip the file first.
* and the French embeddings from
[cross_lingual_text_classification](https://github.com/vjstark/crosslingual_text_classification).
    * in the terminal, type (in one line)
    `curl -o ./wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec`

Then copy-paste the code below and run it.

```python
# Use this code to download and process the full dataset on your local computer

from gensim.models import KeyedVectors

en_embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
fr_embeddings = KeyedVectors.load_word2vec_format('./wiki.multi.fr.vec')


# loading the english to french dictionaries
en_fr_train = get_dict('en-fr.train.txt')
print('The length of the english to french training dictionary is', len(en_fr_train))
en_fr_test = get_dict('en-fr.test.txt')
print('The length of the english to french test dictionary is', len(en_fr_train))

english_set = set(en_embeddings.vocab)
french_set = set(fr_embeddings.vocab)
en_embeddings_subset = {}
fr_embeddings_subset = {}
french_words = set(en_fr_train.values())

for en_word in en_fr_train.keys():
    fr_word = en_fr_train[en_word]
    if fr_word in french_set and en_word in english_set:
        en_embeddings_subset[en_word] = en_embeddings[en_word]
        fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]


for en_word in en_fr_test.keys():
    fr_word = en_fr_test[en_word]
    if fr_word in french_set and en_word in english_set:
        en_embeddings_subset[en_word] = en_embeddings[en_word]
        fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]


pickle.dump( en_embeddings_subset, open( "en_embeddings.p", "wb" ) )
pickle.dump( fr_embeddings_subset, open( "fr_embeddings.p", "wb" ) )
```

#### The subset of data

To do the assignment on the Coursera workspace, we'll use the subset of word embeddings.

In [64]:
!wget https://github.com/martin-fabbri/colab-notebooks/raw/master/data/nlp/en_embeddings.p 
!wget https://github.com/martin-fabbri/colab-notebooks/raw/master/data/nlp/fr_embeddings.p
!wget https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/data/nlp/en-fr.train.txt
!wget https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/data/nlp/en-fr.test.txt

--2020-11-05 20:00:59--  https://github.com/martin-fabbri/colab-notebooks/raw/master/data/nlp/en_embeddings.p
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/data/nlp/en_embeddings.p [following]
--2020-11-05 20:01:01--  https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/data/nlp/en_embeddings.p
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8119809 (7.7M) [application/octet-stream]
Saving to: ‘en_embeddings.p.3’


2020-11-05 20:01:01 (41.7 MB/s) - ‘en_embeddings.p.3’ saved [8119809/8119809]

--2020-11-05 20:01:01--  https://github.com/mart

In [65]:
en_embeddings_subset = pickle.load(open('en_embeddings.p', 'rb'))
fr_embeddings_subset = pickle.load(open('fr_embeddings.p', 'rb'))

#### Look at the data

* en_embeddings_subset: the key is an English word, and the vaule is a
300 dimensional array, which is the embedding for that word.
```
'the': array([ 0.08007812,  0.10498047,  0.04980469,  0.0534668 , -0.06738281, ....
```

* fr_embeddings_subset: the key is an French word, and the vaule is a 300
dimensional array, which is the embedding for that word.
```
'la': array([-6.18250e-03, -9.43867e-04, -8.82648e-03,  3.24623e-02,...
```

In [66]:
type(en_embeddings_subset)

dict

In [67]:
len(en_embeddings_subset)

6370

In [68]:
len(en_embeddings_subset['the'])

300

#### Load two dictionaries mapping the English to French words
* A training dictionary
* and a testing dictionary.

In [69]:
# loading the english to french dictionaries
en_fr_train = get_dict('en-fr.train.txt')
print('The length of the English to French training dictionary is', len(en_fr_train))
en_fr_test = get_dict('en-fr.test.txt')
print('The length of the English to French test dictionary is', len(en_fr_train))

The length of the English to French training dictionary is 5000
The length of the English to French test dictionary is 5000


#### Looking at the English French dictionary

* `en_fr_train` is a dictionary where the key is the English word and the value
is the French translation of that English word.
```
{'the': 'la',
 'and': 'et',
 'was': 'était',
 'for': 'pour',
```

* `en_fr_test` is similar to `en_fr_train`, but is a test set.  We won't look at it
until we get to testing.

In [70]:
en_fr_train['the']

'la'

In [71]:
en_fr_train['people']

'personnes'

<a name="1-1"></a>

## 1.1 Generate embedding and transform matrices

<a name="ex-01"></a>
#### Exercise 01: Translating English dictionary to French by using embeddings

You will now implement a function `get_matrices`, which takes the loaded data
and returns matrices `X` and `Y`.

Inputs:
- `en_fr` : English to French dictionary
- `en_embeddings` : English to embeddings dictionary
- `fr_embeddings` : French to embeddings dictionary

Returns:
- Matrix `X` and matrix `Y`, where each row in X is the word embedding for an
english word, and the same row in Y is the word embedding for the French
version of that English word.

<div style="width:image width px; font-size:100%; text-align:center;">
<img src='https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/images/X_to_Y.jpg' alt="alternate text" width="width" height="height" style="width:800px;height:200px;" /> Figure 2 </div>

Use the `en_fr` dictionary to ensure that the ith row in the `X` matrix
corresponds to the ith row in the `Y` matrix.

**Instructions**: Complete the function `get_matrices()`:
* Iterate over English words in `en_fr` dictionary.
* Check if the word have both English and French embedding.

<details>
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <p>
        <ul>
            <li><a href="https://realpython.com/python-sets/#set-size-and-membership" >Sets</a> are useful data structures that can be used to check if an item is a member of a group.</li>
            <li>You can get words which are embedded into the language by using <a href="https://www.w3schools.com/python/ref_dictionary_keys.asp"> keys</a> method.</li>
            <li>Keep vectors in `X` and `Y` sorted in list. You can use <a href="https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ma.vstack.html"> np.vstack()</a> to merge them into the numpy matrix. </li>
            <li><a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html">numpy.vstack</a> stacks the items in a list as rows in a matrix.</li>
        </ul>
    </p>

In [72]:
a = np.array([1, 2, 4])
b = np.array([5, 6, 7])
c = np.vstack((a, b))
d = np.array([8, 9, 10])
e = np.vstack((c, d))
e

array([[ 1,  2,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])

In [73]:
l = list()
l.append([1, 2, 4])
l.append([5, 6, 7])
l.append([8, 9, 10])
e = np.vstack(l)
e

array([[ 1,  2,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])

In [74]:
e = {k for k in en_embeddings_subset.keys()}
type(e)

set

In [75]:
f = {k for k in fr_embeddings_subset.keys()}
type(f)

set

In [76]:
en_fr_train.values()

dict_values(['la', 'et', 'était', 'pour', 'cela', 'avec', 'depuis', 'ce', 'tuc', 'son', 'pas', 'sont', 'parlez', 'lequel', 'egalement', 'étaient', 'mais', 'ont', 'one', 'nouveautés', 'premiers', 'page', 'you', 'eux', 'avais', 'article', 'who', 'all', 'leurs', 'là', 'fabriqué', 'son', 'personnes', 'peut', 'aprés', 'autres', 'devrais', 'deux', 'partition', 'her', 'peut', 'ferait', 'plus', 'elle', 'quand', 'heure', 'equipe', 'américains', 'telles', 'débat', 'liens', 'seule', 'quelques', 'vois', 'unies', 'ans', 'école', 'mondiale', 'universitaire', 'lors', 'out', 'état', 'états', 'nationales', 'wikipedia', 'année', 'most', 'villes', 'utilisée', 'puis', 'comté', 'externes', 'où', 'sera', 'quelle', 'effacer', 'ces', 'janvier', 'mars', 'août', 'juillet', 'être', 'film', 'lui', 'plusieurs', 'sud', 'septembre', 'aimez', 'entre', 'octobre', 'three', 'juin', 'bah', 'utilisez', 'war', 'under', 'eux', 'avril', 'born', 'decembre', 'lien', 'ultérieur', 'partie', 'novembre', 'joueurs', 'listes', 'svp'

In [77]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_matrices(en_fr, french_vecs, english_vecs):
    """
    Input:
        en_fr: English to French dictionary
        french_vecs: French words to their corresponding word embeddings.
        english_vecs: English words to their corresponding word embeddings.
    Output: 
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the projection matrix that minimizes the F norm ||X R -Y||^2.
    """

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

    # X_l and Y_l are lists of the english and french word embeddings
    X_l = list()
    Y_l = list()

    # get the english words (the keys in the dictionary) and store in a set()
    english_set = {k for k in en_embeddings_subset.keys()}

    # get the french words (keys in the dictionary) and store in a set()
    french_set = {k for k in fr_embeddings_subset.keys()}

    # store the french words that are part of the english-french dictionary (these are the values of the dictionary)
    french_words = set(en_fr.values())

    # loop through all english, french word pairs in the english french dictionary
    for en_word, fr_word in en_fr.items():

        # check that the french word has an embedding and that the english word has an embedding
        if fr_word in french_set and en_word in english_set:

            # get the english embedding
            en_vec = english_vecs[en_word]

            # get the french embedding
            fr_vec = french_vecs[fr_word]

            # add the english embedding to the list
            X_l.append(en_vec)

            # add the french embedding to the list
            Y_l.append(fr_vec)

    # stack the vectors of X_l into a matrix X
    X = np.vstack(X_l)

    # stack the vectors of Y_l into a matrix Y
    Y = np.vstack(Y_l)
    ### END CODE HERE ###

    return X, Y

Now we will use function `get_matrices()` to obtain sets `X_train` and `Y_train`
of English and French word embeddings into the corresponding vector space models.

In [78]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# getting the training set:
X_train, Y_train = get_matrices(
    en_fr_train, fr_embeddings_subset, en_embeddings_subset)
X_train, Y_train

(array([[ 0.08007812,  0.10498047,  0.04980469, ...,  0.00366211,
          0.04760742, -0.06884766],
        [ 0.02600098, -0.00189209,  0.18554688, ..., -0.12158203,
          0.22167969, -0.02197266],
        [-0.01177979, -0.04736328,  0.04467773, ...,  0.07128906,
         -0.03491211,  0.02416992],
        ...,
        [-0.17089844,  0.17871094, -0.06494141, ..., -0.10644531,
         -0.31640625, -0.09326172],
        [-0.21875   ,  0.09179688,  0.03637695, ..., -0.015625  ,
         -0.27148438,  0.14941406],
        [-0.00418091,  0.0703125 , -0.04516602, ..., -0.16015625,
          0.09326172, -0.15039062]], dtype=float32),
 array([[-0.0061825 , -0.00094387, -0.00882648, ...,  0.111644  ,
         -0.0503964 , -0.0603421 ],
        [-0.0341354 ,  0.042414  , -0.0656882 , ..., -0.0539992 ,
          0.0371097 , -0.0433599 ],
        [ 0.0426481 ,  0.0395683 , -0.00825683, ...,  0.0295259 ,
          0.0713421 ,  0.0626402 ],
        ...,
        [ 0.0903279 , -0.108363  , -0.0

<a name="2"></a>

# 2. Translations

<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/images/e_to_f.jpg' alt="alternate text" width="width" height="height" style="width:700px;height:200px;" /> Figure 1 </div>

Write a program that translates English words to French words using word embeddings and vector space models. 

<a name="2-1"></a>
## 2.1 Translation as linear transformation of embeddings

Given dictionaries of English and French word embeddings you will create a transformation matrix `R`
* Given an English word embedding, $\mathbf{e}$, you can multiply $\mathbf{eR}$ to get a new word embedding $\mathbf{f}$.
    * Both $\mathbf{e}$ and $\mathbf{f}$ are [row vectors](https://en.wikipedia.org/wiki/Row_and_column_vectors).
* You can then compute the nearest neighbors to `f` in the french embeddings and recommend the word that is most similar to the transformed word embedding.

### Describing translation as the minimization problem

Find a matrix `R` that minimizes the following equation. 

$$\arg \min _{\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|_{F}\tag{1} $$

### Frobenius norm

The Frobenius norm of a matrix $A$ (assuming it is of dimension $m,n$) is defined as the square root of the sum of the absolute squares of its elements:

$$\|\mathbf{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}\tag{2}$$

### Actual loss function
In the real world applications, the Frobenius norm loss:

$$\| \mathbf{XR} - \mathbf{Y}\|_{F}$$

is often replaced by it's squared value divided by $m$:

$$ \frac{1}{m} \|  \mathbf{X R} - \mathbf{Y} \|_{F}^{2}$$

where $m$ is the number of examples (rows in $\mathbf{X}$).

* The same R is found when using this loss function versus the original Frobenius norm.
* The reason for taking the square is that it's easier to compute the gradient of the squared Frobenius.
* The reason for dividing by $m$ is that we're more interested in the average loss per embedding than the  loss for the entire training set.
    * The loss for all training set increases with more words (training examples),
    so taking the average helps us to track the average loss regardless of the size of the training set.

##### [Optional] Detailed explanation why we use norm squared instead of the norm:
<details>
<summary>
    Click for optional details
</summary>
    <p>
        <ul>
            <li>The norm is always nonnegative (we're summing up absolute values), and so is the square. 
            <li> When we take the square of all non-negative (positive or zero) numbers, the order of the data is preserved.  
            <li> For example, if 3 > 2, 3^2 > 2^2
            <li> Using the norm or squared norm in gradient descent results in the same <i>location</i> of the minimum.
            <li> Squaring cancels the square root in the Frobenius norm formula. Because of the <a href="https://en.wikipedia.org/wiki/Chain_rule"> chain rule</a>, we would have to do more calculations if we had a square root in our expression for summation.
            <li> Dividing the function value by the positive number doesn't change the optimum of the function, for the same reason as described above.
            <li> We're interested in transforming English embedding into the French. Thus, it is more important to measure average loss per embedding than the loss for the entire dictionary (which increases as the number of words in the dictionary increases).
        </ul>
    </p>

<a name="ex-02"></a>

### Exercise 02: Implementing translation mechanism described in this section.

#### Step 1: Computing the loss
* The loss function will be squared Frobenoius norm of the difference between
matrix and its approximation, divided by the number of training examples $m$.
* Its formula is:
$$ L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2}$$

where $a_{i j}$ is value in $i$th row and $j$th column of the matrix $\mathbf{XR}-\mathbf{Y}$.

#### Instructions: complete the `compute_loss()` function

* Compute the approximation of `Y` by matrix multiplying `X` and `R`
* Compute difference `XR - Y`
* Compute the squared Frobenius norm of the difference and divide it by $m$.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
   <li> Useful functions:
       <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html">Numpy dot </a>,
       <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html">Numpy sum</a>,
       <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.square.html">Numpy square</a>,
       <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html">Numpy norm</a>
    </li>
   <li> Be careful about which operation is elementwise and which operation is a matrix multiplication.</li>
   <li> Try to use matrix operations instead of the numpy norm function.  If you choose to use norm function, take care of extra arguments and that it's returning loss squared, and not the loss itself.</li>

</ul>
</p>

In [79]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_loss(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        L: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R.
    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # m is the number of rows in X
    m = len(X)
    
    # diff is XR - Y
    diff = np.dot(X, R) - Y

    # diff_squared is the element-wise square of the difference
    diff_squared = np.square(diff)

    # sum_diff_squared is the sum of the squared elements
    sum_diff_squared = np.sum(diff_squared)

    # loss i the sum_diff_squard divided by the number of examples (m)
    loss = sum_diff_squared / m
    ### END CODE HERE ###
    return loss

<a name="ex-03"></a>

### Exercise 03

### Step 2: Computing the gradient of loss in respect to transform matrix R

* Calculate the gradient of the loss with respect to transform matrix `R`.
* The gradient is a matrix that encodes how much a small change in `R`
affect the change in the loss function.
* The gradient gives us the direction in which we should decrease `R`
to minimize the loss.
* $m$ is the number of training examples (number of rows in $X$).
* The formula for the gradient of the loss function $𝐿(𝑋,𝑌,𝑅)$ is:

$$\frac{d}{dR}𝐿(𝑋,𝑌,𝑅)=\frac{d}{dR}\Big(\frac{1}{m}\| X R -Y\|_{F}^{2}\Big) = \frac{2}{m}X^{T} (X R - Y)$$

**Instructions**: Complete the `compute_gradient` function below.

<details>
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
    <ul>
    <li><a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.T.html" > Transposing in numpy </a></li>
    <li><a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html" > Finding out the dimensions</a> of matrices in numpy </li>
    <li>Remember to use numpy.dot for matrix multiplication </li>
    </ul>
</p>

In [80]:
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_gradient(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        g: a matrix of dimension (n,n) - gradient of the loss function L for given X, Y and R.
    '''
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # m is the number of rows in X
    m = len(X)

    # gradient is X^T(XR - Y) * 2/m
    gradient = 2/m * np.dot(X.T, np.dot(X, R) - Y)
    ### END CODE HERE ###
    return gradient

### Step 3: Finding the optimal R with gradient descent algorithm

#### Gradient descent

[Gradient descent](https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html) is an iterative algorithm which is used in searching for the optimum of the function. 
* Earlier, we've mentioned that the gradient of the loss with respect to the matrix encodes how much a tiny change in some coordinate of that matrix affect the change of loss function.
* Gradient descent uses that information to iteratively change matrix `R` until we reach a point where the loss is minimized. 

#### Training with a fixed number of iterations

Most of the time we iterate for a fixed number of training steps rather than iterating until the loss falls below a threshold.

##### OPTIONAL: explanation for fixed number of iterations
<details>
<summary>
    <font size="3" color="darkgreen"><b>click here for detailed discussion</b></font>
</summary>
<p>
<ul>
    <li> You cannot rely on training loss getting low -- what you really want is the validation loss to go down, or validation accuracy to go up. And indeed - in some cases people train until validation accuracy reaches a threshold, or -- commonly known as "early stopping" -- until the validation accuracy starts to go down, which is a sign of over-fitting.
    </li>
    <li>
    Why not always do "early stopping"? Well, mostly because well-regularized models on larger data-sets never stop improving. Especially in NLP, you can often continue training for months and the model will continue getting slightly and slightly better. This is also the reason why it's hard to just stop at a threshold -- unless there's an external customer setting the threshold, why stop, where do you put the threshold?
    </li>
    <li>Stopping after a certain number of steps has the advantage that you know how long your training will take - so you can keep some sanity and not train for months. You can then try to get the best performance within this time budget. Another advantage is that you can fix your learning rate schedule -- e.g., lower the learning rate at 10% before finish, and then again more at 1% before finishing. Such learning rate schedules help a lot, but are harder to do if you don't know how long you're training.
    </li>
</ul>
</p>

Pseudocode:
1. Calculate gradient $g$ of the loss with respect to the matrix $R$.
2. Update $R$ with the formula:
$$R_{\text{new}}= R_{\text{old}}-\alpha g$$

Where $\alpha$ is the learning rate, which is a scalar.

#### Learning rate

* The learning rate or "step size" $\alpha$ is a coefficient which decides how much we want to change $R$ in each step.
* If we change $R$ too much, we could skip the optimum by taking too large of a step.
* If we make only small changes to $R$, we will need many steps to reach the optimum.
* Learning rate $\alpha$ is used to control those changes.
* Values of $\alpha$ are chosen depending on the problem, and we'll use `learning_rate`$=0.0003$ as the default value for our algorithm.

<a name="ex-04"></a>

### Exercise 04

#### Instructions: Implement `align_embeddings()`

<details>
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Use the 'compute_gradient()' function to get the gradient in each step</li>

</ul>
</p>

In [84]:
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003):
    '''
    Inputs:
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        train_steps: positive int - describes how many steps will gradient descent algorithm do.
        learning_rate: positive float - describes how big steps will  gradient descent algorithm do.
    Outputs:
        R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||X R -Y||^2
    '''
    np.random.seed(129)

    # the number of columns in X is the number of dimensions for a word vector (e.g. 300)
    # R is a square matrix with length equal to the number of dimensions in th  word embedding
    R = np.random.rand(X.shape[1], X.shape[1])

    for i in range(train_steps):
        if i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")
        ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
        # use the function that you defined to compute the gradient
        gradient = compute_gradient(X, Y, R)

        # update R by subtracting the learning rate times gradient
        R = R - learning_rate * gradient 
        ### END CODE HERE ###
    return R

In [82]:
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Testing your implementation.
np.random.seed(129)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = align_embeddings(X, Y)

loss at iteration 0 is: 3.7242
loss at iteration 25 is: 3.6283
loss at iteration 50 is: 3.5350
loss at iteration 75 is: 3.4442


**Expected Output:**
```
loss at iteration 0 is: 3.7242
loss at iteration 25 is: 3.6283
loss at iteration 50 is: 3.5350
loss at iteration 75 is: 3.4442
```

## Calculate transformation matrix R

Using those the training set, find the transformation matrix $\mathbf{R}$ by calling the function `align_embeddings()`.

**NOTE:** The code cell below will take a few minutes to fully execute (~3 mins)

In [83]:
# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
R_train = align_embeddings(X_train, Y_train, train_steps=400, learning_rate=0.8)

loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735


##### Expected Output

```
loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735