# Spectral_decomposition_driver_notebook

## Overview
The Isomantics algorithm consists of the following stages:
### **(Stage 0 and 1)** Prepare vocabulary:
* currently prepared languages:
    1. English
    2. Russian
    3. German
    4. French
    5. Italian
    6. Chinese 
    

### **(Stage 2)** Train translation matrices:

* Training set:
    * For two given languages $Lg_1$ and $Lg_2$, we create a training set $\Omega_{(Lg_1,Lg_2)}$ as follows:
        1. For each $word_i$ in language 1, find the direct translation $\widehat{word_i}$ in language 2.
        2. Find vector embeddings $w_i\in Lg_1$ and $\widehat{w_i}\in Lg_2$ of $word_i$ and $\widehat{word_i}$ respectively.
        3. Add the pair $<w_i,\widehat{w_i}>$ to the training set $\Omega_{(Lg_1,Lg_2)}$
            * **Note** we found that training only for for only the the top 5-10k most popular terms in  $\Omega_{(Lg_1,Lg_2)}$ generates the best word-to-word translation results on out of sample test sets.
* Building the Cost function:
    * Loss function for the learning process:
        * $ Loss(T_{Lg_1,Lg_2})= ||Tw_i - \widehat{w_i}||^2_2 $
    * Regularization terms:
        * Over fitting Regularizer:
            * $Reg_{Frobenius}(T_{Lg_1,Lg_2}) = ||T_{Lg_1,Lg_2}||_2$
        * Normality Regularizer:
            * $Reg_{Normality}(T_{Lg_1,Lg_2}) = ||T_{Lg_1,Lg_2}^{T}T_{Lg_1,Lg_2} - T_{Lg_1,Lg_2}T_{Lg_1,Lg_2}^T||_2$
                * **Note** The Normality Regularizer is used to ensure that the resulting matrix is diagonalizable.


#### Full cost function:
$$ J(T_{Lg_1,Lg_2})= Loss(T_{Lg_1,Lg_2}) + \lambda_{1}Reg_{Frobenius}(T_{Lg_1,Lg_2}) + \lambda_{2}Reg_{Normality}(T_{Lg_1,Lg_2}) $$  

### **(Stage 3)** Translation Spectral Analysis:
* Factor the matrix $T_{Lg_1,Lg_2} = U\Sigma V^T$ where $U$ and $V$ are orthonormal (rotation) matrices and $\Sigma$ gives the eigenvalues of $T_{Lg_1,Lg_2}$  or the "*Translation spectrum*"
* Run a statistical analysis of the spectral values associated with each pair of languages.
    1. mean
    2. median
    3. max value
    4. min value
    5. standard deviation

* Compare the statistical spectral analysis across different language pairs.
***

# Stages of Isomantics:

***

# Stage 0:

### Download the files required for Isomantics.

The Steps mentioned in **Stage 1** are used to:

  1. Authenticating Google Drive API for translating words from one language to another. 
  2. Creating vocabulary and corresponding vectors pickle files for each language. 
  3. Translate one language vocab to another and create pickle files for lg1-> lg2 translations. 
  (e.g. en_en.pkl, en_ru.pkl....). 
  
**Note: For the ease of running experiments and testing, pickle files for vocabs, vectors and translations have already been created.**

* **Go to 1...** if you wish to create the pickles on your own from downloaded vectors. (takes a long time to create your own pickles)
* **Go to 2...** if you wish to skip creating your own pickle translation files. ( Recommended)

    1. To Download the Pre-Trained FastText Embeddings:
          * Go to [facebookresearch/fasttext](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md)
          * Click on the *[text]()* link corresponding to the language and save '.vec' file to '/code/fasttext' directory.
          
          </br>
          **OR**
          </br>
          
          * Vectors can be downloaded from the following links too.
              * [English](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec)
              * [Russian](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.ru.vec)
              * [German](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.de.vec)
              * [French](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.vec)
              * [Italian](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.it.vec)
              * [Chinese](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zh.vec)
          * Go to **Stage 0**.
          
       </br>   
    2. Download the pickled translations from [pickled translations](https://mega.nz/#!wkRlQAAQ!bkqHMKfreAgo8jVJQywoAWOxjXHfM63WbfNx3nYHnQ4) link, save and unzip the folder in the root directory. (Don't change the name of the unzipped folder)  
***

# Stage 1:


**1. If you've chosen to download the vectors, do the following steps to create your own .pkl translation files, if not go to 2.:**



1. To authenticate the gdrive API:
    * Go to [Google Drive API Guide](https://developers.google.com/drive/v3/web/quickstart/python) and follow the step 1 to turn on the gdrive API for your project.
    * **Note**: Download the json file in step h) and name it as **client_secrets.json** in the root directory.
2. Change directory to the **/code** and run the following command on the terminal to create a **gauth.yml** file:

    ``` 
    $ python3 gauth.py 
    ```

3. Run **vocab_vectors.py** to pickle the vocab and vector objects.

    ```
    $ python3 vocab_vectors.py
    ```
4. Run **build_translations.py** to create pickle translations for English to English.

    ```
    $ python3 build_translations.py en en

    ```
* open a new tab in the terminal and run the translations code for a new translation English to Spanish.

    ```
    $ python3 build_translations.py en es

    ```
        .........

        .........

    * and so on for all the possible combination of languages....

    </br>
        
#### 2. If you've chosen to download the pickle files from the link mentioned, go directly to **Stage 2**        
        
        

</br>
***

# (Stage 2) 

1. Train translation matrices:

 * the vocab and the vector embeddings are located in **'/pickle'** folder.
 * Change directory to **'/code'** to train the translation matrices.
 * The following code trains the transaltion matrices and exports the .csv files to the mentioned test folder in the **'/data'**
 * Run the following code with appropriate flags

In [36]:
# specify name of the experiment with '--e' flag: 
# specify the name of the model regularizer with '--m' flag

#possible values for model regularizers: 'l2', 'l3_l2', 'l3'

! python3 isomantics_train_translations.py --e exp_l2_T --m l2

Using TensorFlow backend.
Creating Translation Matrices for....:

en->en

length of en-en vocab: 49999
en->ru

length of en-ru vocab: 15610
^C


# (Stage 3) Translation Spectral Analysis

 * This code imports the translation matrices from the '/data/exp-name/T/' 
 * Performs Spectral Analysis on the T matrices and exports the data for plotting heatmaps

In [31]:
! python3 isomantics_statistical_analysis.py

Using TensorFlow backend.
Creating Translation Matrices for....:

Exporting DataFrames for SVD Heatmaps as .CSV files


# (Stage 4) Create SVD Heatmaps

In [None]:
! jupyter notebook SVD_heatmaps.ipynb

# Translation Matrix Results  
## En to Ru Fasttext_Random  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- Ru Vocabulary Size = 944,211  
- Ru Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 3.9%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for Ru test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_ru_fasttext_random_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 32.3%  
![](../images/en_ru_fasttext_random_T_isotropy.png)  

## En to Ru Fasttext_Top  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- Ru Vocabulary Size = 944,211  
- Ru Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 46.3%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for Ru test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_ru_fasttext_top_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 38.2%  
![](../images/en_ru_fasttext_top_T_isotropy.png)  

## En to De Fasttext_Random  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- De Vocabulary Size = 1,137,616  
- De Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 21.9%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for De test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_de_fasttext_random_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 35.6%  
![](../images/en_de_fasttext_random_T_isotropy.png)  

## En to De Fasttext_Top  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- De Vocabulary Size = 1,137,616  
- De Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 63.6%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for De test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_de_fasttext_top_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 43.4%  
![](../images/en_de_fasttext_top_T_isotropy.png)  

## En to It Zeroshot  
- En Vocabulary Size = 200,000  
- En Embedding Length = 300  
- It Vocabulary Size = 200,000  
- It Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,869  
- <b>Test Accuracy = 27.9%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for It test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_it_zeroshot_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 46.6%  
![](../images/en_it_zeroshot_T_isotropy.png)  

