# Spectral_decomposition_driver_notebook

## Overview
The Isomantics algorithm consists of the following stages:
### **(Stage 0)** Prepare vocabulary
* currently prepared languages:
    1. English
    2. Russian
    3. German
    4. French
    5. Italian
    6. Chinese 
    
### **(Stage 1)** Word embeddings.  Using `fasttext` or `word2vec` we embed the vocab for each of the languages
### **(Stage 2)** Train translation matrices:

* Training set:
    * For two given languages $Lg_1$ and $Lg_2$, we create a training set $\Omega_{(Lg_1,Lg_2)}$ as follows:
        1. For each $word_i$ in language 1, find the direct translation $\widehat{word_i}$ in language 2.
        2. Find vector embeddings $w_i\in Lg_1$ and $\widehat{w_i}\in Lg_2$ of $word_i$ and $\widehat{word_i}$ respectively.
        3. Add the pair $<w_i,\widehat{w_i}>$ to the training set $\Omega_{(Lg_1,Lg_2)}$
            * **Note** we found that training only for for only the the top 5-10k most popular terms in  $\Omega_{(Lg_1,Lg_2)}$ generates the best word-to-word translation results on out of sample test sets.
* Building the Cost function:
    * Loss function for the learning process:
        * $ Loss(T_{Lg_1,Lg_2})= ||Tw_i - \widehat{w_i}||^2_2 $
    * Regularization terms:
        * Over fitting Regularizer:
            * $Reg_{Frobenius}(T_{Lg_1,Lg_2}) = ||T_{Lg_1,Lg_2}||_2$
        * Normality Regularizer:
            * $Reg_{Normality}(T_{Lg_1,Lg_2}) = ||T_{Lg_1,Lg_2}^{T}T_{Lg_1,Lg_2} - T_{Lg_1,Lg_2}T_{Lg_1,Lg_2}^T||_2$
                * **Note** The Normality Regularizer is used to ensure that the resulting matrix is diagonalizable.


#### Full cost function:
$$ J(T_{Lg_1,Lg_2})= Loss(T_{Lg_1,Lg_2}) + \lambda_{1}Reg_{Frobenius}(T_{Lg_1,Lg_2}) + \lambda_{2}Reg_{Normality}(T_{Lg_1,Lg_2}) $$  

### **(Stage 3)** Translation Spectral Analysis:
* Factor the matrix $T_{Lg_1,Lg_2} = U\Sigma V^T$ where $U$ and $V$ are orthonormal (rotation) matrices and $\Sigma$ gives the eigenvalues of $T_{Lg_1,Lg_2}$  or the "*Translation spectrum*"
* Run a statistical analysis of the spectral values associated with each pair of languages.
    1. mean
    2. median
    3. max value
    4. min value
    5. standard deviation

* Compare the statistical spectral analysis across different language pairs.

# (Stage 0) Prepare vocabulary:

# TODO 
 * add details on where the vocab was downloaded from
 * point to where the data is in the repo
 * add instructions on where to call it in the code

# (Stages 1 and 2) 
1. Word embeddings. Using fasttext or word2vec we embed the vocab for each of the languages
2. Train translation matrices:

# TODO 
 * add details on where the vocab is located
 * point to where the embeddings are located
 * Test following code on embedding process:

In [1]:
# Import tools for running Spectral decomposition
import sys
sys.path.append("../")
import ismtools



Using TensorFlow backend.


In [None]:
# TODO script this section and call the script from bash in this cell
    # add the -a for setting calculate_KNN = True
# TODO?? put manual list of translations into ismtools or another imported .py

#Set parameter on calculate KNN (need to change for -a sysarg)
calculate_KNN = False

languages = ['en','ru','de','es','fr','it', 'zh-CN']
for lang1 in languages:
    for lang2 in languages:
        translations.append(('fasttext_top',lang1, lang2))

for translation in translations:
    embedding, lg1, lg2 = translation
    # Vocab/Vectors/Dicts
    lg1_vocab, lg1_vectors, lg2_vocab, lg2_vectors = \
        pickle_rw((lg1 + '_' + embedding.split('_')[0] + '_vocab', 0),
                  (lg1 + '_' + embedding.split('_')[0] + '_vectors', 0),
                  (lg2 + '_' + embedding.split('_')[0] + '_vocab', 0),
                  (lg2 + '_' + embedding.split('_')[0] + '_vectors', 0),
                  write=False)
    lg1_dict = make_dict(lg1_vocab, lg1_vectors)
    lg2_dict = make_dict(lg2_vocab, lg2_vectors)

    print('Translation: '+lg1+'->'+lg2+'\n')

    # Train/Test Vocab/Vectors
    vocab_train, vocab_test = vocab_train_test(embedding, lg1, lg2, lg1_vocab)
    X_train, X_test, y_train, y_test = vectors_train_test(vocab_train,
                                                          vocab_test,lg1_dict,lg2_dict)
    
    # Fit tranlation matrix to training data
    model, history, T, tf, I, M, fro = translation_matrix(X_train, y_train)
    
    if calculate_KNN:
        results_df = translation_results(X_test, y_test, vocab_test, T,
                                     lg2_vectors, lg2_vocab)
        acc = T_norm_EDA(results_df)

"""
TODO create standardized dumping location for the translation matrix and its meta data.
- pickel dump based on the dir structure in the folders
"""        

# (Stage 3) Translation Spectral Analysis

# TODO 
 * 

In [42]:
# Set list of T_matrix full paths
T_matrix_dir = "./T_Matrices_examples/"
T_matrix_names = !ls $T_matrix_dir

T_matrix_full_paths = []
for T_matrix_name in T_matrix_names:
    T_matrix_loc = T_matrix_dir+T_matrix_name
    T_matrix_full_paths += [T_matrix_loc]

# Load T_matrix into memory
load_path = T_matrix_full_paths[2]
print(load_path)
file = open ("./T_Matrices_examples/T_matrix_de_es.pkl",'rb')
file.seek(0)
object_file = pickle.load(file)

"""
with open(r"./T_Matrices_examples/T_matrix_de_es.pkl", 'rb') as f:
    T_matrix = pickle.load(f)"""

./T_Matrices_examples/T_matrix_de_es.pkl


EOFError: 

In [48]:
import pandas as pd
pd.read_pickle("./T_Matrices_examples/T_matrix_de_fr.pkl")

ValueError: unsupported pickle protocol: 3

In [None]:
U,s,Vh = SVD(T)

s1 = log(s)


for stat in stats:
    svd[stat,translation[1],translation[2]] = stat_calc(stat, s, fro, acc)
    svd1[stat,translation[1],translation[2]] = stat_calc(stat, s1, fro, acc)


#Exporting DataFrames for SVD Heatmaps

s_df = make_df(languages,languages)
s1_df = make_df(languages,languages)


for stat in stats:
    for lang1 in languages:
        for lang2 in languages:
            s_df.set_value(lang1,lang2,svd[stat,lang1,lang2])
            s1_df.set_value(lang1,lang2,svd1[stat,lang1,lang2])

    s_df.to_csv('../HeatmapData/T/s_{}.csv'.format(stat),columns = languages)
    s1_df.to_csv('../HeatmapData/T/s1_{}.csv'.format(stat),columns = languages)



In [None]:
if __name__ == '__main__':
    # Manually set list of translations (embedding, lg1, lg2)
    
    svds = ['s','s1']
    languages = ['en','ru','de','es','fr','it', 'zh-CN']
    stats = ['min','max','mean','median','std','fro','acc']
    
    translations=[]
    
    for lang1 in languages:
        for lang2 in languages:
            translations.append(('fasttext_top',lang1, lang2))
    
    svd = {}
    svd1 = {}
    
    for translation in translations:
        embedding, lg1, lg2 = translation
        # Vocab/Vectors/Dicts
        lg1_vocab, lg1_vectors, lg2_vocab, lg2_vectors = \
            pickle_rw((lg1 + '_' + embedding.split('_')[0] + '_vocab', 0),
                      (lg1 + '_' + embedding.split('_')[0] + '_vectors', 0),
                      (lg2 + '_' + embedding.split('_')[0] + '_vocab', 0),
                      (lg2 + '_' + embedding.split('_')[0] + '_vectors', 0),
                      write=False)
        lg1_dict = make_dict(lg1_vocab, lg1_vectors)
        lg2_dict = make_dict(lg2_vocab, lg2_vectors)

        print('Translation: '+lg1+'->'+lg2+'\n')

        # Train/Test Vocab/Vectors
        vocab_train, vocab_test = vocab_train_test(embedding, lg1, lg2, lg1_vocab)
        X_train, X_test, y_train, y_test = vectors_train_test(vocab_train,
                                                              vocab_test,lg1_dict,lg2_dict)
 
        
        # Fit tranlation matrix to training data
        model, history, T, tf,I, M, fro = translation_matrix(X_train, y_train)
        
        results_df = translation_results(X_test, y_test, vocab_test, T,
                                         lg2_vectors, lg2_vocab)
        acc = T_norm_EDA(results_df)
        
        U,s,Vh = SVD(T)
        
        s1 = log(s)
    
        
        for stat in stats:
            svd[stat,translation[1],translation[2]] = stat_calc(stat, s, fro, acc)
            svd1[stat,translation[1],translation[2]] = stat_calc(stat, s1, fro, acc)
            
        
    #Exporting DataFrames for SVD Heatmaps
    
    s_df = make_df(languages,languages)
    s1_df = make_df(languages,languages)
    

    for stat in stats:
        for lang1 in languages:
            for lang2 in languages:
                s_df.set_value(lang1,lang2,svd[stat,lang1,lang2])
                s1_df.set_value(lang1,lang2,svd1[stat,lang1,lang2])

        s_df.to_csv('../HeatmapData/T/s_{}.csv'.format(stat),columns = languages)
        s1_df.to_csv('../HeatmapData/T/s1_{}.csv'.format(stat),columns = languages)

    



# Translation Matrix Results  
## En to Ru Fasttext_Random  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- Ru Vocabulary Size = 944,211  
- Ru Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 3.9%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for Ru test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_ru_fasttext_random_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 32.3%  
![](../images/en_ru_fasttext_random_T_isotropy.png)  

## En to Ru Fasttext_Top  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- Ru Vocabulary Size = 944,211  
- Ru Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 46.3%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for Ru test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_ru_fasttext_top_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 38.2%  
![](../images/en_ru_fasttext_top_T_isotropy.png)  

## En to De Fasttext_Random  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- De Vocabulary Size = 1,137,616  
- De Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 21.9%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for De test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_de_fasttext_random_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 35.6%  
![](../images/en_de_fasttext_random_T_isotropy.png)  

## En to De Fasttext_Top  
- En Vocabulary Size = 1,259,685  
- En Embedding Length = 300  
- De Vocabulary Size = 1,137,616  
- De Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,500  
- <b>Test Accuracy = 63.6%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for De test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_de_fasttext_top_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 43.4%  
![](../images/en_de_fasttext_top_T_isotropy.png)  

## En to It Zeroshot  
- En Vocabulary Size = 200,000  
- En Embedding Length = 300  
- It Vocabulary Size = 200,000  
- It Embedding Length = 300  
- Train Size = 5,000  
- Test Size = 1,869  
- <b>Test Accuracy = 27.9%</b>  

#### Test L2 Norms  
- X_norm: L2 norms for En test vectors  
- y_norm: L2 norms for It test vectors  
- yhat_norm: L2 norms for X.dot(T) test vectors (T = translation matrix)  
- yhat_neighbor norm: L2 norms for nearest neighborto X.dot(T) in y test vectors  
![](../images/en_it_zeroshot_T_norm.png)  

#### Translation Matrix Isotropy  
- Isotropy = 46.6%  
![](../images/en_it_zeroshot_T_isotropy.png)  

