# Introduction to cross-lingual word-embeddings at Wikimania 2019

* In this tutorial we will learn how to work with cross-lingual word embeddings. 
* See introduction [slides here](https://upload.wikimedia.org/wikipedia/commons/6/63/Tutorial_on_Multilingual_Word_Embeddings%2C_Wikimania_2019.pdf)
* This code is based on the repository shared by [Smith et al](https://github.com/Babylonpartners/fastText_multilingual)
* You can see applications of these code on the Wikipedia [Sections](https://github.com/digitalTranshumant/wmf-interlanguage) and [Template parameters](https://github.com/digitalTranshumant/templatesAlignment) alignments.



In [10]:
#Config 
## Add here your folders and languages
import fasttext as fastText
from scipy.spatial import distance
import numpy as np
import networkx as nx
lang1 = 'en'
lang2 = 'es'
langs =[lang1,lang2]
pathVectors = 'vectors' 
pathAlignment = 'wikiAlignments/'
print(l)

es


## Download fasttext models

* This script download the fasttext pre-trained models in the languages listed langs variable.
* This process **can take long time**.
* Note that **each model file is around 8G** , and later you will need to unzip those models, using around 15G per model in total.
* Comment (add # prefix)the first line in the next cell to download the models. If you already have the models you in your folder, you can skip this step. 



In [4]:
# COMMENT  HERE TO RUN THIS CELL

!mkdir {pathVectors}
for l in langs:
    print(l)
    !wget -P vectors/ {'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.%s.zip' % l}

mkdir: vectors/: File exists
en
--2019-09-26 15:46:30--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.6.166, 104.20.22.166
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.6.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10356881291 (9.6G) [application/zip]
Saving to: ‘vectors/wiki.en.zip’


2019-09-26 16:44:53 (2.82 MB/s) - ‘vectors/wiki.en.zip’ saved [10356881291/10356881291]

es
--2019-09-26 16:44:53--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.6.166, 104.20.22.166
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.6.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5445405108 (5.1G) [application/zip]
Saving to: ‘vectors/wiki.es.zip’


2019-09-26 19:38:01 (2.26 MB/s) - Read error at byte 2327299513/5445405108 (S

* Load the models, this can take few minutes

In [17]:
model = {}
print('%s/wiki.%s.bin' % (pathVectors,lang))

for lang in langs:
    model[lang] = fasttext.load_model('%s/wiki.%s.bin' % (pathVectors,lang))  

vectors/wiki.es.bin






### Embeddings example

* This is how a cat looks like for this embedding model

In [18]:
model['en'].get_sentence_vector('cat')

array([-0.03272143,  0.03321843, -0.0772398 ,  0.02752271, -0.04689749,
        0.10779523,  0.05039032, -0.1213629 ,  0.00796918,  0.0365316 ,
        0.03590076, -0.00070022,  0.04651386, -0.04166288,  0.06664581,
       -0.02164455,  0.0180805 , -0.10384931,  0.04688571,  0.06662694,
        0.00233572,  0.12208793, -0.09872068,  0.0255165 ,  0.08340995,
        0.00577335, -0.0176113 ,  0.06296295,  0.07984982,  0.11208957,
        0.06389221,  0.05539172,  0.02762271, -0.05251925,  0.04438546,
       -0.02399754, -0.01537215, -0.01010495,  0.01509995, -0.00657111,
       -0.02613736, -0.061927  , -0.05292661, -0.00875183,  0.03022415,
        0.12282077, -0.01940934, -0.09258875,  0.03871215, -0.06963187,
        0.02200041, -0.01411154, -0.02184908, -0.08269455,  0.07468157,
        0.08944456,  0.00224687, -0.1002942 ,  0.01784089,  0.04561058,
        0.04928856, -0.11202534,  0.02219844,  0.05074738, -0.01451611,
       -0.08938298,  0.02949806, -0.00669799, -0.03016053,  0.01

* Remember that **those number doesn't have a meaning by themselves**, and will change if you retrain your model in other corpus. 


In [19]:
print(len(model['en'].get_sentence_vector('cat')))

300


### Distances within the same language

In [20]:
v1 = model['en'].get_sentence_vector('cat')
v2 = model['en'].get_sentence_vector('kitty')
distance.cosine(v1,v2)

0.5402991771697998

In [21]:
v1 = model['en'].get_sentence_vector('cat')
v2 = model['en'].get_sentence_vector('lion')
distance.cosine(v1,v2)

0.6383496820926666

In [22]:
v1 = model['en'].get_sentence_vector('cat')
v2 = model['en'].get_sentence_vector('car')
distance.cosine(v1,v2)

0.8043883889913559

## Load transformation Matrices
* Note that his repository already contains transformation from 'en' to 'es'
* This alignments are generated using [this code](https://analytics.wikimedia.org/datasets/one-off/dsaez/)
* If you need to a pair of languages that is not contained here, please contact us, or use the pre-trained [provided here](https://drive.google.com/drive/folders/1_cbl3GKmg9Ots6_QOXcGxRNQr8SCELWO?usp=sharing)

In [23]:
v1 = model['es'].get_sentence_vector('perro')
v2 = model['en'].get_sentence_vector('dog')

In [24]:
distance.cosine(v1,v2)

0.9362809807062149

* The following function apply the transformation to a given vector. 

In [25]:
def apply_transform(vec, transform):
        """
        Apply the given transformation to the vector space

        Right-multiplies given transform with embeddings E:
            E = E * transform

        Transform can either be a string with a filename to a
        text file containing a ndarray (compat. with np.loadtxt)
        or a numpy ndarray.
        """
        transmat = np.loadtxt(transform) if isinstance(transform, str) else transform
        return np.matmul(vec, transmat)

In [27]:
# Align second language
v2Aligned = apply_transform(v2,'%s/apply_in_%s_to_%s.txt' % (pathAlignment,'en','es') )

In [28]:
distance.cosine(v1,v2Aligned)

0.29473989976649984

### Subword information

* Using subword information with modified or misspelled words

within the same language:

In [29]:
v1 = model['en'].get_sentence_vector('excellent')
v2 = model['en'].get_sentence_vector('excelent')
distance.cosine(v1,v2)

0.4250869154930115

or with cross-lingual aligned vectors

In [30]:
v1 = model['es'].get_sentence_vector('perro1')
v2 = model['en'].get_sentence_vector('dog')
v2Aligned = apply_transform(v2,'%s/apply_in_%s_to_%s.txt' % (pathAlignment,'en','es') )
distance.cosine(v1,v2Aligned)

0.4684736197488095

### Sentence Level

In [31]:
sentence1 = model['es'].get_sentence_vector('Hola! Que tenga un bonito día señor, nos vemos más tarde! :)')
sentence2 = model['en'].get_sentence_vector('Hi! Have a nice day sir, see you later! :)')

In [32]:
distance.cosine(sentence1,sentence2)

0.9927418855950236

In [35]:
sentence2Aligned = apply_transform(sentence2,'%s/apply_in_%s_to_%s.txt' % (pathAlignment,'en','es') )

In [36]:
distance.cosine(sentence1,sentence2Aligned)

0.44225135145721073

## Aligning  sets of words

Load all transformation

In [37]:
transmat = {}
for lang in langs:
    print(lang)
    transmat[lang] = {}
    for lang2 in langs:
        if lang!=lang2:
            transmat[lang][lang2] = np.loadtxt('%s/apply_in_%s_to_%s.txt' % (pathAlignment,lang2,lang))

en
es


In [38]:
words = {}
words[lang1] = ['cat','kitty','motocycle','car','dog','truck','geography','mountains','rivers','basketball','football']
words[lang2] = ['gato','automóvil','perro','camión','geografía','montañas','rios','baloncesto','futbol']

In [39]:
def getMoreSimilar(wordLang1,setLang2,sourceLang,targetLang):
    """
    Given a word in language 1 and set of words/sentences language 2
    return 
    wordLang1: str, 'perro'
    set2: dict or list, ['hello','dog']
    sourceLang: str, 'es'
    targetLang: str, 'en'
    return list
    """
    global model
    global transmat
    d = []
    vec1 = model[sourceLang].get_sentence_vector(wordLang1)
    for s2 in setLang2:
        vec2 = model[targetLang].get_sentence_vector(s2.strip().replace('_',' '))
        vec2T = apply_transform(vec2,transmat[sourceLang][targetLang])
        dist = distance.cosine(vec1,vec2T)
        d.append((dist,s2))
    return sorted(d)[0]


In [40]:
wordEn ='cat'
print('Searching for the most similar word to:', wordEn)
print('list',words['es'])
getMoreSimilar(wordEn,words['es'],'en','es')

Searching for the most similar word to: cat
list ['gato', 'automóvil', 'perro', 'camión', 'geografía', 'montañas', 'rios', 'baloncesto', 'futbol']


(0.4377602546993745, 'gato')

In [41]:
wordEn ='kitty'
print('Searching for the most similar word:', wordEn)
print('list',words['es'])
getMoreSimilar(wordEn,words['es'],'en','es')

Searching for the most similar word: kitty
list ['gato', 'automóvil', 'perro', 'camión', 'geografía', 'montañas', 'rios', 'baloncesto', 'futbol']


(0.6476671722893668, 'gato')

### Aligning set of words

Given two sets of words, get a mapping one-to-one mapping

In [42]:
# One-to-one mappings
def alignSets(set1,set2,sourceLang,targetLang,sensivity=.45):
    """
    Given two sets of words/sentences in two languages
    return the possible alignments between sentences
    set1: dict or list, ['hola','perro']
    set2: dict or list, ['hello','dog']
    sourceLang: str, 'es'
    targetLang: str, 'en'
    return list
    """
    global model
    global transmat
    output = []
    G= nx.Graph()
    for s1 in set1:
        vec1 = model[sourceLang].get_sentence_vector(s1.strip().replace('_',' '))
        for s2 in set2:
                    vec2= model[targetLang].get_sentence_vector(s2.strip().replace('_',' '))
                    vec2T = apply_transform(vec2,transmat[sourceLang][targetLang])
                    dist = distance.cosine(vec1,vec2T)
                    if dist < sensivity:
                        node1= '%s_%s' % (sourceLang,s1)
                        node2= '%s_%s' % (targetLang,s2)
                        G.add_edge(node1,node2)
                        G[node1][node2]['w'] = dist

                
    while G.edges():
            p = sorted(G.edges(data=True), key=lambda x: x[2]['w'])[0]
            psorted = sorted(list(p[:2]))
            output.append({psorted[0][:2]:psorted[0][3:],psorted[1][:2]:psorted[1][3:],'d':p[2]['w']})
            G.remove_node(p[0])
            G.remove_node(p[1])
    return output

In [43]:
print(words[lang1])
print(words[lang2])

alignSets(words[lang1],words[lang2],lang1,lang2)

['cat', 'kitty', 'motocycle', 'car', 'dog', 'truck', 'geography', 'mountains', 'rivers', 'basketball', 'football']
['gato', 'automóvil', 'perro', 'camión', 'geografía', 'montañas', 'rios', 'baloncesto', 'futbol']


[{'en': 'basketball', 'es': 'baloncesto', 'd': 0.20396930125846902},
 {'en': 'mountains', 'es': 'montañas', 'd': 0.24248148366015954},
 {'en': 'truck', 'es': 'camión', 'd': 0.2715019349251284},
 {'en': 'car', 'es': 'automóvil', 'd': 0.274432416258372},
 {'en': 'geography', 'es': 'geografía', 'd': 0.2938059612477104},
 {'en': 'dog', 'es': 'perro', 'd': 0.29473989749777796},
 {'en': 'football', 'es': 'futbol', 'd': 0.39243692824605925},
 {'en': 'cat', 'es': 'gato', 'd': 0.4377602546993745}]