### MUSE and FasText Colab setup

For training MUSE unsupervised models, you can use this environment to train your models. All you need to do is run the setup cells, upload the europarl corpora you want to create alignments of, and run whatever training code in the appropriate cells at the bottom.

NOTE SUGGEST RUNNING THIS IN GPU MODE (Runtime --> Change Runtime Type --> GPU) While Fasttext is CPU only, MUSE is very slow on CPU.

Total time to train everything should be ~1.5 hrs

#### Submission instructions

Use this training code to complete the MUSE related questions for Lab 1. Submit this colab notebook alongside your completed lab1.ipynb.

In [1]:
import torch

In [2]:
!git clone https://github.com/facebookresearch/MUSE.git

Cloning into 'MUSE'...
remote: Enumerating objects: 239, done.[K
remote: Total 239 (delta 0), reused 0 (delta 0), pack-reused 239[K
Receiving objects: 100% (239/239), 215.77 KiB | 1.27 MiB/s, done.
Resolving deltas: 100% (136/136), done.


Download the evaluation data for MUSE

#### [NOTE: We ran into some issues getting MUSE to correctly use the evaluation data. It should be possible to skip this cell (with no impact on training quality) if you follow the note about commenting a line in the evaluator.py file.]

Takes a few minutes to download

In [3]:

# !cd ./MUSE/data/; chmod +x get_evaluation.sh
# !cd ./MUSE/; ./data/get_evaluation.sh

In [4]:
#get fasText 

! wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
! unzip v0.9.2.zip
! cd fastText-0.9.2; make


--2023-02-19 00:09:25--  https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/refs/tags/v0.9.2 [following]
--2023-02-19 00:09:26--  https://codeload.github.com/facebookresearch/fastText/zip/refs/tags/v0.9.2
Resolving codeload.github.com (codeload.github.com)... 192.30.255.120
Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.9.2.zip’

v0.9.2.zip              [     <=>            ]   4.17M  4.18MB/s    in 1.0s    

2023-02-19 00:09:27 (4.18 MB/s) - ‘v0.9.2.zip’ saved [4369852]

Archive:  v0.9.2.zip
5b5943c118b0ec5fb9cd8d20587de2b2d3966dfe
   creating: fastText-0.9.2/
   creating: fastText-0.9.2/.circlec

In [5]:
#get europarl fr_en
!wget https://www.statmt.org/europarl/v7/fr-en.tgz

--2023-02-19 00:09:46--  https://www.statmt.org/europarl/v7/fr-en.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202718517 (193M) [application/x-gzip]
Saving to: ‘fr-en.tgz’


2023-02-19 00:10:42 (3.53 MB/s) - ‘fr-en.tgz’ saved [202718517/202718517]



In [6]:
#unpack
!tar -xvzf fr-en.tgz

europarl-v7.fr-en.en
europarl-v7.fr-en.fr


#### Run your training code:

To run the training scripts you will need to prefix the command with a ! to run it as if it was a linux terminal.

With the Europarl corpora we will build the fasText alignments  (see below). Once embeddings are created, we'll then feed them to MUSE. 

fastText does not need GPU to run and takes about 30 minutes each, you can run it in a CPU only notebook, save the files that are created (MUSE/en.vec MUSE/fr.vec) and then move them over to a GPU notebook for running MUSE, if you are concerned about your GPU quota.


In [7]:
### FastText Here 
!./fastText-0.9.2/fasttext skipgram -input europarl-v7.fr-en.en -output MUSE/en 
!./fastText-0.9.2/fasttext skipgram -input europarl-v7.fr-en.fr -output MUSE/fr


Read 52M words
Number of words:  87628
Number of labels: 0
Progress: 100.0% words/sec/thread:   27809 lr:  0.000000 avg.loss:  1.352167 ETA:   0h 0m 0s
Read 54M words
Number of words:  119873
Number of labels: 0
Progress: 100.0% words/sec/thread:   24196 lr:  0.000000 avg.loss:  1.245346 ETA:   0h 0m 0s


In [8]:
#https://github.com/facebookresearch/MUSE/blob/master/demo.ipynb
import io
import numpy as np

def load_vec(emb_path, nmax=50000):
    vectors = []
    word2id = {}
    with io.open(emb_path, 'r', encoding='utf-8', newline='\n', errors='ignore') as f:
        next(f)
        for i, line in enumerate(f):
            word, vect = line.rstrip().split(' ', 1)
            vect = np.fromstring(vect, sep=' ')
            assert word not in word2id, 'word found twice'
            vectors.append(vect)
            word2id[word] = len(word2id)
            if len(word2id) == nmax:
                break
    id2word = {v: k for k, v in word2id.items()}
    embeddings = np.vstack(vectors)
    return embeddings, id2word, word2id

## modified this to return a result list
def get_nn(word, src_emb, src_id2word, tgt_emb, tgt_id2word, K=5):
    result = []
    print("Nearest neighbors of \"%s\":" % word)
    word2id = {v: k for k, v in src_id2word.items()}
    word_emb = src_emb[word2id[word]]
    scores = (tgt_emb / np.linalg.norm(tgt_emb, 2, 1)[:, None]).dot(word_emb / np.linalg.norm(word_emb))
    k_best = scores.argsort()[-K:][::-1]
    for i, idx in enumerate(k_best):
        result.append((scores[idx], tgt_id2word[idx]))
    return result

In [9]:
# load english and french word embeddings
MUSE_PATH = "MUSE"
en_embeddings, en_id2word, en_word2id = load_vec(MUSE_PATH + "/en.vec", nmax=50000)
fr_embeddings, fr_id2word, fr_word2id = load_vec(MUSE_PATH + "/fr.vec", nmax=50000)

You can use the get_nn function as follows (where K is the number of results, feel free to increase). Do this for the words in English (Minutes, minutes, vote) and French (vous, intervienne, accord)

In [10]:
print('most similar word to Minutes is %s'%get_nn('Minutes', en_embeddings, en_id2word, en_embeddings, en_id2word, K=2))

## TO COMPLETE*** GET REST OF WORDS

Nearest neighbors of "Minutes":
most similar word to Minutes is [(0.9999999999999999, 'Minutes'), (0.9388245054182518, 'Minutes.)')]


In [11]:
# FAISS is a tool to speed training of some facebook models, this is how you can import it.
!apt install libomp-dev
#!python -m pip install --upgrade faiss faiss-gpu
!pip install faiss-gpu
import faiss

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-510
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libomp-10-dev libomp5-10
Suggested packages:
  libomp-10-doc
The following NEW packages will be installed:
  libomp-10-dev libomp-dev libomp5-10
0 upgraded, 3 newly installed, 0 to remove and 21 not upgraded.
Need to get 351 kB of archives.
After this operation, 2,281 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 libomp5-10 amd64 1:10.0.0-4ubuntu1 [300 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/universe amd64 libomp-10-dev amd64 1:10.0.0-4ubuntu1 [47.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/universe amd64 libomp-dev amd64 1:10.0-50~exp1 [2,824 B]
Fetched 351 kB in 2s (175 kB/s)
Selecting previously unselected package libomp5-10:amd64.

Now we are going to run MUSE.
Note: We found an issues with running the eval parts of the training, to get around this comment out line 217 in /MUSE/src/evaluation/evaluator.py:
 self.word_translation(to_log)

### Training time should take around 30 minutes on GPU, plan accordingly.

In [12]:
### MUSE Here
# %cd MUSE
# !python unsupervised.py --src_lang fr --tgt_lang en --src_emb fr.vec --tgt_emb en.vec --n_refinement 5 --emb_dim 100 --dis_most_frequent 0



In [13]:
en_embeddings, en_id2word, en_word2id = load_vec(MUSE_PATH + "/en.vec", nmax=50000)
fr_embeddings, fr_id2word, fr_word2id = load_vec(MUSE_PATH + "/fr.vec", nmax=50000)