# Neural Machine Translator using Attention

The goal will be to train a deep learning network to translate from French to English.

We will use a dataset of French-English pairs from the nonprofit [Tatoeba](https://tatoeba.org/fr/).

1. Either use RNN only (LSTM or GRU)
2. Or RNN and an attention + RNN decoder
3. Or just attention (GPT)

One of 2. or 3. is necessary for this homework.

## Hints

- Start on random tensors
- Then, overfit a single batch

## Pitfalls

- Handling variable-length batches
- Considering (batch $\times$ head $\times$ length $\times$ embedding_size) tensors

In [None]:
!uv add torcheval scikit-learn matplotlib pandas

In [None]:
# Download the dataset
!wget https://download.pytorch.org/tutorial/data.zip
# On Mac: !curl -O https://download.pytorch.org/tutorial/data.zip

In [1]:
# Rename it to tatoeba as there is already a data folder

In [2]:
!head tatoeba/eng-fra.txt

Go.	Va !
Run!	Cours !
Run!	Courez !
Wow!	Ça alors !
Fire!	Au feu !
Help!	À l'aide !
Jump.	Saute.
Stop!	Ça suffit !
Stop!	Stop !
Stop!	Arrête-toi !


In [3]:
import pandas as pd

df = pd.read_csv('tatoeba/eng-fra.txt', sep='\t', names=('en', 'fr'))

In [4]:
df.head()

Unnamed: 0,en,fr
0,Go.,Va !
1,Run!,Cours !
2,Run!,Courez !
3,Wow!,Ça alors !
4,Fire!,Au feu !


In [5]:
df['sample'] = df.apply(lambda row: f"FRA: {row['fr']} ENG: {row['en']}", axis=1)
df['sample']

0                                        FRA: Va ! ENG: Go.
1                                    FRA: Cours ! ENG: Run!
2                                   FRA: Courez ! ENG: Run!
3                                 FRA: Ça alors ! ENG: Wow!
4                                  FRA: Au feu ! ENG: Fire!
                                ...                        
135837    FRA: Une empreinte carbone est la somme de pol...
135838    FRA: La mort est une chose qu'on nous décourag...
135839    FRA: Puisqu'il y a de multiples sites web sur ...
135840    FRA: Si quelqu'un qui ne connaît pas vos antéc...
135841    FRA: Il est peut-être impossible d'obtenir un ...
Name: sample, Length: 135842, dtype: object

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
vec.fit(df['sample'])

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


In [7]:
analyzer = vec.build_analyzer()
analyzer("Je mange des carottes")

['je', 'mange', 'des', 'carottes']

In [9]:
len(vec.vocabulary_)

33696

In [10]:
df['tokens'] = df['sample'].map(analyzer)

In [11]:
df['tokens'].map(len).max()

np.int64(101)

In [12]:
df['X'] = df['tokens'].map(lambda x: list(map(vec.vocabulary_.get, x)))

In [13]:
df['X-'] = df['X'].map(lambda l: l[:-1])
df['y'] = df['X'].map(lambda l: l[1:])
df.head()

Unnamed: 0,en,fr,sample,tokens,X,X-,y
0,Go.,Va !,FRA: Va ! ENG: Go.,"[fra, va, eng, go]","[13294, 31616, 11109, 14014]","[13294, 31616, 11109]","[31616, 11109, 14014]"
1,Run!,Cours !,FRA: Cours ! ENG: Run!,"[fra, cours, eng, run]","[13294, 7445, 11109, 26204]","[13294, 7445, 11109]","[7445, 11109, 26204]"
2,Run!,Courez !,FRA: Courez ! ENG: Run!,"[fra, courez, eng, run]","[13294, 7428, 11109, 26204]","[13294, 7428, 11109]","[7428, 11109, 26204]"
3,Wow!,Ça alors !,FRA: Ça alors ! ENG: Wow!,"[fra, ça, alors, eng, wow]","[13294, 32924, 1372, 11109, 32784]","[13294, 32924, 1372, 11109]","[32924, 1372, 11109, 32784]"
4,Fire!,Au feu !,FRA: Au feu ! ENG: Fire!,"[fra, au, feu, eng, fire]","[13294, 2699, 12581, 11109, 12766]","[13294, 2699, 12581, 11109]","[2699, 12581, 11109, 12766]"


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135842 entries, 0 to 135841
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   en      135842 non-null  object
 1   fr      135842 non-null  object
 2   sample  135842 non-null  object
 3   tokens  135842 non-null  object
 4   X       135842 non-null  object
 5   X-      135842 non-null  object
 6   y       135842 non-null  object
dtypes: object(7)
memory usage: 7.3+ MB
