# Lesson 11

## 00:00:07 - Blog recap

* Blog on super convergence: [The 1cycle policy](https://sgugger.github.io/the-1cycle-policy.html)
  * 5x faster than stepwise approaches.
  * Let's you have massively high learning rates (somewhere between 1 and 3).
  * Trains at high learning rates for lots of the epochs: loss doesn't improve much but it's doing a lot of searching to find generalisable areas.
  * When learning rate is high, momentum is lower.
* Hamel Husain's blog on [sequence-to-sequence data products](https://towardsdatascience.com/how-to-create-data-products-that-are-magical-using-sequence-to-sequence-models-703f86a231f8)

## 00:05:42 - Building a sequence-to-sequence model using machine translation

* Neural translation has surpased typical translations techniques as of 2016.
  * Path of neural translation similar to image classification in 2012: just surpased state of the art and now moving past it rapidly.
* Four big wins of Neural MT:
  1. End-to-end training: all params are optimised to minimise loss function (less hyperparams)
  2. Distributed rep share strength: better exploit word and phrase similarities.
  3. Better exploitation of context: NMT can use a much bigger context to translate more accurately.
  4. More fluent text generation.
  
* Models use Bidirectional LSTM with Attention (which is obviously not just useful for machine translation).

## 00:10:05 - Translate French into English

* Basic idea: make it look like a standard NN problem, need 3 things:
  1. Data (x, y pairs)
  2. Architecture
  3. Loss function
  
* Lots of parallel corpuses (some language -> some other language), especially for European documents.
* For bounding boxes, all interesting stuff is in loss function, for translation, it's all in the arch.

## 00:13:16 - Neural translation walkthrough

* Take a sentence in English and put it through an RNN/Encoder.
* Encoder: piece of NN architecture that takes input and turns into some representation.
* Decoder: take the encoder / RNN output and convert into a sequence of French tokens.

* For translating language, you don't know how many words should be outputted from a sentence.
  * Key issue: arbitrary length output which don't correspond to the input length.
 
## 00:18:19 - RNN revision

* Need to understand Lesson 6 if the lesson doesn't make sense.
* RNN is a standard fully connected network, which takes an input to a linear layer which is fed into another layer and so on. However, it has one key difference: the second layer can also accept and concat another input.

<img src="https://i.gyazo.com/900233717de09d0ac63b4330a2c6b877.gif" width="400px">

* Use the same weight matrix for each of the layer outputs and the same weight matrix for each input.

* The diagram can be refactored to be a for loop:

<img src="https://i.gyazo.com/a53c737b2b3c325112430d9d3ad4b6a5.gif" width="400px">

<img src="https://i.gyazo.com/82ecea54084aab349d420720a0caa647.gif" width="400px">

  * The refactoring is basically what makes it an RNN.
  
* RNNs can be stacked: output of one RNN can be fed into another:

<img src="https://i.gyazo.com/0383008c47d943200ea38423ffcb3071.gif" width="400px">

  * Need to be able to write it from scratch to really understand it.

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [60]:
import re
from pathlib import Path
import pickle
from collections import Counter, defaultdict

import numpy as np

from fastai.text import Tokenizer, partition_by_cores

In [19]:
PATH = Path('data/translate')
TMP_PATH = PATH / 'tmp'
TMP_PATH.mkdir(exist_ok=True)
fname = 'giga-fren.release2.fixed'
en_fname = PATH / f'{fname}.en'
fr_fname = PATH / f'{fname}.fr'

### **Start do not rerun**

In [10]:
!wget http://www.statmt.org/wmt10/training-giga-fren.tar --directory-prefix={PATH}

--2018-08-04 15:34:53--  http://www.statmt.org/wmt10/training-giga-fren.tar
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2595102720 (2.4G) [application/x-tar]
Saving to: 'data/translate/training-giga-fren.tar’


2018-08-04 15:55:03 (2.05 MB/s) - 'data/translate/training-giga-fren.tar’ saved [2595102720/2595102720]



In [11]:
!cd {PATH} && tar -xvf training-giga-fren.tar

giga-fren.release2.fixed.en.gz
giga-fren.release2.fixed.fr.gz


In [12]:
!cd {PATH} && gunzip giga-fren.release2.fixed.en.gz && gunzip giga-fren.release2.fixed.fr.gz

* Training translation models takes a long time.
  * No conceptual difference between 2 and 8 layers: use 2 layers because we think it should be enough.
* Find questions that start with Wh (what, where, when etc) and match with French questions:

In [31]:
re_eng_questions = re.compile('^(Wh[^?.!]+\?)')
re_french_questions = re.compile('^([^?.!]+\?)')

en_fh = open(en_fname, encoding='utf-8')
fr_fh = open(fr_fname, encoding='utf-8')

lines = []

for eq, fq in zip(en_fh, fr_fh):
    lines.append((
        re_eng_questions.search(eq),
        re_french_questions.search(fq)
    ))

questions = [(e.group(), f.group()) for e, f in lines if e and f]

In [32]:
pickle.dump(questions, (PATH / 'fr-en-qs.pkl').open('wb'))

### **End do not rerun**

In [33]:
questions = pickle.load((PATH / 'fr-en-qs.pkl').open('rb'))

* We now have 52k sentence pairs:

In [34]:
questions[:5], len(questions)

([('What is light ?', 'Qu’est-ce que la lumière?'),
  ('Who are we?', 'Où sommes-nous?'),
  ('Where did we come from?', "D'où venons-nous?"),
  ('What would we do without it?', 'Que ferions-nous sans elle ?'),
  ('What is the absolute location (latitude and longitude) of Badger, Newfoundland and Labrador?',
   'Quelle sont les coordonnées (latitude et longitude) de Badger, à Terre-Neuve-etLabrador?')],
 52331)

* Separate questions into each language:

In [36]:
en_qs, fr_qs = zip(*questions)

### **Start do no rerun**

* Tokenise using English and French tokenizer.

In [45]:
!python -m spacy download fr

Collecting https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz (39.8MB)
[K    100% |████████████████████████████████| 39.8MB 2.4MB/s ta 0:00:011
[?25hInstalling collected packages: fr-core-news-sm
  Running setup.py install for fr-core-news-sm ... [?25ldone
[?25hSuccessfully installed fr-core-news-sm-2.0.0

[93m    Linking successful[0m
    /home/lex/anaconda3/envs/fastai/lib/python3.6/site-packages/fr_core_news_sm
    -->
    /home/lex/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/data/fr

    You can now load the model via spacy.load('fr')



In [51]:
en_tok = Tokenizer.proc_all_mp(partition_by_cores(en_qs))
fr_tok = Tokenizer.proc_all_mp(partition_by_cores(fr_qs), 'fr')

In [52]:
en_tok[0], fr_tok[0]

(['what', 'is', 'light', '?'],
 ['qu’', 'est', '-ce', 'que', 'la', 'lumière', '?'])

* Want to find the largest 90th sequence sentences, and make that the max sequence length.

In [55]:
np.percentile([len(o) for o in en_tok], 90), np.percentile([len(o) for o in fr_tok], 90)

(23.0, 28.0)

In [54]:
keep = np.array([len(o) < 30 for o in en_tok])

In [56]:
en_tok = np.array(en_tok)[keep]
fr_tok = np.array(fr_tok)[keep]

In [57]:
pickle.dump(en_tok, (PATH / 'en_tok.pkl').open('wb'))
pickle.dump(fr_tok, (PATH / 'fr_tok.pkl').open('wb'))

### **End do no rerun**

In [58]:
en_tok = pickle.load((PATH / 'en_tok.pkl').open('rb'))
fr_tok = pickle.load((PATH / 'fr_tok.pkl').open('rb'))

* Don't need to know a lot of NLP stuff for deep learning on text, but the basics are useful: particurally tokenising.
* 00:28:37 - some students in the study group are trying to build language models for Chinese, need a tokeniser like [sentence piece](https://github.com/google/sentencepiece), since it doesn't have individual words.

* Next, turn tokens into numbers:

In [62]:
def toks2ids(tok, pre):
    freq = Counter(p for o in tok for p in o)
    itos = [o for o, c in freq.most_common(40000)]
    itos.insert(0, '_bos_')
    itos.insert(1, '_pad_')
    itos.insert(2, '_eos_')
    itos.insert(3, '_unk')
    stoi = defaultdict(lambda: 3, {v: k for k, v in enumerate(itos)})
    ids = np.array([([stoi[o] for o in p] + [2]) for p in tok])
    np.save(TMP_PATH / f'{pre}_ids.npy', ids)
    pickle.dump(itos, open(TMP_PATH / f'{pre}_itos.pkl', 'wb'))
    return ids, itos, stoi

In [63]:
en_ids, en_itos, en_stoi = toks2ids(en_tok, 'en')
fr_ids, fr_itos, fr_stoi = toks2ids(fr_tok, 'fr')

In [64]:
def load_ids(pre):
    ids = np.load(TMP_PATH / f'{pre}_ids.npy')
    itos = pickle.load(open(TMP_PATH / f'{pre}_itos.pkl', 'rb'))
    stoi = defaultdict(lambda: 3, {v:k for k, v in enumerate(itos)})
    return ids, itos, stoi

In [66]:
en_ids, en_itos, en_stoi = load_ids('en')
fr_ids, fr_itos, fr_stoi = load_ids('fr')

In [67]:
[fr_itos[o] for o in fr_ids[0]], len(en_itos), len(fr_itos)

(['qu’', 'est', '-ce', 'que', 'la', 'lumière', '?', '_eos_'], 17573, 24793)

## 00:33:01 - Word vectors

* Seq-to-seq with language models hasn't been explored yet in academia: lots of potential papers to be written.
* Word2Vec has been surpased by a number of word vectors: FastText is a good choice.

#### **Start do not rerun**

In [70]:
!pip install git+https://github.com/facebookresearch/fastText.git

Collecting git+https://github.com/facebookresearch/fastText.git
  Cloning https://github.com/facebookresearch/fastText.git to /tmp/pip-req-build-erf46hfl
Building wheels for collected packages: fasttext
  Running setup.py bdist_wheel for fasttext ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-s3ecl52n/wheels/69/f8/19/7f0ab407c078795bc9f86e1f6381349254f86fd7d229902355
Successfully built fasttext


* Need to also download the fasttext word vectors:

In [83]:
#!wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip --directory-prefix={PATH}

In [84]:
#!wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.zip  --directory-prefix={PATH}

In [87]:
!cd {PATH} && unzip wiki.en.zip && unzip wiki.fr.zip

Archive:  wiki.en.zip
  inflating: wiki.en.vec             
  inflating: wiki.en.bin             
Archive:  wiki.fr.zip
  inflating: wiki.fr.vec             
  inflating: wiki.fr.bin             


#### **End do not rerun**

In [88]:
import fastText as ft

In [89]:
english_vecs = ft.load_model(str((PATH / 'wiki.en.bin')))

In [90]:
french_vecs = ft.load_model(str((PATH / 'wiki.fr.bin')))

* Turn it into a dictionary:

In [None]:
def get_vecs(lang, ft_vecs):
    vecd = {w: ft_vecs.get_word_vector(w) for w in ft_vecs.get_words()}
    pickle.dump(vecd, open(PATH / 'wiki.{lang}.pkl'))

* Preparing data for PyTorch

In [None]:
en_ids_tr = np.array([o[:enlen_90] for o in en_ids])
fr_ids_tr = np.array([o[:rnlen_90] for o in fr_ids])
enlen_90, frlen_90

In [None]:
def iter_to_numpy(*a):
    """convert iterable object into numpy array"""
    return np.array(a[0]) if len(a)==1 else [np.array(o) for o in a]

In [None]:
class Seq2SeqDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __getitem__(self, idx):
        return iter_to_numpy(self.x[idx], self.y[idx])
    
    def __len__(self):
        return len(self.x)

In [None]:
np.random.seed(42)
trn_keep = np.random.rand(len(en_ids_tr)) > 0.1
en_trn, fr_trn = es_ids_tr[trn_keep], fr_ids_tr[trn_keep]
en_val, fr_val = en_ids_tr[~trn_keep], fr_ids_tr[~trn_keep]
len(en_trn), len(en_val)

In [None]:
trn_ds = Seq2SeqDataset(fr_trn, en_trn)
val_ds = Seq2SeqDataset(fr_val, en_val)

In [None]:
bs = 125