# Loading Text Data

#### TorchText
TorchText help to load/preprocess NLP datasets, you can follow a nice tutorial [here](https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95) and oficial doc is: [here](https://torchtext.readthedocs.io/en/latest/index.html)

![alt text](docs/imgs/torchtext_diagram.png "Title")

Main features of TorchText:
* Ability to define a preprocessing pipeline
* Batching, padding, and numericalizing (including building a vocabulary object)
* Wrapper for dataset splits (train, validation, test)
* Loader a custom NLP dataset

#### Spacy
It's a production library to help NLP tasks, it's main features
* Tokenization (What we want now)
* Part-of-speech tagging
* Similarity
* Serialization

Spacy is a library that has been specifically built to take sentences in various languages and split them into different tokens.

![alt text](docs/imgs/spacy_diagram.png "Title")

For examples and tutorials check [here](https://spacy.io/usage/spacy-101)

#### Tokenizer and Indexing
First we need to transform our senteces into tokens and then into indexes of words.

![alt text](docs/imgs/tokenizer_indexing.png "Title")

#### Install spacy/torchtext and language support
```bash
pip install torchtext spacy
# Download 
python -m spacy download en
python -m spacy download de
python -m spacy download fr
python -m spacy download pt
```

#### Download Some Datasets
``` bash
wget http://www.statmt.org/europarl/v7/fr-en.tgz
tar -zxvf fr-en.tgz
```

#### References
* https://medium.com/@debanjanmahata85/natural-language-processing-with-spacy-36b90b9afa3d
* https://spacy.io/usage/training
* [Tutorial on TorchText](http://anie.me/On-Torchtext/)
* https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95
* https://github.com/pytorch/text
* https://nlpforhackers.io/complete-guide-to-spacy/
* http://www.statmt.org/europarl/
* [Nice Sentiment Analysis using torchtext](https://medium.com/@sonicboom8/sentiment-analysis-torchtext-55fb57b1fab8)

In [1]:
import pandas as pd
from torchtext import data, datasets
import spacy
import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset

# Use to split train/val
from sklearn.model_selection import train_test_split

# Download spacy class to handle english and french
spacy_fr = spacy.load('fr')
spacy_en = spacy.load('en')

SOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"

MAX_LEN = 100
MIN_FREQ = 2

In [2]:
def tokenize_fr(text):
    return [tok.text for tok in spacy_fr.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

print(tokenize_en('Hi! my name is Leo, and yours?'))

['Hi', '!', 'my', 'name', 'is', 'Leo', ',', 'and', 'yours', '?']


In [3]:
SRC = data.Field(tokenize=tokenize_fr, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = SOS_WORD, eos_token = EOS_WORD, pad_token=BLANK_WORD)

### Load Dataset

In [4]:
europarl_en = open('./europarl-v7.fr-en.en', encoding='utf-8').read().split('\n')
europarl_fr = open('./europarl-v7.fr-en.fr', encoding='utf-8').read().split('\n')

raw_data = {'English' : [line for line in europarl_en], 'French': [line for line in europarl_fr]}
df = pd.DataFrame(raw_data, columns=["English", "French"])

# remove very long sentences and sentences where translations are 
# not of roughly equal length
df['eng_len'] = df['English'].str.count(' ')
df['fr_len'] = df['French'].str.count(' ')
df = df.query('fr_len < 80 & eng_len < 80')
df = df.query('fr_len < eng_len * 1.5 & fr_len * 1.5 > eng_len')

#### Split Between Train/Val

In [5]:
# create train and validation set 
train, val = train_test_split(df, test_size=0.1)

#### Convert to CSV

In [6]:
train.to_csv("train.csv", index=False)
val.to_csv("val.csv", index=False)

#### Create Pytorch Dataset
Now use the spacy tokenizers and torchtext to process the dataset

In [7]:
# Create source and target fields given the spacy tokenizers
SRC = data.Field(tokenize=tokenize_fr, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = SOS_WORD, eos_token = EOS_WORD, pad_token=BLANK_WORD)

# associate the text in the 'English' column with the EN_TEXT field, # and 'French' with FR_TEXT
data_fields = [('English', TGT), ('French', SRC)]
train,val = data.TabularDataset.splits(path='./', train='train.csv', validation='val.csv', format='csv', fields=data_fields)

# Other way..
#train, val, test = datasets.IWSLT.splits(exts=('.fr', '.en'), fields=(SRC, TGT), 
#    filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and len(vars(x)['trg']) <= MAX_LEN)

#### Get indexes for all words
This step will get an specific index for every word, this will be the embedding input.

In [8]:
SRC.build_vocab(train, val, min_freq=MIN_FREQ)
TGT.build_vocab(train, val, min_freq=MIN_FREQ)

In [9]:
print('Index of word \'the\:', SRC.vocab.stoi['the'])

Index of word 'the\: 5953


#### Get an Iterator

In [10]:
train_iter = BucketIterator(train, batch_size=10, sort_key=lambda x: len(x.French), shuffle=True)

In [11]:
batch = next(iter(train_iter))
print(batch.English)

tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2],
        [   52,    15,   206,    45,   143,   143,  2567,    45,    29,  2809],
        [   12,   122,     6,    39,     5,   629,     4,    23,   587,     4],
        [   88,    11,  4570,   219,     7,    24,    63,     8,    12,   222],
        [   26,     4,  1357,    25,   199,    40,    59,  1270,   479,   807],
        [  204,  2597,     9,   175,     5,     4,     9,     5,     8,   139],
        [  297,   257,  1357,   794,  1114,    57,     4,    33,   954,     5],
        [   11,    46,    43,    25,     8,   172,    34,  2007,   512,    24],
        [   24,    18,     4,    19,     4,    11,    59,     5,   112,    32],
        [   32,  1108, 16121,    41,  5645,     4,   119,    11,   219,    48],
        [   48,     8,   851,    16,     7,  9386,    21,   109,   542,  1856],
        [  117,     4,   122,  1055,    11,  2701,    77,    17,    10,    11],
        [  155,    83,    11,    77,   5

#### More efficient way
While Torchtext is brilliant, it’s sort_key based batching leaves a little to be desired. Often the sentences aren’t of the same length at all, and you end up feeding a lot of padding into your network (as you can see with all the 1s in the last figure).

An efficient batching mechanism would change the batch size depending on the sequence length to make sure around 1500 tokens were being processed each iteration.

In [12]:
global max_src_in_batch, max_tgt_in_batch
def batch_size_fn(new, count, sofar):
    "Keep augmenting batch and calculate total number of tokens + padding."
    global max_src_in_batch, max_tgt_in_batch
    if count == 1:
        max_src_in_batch = 0
        max_tgt_in_batch = 0
    max_src_in_batch = max(max_src_in_batch,  len(new.English))
    max_tgt_in_batch = max(max_tgt_in_batch,  len(new.French) + 2)
    src_elements = count * max_src_in_batch
    tgt_elements = count * max_tgt_in_batch
    return max(src_elements, tgt_elements)
class MyIterator(data.Iterator):
    def create_batches(self):
        if self.train:
            def pool(d, random_shuffler):
                for p in data.batch(d, self.batch_size * 100):
                    p_batch = data.batch(
                        sorted(p, key=self.sort_key),
                        self.batch_size, self.batch_size_fn)
                    for b in random_shuffler(list(p_batch)):
                        yield b
            self.batches = pool(self.data(), self.random_shuffler)
            
        else:
            self.batches = []
            for b in data.batch(self.data(), self.batch_size,
                                          self.batch_size_fn):
                self.batches.append(sorted(b, key=self.sort_key))