## Install `PADL` and `sentencepiece`

In [None]:
!python -m pip install git+https://github.com/lf1-io/padl.git

!pip install sentencepiece

## Download `Sentiment140` dataset with 1.6 million tweets

In [None]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/sentiment-analysis-is-bad/data/training.1600000.processed.noemoticon.csv.zip -P data
!unzip -n -d data data/training.1600000.processed.noemoticon.csv.zip

----

## Sentiment Analysis with Padl
This notebook implements and trains a Sentiment Monitor using `padl`.

We import `padl` along all the needed libraries and define some required constants for the model. From `padl` we import the elements `transform`, `batch`, `unbatch`, `same` and `identity`:
- `transform`: Any callable class implementing `__call__` or any class inheriting `torch.nn.Module` (and implementing `forward`) can become a `Transform` using the `transform` decorator. 

- `batch`: Stands for `padl.transforms.Batchify`, which determines where the dataloader is called, and the batchs are created sent to the gpu.

- `unbatch`: Stands for `padl.transforms.Unbatchify`, which unbatches the output of the neural network and indicates the beginning of the postprocess stage, carried out on the cpu.

- `same`: Operator for calling methods or attributes of the object passed through it. For example: `same.count(5)([5, 7, 8, 5, 5) # outputs 3`

- `identity`: Stands for `padl.transforms.Identity()`, which is the Identity transform.

For our model we will also use the some global variables: `VOCAB_SIZE`, `TRAIN_TEST_SPLIT`, `EMB_DIM`, `RNN_HIDDEN_SIZE`, `DECODER_HIDDEN`, `PADDING_PERCENTILE`.

In [None]:
import os 
import uuid
import numpy
import pandas
import sentencepiece
import torch
import matplotlib.pyplot as plt
import random

import padl
from padl import transform, batch, unbatch, same, identity

VOCAB_SIZE = 5000 # Size of the vocabulary used by our tokenizer  
TRAIN_TEST_SPLIT = 10000 # Number of components of each embedding vector
EMB_DIM = 64 # Number of components of each embedding vector
RNN_HIDDEN_SIZE = 1024 # Hidden size of our recurrent layer
DECODER_HIDDEN = 64 # Number of hidden dimensions in the dense layers after the rnn
PADDING_PERCENTILE = 99 # Percentile of datapoints at which we want to truncate our padding

### The data
The dataset used in this notebook is `Sentiment140`, which contains 1.6 million of tweets classified as negative (0) or positive (4). 

In [None]:
data = pandas.read_csv(
    "data/training.1600000.processed.noemoticon.csv",
    header=None,
    encoding='latin-1'
)

Let's check out the data.

In [None]:
data.head(10)

We only need the first (label) and the last (text) columns. We keep, rename them and split in train and valid sets.

In [None]:
data = data.drop(labels=[1, 2, 3, 4], axis=1).rename(columns={0: 'label', 5: 'text'})
data['label'] = data['label'].apply(lambda x: int(x/4))
data_list = list(zip(data['text'], data['label']))
random.shuffle(data_list)
train_data = data_list[:-TRAIN_TEST_SPLIT]
valid_data = data_list[-TRAIN_TEST_SPLIT:]

We dump a text file with one sentence on each line that will be used for training our tokenizer.

In [None]:
with open('corpus.txt', 'w') as f:
    f.writelines(data['text'][:-TRAIN_TEST_SPLIT].apply(lambda x: x + '\n').tolist())

### Creating the Transforms
It is time to proceed to define and instantiate the `Transform` we are going to use in our model. With `padl` this is very easy! We write functions and classes implementing a `__call__` method or a `forward` if they  inherit `torch.nn.Module`, and we add the `@transform` decorator. Then, they will be ready to use the `padl` features like saving, composing, applying... As simple as that!

We create the following ones:
- `Bpe`: Consists of a tokenizer based on the byte pair encoding algorithm and uses the `sentencepiece` package.
- `Pad_Seq`: Pads our sentences so they have the same sequence length and can be processed into batches. In our case, we choose a padding length at the 99th percentile of lengths of the samples.
- `Embedding`: Our tokens embedder.
- `MyNN`: Class containing our architecture.
- `classify`: postprocess the output of the neural network.
- `to_tensor`: converts the input to a `torch.Tensor`.
- `loss_function`: loss function used on the training, which is the CrossEntropyLoss.
- `norm`: computes probability values using a softmax function. This is used in the infer mode to get an idea of the probabilities of positiveness and negativeness.

In [None]:
@transform
class Bpe:
    def __init__(self):
        self._model = None
        self.vocab_size = None
        self.dic = None
        self.model_prefix = str(uuid.uuid4())
        
    def __call__(self, x):
        return self._model.encode_as_ids(x)
    
    def fit(self, corpus_file):
        sentencepiece.SentencePieceTrainer.Train(
            f'--input={corpus_file} '
            f'--model_prefix={self.model_prefix} '
            f'--vocab_size={VOCAB_SIZE} '
            f'--character_coverage={1.0} '
            '--model_type=bpe '
        )
        self._model = sentencepiece.SentencePieceProcessor()
        self._model.Load(f'{self.model_prefix}.model')
        self.vocab_size = self._model.vocab_size()
        self.dic = {i:self._model.decode([i]) for i in range(self.vocab_size)}
        with open(f'{self.model_prefix}.model', 'rb') as f:
            self._content = f.read()
        os.remove(f'{self.model_prefix}.model')
        os.remove(f'{self.model_prefix}.vocab')
    
    def post_load(self, path, i):
        self._model = sentencepiece.SentencePieceProcessor()
        self._model.Load(str(path / f'{i}.model'))
    
    def pre_save(self, path, i):
        with open(path / f'{i}.model', 'wb') as f:
            f.write(self._content)


@transform
class Pad_Seq:
    def __init__(self, seq_len):
        self.seq_len = seq_len
    
    def __call__(self, seq):
        if len(seq) < self.seq_len:
            return seq + [2 for i in range(len(seq), self.seq_len)], [len(seq)]
        return seq[:self.seq_len], [self.seq_len] 


@transform
class MyNN(torch.nn.Module):
    def __init__(self, hidden_size, decoder_hidden, emb_dim):
        super().__init__()
        self.hidden_size = hidden_size
        self.decoder_hidden = decoder_hidden
        self.lstm = torch.nn.LSTM(
            input_size=emb_dim,
            hidden_size=self.hidden_size, 
            batch_first=True
        )
        self.lin1 = torch.nn.Linear(self.hidden_size, self.decoder_hidden)
        self.act = torch.nn.ReLU()
        self.lin2 = torch.nn.Linear(self.decoder_hidden, 2)
    
    def forward(self, x, lengths=None):
        out, state = self.lstm(x)
        if self.pd_mode != 'infer':
            output = [sentence[length.item() - 1 , :] for sentence, length in zip(out, lengths)]
            output = torch.stack(output)
        if self.pd_mode == 'infer':
            output = state[0].squeeze(0)
        dec = self.lin1(output)
        dec = self.act(dec)
        return self.lin2(dec)
    

@transform
def classify(x):
    negative_score = x[0].item()
    positive_score = x[1].item()
    if positive_score > 0.6:
        category = 'Positive'
    elif 0.4 < positive_score <= 0.6:
        category = 'Neutral'
    elif 0.4 <= positive_score:
        category = 'Negative'
    return {'Negativeness': round(negative_score, 2),
            'Positiveness': round(positive_score, 2), 'Sentiment': category}

Initialize the Byte Pair Encoder and print it. 

In [None]:
bpe = Bpe()
bpe

Fit the Byte Pair Encoder on our corpus.

In [None]:
bpe.fit('corpus.txt')

Choose a padding length and define the remaining components of our model.

In [None]:
random_sample = [train_data[i][0] for i in numpy.random.permutation(len(train_data))[:10000]]
len_list = [len(bpe(sent)) for sent in random_sample]
seq_len = int(numpy.quantile(len_list, 0.01 * PADDING_PERCENTILE))

print(f'sequence-length chosen on 99th percentile: {seq_len}')

pad = Pad_Seq(seq_len)
to_tensor = transform(lambda x: torch.LongTensor(x))
emb = transform(torch.nn.Embedding)(VOCAB_SIZE, EMB_DIM)
nn = MyNN(
    hidden_size=RNN_HIDDEN_SIZE,
    decoder_hidden=DECODER_HIDDEN,
    emb_dim=EMB_DIM,
)
loss_function = transform(torch.nn.CrossEntropyLoss)()
norm = transform(torch.nn.Softmax)(dim=-1)

Let's represent graphically the distribution of the sequence length of a subsample of our data. We choose a padding length such that we keep don't cut off 99% of our sentences.

In [None]:
plt.hist(len_list, bins=20);

### Building and training the model.

We build now our training and infer pipelines. Let's make a quick reminder of the `padl` operators:
- `>>`: Compose operator: $(f_1 >> f_2)(x) \rightarrow f_2(f_1(x))$
- `+`: Rollout operator: $(f_1 + f_2) (x) \rightarrow (f_1(x), f_2(x))$
- `/`: Parallel operator: $(f_1 / f_2)((x_1,x_2)) \rightarrow (f_1(x_1), f_2(x_2))$
- `~`: Map operator: $(~f)([x_1, ..., x_n]) \rightarrow ([f(x_1), ..., f(x_n)]$
- `-`: Name operator: Names a transform so that its output can be accesed by given name or the transform itself can be accessed by its name from the pipeline:  
    - $((f_1 - \text{'zulu'})+f_2)(x) \rightarrow \text{Namedtuple}(\text{'zulu'}:f_1(x), \text{'out_1'}:f_2(x))$
    - $((f_1 - \text{'zulu'})+f_2)[\text{'zulu'}] = f_1$

In [None]:
# training pipeline
data_model = ( 
    same[0] 
    >> bpe 
    >> pad 
    >> ~ to_tensor  
    >> batch
    >> emb / identity  
    >> nn
)
targets = same[1] >> batch
model = data_model + targets >> loss_function

# inference pipeline for easy human readability of the output
infer_model = (
    bpe
    >> to_tensor
    >> batch
    >> emb
    >> nn
    >> norm
    >> unbatch
    >> classify
)

In [None]:
# Print the inference model
infer_model

Below, the model is trained for 20 epochs using an Adam optimization algorithm, validating each 100 steps and saving using the `pd_save` method. 

`Padl` provides a built-in feature for saving a `Transform`, which is the `pd_save` method. A `Transform` inheriting `torch.nn.Module` has a default saving using the `torch` saving functionality. If other `Transform` need to save anything, like `Bpe` in this example, we need to define a way to save and load, implemented respectively in the `pre_save` and `post_load` methods. If we want to overwrite a saved `padl` model which exists at the same path, we need to set the argument `force_overwrite` to `True`. 

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device to be used: ', device)

In [None]:
model.pd_to('cuda')
max_accuracy = 0.
optimizer = torch.optim.Adam(model.pd_parameters(), lr=1e-4)
it = 0
num_epochs = 1
max_itr = 201
train_batch_size = 2000
valid_batch_size = 2000

if os.path.exists('train_file.csv'):
    os.remove('train_file.csv')

if os.path.exists('valid_file.csv'):
    os.remove('valid_file.csv')

for epoch in range(num_epochs):
    print('Start epoch %d'%epoch)
    for loss in model.train_apply(train_data, batch_size=train_batch_size):
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if it % 5 == 0:
            print(f'TRAIN iteration {it}; loss: {loss.item()}')
            with open('train_file.csv', 'a') as f:
                f.write(f'{loss.item()}\n')
        if it % 50 == 0:
            counter = 0.
            accuracy = 0.
            for res, targets in model[:-1].eval_apply(valid_data, batch_size=valid_batch_size):
                top_prob, preds = res.topk(1, dim=1)
                correct = (preds.view(-1) == targets)
                accuracy += torch.mean(correct.type(torch.FloatTensor))
                counter += 1
            accuracy = accuracy/counter
            print(f'VALID_accuracy: {accuracy}')
            with open('valid_file.csv', 'a') as f:
                f.write(f'{accuracy}\n')
            if accuracy > max_accuracy:
                max_accuracy = accuracy
                print('Saving...')
                infer_model.pd_save('sent_analysis.padl', force_overwrite=True)
        if it == max_itr:
            break
        it += 1 

Now we can load and use our trained models with  the `load` function.

In [None]:
loaded_model = padl.load('sent_analysis.padl')

In [None]:
loaded_model.infer_apply('Padl is a powerful and super cool tool!')

And that's it! This is how easy is to build, train, save and load models with `padl`.