# Character-based Language Model with AllenNLP

## Objective

In this task, we switch to character-based language models. We implement the same deep learning multinomial classification approach that we completed in the previous task.

The goal remains to predict the next token given a preceding sequence of tokens. However, by using characters as tokens, instead of words, we solve 2 problems:

There are no more Out Of Vocabulary (OOV) tokens, since all the characters are known in advance.
The total number of classes to predict is reduced to a few dozen characters instead of thousands of different words. (This is true for alphabetic-based scripts such as Latin, Arabic or Cyrillic but not in the case of logographic scripts used in Mandarin, Korean, or Japanese.)
To build a character-based language model on our domain specific corpus, we will use the AllenNLP framework. AllenNLP is a state-of-the-art NLP framework created by the Allen Institute for AI. Its generic approach allows us to work on a wide range of NLP problems. AllenNLP is based on PyTorch.

Experimenting with character based language models underlines the difference with word based models in terms of the implementation process and the resulting outputs (generated texts and perplexity scores). In particular, switching from words to characters as the target multiclass has 2 main advantages:

The number of target classes is drastically reduced from thousands of tokens to less than a hundred characters.
All characters are known in advance.
The abstraction level of the AllenNLP framework makes it particularly well suited to handle all sorts of NLP tasks (POS, NER, ). And the investment required to learn the framework is well worth it.

## Background

- Character set

As you will notice, the set of characters present in the corpus is much richer than the expected set of ASCII letters and digits. It may be useful to filter out rare and noisy characters.

- AllenNLP

AllenNLP is based on PyTorch and therefore enforces a precise object-oriented code structure with predefined classes and functions. Although specific to AllenNLP, these constraints bring an efficient conceptualization of diverse NLP tasks. The return on investment (ROI) is worth the time needed to learn the framework.

## Explore and analyze the set of unique characters present in the dataset.

In [1]:
import sys
sys.executable

'/home/michael/.pyenv/versions/3.7.1/envs/allennlp=0.9.0/bin/python3.7'

In [6]:
# !pip install pandas-profiling

In [45]:
import pandas as pd
import numpy as np
import re
import csv
from tqdm import tqdm
import torch
from collections import defaultdict, Counter
from typing import List, Dict, Tuple
import allennlp
from allennlp.common.util import START_SYMBOL, END_SYMBOL
from allennlp.data.tokenizers.character_tokenizer import CharacterTokenizer
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder, TextFieldEmbedder
from allennlp.modules.seq2seq_encoders import PytorchSeq2SeqWrapper
from allennlp.data.iterators import BasicIterator
from allennlp.training.trainer import Trainer
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.data.tokenizers.token import Token
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.fields import TextField
from allennlp.data.instance import Instance
from allennlp.data.vocabulary import Vocabulary, DEFAULT_PADDING_TOKEN
from allennlp.models import Model
from torch.optim import Adam

In [46]:
allennlp.__version__

'0.9.0'

In [3]:
df_raw = pd.read_csv("../data/stackexchange_812k.tokenized.csv").sample(frac=1, random_state=8).reset_index(drop=True)

In [4]:
#Concatenate all the original texts from the dataset and list the unique 
text = ''.join(df_raw.text.values).lower()

# simple but efficient way to split string into list of characters
all_characters = [s for s in text]
unique_characters = np.unique(all_characters) 
print(unique_characters)

char_count = Counter(all_characters)
print(char_count.most_common())

# limit the allowed characters to MAX_VOCAB_SIZE
MAX_VOCAB_SIZE = 40
valid_characters = [t[0] for t in  char_count.most_common(MAX_VOCAB_SIZE)]
valid_characters.sort()

['\t' '\x0b' '\x0c' ' ' '!' "'" ',' '-' '.' '?' '\\' 'a' 'b' 'c' 'd' 'e'
 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w'
 'x' 'y' 'z' '\x7f' '\xa0' '¡' '¢' '£' '¥' '¦' '§' '¨' '©' 'ª' '«' '¬'
 '\xad' '®' '¯' '°' '±' '²' '³' '´' 'µ' '¶' '·' '¹' 'º' '»' '¼' '½' '¾'
 '¿' '×' 'ß' 'à' 'á' 'â' 'ã' 'ä' 'å' 'æ' 'ç' 'è' 'é' 'ê' 'ë' 'ì' 'í' 'î'
 'ï' 'ð' 'ñ' 'ò' 'ó' 'ô' 'õ' 'ö' '÷' 'ø' 'ù' 'ú' 'û' 'ü' 'ý' 'ā' 'ă' 'ą'
 'ć' 'č' 'ē' 'ĕ' 'ė' 'ę' 'ğ' 'ī' 'ı' 'ĺ' 'ļ' 'ł' 'ń' 'ō' 'ő' 'œ' 'ř' 'ś'
 'ş' 'š' 'ū' 'ů' 'ŷ' 'ź' 'ž' 'ƒ' 'ơ' 'ƴ' 'ț' 'ȳ' 'ɑ' 'ə' 'ɛ' 'ɣ' 'ɪ' 'ɵ'
 'ʃ' 'ʊ' 'ʒ' 'ʼ' 'ˆ' 'ˇ' 'ˈ' 'ˉ' 'ˌ' '˙' '˚' '˜' '̀' '́' '̂' '̃' '̄' '̅'
 '̇' '̈' '̧' '̶' '̸' '͝' ';' '΄' 'ά' 'έ' 'ή' 'ί' 'α' 'β' 'γ' 'δ' 'ε' 'ζ'
 'η' 'θ' 'ι' 'κ' 'λ' 'μ' 'ν' 'ξ' 'ο' 'π' 'ρ' 'ς' 'σ' 'τ' 'υ' 'φ' 'χ' 'ψ'
 'ω' 'ό' 'ύ' 'ώ' 'ϐ' 'ϕ' 'ϵ' 'а' 'б' 'в' 'г' 'д' 'е' 'ж' 'з' 'и' 'й' 'к'
 'л' 'м' 'н' 'о' 'п' 'р' 'с' 'т' 'у' 'ф' 'х' 'ц' 'ч' 'ш' 'щ' 'ъ' 'ы' 'ь'
 'э' 'ю' 'я' 'ё' 'є' 'א' 'ב' 'ד' 'ה' 'ו' 'ח' 'י' 'כ' '

## Subsample the dataset to take into account the titles only.

In [5]:
POSTS_TYPE = 'title'
DF_SAMPLE_COUNT = 10000

# subsample the original dataset

df = df_raw[(df_raw.category == POSTS_TYPE)].sample(DF_SAMPLE_COUNT).reset_index(drop = True)

print("df.shape: ", df.shape)

print(df.text.sample(2).values)

df.shape:  (10000, 7)
['Dice probability for Yahtzee large straight'
 'validation of a Zero Adjusted Gamma model']


## Implement the tokenization of the dataset and transform the tokens into AllenNLP Instances.

In [6]:
tokenizer = CharacterTokenizer()
train_set = df.text.apply(lambda txt : tokenizer.tokenize(txt.lower())).values

In [8]:
def tokens_to_instance(tokens: List[Token], token_indexers: Dict[str, TokenIndexer]):
    tokens = list(tokens)
    tokens.insert(0, Token(START_SYMBOL))
    tokens.append(Token(END_SYMBOL))

    input_field  = TextField(tokens[:-1], token_indexers)
    output_field = TextField(tokens[1:], token_indexers)
    return Instance({'input_tokens': input_field, 'output_tokens': output_field})        

In [9]:
token_indexers = {'tokens': SingleIdTokenIndexer()}
instances = [tokens_to_instance(tokens, token_indexers) for tokens in train_set]
token_counts = {char: 1 for char in valid_characters}
vocab = Vocabulary({'tokens': token_counts})

## Design a RNN with AllenNLP that includes:
- an embedding of the tokens,
- a seq2seq LSTM layer,
- a feed forward layer that outputs probability distribution of the characters

AllenNLP is a modelling framework which is capable of many types of models. Most of the state of the art results in academia you will see are variations on Transformer networks which are not directly covered in this course but are available within AllenNLP. LSTMs and Seq2Seq models are still very competitive and widely used. In industry however you have to weigh up a lot of things to decide what the best model is; size of the dataset, how long you have to train/fine tune the model, infrastructure available, how the model will be served, what an acceptable level of accuracy is etc. Normally you start with a very simple model and gradually try more complex models until you reach an acceptable level of performance. It is usually a mistake to jump straight to an LSTM or Transformer model without having simpler baselines to benchmark them against and you will be surprised how often a simple model will achieve the goal.

In [11]:
EMBEDDING_SIZE = 32
HIDDEN_SIZE = 256
BATCH_SIZE = 128

In [37]:
if torch.cuda.is_available():
    cuda_device = 0
else:
    cuda_device = -1
    
cuda_device == 0

True

In [33]:
# @Model.register('rnn_language_model')
class RNNLanguageModel(Model):
    def __init__(self,
                 embedder: TextFieldEmbedder,
                 hidden_size: int,
                 max_len: int,
                 vocab: Vocabulary) -> None:
        super().__init__(vocab)

        self.embedder = embedder

        # initialize a Seq2Seq encoder, LSTM
        self.rnn = PytorchSeq2SeqWrapper(
            torch.nn.LSTM(EMBEDDING_SIZE, HIDDEN_SIZE, batch_first=True))

        self.hidden2out = torch.nn.Linear(in_features=self.rnn.get_output_dim(), out_features=vocab.get_vocab_size('tokens'))
        self.hidden_size = hidden_size
        self.max_len = max_len

    def forward(self, input_tokens, output_tokens):
        '''
        This is the main process of the Model where the actual computation happens. 
        Each Instance is fed to the forward method. 
        It takes dicts of tensors as input, with same keys as the fields in your Instance (input_tokens, output_tokens)
        It outputs the results of predicted tokens and the evaluation metrics as a dictionary. 
        '''

        mask = get_text_field_mask(input_tokens)
        embeddings = self.embedder(input_tokens)
        rnn_hidden = self.rnn(embeddings, mask)
        out_logits = self.hidden2out(rnn_hidden)
        loss = sequence_cross_entropy_with_logits(out_logits, output_tokens['tokens'], mask)

        return {'loss': loss}

    def generate(self) -> Tuple[List[Token], torch.tensor]:

        start_symbol_idx = self.vocab.get_token_index(START_SYMBOL, 'tokens')
        end_symbol_idx = self.vocab.get_token_index(END_SYMBOL, 'tokens')
        padding_symbol_idx = self.vocab.get_token_index(DEFAULT_PADDING_TOKEN, 'tokens')

        log_likelihood = 0.
        words = []
        state = (torch.zeros(1, 1, self.hidden_size).to(cuda_device), torch.zeros(1, 1, self.hidden_size).to(cuda_device))

        word_idx = start_symbol_idx

        for i in range(self.max_len):
            tokens = torch.tensor([[word_idx]]).to(cuda_device)

            embeddings = self.embedder({'tokens': tokens})
            output, state = self.rnn._module(embeddings, state)
            output = self.hidden2out(output)

            log_prob = torch.log_softmax(output[0, 0], dim=0)

            dist = torch.exp(log_prob)

            word_idx = start_symbol_idx

            while word_idx in {start_symbol_idx, padding_symbol_idx}:
                word_idx = torch.multinomial(
                   dist, num_samples=1, replacement=False).item()

            log_likelihood += log_prob[word_idx]

            if word_idx == end_symbol_idx:
                break

            token = Token(text=self.vocab.get_token_from_index(word_idx, 'tokens'))
            words.append(token)

        return words, log_likelihood    

## Train the model.

In [38]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'), embedding_dim=EMBEDDING_SIZE)

embedder = BasicTextFieldEmbedder({"tokens": token_embedding})

model = RNNLanguageModel(embedder=embedder, hidden_size=HIDDEN_SIZE, max_len=80, vocab=vocab)

if cuda_device == 0:
    model.to(cuda_device)

In [40]:
iterator = BasicIterator(batch_size=BATCH_SIZE)
iterator.index_with(vocab)
optimizer = Adam(model.parameters(), lr=5.e-3)

In [47]:
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=instances,
                  num_epochs=40,
                  cuda_device=cuda_device)
trainer.train()

loss: 0.9274 ||: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:06<00:00, 11.94it/s]
loss: 0.9187 ||: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:06<00:00, 11.91it/s]
loss: 0.9148 ||: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:06<00:00, 11.90it/s]
loss: 0.9087 ||: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:06<00:00, 11.91it/s]
loss: 0.8989 ||: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

loss: 0.7860 ||: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:06<00:00, 11.93it/s]
loss: 0.7901 ||: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:06<00:00, 11.76it/s]


{'best_epoch': 39,
 'peak_cpu_memory_MB': 6254.796,
 'peak_gpu_0_memory_MB': 947,
 'training_duration': '0:04:31.333810',
 'training_start_epoch': 0,
 'training_epochs': 39,
 'epoch': 39,
 'training_loss': 0.7900567213191262,
 'training_cpu_memory_MB': 6254.796,
 'training_gpu_0_memory_MB': 947}

## Evaluate the model by calculating the loss of some sentences and by generating text.

In [48]:
def predict(text: str, model: Model) -> float:
    tokenizer = CharacterTokenizer()
    tokens = tokenizer.tokenize(text)

    token_indexers = {'tokens': SingleIdTokenIndexer()}
    instance = tokens_to_instance(tokens, token_indexers)
    output = model.forward_on_instance(instance)
    print(output)

In [49]:
sentence = "In a fixed-effects model only time-varying variables can be used."
predict(sentence, model)

sentence = "I know a pretty little place in Southern California, down San Diego way."
predict(sentence, model)

sentence = "This that is noon but yes apple whatever did regression variable"
predict(sentence, model)

{'loss': 1.600434}
{'loss': 3.3273149}
{'loss': 2.8697264}


In [50]:
for _ in range(50):
    tokens, _ = model.generate()
    print(''.join(token.text for token in tokens))

defining a percentage of a given of predictor with many parameter for categorica
what is the generation of missing values? for rmsemeter' we dath when sloupher t
is using chi square fich approaches that factors on many zero? pairwise comparis
testing whether two coefficients and bernoulli wins for estimating proportion or
how to truncated processes? that is not correctly detection? in stata model and 
how does the sampling distribution is jags? all zero-information criterion in r?
how is test gensod for confidence intervals and nn? --gbf ? into dereige iter's?
sample regression contribution and divisa-- sampled between pseudo allorized in 
model disiders, estimate of correlation? how to detex a ?? mistage accept or p-v
as an unbiased mining where roc correct? in r with bayesian marginal polynomials
calculate the extrement used, use adaptive distribution? often researched by sta
adapting cross validation results? of the goodness of fit cale of sample is sign
optimal outliers for the sta