<a href="https://colab.research.google.com/github/igmim-yassine/Pytorch-Tutorial/blob/master/03E_LanguageModelling_ROCDataset_NLP_class_Emines_Jan22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook Summary
* Learn a Language Model on the ROC Story Dataset: https://cs.rochester.edu/nlp/rocstories/
> Available here: https://drive.google.com/file/d/1eJINcSbC3JLl0hTNbhh5G94zTuXinpC-/view?usp=sharing

* Generate Text with this Language Model using several decoding techniques
* Evaluate the Language Model using the perplexity and the BLEU score. 

In [1]:
import pandas as pd
import tensorflow as tf
import numpy as np
import re
import json

### 1. Load and Preprocess the Dataset

In [None]:
data_path = "/content/drive/MyDrive/12_Teaching/UM6P-NLP-Jan2022/notebooks/ROCStories_winter2017.csv"
df = pd.read_csv(data_path)

In [None]:
def get_sentences(df, max_samples=None):
    df["sentence_1_2"] = df.sentence1 + " " + df.sentence2
    sentences = df.sentence_1_2
    sentences_1, sentences_2 = df["sentence1"], df["sentence2"]
    if max_samples is not None:
        sentences = sentences[:max_samples]
        sentences_1 = sentences_1[:max_samples]
        sentences_2 = sentences_2[:max_samples]
    return sentences, sentences_1, sentences_2

In [None]:
sentences, sentences_1, sentences_2 = get_sentences(df)

In [None]:
# print sentences example:


**Exercise 1**: Create a function `clean_text` that clean sentences
* split words with "-"
* split number and text using a regular expressions and the function `re.split`
* Replace the token "&" by the token "and". 
* Tips: create lambda functions and apply it to the dataframe using `.apply` method. 

In [None]:
def clean_text(sentences):
    clean_func1 = lambda t:
    clean_func2 = lambda t:
    clean_func3 = lambda t:
    clean_func4 = lambda t:
    sentences = sentences.apply(clean_func1)
    # ... #
    return sentences

In [None]:
sentences = clean_text(sentences)
sentences_1 = clean_text(sentences_1)
sentences_2 = clean_text(sentences_2)

**Exercise 2**: Build the vocab by removing some punctuation and adding the special tokens. 

In [None]:
import nltk
from nltk import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
def get_vocab(sentences, tokens_to_remove=["$", "%", "'", "''"], special_tokens=["<PAD>", "<SOS>", "<EOS>"]):
    print("Building vocab....")
    # tokenize sentences


    # build vocab


    print("vocab length:", len(vocab))
    print("saving vocab...")
    with open("vocab.json", "w") as f:
        json.dump(vocab, f)
    return tokens, vocab

In [None]:
tokens, vocab = get_vocab(sentences)

Building vocab....
vocab length: 20111
saving vocab...


In [None]:
def tokenize(sentences, vocab):

    return df, len_sentences

In [None]:
df, len_sentences = tokenize(sentences, vocab)

In [None]:
def tokenize_test(sentences, vocab):
    tokenize_func = lambda t: word_tokenize(t)
    tok_to_id_func = lambda t: [vocab["<SOS>"]]+[vocab[w] for w in t if w in vocab.keys()]+[vocab["<EOS>"]]
    tokenized_sentences = sentences.apply(tokenize_func)
    tokens_id = tokenized_sentences.apply(tok_to_id_func)
    len_sentences = tokens_id.apply(len)
    return tokens_id, len_sentences

In [None]:
def split_train_test(sentences, sentences_1_and_2, val_size=5000, test_size=3000):

    return train_sentences, val_sentences, test_sentences

In [None]:
def preprocess_data(data_path):
    df = load_data(data_path)
    sentences, sentences_1, sentences_2 = get_sentences(df)
    sentences, sentences_1, sentences_2 = clean_text(sentences), clean_text(sentences_1), clean_text(sentences_2)
    tokens, vocab = get_vocab(sentences)
    padded_sentences, len_sentences = tokenize(sentences, vocab)
    print("dataset set length:", len(padded_sentences))
    sentences_1, len_sentences_1 = tokenize_test(sentences_1, vocab)
    sentences_2, len_sentences_2 = tokenize_test(sentences_2, vocab)
    sentences_1_and_2 = pd.concat([sentences_1, sentences_2], axis=1)
    train_sentences, val_sentences, test_sentences = split_train_test(padded_sentences, sentences_1_and_2)
    print("train dataset size", len(train_sentences))
    print("val dataset size", len(val_sentences))
    print("test dataset size", len(test_sentences))
    return train_sentences, val_sentences, test_sentences

In [None]:
def get_dataloader(dataset, max_samples, batch_size):
    input_sentence = np.array([seq for seq in dataset.input_sentence.values])
    target_sentence = np.array([seq for seq in dataset.target_sentence.values])
    if max_samples is not None:
      input_sentence = input_sentence[:max_samples]
      target_sentence = target_sentence[:max_samples]
    tfdataset = tf.data.Dataset.from_tensor_slices(
            (input_sentence, target_sentence))
    dataloader = tfdataset.batch(batch_size, drop_remainder=True)
    return dataloader

In [None]:
def get_test_dataloader(data):
    inputs, targets = data
    inputs = inputs.to_list()
    targets = targets.to_list()
    inputs = [tf.constant(inp, dtype=tf.int32) for inp in inputs]
    targets = [tf.constant(tar, dtype=tf.int32) for tar in targets]
    return (inputs, targets)

***Exercise 4***: 
Create a `decode` function that decode a list of tokens id into text using the vocab. 

In [None]:
def decode(self, seq_idx, vocab, delim=' ', ignored=["<SOS>", "<PAD>"]):


**Exercise 5**: Build a LSTM network with: 
* An embedding layer
* A LSTM layer: there is a subtility -> you need to ouput the whole sequence of hidden states using the return_sequences argument: https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
* A dropout layer after the LSTM Layer
* A dense layer that project the hidden state over the vocabulary. 
> What is the size of the NN output ? 

In [None]:
# Build Model
def build_LSTM(seq_len, vocab_size, emb_size, output_size, rnn_units, dropout_rate, rnn_drop_rate=0.0):

  return lstm_model

**Exercise 6:**  
Create a function that train LSTM (similarly of notebook of day 2)
Compute the perplexity over the train and validation set: Note that the perplexity is the exponantial of the cross-entropy ! 

In [None]:
def train_LSTM(model, optimizer, EPOCHS, train_dataset, val_dataset, output_path, checkpoint_path):
    LSTM_ckpt_path = checkpoint_path + '/' + 'LSTM-{epoch}'

    callbacks = [
        tf.keras.callbacks.ModelCheckpoint(
            filepath=LSTM_ckpt_path,
            monitor='val_loss',
            save_best_only=True,
            save_weights_only=True,
            verbose=1)
    ]

    return train_history

**Exercise 7**:
Create a function that generate text at inference over the trained lstm. 
This function either use: 
* greedy decoding using `tf.math.argmax`
* sampling with temperature decoding `tf.random.categorical`

In [None]:
def generate_text(lstm, inputs, seq_len=10,
                            decoding="sampling", temp=1):
  # Loop over number of decoding timesteps: (equal to seq_len)

      # pass forward on the lstm 

      # get the last prediction (logits)

      # if decoding = sampling 
        # divide logits by temperature 
        # sample a word

        # if decoding == "greedy"
        # find the greedy word (argmax)

      # compute the inputs of the next timestep by concatenating inputs and the predicted token using tf.concat

  # return the final inputs (complete sequence of word ids)
  return inputs

**Exercise 8**:
* Take an `inputs` of the test dataset, generate text on this inputs, and decode it with the `decode` function

#### Measuring the BLEU score 
(between true sentence and generated sentence on the test dataset)
use sentence_bleu of nltk: https://www.nltk.org/_modules/nltk/translate/bleu_score.html

In [None]:
from nltk.translate.bleu_score import SmoothingFunction, sentence_bleu

In [None]:
def BLEU_score(true_sentence, generated_sentence, split_str=False):
    if split_str:
        true_sentence = true_sentence.split(sep=' ')
        generated_sentence = [generated_sentence.split(sep=' ')]
    score = sentence_bleu(references=generated_sentence, hypothesis=true_sentence, smoothing_function=SmoothingFunction().method2)
    return score

**Exercise 9**: Create a function that: 
* Loop over the test set 
* generate text on each inputs of the test set
* decode it using the decode function 
* Evaluate the BLEU score between the true decoded sentence (from the test set) and the decoded generate sentence 
* Compute the average BLEU score on the test set. 