<a href="https://colab.research.google.com/github/mkaramib/NLP/blob/main/Summarization/TextSummarization_Transformers_Decoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization using Transformers based on GPT (Decoder)
This jupyter notebook implements a text summarization using Transformerts. It employes Trax to train a summarizer on cnn daily news. 

## Install Libraries
this section, required libraries such as Trax and SentencePiece will be installed. 

In [None]:
# install SentencePiece
!pip install sentencepiece

# install Trax
!pip install -q -U trax

# check the Trax version
!pip list | grep trax

## Import Libraries
Required libraries will be imported.

In [7]:
import sys
import os
import time
import numpy as np
import pandas as pd
import gin
import textwrap
import trax
from trax import layers as tl
from trax.fastmath import numpy as jnp
import sentencepiece as spm
from trax.supervised import training

## Initializations
This section will initialize required parameters of libraries. 

In [3]:
wrapper = textwrap.TextWrapper(width=70)

# Data Prepare
In this section, the data will be prepared and converted to tensor in the required format for the Transformers. 

### Train SentencePiece and Build Vocabulary
In this section, a vocabulary will be built using sentencepiece. It will contain sub-words. Generated model will be saved in *./data/mbpe.model*.

In [15]:
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=./data/vocab_train.txt --model_prefix=./data/mbpe --model_type=bpe --vocab_size=20000')

# changind the pre-defined symbols
#spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('./data/mbpe.model')

# get the size of vocabulary
vocab_size = sp.get_piece_size()

# get the vocabulary
vocab = [sp.id_to_piece(id) for id in range(vocab_size)]

In [None]:
# print vocab size
print(f'vocab size = {vocab_size}')

# get some vocabs 
print(vocab[0:20], sep="\n")

### Test tokenization
Here are an example to see whether the tokenization using sentencepiece is working or not.

In [None]:
# encode: text => id
s1 = "Four groups that advocate for immigrant rights said Thursday they will challenge Arizona 's new immigration law , which allows police to ask anyone for proof of legal U.S. residency ."

# print the encoded results, tokens and ids
print(sp.encode_as_pieces(s1))
print(sp.encode_as_ids(s1))

# print decoded 
print(sp.decode_pieces(sp.encode_as_pieces(s1)))
print(sp.decode_ids([1758, 1093, 32, 21, 3370, 25, 8133, 485, 26, 222, 70, 58, 959]))

### Trax tokenizer
This section will see how the trax tokenizer works


In [12]:
def tokenize(input_str, EOS=1):
    """Input str to features dict, ready for inference"""
  
    # Use the trax.data.tokenize method. It takes streams and returns streams,
    # we get around it by making a 1-element stream with `iter`.
    inputs =  next(trax.data.tokenize(iter([input_str]),
                                      vocab_dir='./data/',
                                      vocab_file='mbpe.vocab'))
    
    # Mark the end of the sentence with EOS
    return list(inputs) + [EOS]

def detokenize(integers):
    """List of ints to str"""
  
    s = trax.data.detokenize(integers,
                             vocab_dir='./data/',
                             vocab_file='mbpe.vocab')
    
    return wrapper.fill(s)

### Data Load
In this section, the data will be loaded. Data is stored in csv file using two columns of *body* and *highlights*.

In [9]:
# data file
cnn_file = "./data/cnn_1000.csv"

# read data
cnn = pd.read_csv(cnn_file, sep="\t")
cnn_articles = cnn["body"]
cnn_highlights = cnn["highlights"]

# train the sentencepiece for tokenization and vocabulary

# tokenize them using sentencepiece

In [None]:
# test data loading
print(f'num of sampels = {len(cnn_articles)}')

# Transformer - Decoder Only
This experiment is based on using only the Decoder part of Transformer. 


## Embedding and Positional Encoding
In this section, an Embedding and Positional Encoding will be implemented. 

In [None]:
def PositionalEncoder(vocab_size, d_model, dropout, max_len, mode):
    """Returns a list of layers that: 
    1. takes a block of text as input, 
    2. embeds the words in that text, and 
    3. adds positional encoding, 
       i.e. associates a number in range(max_len) with 
       each word in each sentence of embedded input text 
    
    The input is a list of tokenized blocks of text
    
    Args:
        vocab_size (int): vocab size.
        d_model (int):  depth of embedding.
        dropout (float): dropout rate (how much to drop out).
        max_len (int): maximum symbol length for positional encoding.
        mode (str): 'train' or 'eval'.
    """
    # Embedding inputs and positional encoder
    return [ 
        # Add embedding layer of dimension (vocab_size, d_model)
        tl.Embedding(vocab_size, d_model),  
        # Use dropout with rate and mode specified
        tl.Dropout(rate=dropout, mode=mode), 
        # Add positional encoding layer with maximum input length and mode specified
        tl.PositionalEncoding(max_len=max_len, mode=mode)] 

## Feedforward Layer

In [None]:
def FeedForward(d_model, d_ff, dropout, mode, ff_activation):
    """
    Returns a list of layers that implements a feed-forward block.
    The input is an activation tensor.
    Args:
        d_model (int):  depth of embedding.
        d_ff (int): depth of feed-forward layer.
        dropout (float): dropout rate (how much to drop out).
        mode (str): 'train' or 'eval'.
        ff_activation (function): the non-linearity in feed-forward layer.

    Returns:
        list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
    """
    
    # Create feed-forward block (list) with two dense layers with dropout and input normalized
    return [ 
        # Normalize layer inputs
        tl.LayerNorm(), 
        # Add first feed forward (dense) layer (don't forget to set the correct value for n_units)
        tl.Dense(d_ff), 
        # Add activation function passed in as a parameter (you need to call it!)
        ff_activation(),  # Generally ReLU
        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
        tl.Dropout(rate=dropout, mode=mode), 
        # Add second feed forward layer (don't forget to set the correct value for n_units)
        tl.Dense(d_model), 
        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
        tl.Dropout(rate=dropout, mode=mode) 
    ]

## Decoder Layer

In [None]:
def DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation):
    """
    Returns a list of layers that implements a Transformer decoder block.
    The input is an activation tensor.

    Args:
        d_model (int):  depth of embedding.
        d_ff (int): depth of feed-forward layer.
        n_heads (int): number of attention heads.
        dropout (float): dropout rate (how much to drop out).
        mode (str): 'train' or 'eval'.
        ff_activation (function): the non-linearity in feed-forward layer.

    Returns:
        list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
    """
        
    # Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks
    return [
      tl.Residual(
          # Normalize layer input
          tl.LayerNorm(), 
          # Add causal attention 
          tl.CausalAttention(d_model, n_heads=n_heads, dropout=dropout, mode=mode)
          # Add dropout with rate and mode specified
          tl.Dropout(rate=dropout, mode=mode)
        ),
      tl.Residual(
          # Add feed-forward block
          # We don't need to normalize the layer inputs here. The feed-forward block takes care of that for us.
          FeedForward(d_model, d_ff, dropout, mode, ff_activation)
        ),
      ]

## Transformer LM
In this section, a decoder-based Transformer is implemented.


In [None]:
# A Transformer using only the Docoder part
def TransformerLM(vocab_size=33300,
                  d_model=512,
                  d_ff=2048,
                  n_layers=6,
                  n_heads=8,
                  dropout=0.1,
                  max_len=4096,
                  mode='train',
                  ff_activation=tl.Relu):
    """
    Returns a Transformer language model.
    The input to the model is a tensor of tokens. 

    Args:
        vocab_size (int): vocab size.
        d_model (int):  depth of embedding.
        d_ff (int): depth of feed-forward layer.
        n_layers (int): number of decoder layers.
        n_heads (int): number of attention heads.
        dropout (float): dropout rate (how much to drop out).
        max_len (int): maximum symbol length for positional encoding.
        mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference.
        ff_activation (function): the non-linearity in feed-forward layer.

    Returns:
        trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens
        to activations over a vocab set.
    """
    
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    # Embedding inputs and positional encoder
    positional_encoder = [ 
        # Add embedding layer of dimension (vocab_size, d_model)
        tl.Embedding(vocab_size, d_model),
        # Use dropout with rate and mode specified
        tl.Dropout(rate=dropout,mode=mode),
        # Add positional encoding layer with maximum input length and mode specified
        tl.PositionalEncoding(max_len=max_len, mode=mode)]

    # Create the complete model as written in the figure
    return tl.Serial(
        # Use teacher forcing (feed output of previous step to current step)
        tl.ShiftRight(mode=mode), # Specify the mode!
        # Add positional encoder
        positional_encoder,
        # Add decoder blocks,(list of Decoders)
        [DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation) for _ in range(n_layers)],
        # Normalize layer
        tl.LayerNorm(),
        # Add dense layer of vocab_size (since need to select a word to translate to)
        # (a.k.a., logits layer. Note: activation already set by ff_activation)
        tl.Dense(vocab_size),
        # Get probabilities with Logsoftmax
        tl.LogSoftmax()
    )

# Training
In this section, implemented Decoder-based Transformer will be trained. 

In [None]:
def training_loop(TransformerLM, 
                  d_model=512,
                  d_ff=2048,
                  n_layers=6,
                  n_heads=8,
                  dropout=0.1,
                  max_len=4096,
                  mode='train',
                  train_gen, eval_gen, output_dir = "~/model"):
    '''
    Input:
        TransformerLM (trax.layers.combinators.Serial): The model you are building.
        train_gen (generator): Training stream of data.
        eval_gen (generator): Evaluation stream of data.
        output_dir (str): folder to save your file.
    Returns:
        trax.supervised.training.Loop: Training loop.
    '''
    output_dir = os.path.expanduser(output_dir)  # trainer is an object
    lr_schedule = trax.lr.warmup_and_rsqrt_decay(n_warmup_steps=1000, max_value=0.01)

    # define the training task
    train_task = training.TrainTask( 
      labeled_data= train_gen, # The training generator
      loss_layer= tl.CrossEntropyLoss(), # Loss function 
      optimizer= trax.optimizers.Adam(0.01), # Optimizer (Don't forget to set LR to 0.01)
      lr_schedule= lr_schedule,
      n_steps_per_checkpoint=10
    )

    # define the evaluation task
    eval_task = training.EvalTask( 
      labeled_data= eval_gen, # The evaluation generator
      metrics=[tl.CrossEntropyLoss(), tl.Accuracy()] # CrossEntropyLoss and Accuracy
    )

    # define the training loop
    loop = training.Loop(TransformerLM(d_model=4,
                                       d_ff=16,
                                       n_layers=1,
                                       n_heads=2,
                                       mode='train'),
                         train_task,
                         eval_tasks=[eval_task],
                         output_dir=output_dir)
    
    return loop