<a href="https://colab.research.google.com/github/mkaramib/NLP/blob/main/Summarization/TextSummarization_Transformers_Decoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization using Transformers based on GPT (Decoder)
This jupyter notebook implements a text summarization using Transformerts. It employes Trax to train a summarizer on cnn daily news. 

## Install Libraries
this section, required libraries such as Trax and SentencePiece will be installed. 

In [None]:
# install SentencePiece
!pip install sentencepiece

# install Trax
!pip install -q -U trax

# check the Trax version
!pip list | grep trax

## Import Libraries
Required libraries will be imported.

In [2]:
import sys
import os
import time
import numpy as np
import pandas as pd
import random as rnd
import textwrap
import trax
from trax import layers as tl
from trax.fastmath import numpy as jnp
import sentencepiece as spm
from trax.supervised import training

## Initializations
This section will initialize required parameters of libraries. 

In [4]:
wrapper = textwrap.TextWrapper(width=70)

# Data Prepare
In this section, the data will be prepared and converted to tensor in the required format for the Transformers. 

### Train SentencePiece and Build Vocabulary
In this section, a vocabulary will be built using sentencepiece. It will contain sub-words. Generated model will be saved in *./data/mbpe.model*.

In [8]:
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=./data/vocab_train.txt --model_prefix=./data/mbpe --model_type=bpe --user_defined_symbols=[SEP] --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS] --vocab_size=20000')

# changind the pre-defined symbols
#spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('./data/mbpe.model')

# get the size of vocabulary
vocab_size = sp.get_piece_size()

# get the vocabulary
vocab_terms = [sp.id_to_piece(id) for id in range(vocab_size)]

In [35]:
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  
print('code=',sp.decode_ids([0, 1, 2, 3, 4, 5, 6]))
print('id = ',sp.encode_as_ids("[SEP]"))
print('text = ', sp.decode_ids([19930, 4]))

eos= 3
unk= 1
pad= 0
code=  ⁇ [SEP] t a
id =  [19930, 4]
text =  [SEP]


### Test tokenization
Here are an example to see whether the tokenization using sentencepiece is working or not.

In [None]:
# encode: text => id
s1 = "Four groups that advocate for immigrant rights said Thursday ."

# print the encoded results, tokens and ids
print(sp.encode_as_pieces(s1))
print(sp.encode_as_ids(s1))

# print decoded 
print(sp.decode_pieces(sp.encode_as_pieces(s1)))
print(sp.decode_ids([1758, 1093, 32, 21, 3370, 25, 8133, 485, 26, 222, 70, 58, 959]))

### Data Load
In this section, the data will be loaded. Data is stored in csv file using two columns of *body* and *highlights*.

In [30]:
# data file
cnn_file = "./data/cnn_1000.csv"

# train, validation, test rates
train_rate, val_rate, test_rate = 80, 10, 10

# read data
cnn = pd.read_csv(cnn_file, sep="\t")
cnn_articles = cnn["body"]
cnn_highlights = cnn["highlights"]

# get boundaries
total_samples = len(cnn_articles)
train_end_i = int( total_samples * train_rate/100)
val_end_i = train_end_i + int(total_samples * val_rate/100)

### Data Vectorize
It is required to vectorize the articles as well as their highlights. 


In [31]:
# Vectorize articles and highlights
articles = [sp.encode_as_ids(artcl) for artcl in cnn_articles]
highlights = [sp.encode_as_ids(hglt) for hglt in cnn_highlights]

# split into train, validation, and test sets
train_articles, val_articles, test_articles = articles[:train_end_i], articles[train_end_i:val_end_i], articles[val_end_i:]  
train_highlights, val_highlights, test_highlights = highlights[:train_end_i], highlights[train_end_i:val_end_i], highlights[val_end_i:]

### Max Length
It is needed to know the max lenght of concatenations of articles and their highlights.

In [32]:
# find max length
max_l = 0
for a,h in zip(articles, highlights):
  if len(a)+len(h) > max_l:
    max_l = len(a)+len(h)

# add the EOS, and SEP to the length: article + [EOS] + [SEP] + highlights + [EOS]
max_l += 3

In [33]:
max_l

2611

### Data Generator
Data generators provides data for training, validation, and testing. 

In [118]:
# define the data generator
def data_generator(x, y, max_length, batch_size, eos_id, sep_id, pad_id, loop = False, shuffle=False):
  '''
  Yield list of trainig inputs, training targets, and training mask with given batch-size.
  Inputs:
    x: list of articles
    y: list of highlights
    max_length: max length of concatenation of an article and its highlights
    loop: indicates whether it make loop at the end of list or not.
    shuffle: shows if the shuffle is required.
  Outputs:
    triples of (input, traget, mask)
  '''
  data_size = len(x)
  indexes = [*range(data_size)]
  
  # shuffle if required
  if shuffle:
    rnd.shuffle(indexes)

  index = 0
  stop = False

  # iterate
  while True:
    
    # primary indexes for outputs
    out_indexes = [0]*batch_size

    # iterate batch_size time
    for i in range(batch_size):
      
      # if at the end of list
      if index >= data_size:
        if not Loop:
          stop = True
          break
        
        index = 0
        if shuffle:
          rnd.shuffle(indexes)

      # add next on to the output indexes
      out_indexes[i] = indexes[index]

      # increase index
      index += 1

    # convert the output indexes into actual output
    # pad all to the max-length
    #X = np.full((batch_size, max_length), pad_id)
    #Y = np.full((batch_size, max_length), pad_id)
    result_x = []
    result_mask = []

    # interate through the batches
    for i in range(batch_size):

      # add the article_vec + eos_id + pad_id + highlight_vec + eos_id
      out = x[out_indexes[i]] + [eos_id] + [pad_id] + y[out_indexes[i]] + [eos_id]
      mask_temp = [0]*(len(x[out_indexes[i]]) + 2) + [1]*(len(y[out_indexes[i]]) + 1)

      # pad to the max_length
      out = out + [pad_id] * (max_length-len(out))
      mask_temp = mask_temp + [pad_id] * (max_length-len(mask_temp))

      # add the results
      result_x.append(out)
      result_mask.append(mask_temp)

    # yield the result
    yield jnp.array(result_x), jnp.array(result_x), jnp.array(result_mask)

    # is stop flagged, do not continue
    if stop:
      break

In [119]:
eos_id = sp.eos_id()
pad_id = sp.pad_id()
sep_id = 4

temp_data_generator = data_generator(train_articles, train_highlights, max_length = max_l, batch_size=4, eos_id=eos_id, sep_id=sep_id, pad_id=pad_id)
next_batch = next(temp_data_generator)
x,y,mask = next_batch
print(f' x_shape = {x.shape}, y_shape={y.shape}, mask shape = {mask.shape}')
print(f'x[0] = {type(x[0])}')

 x_shape = (4, 2611), y_shape=(4, 2611), mask shape = (4, 2611)
x[0] = <class 'jax.interpreters.xla._DeviceArray'>


# Transformer - Decoder Only
This experiment is based on using only the Decoder part of Transformer. 


## Embedding and Positional Encoding
In this section, an Embedding and Positional Encoding will be implemented. 

In [120]:
def PositionalEncoder(vocab_size, d_model, dropout, max_len, mode):
    """Returns a list of layers that: 
    1. takes a block of text as input, 
    2. embeds the words in that text, and 
    3. adds positional encoding, 
       i.e. associates a number in range(max_len) with 
       each word in each sentence of embedded input text 
    
    The input is a list of tokenized blocks of text
    
    Args:
        vocab_size (int): vocab size.
        d_model (int):  depth of embedding.
        dropout (float): dropout rate (how much to drop out).
        max_len (int): maximum symbol length for positional encoding.
        mode (str): 'train' or 'eval'.
    """
    # Embedding inputs and positional encoder
    return [ 
        # Add embedding layer of dimension (vocab_size, d_model)
        tl.Embedding(vocab_size, d_model),  
        # Use dropout with rate and mode specified
        tl.Dropout(rate=dropout, mode=mode), 
        # Add positional encoding layer with maximum input length and mode specified
        tl.PositionalEncoding(max_len=max_len, mode=mode)] 

## Feedforward Layer

In [121]:
def FeedForward(d_model, d_ff, dropout, mode, ff_activation):
    """
    Returns a list of layers that implements a feed-forward block.
    The input is an activation tensor.
    Args:
        d_model (int):  depth of embedding.
        d_ff (int): depth of feed-forward layer.
        dropout (float): dropout rate (how much to drop out).
        mode (str): 'train' or 'eval'.
        ff_activation (function): the non-linearity in feed-forward layer.

    Returns:
        list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
    """
    
    # Create feed-forward block (list) with two dense layers with dropout and input normalized
    return [ 
        # Normalize layer inputs
        tl.LayerNorm(), 
        # Add first feed forward (dense) layer (don't forget to set the correct value for n_units)
        tl.Dense(d_ff), 
        # Add activation function passed in as a parameter (you need to call it!)
        ff_activation(),  # Generally ReLU
        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
        tl.Dropout(rate=dropout, mode=mode), 
        # Add second feed forward layer (don't forget to set the correct value for n_units)
        tl.Dense(d_model), 
        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
        tl.Dropout(rate=dropout, mode=mode) 
    ]

## Decoder Layer

In [122]:
def DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation):
    """
    Returns a list of layers that implements a Transformer decoder block.
    The input is an activation tensor.

    Args:
        d_model (int):  depth of embedding.
        d_ff (int): depth of feed-forward layer.
        n_heads (int): number of attention heads.
        dropout (float): dropout rate (how much to drop out).
        mode (str): 'train' or 'eval'.
        ff_activation (function): the non-linearity in feed-forward layer.

    Returns:
        list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
    """
        
    # Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks
    return [
      tl.Residual(
          # Normalize layer input
          tl.LayerNorm(), 
          # Add causal attention 
          tl.CausalAttention(d_model, n_heads=n_heads, dropout=dropout, mode=mode),
          # Add dropout with rate and mode specified
          tl.Dropout(rate=dropout, mode=mode)
        ),
      tl.Residual(
          # Add feed-forward block
          # We don't need to normalize the layer inputs here. The feed-forward block takes care of that for us.
          FeedForward(d_model, d_ff, dropout, mode, ff_activation)
        ),
      ]

## Transformer LM
In this section, a decoder-based Transformer is implemented.


In [126]:
# A Transformer using only the Docoder part
def TransformerLM(vocab_size=33300,
                  d_model=4,
                  d_ff=16,
                  n_layers=2,
                  n_heads=2,
                  dropout=0.1,
                  max_len=4096,
                  mode='train',
                  ff_activation=tl.Relu):
    """
    Returns a Transformer language model.
    The input to the model is a tensor of tokens. 

    Args:
        vocab_size (int): vocab size.
        d_model (int):  depth of embedding.
        d_ff (int): depth of feed-forward layer.
        n_layers (int): number of decoder layers.
        n_heads (int): number of attention heads.
        dropout (float): dropout rate (how much to drop out).
        max_len (int): maximum symbol length for positional encoding.
        mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference.
        ff_activation (function): the non-linearity in feed-forward layer.

    Returns:
        trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens
        to activations over a vocab set.
    """
  
    # Embedding inputs and positional encoder
    positional_encoder = [ 
        # Add embedding layer of dimension (vocab_size, d_model)
        tl.Embedding(vocab_size, d_model),
        # Use dropout with rate and mode specified
        tl.Dropout(rate=dropout,mode=mode),
        # Add positional encoding layer with maximum input length and mode specified
        tl.PositionalEncoding(max_len=max_len, mode=mode)]

    # Create the complete model as written in the figure
    return tl.Serial(
        # Use teacher forcing (feed output of previous step to current step)
        tl.ShiftRight(mode=mode), # Specify the mode!
        # Add positional encoder
        positional_encoder,
        # Add decoder blocks,(list of Decoders)
        [DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation) for _ in range(n_layers)],
        # Normalize layer
        tl.LayerNorm(),
        # Add dense layer of vocab_size (since need to select a word to translate to)
        # (a.k.a., logits layer. Note: activation already set by ff_activation)
        tl.Dense(vocab_size),
        # Get probabilities with Logsoftmax
        tl.LogSoftmax()
    )

# Training
In this section, implemented Decoder-based Transformer will be trained. 

In [127]:
# training loop
def training_loop(model, train_gen, eval_gen, train_steps=1, output_dir = "model"):
    '''
    Input:
        TransformerLM (trax.layers.combinators.Serial): The model you are building.
        train_gen (generator): Training stream of data.
        eval_gen (generator): Evaluation stream of data.
        train_steps: number of steps 
        output_dir (str): folder to save your file.
    Returns:
        trax.supervised.training.Loop: Training loop.
    '''
    #output_dir = os.path.expanduser(output_dir)  # trainer is an object
    lr_schedule = trax.lr.warmup_and_rsqrt_decay(n_warmup_steps=1000, max_value=0.01)

    # define the training task
    train_task = training.TrainTask( 
      labeled_data= train_gen, # The training generator
      loss_layer= tl.CrossEntropyLoss(), # Loss function 
      optimizer= trax.optimizers.Adam(0.01), # Optimizer (Don't forget to set LR to 0.01)
      lr_schedule= lr_schedule,
      n_steps_per_checkpoint=100
    )

    # define the evaluation task
    eval_task = training.EvalTask( 
      labeled_data= eval_gen, # The evaluation generator
      metrics=[tl.CrossEntropyLoss(), tl.Accuracy()], # CrossEntropyLoss and Accuracy
      n_eval_batches = 10         # Number of batches to use on each evaluation
    )

    # define the training loop
    loop = training.Loop(
        model,
        train_task, # A train task
        eval_tasks = [eval_task], # The evaluation task
        output_dir=output_dir)
    
    # run the training loop for 
    loop.run(n_steps= train_steps)
    return loop

### Training Process
It is needed to instantiate train and validation data generator as well as training steps.

In [None]:
train_steps = 10           # number of training steps
!rm -f 'model/model.pkl.gz'   # Remove old model.pkl if it exists

# train data generator
train_data_gen = data_generator(train_articles, train_highlights, 
                                max_length = max_l, batch_size=64, 
                                eos_id=eos_id, sep_id=sep_id, pad_id=pad_id)

# validation data generator
val_data_gen = data_generator(val_articles, val_highlights, 
                              max_length = max_l, batch_size=64, 
                              eos_id=eos_id, sep_id=sep_id, pad_id=pad_id)

# define the model
model = TransformerLM(vocab_size=2000)

# instantiate the training loop
#train_loops = training_loop(model, train_data_gen, val_data_gen, train_steps=train_steps)

# Conclusion
I was not able to train the model because of memory limits.