# Transformer Summarizer

- Explore summarization using the transformer model
- Implement transformer decoder from scratch
![alt_text](images/transformerNews.png)

## Outline

- [Introduction](#0)
- [Part 1: Importing the dataset](#1)
    - [1.1 Encode & Decode helper functions](#1.1)
    - [1.2 Defining parameters](#1.2)
    - [1.3 Exploring the data](#1.3)
- [Part 2: Summarization with transformer](#2)
    - [2.1 Dot product attention](#2.1)
        - [Exercise 01](#ex01)
    - [2.2 Causal Attention](#2.2)
        - [Exercise 02](#ex02)
    - [2.3 Transformer decoder block](#2.3)
        - [Exercise 03](#ex03)
    - [2.4 Transformer Language model](#2.4)
        - [Exercise 04](#ex04)
- [Part 3: Training](#3)
    - [3.1 Training the model](#3.1)
        - [Exercise 05](#ex05)
- [Part 4: Evaluation](#4)
    - [4.1 Loading in a trained model](#4.1)
- [Part 5: Testing with your own input](#5) 
    - [Exercise 6](#ex06)
    - [5.1 Greedy decoding](#5.1)
        - [Exercise 07](#ex07)

<a name='0'></a>
### Introduction
- Summarization is an important task in NLP and could be useful for a consumer enterprise
- Bots can be used to scrape articles, summarize them and then you can use sentiment analysis to identify the sentiment about certain stock
- Summarize long emails or articles 

1. Use built in functions to preprocess data
2. Implement DotProductAttention
3. Implement Causal Attention
4. Understand how attention works
5. Build the transformer model
6. Evaluate the model
7. Summarize the article

In [4]:
import sys
import os

import numpy as np
import textwrap
wrapper = textwrap.TextWrapper(width=70)

import trax
from trax import layers as tl
from trax.fastmath import numpy as jnp

# To print the entire np array
np.set_printoptions(threshold=sys.maxsize)

In [12]:
DATA_DIR = 'data'
VOCAB_DIR = os.path.join(DATA_DIR, 'vocab_dir')

In [5]:
# Importing CNN/DailyMail articles dataset
train_stream_fn = trax.data.TFDS('cnn_dailymail',
                                 data_dir=DATA_DIR,
                                 keys=('article', 'highlights'),
                                 train=True)

# This should be much faster as the data is downloaded already.
eval_stream_fn = trax.data.TFDS('cnn_dailymail',
                                data_dir='data/',
                                keys=('article', 'highlights'),
                                train=False)

[1mDownloading and preparing dataset cnn_dailymail/plain_text/3.0.0 (download: 558.32 MiB, generated: 1.27 GiB, total: 1.82 GiB) to data/cnn_dailymail/plain_text/3.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to data/cnn_dailymail/plain_text/3.0.0.incompleteE8RCQQ/cnn_dailymail-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=287113.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to data/cnn_dailymail/plain_text/3.0.0.incompleteE8RCQQ/cnn_dailymail-validation.tfrecord


HBox(children=(FloatProgress(value=0.0, max=13368.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to data/cnn_dailymail/plain_text/3.0.0.incompleteE8RCQQ/cnn_dailymail-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=11490.0), HTML(value='')))

[1mDataset cnn_dailymail downloaded and prepared to data/cnn_dailymail/plain_text/3.0.0. Subsequent calls will reuse this data.[0m


<a name='1.1'></a>
## 1.1 Tokenize & Detokenize helper functions

Just like in the previous assignment, the cell above loads in the encoder for you. Given any data set, you have to be able to map words to their indices, and indices to their words. The inputs and outputs to your [Trax](https://github.com/google/trax) models are usually tensors of numbers where each number corresponds to a word. If you were to process your data manually, you would have to make use of the following: 

- <span style='color:blue'> word2Ind: </span> a dictionary mapping the word to its index.
- <span style='color:blue'> ind2Word:</span> a dictionary mapping the index to its word.
- <span style='color:blue'> word2Count:</span> a dictionary mapping the word to the number of times it appears. 
- <span style='color:blue'> num_words:</span> total number of words that have appeared. 

Since you have already implemented these in previous assignments of the specialization, we will provide you with helper functions that will do this for you. Run the cell below to get the following functions:

- <span style='color:blue'> tokenize: </span> converts a text sentence to its corresponding token list (i.e. list of indices). Also converts words to subwords.
- <span style='color:blue'> detokenize: </span> converts a token list to its corresponding sentence (i.e. string).

In [19]:
def tokenize(input_str, EOS=1):
    """Input str to features dict, ready for inference"""
  
    # Use the trax.data.tokenize method. It takes streams and returns streams,
    # we get around it by making a 1-element stream with `iter`.
    inputs =  next(trax.data.tokenize(
        iter([input_str]),
        vocab_dir=VOCAB_DIR,
        vocab_file='summarize32k.subword.subwords'
    ))
    
    # Mark the end of the sentence with EOS
    return list(inputs) + [EOS]

def detokenize(integers):
    """List of ints to str"""
  
    s = trax.data.detokenize(
        integers,
        vocab_dir=VOCAB_DIR,
        vocab_file='summarize32k.subword.subwords'
    )
    
    return wrapper.fill(s)

<a name='1.2'></a>
## 1.2 Preprocessing for Language Models: Concatenate It!
**Transformer Decoder** Language model to solve an input-output problem
- Language models predict the next word, they have no notion of input
- Concatenate inputs with targets putting a separator in between
- Create a mask - [0, 1] 0s at inputs and 1s at targets, so that the model is not penalized for mis-predicting the article and only focus on summary


In [20]:
SEP = 0 # Padding or separator token
EOS = 1 # End of sentence token

# Concatenate ttokenized inputs and targets sing 0 as separator
def preprocess(stream):
    for (article, summary) in stream:
        joint = np.array(list(article) + [EOS, SEP] + list(summary) + [EOS])
        mask = [0] * (len(list(article)) + 2) + [1]* (len(list(summary)) + 1) # Accounting for EOS and SEP
        yield joint, joint, np.array(mask)
        
# You can combine a few data preprocessing steps into a pipeline like this
input_pipeline = trax.data.Serial(
    # Tokenizes
    trax.data.Tokenize(vocab_dir=VOCAB_DIR, vocab_file='summarize32k.subword.subwords'),
    # Uses function defined above
    preprocess,
    # Filters out examples longer than 2048
    trax.data.FilterByLength(2048)
)

# Apply preprocessing to data streams
train_stream = input_pipeline(train_stream_fn())
eval_stream = input_pipeline(eval_stream_fn())

train_input, train_target, train_mask = next(train_stream)

assert sum((train_input - train_target)**2) == 0 # They are the same in Language Model (LM)

In [21]:
# prints mask, 0s on article, 1s on summary
print(f'Single example mask:\n\n {train_mask}')

Single example mask:

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0

In [22]:
# prints: [Example][<EOS>][<pad>][Example Summary][<EOS>]
print(f'Single example:\n\n {detokenize(train_input)}')

Single example:

 By . Emily Allen . Last updated at 6:33 PM on 7th September 2011 .
Anthony Lloyd, 17, was found by police with his pockets 'bulging' with
stolen cigarettes and jewellery . A teenage looter who took part in
the Manchester riots lied in court about being an Olympic hopeful in a
bid to avoid jail. Anthony Lloyd, 17, who was caught with pockets
bulging with jewellery and cigarettes during the riots in Manchester
on August 9, was jailed for eight months yesterday. But now he could
now be hauled back into the dock after falsely claiming he was a
member of the British judo team in a desperate plea for leniency. A
letter was even presented to the judge at Manchester Magistrates Court
backing up the claim. Lloyd had managed to convince his unwitting
legal team he was a promising judo star. His defence lawyer Estelle
Parkhouse told the court: 'A custodial sentence would impair his
prospects with the squad and being part of the Olympics.' But it has
been discovered Lloyd has nev

<a name='1.3'></a>

## 1.3 Batching with bucketing

In [23]:
# Bucketing to crete batched generators

# Buckets are defined in terms of boundaries and batch sizes
# Batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 16 sentences of length < 128, 8 of length < 256
# 4 of length < 512 and so on

boundaries = [128, 256, 512, 1024]
batch_sizes = [16, 8, 4, 2, 1]

# Create the streams
train_batch_stream = trax.data.BucketByLength(
    boundaries,
    batch_sizes
)(train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries,
    batch_sizes
)(eval_stream)

In [27]:
# Every execution will result in generation of a different article
# Try running this cell multiple times to see how the length of the examples affects the batch size
input_batch, _, mask_batch = next(train_batch_stream)

input_batch.shape

(1, 1076)

In [28]:
# print corresponding integer values
print(input_batch[0])

[  567   379  4773 13859 23839    58   186 13550   574 23839    58   379
  7226  5182  3047  6611   136  4601     3  2937   180  1731 16958     4
     2   406   754   429 11969 28081   379  9720 22449  3590  4601     3
  2937   180  1731 16958     4     2   406   754   429   379 16226   958
    11  5496  6758  9945  3403   417   229  2554    28   177  6369   809
   213  3843   819  3324 16864  2173   132 13597     2  4084     2  1019
  9022 11195    16   213  1293   261  2474   636 12461   379     9  9733
  8409  1779  2935   186  1526    64   213  1293   261  2474   636 12461
    70  1779    23    46   132 11235  1019    44    74    28  2593    70
    23  5159    28 17383   320   250    15 21048   921 16564   333     3
  5496  6758  9945  3403   417     2  1210     2    23    46 17694   254
    15  4161   132  3126   132  1110    70    72    91   102    22  1343
   635   101   186  5973    32    88   226   469   102 19387 25679   113
 19181  1551 10017   213   438 10725     3  1191   

In [32]:
# print the article and its summary
print('Article:\n\n', detokenize(input_batch[0]))

Article:

 By . Daily Mail Reporter and Associated Press Reporter . PUBLISHED: .
09:41 EST, 18 February 2013 . | . UPDATED: . 09:41 EST, 18 February
2013 . Terrorist: Ramzi Yousef is serving a life sentence at the ADX
supermax prison in Florence, Colorado, for masterminding the 1993
World Trade Center bombing . The convicted terrorist who planned and
carried out the 1993 World Trade Center bombing - who has been in
isolation for more than a decade - has filed a lawsuit to end his
solitary confinement. Ramzi Yousef, 45, has been imprisoned since his
capture in Pakistan in 1995 - two years after he killed six people and
injured 1,000 others after detonating explosives beneath the North
Tower. Since the September 11 attacks, the 45-year-old Pakistani
national has . been in solitary confinement in a 7-foot-by-11-foot
cell at the . federal ADX supermax prison in Colorado, known as 'the
Alcatraz of the Rockies.' Yousef says that despite good . behavior
while behind bars, he remains in solita

<a name='2'></a>
# Part 2: Summarization with transformer
![alt_text](images/transformer_decoder_zoomin.png)

<a name='2.1'></a>
## 2.1 Dot product attention 

Now you will implement dot product attention which takes in a query, key, value, and a mask. It returns the output.   
![alt_text](images/dotproduct.png)

Here are some helper functions that will help you create tensors and display useful information:
   - `create_tensor`  creates a `jax numpy array` from a list of lists.
   - `display_tensor` prints out the shape and the actual tensor.

In [34]:
def create_tensor(t):
    """
    Create tensor from list of lists
    """
    return jnp.array(t)

def display_tensor(t, name):
    """Display shape and tensor"""
    print(f'{name} shape: {t.shape}\n')
    print(f'{t}\n')

The formula for attention is this one:

$$
\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{1}\
$$

$d_{k}$ stands for the dimension of queries and keys.

The `query`, `key`, `value` and `mask` vectors are provided for this example.

Notice that the masking is done using very negative values that will yield a similar effect to using $-\infty $. 

In [35]:
q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, 'query')
k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, 'key')
v = create_tensor([[0, 1, 0], [1, 0, 1]])
display_tensor(v, 'value')
m = create_tensor([[0, 0], [-1e9, 0]])
display_tensor(m, 'mask')

query shape: (2, 3)

[[1 0 0]
 [0 1 0]]

key shape: (2, 3)

[[1 2 3]
 [4 5 6]]

value shape: (2, 3)

[[0 1 0]
 [1 0 1]]

mask shape: (2, 2)

[[ 0.e+00  0.e+00]
 [-1.e+09  0.e+00]]





In [36]:
q_dot_k = q @ k.T / jnp.sqrt(3)
display_tensor(q_dot_k, 'query dot key')

query dot key shape: (2, 2)

[[0.57735026 2.309401  ]
 [1.1547005  2.8867514 ]]



In [37]:
masked = q_dot_k + m
display_tensor(masked, 'masked query dot key')

masked query dot key shape: (2, 2)

[[ 5.7735026e-01  2.3094010e+00]
 [-1.0000000e+09  2.8867514e+00]]



In [38]:
display_tensor(masked @ v, 'masked query dot key dot value')

masked query dot key dot value shape: (2, 3)

[[ 2.3094010e+00  5.7735026e-01  2.3094010e+00]
 [ 2.8867514e+00 -1.0000000e+09  2.8867514e+00]]



- In order to use the previous dummy tensors to test some of the graded functions, a batch dimension should be added to them so they mimic the shape of real-life examples. 
- The mask is also replaced by a version of it that resembles the one that is used by trax:

In [39]:
q_with_batch = q[None, :]
display_tensor(q_with_batch, 'query with batch dim')

query with batch dim shape: (1, 2, 3)

[[[1 0 0]
  [0 1 0]]]



In [41]:
k_with_batch = k[None,:]
display_tensor(k_with_batch, 'key with batch dim')
v_with_batch = v[None,:]
display_tensor(v_with_batch, 'value with batch dim')
m_bool = create_tensor([[True, True], [False, True]])
display_tensor(m_bool, 'boolean mask')

key with batch dim shape: (1, 2, 3)

[[[1 2 3]
  [4 5 6]]]

value with batch dim shape: (1, 2, 3)

[[[0 1 0]
  [1 0 1]]]

boolean mask shape: (2, 2)

[[ True  True]
 [False  True]]

