# Generating Summaries
## Justyn Lewis and Emery Jacobowitz

---

Reading is hard. It would be really nice if we could just plug texts into a computer program, which could succinctly explain what's going on. This is the summarization problem. How do we figure out what information a text is trying to convey, and what to report in our summary.

There are two approaches to summarization.

The first is **extractive** summarization, where we come up with some way of ranking how important sentences are, then cutting out everything except those sentences. This is pretty easy to do, and it retains the original linguistic structure, but it's not how humans summarize things.

The "natural" approach is **abstractive** summarization. This is where we read through a text, and come up with a novel summary that captures the main ideas. This is the approach we will be implementing today.

We start by importing a dataset. This is a set of Amazon reviews for various projects over the span of a few years. Each one comes with a sample summary, which we will use to train our model.

In [1]:
import pandas as pd
import tensorflow as tf


# Getting the data
# This is a dataset of 500k Amazon reviews
# We only use 100k of these.
data = pd.read_csv('data\Reviews.csv', nrows=100000)
data.drop_duplicates(subset=['Text'],inplace=True)  #dropping duplicates
data.dropna(axis=0,inplace=True)   #dropping na
text_data = data['Text']


In [2]:
print(tf.__version__)
print(tf.keras.__version__)

print(tf.test.is_built_with_cuda())
print(tf.config.list_physical_devices('GPU'))

2.10.0
2.10.0
False
[]


## Prologue: Preparing the data

The first step when dealing with raw text is to clean it up, so we will use NLTK's NLP utilities to do this. This step includes:
- Case normalization
- Lemmatizing
    - i.e. replacing inflected words with their "dictionary forms"
- Removing punctuation
- etc.

To accomplish this, we will use a **tokenizer** from Keras. The goal is to train the tokenizer on our dataset, to create a "vocabulary" which accurately represents the kind of things we are going to summarize. The tokenizer also cleans the data, as above.

Before the tokenizer runs, we first need to append all of the summaries with tags representing their starting and ending points.

In [3]:
summary_data = data['Summary']
summary_data = ['_START_ ' + summary + ' _END_' for summary in summary_data]

summary_data[:3]

['_START_ Good Quality Dog Food _END_',
 '_START_ Not as Advertised _END_',
 '_START_ "Delight" says it all _END_']

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer

# AKA out-of-vocabulary token
# For unknown words that aren't relevant to tokenization
oov = '_oov_'

# initialize the tokenizer
text_tokenizer = Tokenizer(oov_token=oov)

# fit it on our dataset
text_tokenizer.fit_on_texts(text_data)

# split the text up into sequences
input_seqs = text_tokenizer.texts_to_sequences(text_data)

# repeat for the summaries
summary_tokenizer = Tokenizer(oov_token=oov)
summary_tokenizer.fit_on_texts(summary_data)
target_seqs = text_tokenizer.texts_to_sequences(summary_data)

In [6]:
print('Input vocabulary size: ', len(text_tokenizer.word_index))
print('Target vocabulary size: ', len(summary_tokenizer.word_index))
text_tokenizer.texts_to_sequences(['We', 'love', 'Artificial', 'Intelligence', '!'])
# The output is a measure of how frequent these terms are in our dataset.
# Notice that "!" is removed, because the tokenizer strips out punctuation.




Input vocabulary size:  60712
Target vocabulary size:  15822


[[51], [56], [537], [11973], []]

The last bit of preprocessing is to pad and truncate our sequences, since we are going to want to always feed in inputs of the same length to our model. This is why we needed to label the starts and ends of the summaries. The maximum length after preprocessing will just be the average item length plus a buffer for now. 

In [7]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from math import floor
import numpy

input_max_len = floor(numpy.average([len(item) for item in text_data])) + 25
target_max_len = floor(numpy.average([len(item) for item in summary_data])) + 25

print("Input maxlen:  ", input_max_len)
print("Target maxlen: ", target_max_len)

# the 'post' options mean we pad/truncate at the end of the sequence, not before it.
input_seqs = pad_sequences(input_seqs, maxlen=input_max_len, padding='post', truncating='post')
target_seqs = pad_sequences(target_seqs, maxlen=target_max_len, padding='post', truncating='post')

Input maxlen:   468
Target maxlen:  62


A final adjustment is to set the buffer and batch sizes for our model, producing a randomized Tensorflow Dataset.

In [8]:
BUFFER = 20000
BATCH_SIZE = 32

input_seqs = tf.cast(input_seqs, dtype=tf.int32)
target_seqs = tf.cast(target_seqs, dtype=tf.int32)

dataset = tf.data.Dataset.from_tensor_slices((input_seqs, target_seqs)).shuffle(BUFFER).batch(BATCH_SIZE)
dataset

<BatchDataset element_spec=(TensorSpec(shape=(None, 468), dtype=tf.int32, name=None), TensorSpec(shape=(None, 62), dtype=tf.int32, name=None))>

## The Transformer Architecture, pt. I
---
### Data Utilities
The thing about self-attention in transformers is we could end up losing ordinal information about our data. To solve this problem, we use "positional encodings", where we embed ordinal information within our word data. We want to implement these formulae:
![image.png](attachment:image.png)

In [9]:
# Math is hard.
# This section is closely adapted from https://medium.com/swlh/abstractive-text-summarization-using-transformers-3e774cc42453

# calculate PE
def positional_encoding(pos, i, d_model):
    angle_rads = get_angles(
        np.arange(position)[:, np.newaxis],
        np.arange(d_model)[np.newaxis, :],
        d_model
    )

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

# generate the internal argument to the trig function
def get_angles(position, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return position * angle_rates

Next, we need two small masks.

One is the padding mask, which ignores the padded portions of our sequences.

The other is the lookahead mask, which ignores words that come after a given word. This is important because we want to make predictions based on the words we have seen already, since language is linear.

In [10]:
def padding_mask(seq):
    # we padded with 0s
    remove_padding = tf.math.equal(seq, 0)
    new_seq = tf.cast(remove_padding, tf.float32)
    # add two new axes
    return new_seq[:, tf.newaxis, tf.newaxis, :]

# Once again, I don't really understand how this mask works. It's from the same source above.
def lookahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask

print(padding_mask(input_seqs[:3]))
print(padding_mask(input_seqs[:4]))
print(padding_mask(input_seqs[:5]))
print('==================================================')
print(lookahead_mask(5))
print(lookahead_mask(4))
print(lookahead_mask(3))
print(lookahead_mask(2))

tf.Tensor(
[[[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]], shape=(3, 1, 1, 468), dtype=float32)
tf.Tensor(
[[[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]], shape=(4, 1, 1, 468), dtype=float32)
tf.Tensor(
[[[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]


 [[[0. 0. 0. ... 1. 1. 1.]]]], shape=(5, 1, 1, 468), dtype=float32)
tf.Tensor(
[[0. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1.]
 [0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)
tf.Tensor(
[[0. 1. 1. 1.]
 [0. 0. 1. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]], shape=(4, 4), dtype=float32)
tf.Tensor(
[[0. 1. 1.]
 [0. 0. 1.]
 [0. 0. 0.]], shape=(3, 3), dtype=float32)
tf.Tensor(
[[0. 1.]
 [0. 0.]], shape=(2, 2), dtype=float32)


## The Transformer Architecture, pt. 2
---
### Paying Self-Attention

At this point, we can start assembling the transformer. The first step is to implement a scaled dot product function, which is the key to our self-attention layer. This includes a few components: 
- *Matrix multiplication* (i.e. dot product) of our query and our key
- *Scaling*
    - We need to scale down our inputs in order to prevent them from being too large, so we divide by $\sqrt{d_k}$, a scaling factor proportional to the size of our input.
- An optional *mask*
    - This is where we will use our padding mask from before
- *Softmax*
    - Transforming the value to fit in a \[0, 1\] probability distribution.
- Another step of matrix multiplication of the result of the above with our value.

Thus, the scaled dot product attention is equivalent to ![image.png](attachment:image.png) although the affect of the mask is not counted here.


In [11]:
# takes in a query, a key, a value, and an optional mask.
# returns a tuple of (the output, the attention weights)
def sdp_attention(q, k, v, mask=None):
    # Step 1: QK^T, AKA matrix multiplication
    qk = tf.matmul(q, k, transpose_b=True)
    
    #Scaling
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled = qk / dk
    
    # apply the mask, only if it is not the default None option.
    if mask is not None:
        scaled += (mask * -1e9)
    
    # softmax
    softmax = tf.nn.softmax(scaled)
    
    # finally, multiply by v
    result = tf.matmul(softmax, v)
    return result, softmax

Next, we need to actually make this into a layer in our Transformer architecture. We do this by implementing a subclass of the Layer class from Keras. The important method that we implement is call(), which applies the logic of the layer. Keras documentation also suggests implementing a few other methods, but we don't think those are necessary for our project, so we are going to tactically ignore them.

We make it "multi-headed", which means that we can split up the calculations to make them more efficient. If we want to split into *h* heads, then the sizes of the query, key, and value (AKA the *depth*) will be equal to *d_model* / *h*. Therefore, the former must be evenly divisible by the latter.

In [26]:
class AttentionLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, heads):
        super(AttentionLayer, self).__init__()
        self.d_model = d_model
        self.heads = heads
        
        # d_model needs to be divisible by the number of heads
        if not d_model % heads == 0:
            raise RuntimeError('d_model (' + str(d_model) + ') must be divisible by heads (' + str(heads) + ').')
        
        self.depth = d_model / heads
        
        # make Keras densely-connected NN layers for the data we eventually have
        self.q_layer = tf.keras.layers.Dense(d_model)
        self.k_layer = tf.keras.layers.Dense(d_model)
        self.v_layer = tf.keras.layers.Dense(d_model)
        
        # dense layer that will eventually store the output
        self.out_layer = tf.keras.layers.Dense(d_model)
    
    def call(self, q, v, k, mask):       
        # start applying all those layers we made
        Q = self.q_layer(q)
        V = self.q_layer(v)
        K = self.q_layer(k)
        
        # split the heads apart, so that we are multi-headed
        split_Q = self.split(Q, batch)
        split_V = self.split(V, batch)
        split_K = self.split(K, batch)
    
        # apply the attention
        attention = sdp_attention(q, k, v, mask)
        weights = attention[1]
        attention_results = tf.transpose(attention[0], perm=[0,2,1,3])
        
        # concatenate attention back to one vector
        attention_results = tf.reshape(attention_results, (tf.shape(q)[0], -1, self.d_model))
        output = self.out_layer(attention_results)
        
        return output, weights
        
    
    # helper method for splitting the heads
    def split(self, layer, batch):
        Layer = tf.reshape(layer, (batch, -1, self.heads, self.depth))
        Layer = tf.transpose(Layer, perm=[0,2,1,3])
        return Layer
        
AttentionLayer(6, 3)

<__main__.AttentionLayer at 0x1da0967a920>

### Feeding it Forward

The last basic unit of a transformer that we have to implement is a Feed-Forward Network (FFN) layer. This is pretty simple, since we just need to sequentially go from one layer to another, and Keras has this functionality.

In [27]:
def ffn(d_model, d_ffn):
    l1 = tf.keras.layers.Dense(d_ffn)
    l2 = tf.keras.layers.Dense(d_model)
    return tf.keras.Sequential([l1, l2])

## The Transformer Architecture, pt. 3
---
### Building the Transformer 

So far, we've made the individual building blocks of the Transformer, which process data at each step. Now, we can build the main components: the Encoder and the Decoder. These are comprised of individual layers, each of which contain:
- An attention layer
- A FFN layer
- A normalization layer
- A dropout layer
    - When training, randomly sets some inputs to 0, and distributes that value to the other inputs.

In [32]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, heads, d_ffn):
        super(EncoderLayer, self).__init__()
        
        # the MH Attention layer
        self.att = AttentionLayer(d_model, heads)
        
        # the FFN  layer
        self.ffn = ffn(d_model, d_ffn)
        
        # Normie layers
        # we use a smaller epsilon value than standard
        norm1 = tf.keras.layers.LayerNormalization(0.000001)
        norm2 = tf.keras.layers.LayerNormalization(0.000001)
        
        # Dropout layers
        # We apply dropout 10% of the time.
        drop1 = tf.keras.layers.Dropout(0.1)
        drop2 = tf.keras.layers.Dropout(0.1)
        
    def call(self, datum, is_training, mask):
        # Sorry for the weird formatting.
        # I think this is the best way to present the process
        
        # <...>   <---   normalize <--  dropout    <--   attention <-- input
        attention = self.norm1( datum + self.drop1( self.att(datum, datum, datum, mask), training=is_training))
        
        #           normalize     <--       dropout <-- ffn <-- <...>
        feed = self.norm2( attention + self.drop2( self.ffn(attention), training=is_training))
        
        return feed

A Decoder layer is pretty much the same as an Encoder layer, except that it has that extra attention layer. 

In [35]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, heads, d_ffn):
        super(EncoderLayer, self).__init__()
        
        # the MH Attention layers
        self.att1 = AttentionLayer(d_model, heads)
        self.att2 = AttentionLayer(d_model, heads)
        
        # the FFN  layer
        self.ffn = ffn(d_model, d_ffn)
        
        # Normie layers
        # we use a smaller epsilon value than standard
        norm1 = tf.keras.layers.LayerNormalization(0.000001)
        norm2 = tf.keras.layers.LayerNormalization(0.000001)
        norm3 = tf.keras.layers.LayerNormalization(0.000001)
        
        # Dropout layers
        # We apply dropout 10% of the time.
        drop1 = tf.keras.layers.Dropout(0.1)
        drop2 = tf.keras.layers.Dropout(0.1)
        drop3 = tf.keras.layers.Dropout(0.1)
        
    # The decoder actually needs to use the masks we defined earlier
    def call(self, datum, encoder_result, is_training, look_mask, pad_mask):
        
        # The first attention layer
        att1_res = self.att1(datum, datum, datum, look_mask)
        att1_out = att1_res[0]
        att1_weights = att1_res[1]
        att1_out = self.norm1(datum + (self.drop1(att1_out, training=is_training)))
        
        # This extra attention layer takes in the input from the encoder as a query
        # and combines it with the first layer's output, using. 
        att2_res = self.att2(out1, encoder_result, encoder_result, pad_mask)
        att2_out = att2_res[0]
        att2_weights = att1_res[1]
        att2_out = self.norm2(att1_out + (self.drop2(att2_out, training=is_training)))
        
        # Finally, the ffn layer
        ffn_out = self.norm3(att2_out + self.drop3(self.ffn(att2_out), training=is_training))
        
        return ffn_out, att1_weights, att2_weights
        

### Assembling the components

Now that we have our Encoder/Decoder layers, we can actually put them together to form the Encoder and Decoder themselves.

These are basically just bigger Keras Layers that have multiple smaller layers inside.

In [39]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, layers, d_model, heads, d_ffn, vocab_size, max_pos):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.layers = layers

        # creates an embedding layer with the size of the vocabulary
        self.embedding = tf.keras.layers.Embedding(vocab_size, self.d_model)
        
        self.pos_enc = positional_encoding(max_pos, self.d_model)

        self.encoder_layers = []
        for i in range(layers):
            l = EncoderLayer(d_model, heads, d_ffn)
            self.encoder_layers.append(l)

        self.drop = tf.keras.layers.Dropout(0.1)
    
    def call(self, datum, is_training, mask):
        # get the length of the input
        length = tf.shape(datum)[1]
        
        # get the input embedding
        Datum = self.embedding(datum)
        Datum = Datum * tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        
        # apply the positional encoding
        Datum += self.pos_enc[:, :length, :]
        
        # apply dropout
        Datum = self.drop(Datum, training=is_training)
        
        # apply all of the Encoder layers in order
        for layer in self.encoder_layers:
            Datum = layer(Datum, is_training, mask)
        
        return Datum

In [40]:
!git 

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   project.ipynb

no changes added to commit (use "git add" and/or "git commit -a")
