# Text Summarization

## Abstract
The goal of this notebook is to implement, measure and compare different ML models for abstractive text summarisation.

In particular, three versions will be explored: two recurrent-based networks and a transformer-based network.
- LSTM
- Bi-directional GRU
- Fine tuning a pre-trained transformer model

## Introduction
The objective is to generate an abstractive summary of a given text input. Abstractive in this context means that potentially new words will be generated to represent the original input. This is in contrast with extractive text summarization strategies which would take parts of the original input and compose them to generate the summary.

## Table of contents
* [1. Framing the problem](#framing-the-problem)
* [2. The data](#the-data)
* [3. Exploring the data](#exploring-the-data)
* [4. Processing the data](#processing-the-data)
* [5. Model exploration](#model-exploration)
    * [5.1 LSTM-based architecture](#lstm)
    * [5.2 GRU-based architecture](#gru)
* [6. Comparing the results](#comparing-the-results)

## 1. Framing the problem <a class="anchor" id="framing-the-problem"></a>

The business objective is to produce a shorter version of an input text containing the most important bits of information in order to process more documents without loosing the substance of the original text.

We can express this problem more formally as follows. The abstractive text summarization algorithm receives a sequence of words (s1, s2, ..., sn) and returns another sequence (o1, o2, ..., om), where m < n and the elements in the output sequence do not need to appear in the input sequence.

This can be solved using supervised learning where pairs of text and their manually generated summaries will be provided to the learning algorithm. 

The models to explore will be the traditional seqToSeq recurrent neural networks and pretrained transformer all of which will use an encoder-decoder architecture.

We will use TensorFlow and Hugging Face Transformers to create the neural network models.

## 2. The Data <a class="anchor" id="the-data"></a>
We will use Amazon's fine goods data set, which contains 500_000 samples. However, we will only use 100_000 samples for training and validation due to the limited computational resources available.

Amazon's Fine Goods Reviews data set from Kaggle
https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

## 3. Exploring the Data <a class="anchor" id="exploring-the-data"></a>
In the following section we will:
- Check the structure of the data set
- Display only Text and Summary columns since that's all we are interested in
- Observe from the previous point that the summaries are quite short. This will limit the size of the generated summaries, which might be a problem in the business problem required longer documents and summaries. We could at this point find another data set.
- Plot a chart to confirm the observation about the length of the summaries, with most of them being under 10 words long. And this is before we process the data, which will slightly shorten the summaries, since we can see
that most of them use a very suscint language with few articles, which are one type of words that will be
removed during preprocessing

Additional checks we could perform including checking for rare words, since we might want to remove them during the processing of the data.

In [None]:
import warnings
warnings.filterwarnings('ignore')

### Dependencies 

In [None]:
# pip install pandas
# pip install numpy
# pip install seaborn
# pip install matplotlib
# pip install nltk
# pip install worldcloud
# pip install scikit-learn
# conda install -c apple tensorflow-deps
# pip install tensorflow-macos
# pip install tensorflow-metal
# pip install -q contractions==0.0.48
# Download TF_text (https://github.com/sun1638650145/Libraries-and-Extensions-for-TensorFlow-for-Apple-Silicon/releases/download/v2.13/tensorflow_text-2.13.0-cp38-cp38-macosx_11_0_arm64.whl)
# pip install <path-to-TF-text.whl>


In [None]:
# Foundation
import os
import re
import pickle
import string
import unicodedata
from random import randint

# Scientific computing
import numpy as np
import pandas as pd
from sklearn.utils import shuffle

# Data sets
from dataset_utils import merge_headlines_and_text_from_csv_files, percentage_of_words_that_count_under_limit
from nlp_preprocessing import preprocess_data_frame, remove_indexes, preprocess_text, remove_long_text_and_summary_from_data_frame

# Plotting
import matplotlib.pyplot as plt

from attention import AttentionLayer
from tf_utils import DataSets, tokenized_padded_sets, get_embedding_matrix
# from sklearn.model_selection import train_test_split
# # TensorFlow
import tensorflow as tf
import tensorflow_text as text
from tensorflow.keras import Input, Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.layers import LSTM, Bidirectional, Dense, Embedding, TimeDistributed, Concatenate
from keras import backend as K 

import nltk

from summary_inference import inference_models_from_lstm, decode_sequence, seq2summary, seq2text

In [None]:
# Load the data set and print structure
reviews_path = "./data/Reviews.csv"
df = pd.read_csv(reviews_path, nrows=100000)
df.head(5)

<img src="./images/exploring-the-data-1.PNG" alt="Initial data structure" />

In [None]:
# Display only the relevant columns
df.loc[:,['Text', 'Summary']].head(5)

<img src="./images/exploring-the-data-2.PNG" alt="Initial data structure" />

In [None]:
# Let's cuantify the size of the summaries in the training set
summary_lengths = [len(summary.split()) for summary in df.Summary if isinstance(summary, str)]

pd.DataFrame({'summaries': summary_lengths}).hist(bins=30)
plt.show()

<img src="./images/exploring-the-data-3.PNG" alt="Initial data structure" />

## 4. Processing the data <a class="anchor" id="processing-the-data"></a>
- Leaving only the Text and Summary columns
- Rename to text and headlines. This is purely for consistency if we decide to use another data set.
- Remove duplicates
- Lower case
- Remove html tags
- Remove hyperlinks
- Expand contractions
- Remove possessive apostrophe
- Remove numbers
- Remove stopwords
- Remove special characters
- Remove entries with a length not within the max limits that we'll determine after further data inspection
- Add start and end tokens to the summaries
- Shuffle the data

In this example we will preprocess the text and summary in the same way. Except for adding the start and end tags to the summaries.

The implementation of all these utilities can be found in _nlp_preprocessing.py_

In [None]:
# == DATA SET ==
# Amazon's Fine Goods Reviews data set from Kaggle
# https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
# The data set contains 500.000 samples. Here we are training with a subset to speed up the training
start_token = 'tokenstart'
end_token = 'tokenend'

def remove_columns_except_headlines_and_text(data_frame):
    cols_to_remove = data_frame.columns.tolist()
    cols_to_remove.remove('Summary')
    cols_to_remove.remove('Text')

    data_frame.drop(cols_to_remove, axis='columns', inplace=True)

remove_columns_except_headlines_and_text(df)
df.rename(columns={"Summary": "headlines", "Text": "text"}, inplace=True)
print(f"Before removing duplicates {len(df)}")

df.drop_duplicates(subset=['text'], inplace=True)
df.dropna(axis=0, inplace=True)
print(f"After removing duplicates {len(df)}")

preprocess_data_frame(df, stemming=False, start_token=start_token, end_token=end_token)
# df = shuffle(df)
print("Finished preprocessing the data set.")
df.head(5)


<img src="./images/processing-the-data-1.PNG" alt="Initial data structure" />

In [None]:
# DETERMINING MAX-LENGTH OF PROCESSED SENTENCES
#
# Since we haven't particularly checked the variability of the length of the sentences
# nor the presence of outliers, we will graphically represent the distribution of length counts
# and we will pick a value for max-length that is close to the minimum value that
# leaves below a large % of the population.

text_count = [len(sentence.split()) for sentence in df.text]
headlines_count = [len(sentence.split()) for sentence in df.headlines]

pd.DataFrame({'text': text_count, 'headlines': headlines_count}).hist(bins=30)
plt.show()

print(f"The real max length of in the text column is {np.max(text_count)}")
print(f"The real max length of in the headlines column is {np.max(headlines_count)}")

text_ratio_70 = percentage_of_words_that_count_under_limit(df.text, 70)
text_ratio_80 = percentage_of_words_that_count_under_limit(df.text, 80)
text_ratio_90 = percentage_of_words_that_count_under_limit(df.text, 90)
text_ratio_100 = percentage_of_words_that_count_under_limit(df.text, 100)
headline_ratio_10 = percentage_of_words_that_count_under_limit(df.headlines, 10)
headline_ratio_12 = percentage_of_words_that_count_under_limit(df.headlines, 12)

print(f"Ratio of text with a count equal or less than {70} is {text_ratio_70}")
print(f"Ratio of text with a count equal or less than {80} is {text_ratio_80}")
print(f"Ratio of text with a count equal or less than {90} is {text_ratio_90}")
print(f"Ratio of text with a count equal or less than {100} is {text_ratio_100}")

print(f"Ratio of headlines with a count equal or less than {10} is {headline_ratio_10}")
print(f"Ratio of headlines with a count equal or less than {12} is {headline_ratio_12}")


# Therefore we will pick:
# 100 as the text's max-length since 94% of the population has 100 or less words
# 10 as the text's max-length since 99% of the population has 10 or less words
max_text_len = 100
max_summary_len = 10

<img src="./images/processing-the-data-2.PNG" alt="Initial data structure" />

In [None]:
df = remove_long_text_and_summary_from_data_frame(df, 
                                             max_text_len, 
                                             max_summary_len, 
                                             "text", 
                                             "headlines")
print(f'Dataset size: {len(df)} after deleting long text and summaries')

In [None]:
# Split data sets, Tokenize and padd the sequences
x_tokenizer = Tokenizer()
y_tokenizer = Tokenizer()

data_sets = tokenized_padded_sets(df, x_tokenizer, y_tokenizer, max_text_len, max_summary_len, test_size_ratio=0.1)

x_train_padded = data_sets.train.x
y_train_padded = data_sets.train.y
x_val_padded = data_sets.val.x
y_val_padded = data_sets.val.y

# Remove summary and texts where the summary only has sostok & eostok
remove_train_indexes = remove_indexes(y_train_padded)
remove_val_indexes = remove_indexes(y_val_padded)

y_train_padded = np.delete(y_train_padded, remove_train_indexes, axis=0)
x_train_padded = np.delete(x_train_padded, remove_train_indexes, axis=0)
y_val_padded = np.delete(y_val_padded, remove_val_indexes, axis=0)
x_val_padded = np.delete(x_val_padded, remove_val_indexes, axis=0)

## 5. Model Exploration <a class="anchor" id="model-exploration"></a>

### 5.1 ) LSTM-based network <a class="anchor" id="lstm"></a>

Below is the diagram of the first model. It follows an encoder-decoder architecture. 

The encoder has three LSTM cells. For each cell we display the time steps.

The attention layer sits between the encoder and the decoder, capturing the encoder's hidden states into a context vector that is passed to the decoder.

The decoder contains a single LSTM layer, that receives the attention layer's output and the decoder's last step's output. The decoder also has a dense and a softmax layer to create a probability distribution of what the next word will be. The example below implements a greedy strategy by simply selecting the most probable next word.

<img src="./images/lstm-based-model-arch.PNG" alt="LSTM-based architecture" />


In [None]:
# == LSTM-based network ==

def enc_dec_with_att_model(latent_dim, max_text_len):
    K.clear_session()

    #-------------------------------
    # Encoder
    #-------------------------------
    encoder_inputs = Input(shape=(max_text_len,))
    enc_emb = Embedding(data_sets.x_vocab_size, latent_dim, trainable=True)(encoder_inputs)

    #LSTM 1
    encoder_lstm1 = LSTM(latent_dim, return_sequences=True, return_state=True)
    encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

    #LSTM 2
    encoder_lstm2 = LSTM(latent_dim, return_sequences=True, return_state=True)
    encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

    #LSTM 3
    encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
    encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)

    #-------------------------------
    # Decoder
    #-------------------------------
    decoder_inputs = Input(shape=(None,))
    dec_emb_layer = Embedding(data_sets.y_vocab_size, latent_dim, trainable=True)
    dec_emb = dec_emb_layer(decoder_inputs)

    # LSTM using encoder_states as initial state
    decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb, initial_state=[state_h, state_c])

    #-------------------------------
    # Attention Layer
    #-------------------------------
    attn_layer = AttentionLayer(name='attention_layer')
    attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])

    # Concat attention output and decoder LSTM output
    decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])

    #-------------------------------
    # Dense layer
    #-------------------------------
    decoder_dense = TimeDistributed(Dense(data_sets.y_vocab_size, activation='softmax'))
    decoder_outputs = decoder_dense(decoder_concat_input)

    # Define the model
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    
    return {
        'model': model,
        'encoder_inputs': encoder_inputs,
        'encoder_outputs': encoder_outputs, 
        'decoder_inputs': decoder_inputs,
        'decoder_outputs': decoder_outputs,
        'state_h': state_h,
        'state_c': state_c,
        'dec_emb_layer': dec_emb_layer,
        'decoder_lstm': decoder_lstm,
        'attn_layer': attn_layer,
        'decoder_dense': decoder_dense
    }

In [None]:
# model.save('amazon-reviews-model.h5')

# # LOADING THE MODEL
# savedModel = tf.keras.models.load_model(
#        'amazon-reviews-model-25092023.h5',
#        custom_objects={'AttentionLayer':AttentionLayer}
# )


# savedModel.summary()
# model_info['model'] = savedModel

In [None]:
# == MODEL TRAINING == 

latent_dim = 256

def train_lstm_model(latent_dim):
    model_info = enc_dec_with_att_model(latent_dim, max_text_len)
    model = model_info['model']
    model.summary()

    model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    batch_size = 512
    epochs = 50

    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

    # Train the model
    history = model.fit([x_train_padded, y_train_padded[:,:-1]], y_train_padded.reshape(y_train_padded.shape[0], y_train_padded.shape[1], 1)[:,1:], epochs=epochs, callbacks=[es], batch_size=batch_size, validation_data=([x_val_padded, y_val_padded[:,:-1]], y_val_padded.reshape(y_val_padded.shape[0],y_val_padded.shape[1], 1)[:,1:]))
    
    return model_info, history
    
    
lstm_model_info, lstm_history = train_lstm_model(latent_dim)


<img src="./images/lstm-1.PNG" alt="Initial data structure" />

In [None]:
# == DRAWING THE ACCURACY == 
plt.plot(lstm_history.history['accuracy'][1:], label='train acc')
plt.plot(lstm_history.history['val_accuracy'], label='val')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

In [None]:
# == DRAWING THE LOSS == 
from matplotlib import pyplot 

pyplot.plot(lstm_history.history['loss'], label='train-val') 
pyplot.plot(lstm_history.history['val_loss'], label='test-val') 
pyplot.legend()
pyplot.show()

### LSTM-based model Inference 

In [None]:
inf_encoder_model, inf_decoder_model = inference_models_from_lstm(lstm_model_info, latent_dim, max_text_len)

In [None]:
reverse_target_word_index = y_tokenizer.index_word 
reverse_source_word_index = x_tokenizer.index_word 
target_word_index = y_tokenizer.word_index

summary_labels = []
summary_preds = []

for i in range(10):
  original_summary = seq2summary(y_val_padded[i], target_word_index, reverse_target_word_index)
  prediction = decode_sequence(x_val_padded[i].reshape(1, max_text_len), max_summary_len, inf_encoder_model, inf_decoder_model, target_word_index, reverse_target_word_index)
  summary_labels.append(original_summary)
  summary_preds.append(prediction)

print("Done generating predictions.")

In [None]:
lstm_model_info['model'].save('lstm-amazon-reviews-031020230554.h5')

### Bi-directional GRU-based network <a class="anchor" id="gru"></a>

The second model also follows an encoder-decoder architecture. 

The encoder has two bidirectional GRU cells.

The same attention and decoder layers are used.

<img src="./images/lstm-based-model-arch.PNG" alt="LSTM-based architecture" />

In [None]:
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers import GRU, Embedding, Dropout


latent_dimension = 256
embedding_dim = 100
x_embedding_matrix = get_embedding_matrix(x_tokenizer, embedding_dim, data_sets.x_vocab_size)
y_embedding_matrix = get_embedding_matrix(y_tokenizer, embedding_dim, data_sets.y_vocab_size)


def enc_dec_with_att_model_bi_gru(latent_dim, 
                                  max_text_len, 
                                  embedding_dim, 
                                  x_vocab_size, 
                                  y_vocab_size,
                                  x_embedding_matrix, 
                                  y_embedding_matrix):
    encoder_input = Input(shape=(max_text_len, ))
    decoder_input = Input(shape=(None, ))

    # ENCODER
    encoder_embedding = Embedding(x_vocab_size,
                                  embedding_dim,
                                  embeddings_initializer=Constant(x_embedding_matrix),
                                  trainable=False)(encoder_input) 

    # GRU 1
    encoder_gru_01 = Bidirectional(GRU(latent_dim, return_sequences=True, return_state=True))
    encoder_output_01, encoder_forward_state_01, encoder_backward_state_01 = encoder_gru_01(encoder_embedding)
    encoder_output_dropout_01 = Dropout(0.3)(encoder_output_01)

    # GRU 2
    encoder_gru_02 = Bidirectional(GRU(latent_dim, return_sequences=True, return_state=True))
    encoder_output, encoder_forward_state, encoder_backward_state = encoder_gru_02(encoder_output_dropout_01)
    encoder_state = Concatenate()([encoder_forward_state, encoder_backward_state])

    # DECODER
    decoder_embedding_layer = Embedding(y_vocab_size,
                                  embedding_dim,
                                  embeddings_initializer=Constant(y_embedding_matrix),
                                  trainable=False)
    decoder_embedding = decoder_embedding_layer(decoder_input)

    # GRU using encoder_states as initial state
    decoder_gru = GRU(latent_dim*2, return_sequences=True, return_state=True)
    decoder_output, decoder_state = decoder_gru(decoder_embedding, initial_state=[encoder_state])

    # Attention Layer
    attention_layer = AttentionLayer() 
    attention_out, attention_states = attention_layer([encoder_output, decoder_output])

    # Concat attention output and decoder GRU output 
    decoder_concatenate = Concatenate(axis=-1)([decoder_output, attention_out])

    # Dense layer
    decoder_dense = TimeDistributed(Dense(y_vocab_size, activation='softmax')) #hierarchical
    decoder_dense_output = decoder_dense(decoder_concatenate)

    # Define the model
    model = Model([encoder_input, decoder_input], decoder_dense_output)

    return {
        'model': model,
        'encoder_input': encoder_input,
        'encoder_output': encoder_output, 
        'encoder_state': encoder_state,
        'decoder_input': decoder_input,
        'decoder_state': decoder_state,
        'decoder_dense': decoder_dense,
        'decoder_embedding_layer': decoder_embedding_layer,
        'decoder_gru': decoder_gru
    }

In [None]:
# == Bi-GRU MODEL TRAINING == 

latent_dim = 256

def train_gru_model(latent_dim, 
                   max_text_len, 
                   embedding_dim, 
                   x_vocab_size,
                   y_vocab_size,
                   x_embedding_matrix, 
                   y_embedding_matrix):
    model_info = enc_dec_with_att_model_bi_gru(latent_dim, 
                                        max_text_len, 
                                        embedding_dim, 
                                        x_vocab_size,
                                        y_vocab_size,
                                        x_embedding_matrix, 
                                        y_embedding_matrix)
    model = model_info['model']
    model.summary()

    model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    batch_size = 512
    epochs = 50

    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

    # Train the model
    history = model.fit([x_train_padded, y_train_padded[:,:-1]], y_train_padded.reshape(y_train_padded.shape[0], y_train_padded.shape[1], 1)[:,1:], epochs=epochs, callbacks=[es], batch_size=batch_size, validation_data=([x_val_padded, y_val_padded[:,:-1]], y_val_padded.reshape(y_val_padded.shape[0],y_val_padded.shape[1], 1)[:,1:]))
    
    return model_info, history
    
    
gru_model_info, gru_history = train_gru_model(latent_dim, 
                   max_text_len, 
                   embedding_dim, 
                   data_sets.x_vocab_size,
                   data_sets.y_vocab_size,
                   x_embedding_matrix, 
                   y_embedding_matrix)

In [None]:
gru_model_info['model'].save('gru-amazon-reviews-031020230708.h5')

In [None]:
# == DRAWING THE ACCURACY == 
plt.plot(gru_history.history['accuracy'][1:], label='train acc')
plt.plot(gru_history.history['val_accuracy'], label='val')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

In [None]:
# == DRAWING THE LOSS == 
from matplotlib import pyplot 

pyplot.plot(gru_history.history['loss'], label='train-val') 
pyplot.plot(gru_history.history['val_loss'], label='test-val') 
pyplot.legend()
pyplot.show()

In [None]:
def gru_based_inference_models(model_info, latent_dim, max_text_len):
    model = model_info['model']
    encoder_input = model_info['encoder_input']
    encoder_output = model_info['encoder_output']
    encoder_state = model_info['encoder_state']
    decoder_input = model_info['decoder_input']
    decoder_state = model_info['decoder_state']
    decoder_dense = model_info['decoder_dense']
    y_embedding_layer = model_info['decoder_embedding_layer']
    decoder_gru = model_info['decoder_gru']
    
    # Encoder Inference Model
    encoder_model_inference = Model(encoder_input, [encoder_output, encoder_state])

    # Decoder Inference
    # Below tensors will hold the states of the previous time step
    decoder_state = Input(shape=(latent_dim*2, ))
    decoder_intermittent_state_input = Input(shape=(max_text_len, latent_dim*2))

    # Get Embeddings of Decoder Sequence
    decoder_embedding_inference = y_embedding_layer(decoder_input)

    # Predict Next Word in Sequence, Set Initial State to State from Previous Time Step
    decoder_output_inference, decoder_state_inference = decoder_gru(decoder_embedding_inference,
                                                                    initial_state=[decoder_state])

    # Attention Inference
    attention_layer = AttentionLayer()
    attention_out_inference, attention_state_inference = attention_layer([decoder_intermittent_state_input,
                                                                          decoder_output_inference])
    decoder_inference_concat = Concatenate(axis=-1)([decoder_output_inference,
                                                     attention_out_inference])

    # Dense Softmax Layer to Generate Prob. Dist. Over Target Vocabulary
    decoder_output_inference = decoder_dense(decoder_inference_concat)

    # Final Decoder Model
    decoder_model_inference = Model([decoder_input, decoder_intermittent_state_input, decoder_state], 
                                    [decoder_output_inference, decoder_state_inference])
    
    return encoder_model_inference, decoder_model_inference

def decode_sequence(input_sequence,
                    max_summary_len,
                    enc_inference_model, 
                    dec_inference_model, 
                    start_token, 
                    end_token, 
                    target_word_index,
                    reverse_target_word_index):
  """Text generation function via encoder / decoder network."""

  # Encode Input as State Vectors.
  encoder_output, encoder_state = enc_inference_model.predict(input_sequence)

  # Generate Empty Target Sequence of Length 1.
  target_sequence = np.zeros((1, 1))

  # Choose 'start' as the first word of the target sequence
  target_sequence[0, 0] = target_word_index[start_token]

  decoded_sentence = ''
  break_condition = False
  while not break_condition:
      token_output, decoder_state = dec_inference_model.predict([target_sequence, 
                                                                 encoder_output,
                                                                 encoder_state])

      # Sample Token
      sampled_token_index = np.argmax(token_output[0, -1, :])

      if not sampled_token_index == 0:
        sampled_token = reverse_target_word_index[sampled_token_index]

        if not sampled_token == end_token:
            decoded_sentence += ' ' + sampled_token

        # Break Condition: Encounter Max Length / Find Stop Token.
        if sampled_token == end_token or len(decoded_sentence.split()) >= (max_summary_len - 1):
            break_condition = True

        # Update Target Sequence (length 1).
        target_sequence = np.zeros((1, 1))
        target_sequence[0, 0] = sampled_token_index

      else:
        break_condition = True

      # Update internal states
      encoder_state = decoder_state

  return decoded_sentence

In [None]:
enc_gru_inf_model, dec_gru_inf_model = gru_based_inference_models(gru_model_info,
                                                                  latent_dim, max_text_len)

In [None]:
reverse_target_word_index = y_tokenizer.index_word 
reverse_source_word_index = x_tokenizer.index_word 
target_word_index = y_tokenizer.word_index

summary_labels = []
summary_preds = []

for i in range(100, 150):
  original_summary = seq2summary(y_val_padded[i], target_word_index, reverse_target_word_index)
  prediction = decode_sequence(x_val_padded[i].reshape(1, max_text_len),
                               max_summary_len,
                               enc_gru_inf_model,
                               dec_gru_inf_model,
                               start_token,
                               end_token,
                               target_word_index,
                               reverse_target_word_index)
  summary_labels.append(original_summary)
  summary_preds.append(prediction)

print("Done generating predictions.")

In [None]:
for index, summaries in enumerate(zip(summary_labels, summary_preds)):
    input_seq = x_val_padded[index]
    original_text = seq2text(input_seq, reverse_source_word_index)
    print(f"original text: {original_text}")
    print(f"Original: {summaries[0]}")
    print(f"Prediction: {summaries[1]}")

In [None]:
nltk.download('punkt')

summary_labels_tf = tf.ragged.constant([nltk.word_tokenize(s) for s in summary_labels])
summary_preds_tf = tf.ragged.constant([nltk.word_tokenize(s) for s in summary_preds])
result = text.metrics.rouge_l(summary_preds_tf, summary_labels_tf)

print('F-Measure: %s' % result.f_measure)
print('P-Measure: %s' % result.p_measure)
print('R-Measure: %s' % result.r_measure)

print(np.sum(result.f_measure)/50)


### TensorFlow_text's ROGUE

In [None]:

nltk.download('punkt')

summary_labels_tf = tf.ragged.constant([nltk.word_tokenize(s) for s in summary_labels])
summary_preds_tf = tf.ragged.constant([nltk.word_tokenize(s) for s in summary_preds])
result = text.metrics.rouge_l(summary_preds_tf, summary_labels_tf)

print('F-Measure: %s' % result.f_measure)
print('P-Measure: %s' % result.p_measure)
print('R-Measure: %s' % result.r_measure)

### Keras' ROGUE
Below is another API to calculate ROUGE scores.

In [None]:
# pip install keral_nlp
# pip install rouge-score
import keras_nlp
import rouge_score
from rouge_score import rouge_scorer, scoring

rouge_l = keras_nlp.metrics.RougeL()
rouge_l(summary_labels, summary_preds)["f1_score"]

### Trying your own inputs

In [None]:
thing = [preprocess_text("bought several vitality canned dog food products found ththem good quality product looks like stew processed meat smells better labrador finicky appreciates product better")]
print(type(thing[0]))
my_review = np.array(thing)
print(type(my_review))
print(my_review)
my_tokenizer = Tokenizer()
my_tokenizer.fit_on_texts(my_review) # creates a dictionary of symbols for each word

# Convert text sequences into integer sequences
tokenized_sequence = x_tokenizer.texts_to_sequences(my_review) 

# == PADDING == zero upto maximum length
padded_tokenized_sequence = pad_sequences(tokenized_sequence,  maxlen=max_text_len, padding='post') 

summary = decode_sequence(padded_tokenized_sequence, max_summary_len, inf_encoder_model, inf_decoder_model, target_word_index, reverse_target_word_index)

print(summary)

# 6. Comparing the results <a class="anchor" id="comparing-the-results"></a>

We will look into manual human evaluation of the results and the computed ROUGE metrics.

Given the small size of the training set, both types of Recurrent neural network take a considerably long time to train and their results won't take our breath away. Nevertheless, they don't produce gibberish and what they both produce tend to actually summarize the sentiment of the source text, yet in a very abstracted and condensed form. 

## Manual evaluation

When inspecting the results manually
- The LSTM architecture took twice as much to train, and produces summaries that are a lot more repetitive and with less information content. It also got some results completely wrong. 
- The GRU is considerably better overall and takes less time to train, so it opens the door for further improvement iterations, increasing the training sample size and potentially increasing the complexity of the network to capture more information.

## ROUGE metrics
After manually inspecting the results we can expect the ROUGE scores to be quite low, and they are. This is because the summaries in the training set are quite short and the score is based on number of occurrences of words from the source appearing in the target and viceversa. 
