# Text Summarization of Amazon reviews

This notebook implements the seq2seq model for text summerizer

In [1]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from collections import Counter

import Summarizer
import summarizer_data_utils
import summarizer_model_utils

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
print(tf.__version__)

2.0.0-alpha0


## The data


The data we will be using with is a dataset from Kaggle, the Amazon Fine Food Reviews dataset.  
It contains, as the name suggests, 570.000 reviews of fine foods from Amazon and summaries of those reviews. 
Our aim is to input a review (Text column) and automatically create a summary (Summary colum) for it.


https://www.kaggle.com/snap/amazon-fine-food-reviews/data

### Reading and exploring

In [3]:
# load csv file using pandas.
file_path = './Reviews.csv'
data = pd.read_csv(file_path)
data.shape

(568454, 10)

In [4]:
# we will only use the last two columns Summary (target) and Text (input).
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [5]:
# check for missings --> got some in summary drop those. 
# 26 are missing, so we will drop those!
data.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [6]:
# drop row, if values in Summary is missing. 
data.dropna(subset=['Summary'],inplace = True)

In [7]:
# only summary and text are useful for us.
data = data[['Summary', 'Text']]
data.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [8]:
# we will not use all of them, only short ones and ones of similar size. 
# choosing the ones that are of similar length makes it easier for the model to learn.
raw_texts = []
raw_summaries = []

for text, summary in zip(data.Text, data.Summary):
    if 100< len(text) < 150:
        raw_texts.append(text)
        raw_summaries.append(summary)

In [9]:
len(raw_texts), len(raw_summaries)

(78862, 78862)

In [10]:
for t, s in zip(raw_texts[:5], raw_summaries[:5]):
    print('Text:\n', t)
    print('Summary:\n', s, '\n\n')

Text:
 Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.
Summary:
 Great taffy 


Text:
 This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!
Summary:
 Wonderful, tasty taffy 


Text:
 Right now I'm mostly just sprouting this so my cats can eat the grass. They love it. I rotate it around with Wheatgrass and Rye too
Summary:
 Yay Barley 


Text:
 This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.
Summary:
 Healthy Dog Food 


Text:
 The Strawberry Twizzlers are my guilty pleasure - yummy. Six pounds will be around for a while with my son and I.
Summary:
 Strawberry Twizzlers - Yummy 




In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/nasir/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Clean and prepare the data

In [12]:
# the function gives us the option to keep_most of the characters inisde the texts and summaries, meaning
# punctuation, question marks, slashes...
# or we can set it to False, meaning we only want to keep letters and numbers like here.
processed_texts, processed_summaries, words_counted = summarizer_data_utils.preprocess_texts_and_summaries(
    raw_texts,
    raw_summaries,
    keep_most=False
)

Processing Time:  26.7448627948761


In [13]:
for t,s in zip(processed_texts[:5], processed_summaries[:5]):
    print('Text\n:', t, '\n')
    print('Summary:\n', s, '\n\n\n')

Text
: ['great', 'taffy', 'at', 'a', 'great', 'price', 'there', 'was', 'a', 'wide', 'assortment', 'of', 'yummy', 'taffy', 'delivery', 'was', 'very', 'quick', 'if', 'your', 'a', 'taffy', 'lover', 'this', 'is', 'a', 'deal'] 

Summary:
 ['great', 'taffy'] 



Text
: ['this', 'taffy', 'is', 'so', 'good', 'it', 'is', 'very', 'soft', 'and', 'chewy', 'the', 'flavors', 'are', 'amazing', 'i', 'would', 'definitely', 'recommend', 'you', 'buying', 'it', 'very', 'satisfying'] 

Summary:
 ['wonderful', 'tasty', 'taffy'] 



Text
: ['right', 'now', 'i', 'm', 'mostly', 'just', 'sprouting', 'this', 'so', 'my', 'cats', 'can', 'eat', 'the', 'grass', 'they', 'love', 'it', 'i', 'rotate', 'it', 'around', 'with', 'wheatgrass', 'and', 'rye', 'too'] 

Summary:
 ['yay', 'barley'] 



Text
: ['this', 'is', 'a', 'very', 'healthy', 'dog', 'food', 'good', 'for', 'their', 'digestion', 'also', 'good', 'for', 'small', 'puppies', 'my', 'dog', 'eats', 'her', 'required', 'amount', 'at', 'every', 'feeding'] 

Summary:
 ['

### Create lookup dicts

We cannot feed our network actual words, but numbers. So we first have to create our lookup dicts, where each words gets and int value (high or low, depending on its frequency in our corpus). Those help us to later convert the texts into numbers.

We also add special tokens. EndOfSentence and StartOfSentence are crucial for the Seq2Seq model we later use.
Pad token, because all summaries and texts in a batch need to have the same length, pad token helps us do that.

So we need 2 lookup dicts:
 - From word to index 
 - from index to word. 

In [14]:
specials = ["<EOS>", "<SOS>","<PAD>","<UNK>"]
word2ind, ind2word,  missing_words = summarizer_data_utils.create_word_inds_dicts(words_counted,
                                                                       specials = specials)
print(len(word2ind), len(ind2word), len(missing_words))


25067 25067 0


### Pretrained embeddings

Optionally we can use pretrained word embeddings. Those have proved to increase training speed and accuracy.
Here I used two different options. Either we use glove embeddings or embeddings from tf_hub.
The ones from tf_hub worked better, so we use those. 

In [27]:
glove_embeddings_path = './glove.6B.300d.txt'
embedding_matrix_save_path = './embeddings/my_embedding_github.npy'
emb = summarizer_data_utils.create_and_save_embedding_matrix(word2ind, glove_embeddings_path, embedding_matrix_save_path)

In [32]:
# the embeddings from tf_hub. 
#embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
#embed = hub.load("https://tfhub.dev/google/Wiki-words-250/2")
#emb = embed([key for key in word2ind.keys()])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    embedding = sess.run(emb)

AttributeError: module 'tensorflow' has no attribute 'Session'

In [29]:
embedding.shape

NameError: name 'embedding' is not defined

In [None]:
np.save('./tf_hub_embedding.npy', embedding)

### Convert text and summaries

As I said before we cannot feed the words directly to our network, we have to convert them to numbers first of all. This is what we do here. And we also append the SOS and EOS tokens.

In [None]:
# converts words in texts and summaries to indices
# it looks like we have to set eos here to False
converted_texts, unknown_words_in_texts = summarizer_data_utils.convert_to_inds(processed_texts,
                                                                                word2ind,
                                                                                eos = False)

In [None]:
converted_summaries, unknown_words_in_summaries = summarizer_data_utils.convert_to_inds(processed_summaries,
                                                                                        word2ind,
                                                                                        eos = True,
                                                                                        sos = True)

In [None]:
converted_texts[0]

In [None]:
# seems to have worked well. 
print( summarizer_data_utils.convert_inds_to_text(converted_texts[0], ind2word),
       summarizer_data_utils.convert_inds_to_text(converted_summaries[0], ind2word))


## The model

Now we can build and train our model. First we define the hyperparameters we want to use. Then we create our Summarizer and call the function .build_graph(), which as the name suggests, builds the computation graph. 
Then we can train the model using .train()

After training we can try our model using .infer()

### Training

We can optionally use a cyclic learning rate, which we do here. 
I trained the model for 20 epochs and the loss was low then, but we could train it longer and would probably get better results.

Unfortunately I do not have the resources to find the perfect (or right) hyperparameters, but these do pretty well. 


In [None]:
# model hyperparametes
num_layers_encoder = 4
num_layers_decoder = 4
rnn_size_encoder = 512
rnn_size_decoder = 512

batch_size = 256
epochs = 200
clip = 5
keep_probability = 0.5
learning_rate = 0.0005
max_lr=0.005
learning_rate_decay_steps = 700
learning_rate_decay = 0.90


pretrained_embeddings_path = './tf_hub_embedding.npy'
summary_dir = os.path.join('./tensorboard', str('Nn_' + str(rnn_size_encoder) + '_Lr_' + str(learning_rate)))


use_cyclic_lr = True
inference_targets=True


In [None]:
len(converted_summaries)

In [None]:
round(78862*0.9)

In [None]:
# build graph and train the model 
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   save_path='./models/amazon/my_model',
                                   mode='TRAIN',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   batch_size = batch_size,
                                   clip = clip,
                                   keep_probability = keep_probability,
                                   learning_rate = learning_rate,
                                   max_lr=max_lr,
                                   learning_rate_decay_steps = learning_rate_decay_steps,
                                   learning_rate_decay = learning_rate_decay,
                                   epochs = epochs,
                                   pretrained_embeddings_path = pretrained_embeddings_path,
                                   use_cyclic_lr = use_cyclic_lr,
                                   summary_dir = summary_dir)           

summarizer.build_graph()
summarizer.train(converted_texts[:70976], 
                 converted_summaries[:70976],
                 validation_inputs=converted_texts[70976:],
                 validation_targets=converted_summaries[70976:])


# hidden training output.
# both train and validation loss decrease nicely.

### Inference
Now we can use our trained model to create summaries. 

In [None]:
summarizer_model_utils.reset_graph()
summarizer = Summarizer.Summarizer(word2ind,
                                   ind2word,
                                   './models/amazon/my_model',
                                   'INFER',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   batch_size = len(converted_texts[:50]),
                                   clip = clip,
                                   keep_probability = 1.0,
                                   learning_rate = 0.0,
                                   beam_width = 5,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   inference_targets = True,
                                   pretrained_embeddings_path = pretrained_embeddings_path)

summarizer.build_graph()
preds = summarizer.infer(converted_texts[:50],
                         restore_path =  './models/amazon/my_model',
                         targets = converted_summaries[:50])


In [None]:
# show results
summarizer_model_utils.sample_results(preds,
                                      ind2word,
                                      word2ind,
                                      converted_summaries[:50],
                                      converted_texts[:50])

# Conclusion

Generally I am really impressed by how well the model works. 
We only used a limited amount of data, trained it for a limited amount of time and used nearly random hyperparameters and it still delivers good results. 

However, we are clearly overfitting the training data and the model does not perfectly generalize.
Sometimes the summaries the model creates are good, sometimes bad, sometimes they are better than the original ones and sometimes they are just really funny.


Therefore it would be really interesting to scale it up and see how it performs. 

To sum up, I am impressed by seq2seq models, they perform great on many different tasks and I look foward to exploring more possible applications. 
(speech recognition...)