# Text Summarization using Deep Learning Model
## Developed by

*   [Islam Mohamed](https://github.com/islammohamedd1)
*   [Mostafa Yasser](https://github.com/mostafayasser)
*   [Eslam Mamdouh](https://github.com/EslamMamdoouh)

## Problem Description and Background
Companies and big businesses which offer online products face problems when analyzing the feedback of their customers. Usually, the feedback has a lot of unnecessary text in it. The long feedback is harder to analyze and consumes a huge amount of time.
### Solution
The solution is to develop a learning agent to solve the problem. The agent is a deep learning model that summarizes the feedback using NLP techniques with Recurrent Neural Network to understand the context of the text and generate a summarization of the feedback text.




### Import required libraries

In [1]:
import numpy as np  
import pandas as pd 
import re           
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords   
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import EarlyStopping

import warnings
warnings.filterwarnings("ignore")

### Load the dataset

In [2]:
!wget --no-check-certificate  https://storage.googleapis.com/islamohamedd1.appspot.com/amazon-fine-food-reviews/Reviews.csv -O data.csv

data = pd.read_csv('data.csv', nrows=150000)
data.drop_duplicates(subset=['Text'],inplace=True)
data.dropna(axis=0,inplace=True)

--2020-05-13 14:32:29--  https://storage.googleapis.com/islamohamedd1.appspot.com/amazon-fine-food-reviews/Reviews.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.21.16, 2a00:1450:4006:807::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.21.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300904694 (287M) [text/csv]
Saving to: ‘data.csv’


2020-05-13 14:37:03 (1.05 MB/s) - ‘data.csv’ saved [300904694/300904694]



### Extract, clean and format the data from the dataset
* Contraction words are returned to their original
* Stop words are removed from the reviews text
* \_START_ and \_END_ tokens are inserted in the beggining and the ending of each label
* The data is divided to be training data and test data

In [3]:
# contraction_mapping source: https://gist.github.com/aravindpai/f21a286e3cb8c68b199e4b6692ced40b#file-contraction-py
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}

contraction_mapping['<br>'] = ""

In [4]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

def clean_sentences(data):
  new_data = []
  for s in data:
    new_s = s.lower()
    new_s = ' '.join([w for w in s.split() if not w in stop_words])
    new_s = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in new_s.split(" ")])
    new_data.append(new_s)
  return new_data


def clean_labels(data):
  new_data = []
  for s in data:
    new_s = s.lower()
    new_s = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in new_s.split(" ")])
    new_data.append(new_s)
  return new_data

labels = clean_labels(data['Summary'])
sentences = clean_sentences(data['Text'])

[nltk_data] Downloading package stopwords to /home/islam/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/islam/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
new_labels = []
for s in labels:
  new_s = s.split()
  new_s = ['_START_'] + new_s + ['_END_']
  new_labels.append(' '.join(new_s))
labels = new_labels

In [6]:
labels[:3]

['_START_ good quality dog food _END_',
 '_START_ not as advertised _END_',
 '_START_ "delight" says it all _END_']

In [7]:
from sklearn.model_selection import train_test_split
training_sentences, testing_sentences, training_labels, testing_labels = train_test_split(sentences , labels,test_size=0.1 ,random_state=0 ,shuffle=True) 

### Tokenize the sentences
The data is converted to be numeric tokens so the model can work with it easily

In [8]:
max_len = 80

x_tokenizer = Tokenizer()
x_tokenizer.fit_on_texts(training_sentences)

x_training = x_tokenizer.texts_to_sequences(training_sentences) 
x_validation = x_tokenizer.texts_to_sequences(testing_sentences)

x_training = pad_sequences(x_training, maxlen=max_len, padding='post') 
x_validation = pad_sequences(x_validation, maxlen=max_len, padding='post')

x_vocab_size = len(x_tokenizer.word_index) + 1

In [9]:
y_tokenizer = Tokenizer()
y_tokenizer.fit_on_texts(training_labels)

y_training = y_tokenizer.texts_to_sequences(training_labels)
y_validation = y_tokenizer.texts_to_sequences(testing_labels)

y_training = pad_sequences(y_training, maxlen=max_len, padding='post')
y_validation = pad_sequences(y_validation, maxlen=max_len, padding='post')

y_vocab_size = len(y_tokenizer.word_index) + 1

In [10]:
x_vocab_size, y_vocab_size

(69883, 18277)

### Create the model archeticure
The model is usiing a LSTM (Long-Short-Term Memory) network for the encoder and another LSTM network for the decoder

The Encoder-Decoder archeticure will work as illustrated below
![Enocder-Decoder archeticture](https://storage.googleapis.com/islamohamedd1.appspot.com/ai_project/encoder_decoder_diagram.jpeg)


#### Encoder
The encoder is composed of a LSTM network. The LSTM network starts with a 0 state as it's initial state. Then, the encoder network receives a word every timestamp. the word is passed to the first LSTM layer and processed, the LSTM layer pass the state after processing the word to the next LSTM layer. each LSTM layer receives the word from the input sequence and the previous state. Finally, the encoder outputs the final state.

![Encoder archeticture](https://storage.googleapis.com/islamohamedd1.appspot.com/ai_project/encoder_diagram.jpeg)


#### Decoder
The decoder LSTM network takes the final state of the encoder as it's initial state and the first word (start) as an input. Each LSTM layer in the decoder predicts the next word - y1, y2, etc - using the initial state of the encoder combined with the state of the previous LSTM state.

![Decoder archeticture](https://storage.googleapis.com/islamohamedd1.appspot.com/ai_project/decoder_diagram.jpeg)


In [11]:
from tensorflow.keras import backend as K
K.clear_session()
embidding_dim = 500

# Encoder
encoder_inputs = Input(shape=(max_len,))
encoder_embedding = Embedding(x_vocab_size, embidding_dim, trainable=True)(encoder_inputs)

encoder_lstm1 = LSTM(embidding_dim, return_sequences=True, return_state=True)
encoder_output1, sate_h1, state_c1 = encoder_lstm1(encoder_embedding)

encoder_lstm2 = LSTM(embidding_dim, return_sequences=True, return_state=True)
encoder_output2, state_h1, state_c1 = encoder_lstm2(encoder_output1)

encoder_lstm3 = LSTM(embidding_dim, return_state=True, return_sequences=True)
encoder_outputs, state_h, state_c = encoder_lstm3(encoder_output2)

# Decoder
decoder_inputs = Input(shape=(None,))
decoder_embedding_layer = Embedding(y_vocab_size, embidding_dim, trainable=True)
decoder_embedding = decoder_embedding_layer(decoder_inputs)

decoder_lstm = LSTM(embidding_dim, return_sequences=True, return_state=True)
decoder_outputs, decoder_fwd_state, decoder_back_state = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])

decoder_dense = TimeDistributed(Dense(y_vocab_size, activation='softmax'))
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 80)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 80, 500)      34941500    input_1[0][0]                    
__________________________________________________________________________________________________
lstm (LSTM)                     [(None, 80, 500), (N 2002000     embedding[0][0]                  
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
______________________________________________________________________________________________

In [12]:
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['acc'])

### Create a callback function to save the weights of the model after each epoch

In [13]:
import tensorflow as tf

checkpoint_path = "./model_weights/model_weights/cp.ckpt"

### Load the model

In [14]:
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fa72da40850>

### Create a reveresed word_index to be able to get the text of each word using it's token

In [15]:
reverse_target_word_index=y_tokenizer.index_word 
reverse_source_word_index=x_tokenizer.index_word 
target_word_index=y_tokenizer.word_index

### Set up the decoder model to be ready to generate summaries

In [16]:
encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])

decoder_state_input_h = Input(shape=(embidding_dim,))
decoder_state_input_c = Input(shape=(embidding_dim,))
decoder_hidden_state_input = Input(shape=(max_len, embidding_dim))

decoder_embedding2 = decoder_embedding_layer(decoder_inputs)

decoder_outputs2, state_h2, state_c2 = decoder_lstm(decoder_embedding2, initial_state=[decoder_state_input_h, decoder_state_input_c])

decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model(
    [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
    [decoder_outputs2] + [state_h2, state_c2]
)

### Decoding process
The decoding process runs as follows 


1. The decoding function takes the review to summarize as an input
2. The input review is entered in the encoder
3. The encoder state is used as the initial state for the decoder
4. The first input to the decoder is the word "start"
5. The decoder starts to generate the next words until it predicts the 'end' word or exceeds the maximum length for the summary.



In [17]:
def decode_sequence(input_sequence):
    encoder_out, encoder_h, encoder_c = encoder_model.predict(input_sequence)

    output_sequence = np.zeros((1,1))

    output_sequence[0, 0] = target_word_index['start']

    stop = False
    decoded_sentence = ''
    while not stop:
        output_tokens, h, c = decoder_model.predict([output_sequence] + [encoder_out, encoder_h, encoder_c])

        heighest_probability_token = np.argmax(output_tokens[0, -1, :])
        if (heighest_probability_token == 0):
          break
        hieghest_probability_word = reverse_target_word_index[heighest_probability_token]

        if (hieghest_probability_word != 'end'):
          decoded_sentence += ' ' + hieghest_probability_word

        if (hieghest_probability_word == 'end' or len(decoded_sentence.split()) >= 10 ):
          stop = True

        # Update the target sequence (of length 1).
        output_sequence = np.zeros((1,1))
        output_sequence[0, 0] = heighest_probability_token

        # Update internal states
        encoder_h, encoder_c = h, c

    return decoded_sentence

### Test the model

In [18]:
def get_summary(input_sequence):
    newString=''
    for i in input_sequence:
      if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
        newString=newString+reverse_target_word_index[i]+' '
    return newString

def get_text(input_sequence):
    newString = ''
    for i in input_sequence:
      if(i != 0):
        newString = newString + reverse_source_word_index[i] + ' '
    return newString

In [19]:
for i in range(200):
  original_summary = get_summary(y_validation[i])
  predicted_summary = decode_sequence(x_validation[i].reshape(1,max_len))
  print("Review:", get_text(x_validation[i]))
  print("Original summary:", original_summary)
  print("Predicted summary:", predicted_summary)

  print("\n")

Review: this favorite flavor stash tea adding little almond milk makes perfect also save way double price im getting stash tea amazon now 
Original summary: such a great save 
Predicted summary:  great tea


Review: i pleasently surprised good juicy tasty papaya chunks opened can br br the soft juicy chunks right bite size make sure serve cold br br i mix native forest pineapple chunks delicous fruit bowl 
Original summary: fresh as can be for canned 
Predicted summary:  delicious


Review: really good popcorn shipping costs much popcorn don't know worth it 
Original summary: amish popcorn 
Predicted summary:  great popcorn


Review: we make son gluten free diet take lunch but love them 
Original summary: yummy 
Predicted summary:  great gluten free snack


Review: like others said consistency stage 2 combo thin could thickened oatmeal cereal my 7 mth old loved flavor impatiently waited more baby foods i like contains spinach iron fruit antioxidants sweeten bit i use everyday i am home

Review: ruby red great i wish people aware would available retail again tip even better mix sunrise orange i 4 quart pitcher i mix one pack fab 
Original summary: love this crystal light flavor it is the best 
Predicted summary:  good stuff


Review: i tried product whim remembered ages ago buying kinder eggs kinder products canada enjoying them this excellent product hazelnut filling good wafers quite crisp the outer chocolate coating tasty well i wish american mass produced chocolate quality br br the chocolate come melted product description warn one refridgerate upon delivery hour when i so fine still quite fresh tasty considering i ordered early july exactly unexpected come melted matter quickly ship which shipped quickly aside 
Original summary: excellent chocolate bar 
Predicted summary:  delicious


Review: my dogs loved toys even older dog never plays toys used kept get treat it they even fought other's kongs traded awhile a big christmas success dog toy 
Original summary: dog

Review: i fish driven cat vs chicken beef products absolutely loved them i keep wrapped something let smell away reach he could smell everything even though i couldn't he would ripped anything get them 
Original summary: cats favorite treat 
Predicted summary:  my cat loves it


Review: i was expecting a lot more candy for that price and its not that good im very disappointed i wont be back for more unless the price drops by at least half 
Original summary: not that much candy for the price 
Predicted summary:  not as good as the original


Review: it looks like dung tastes horrible if want experience disgusting yourself buy try it 
Original summary: terrible 
Predicted summary:  horrible


Review: yesterday i decided i cans yet use i opened get rid dumping soup sausepan i could see consistency wonderfully creamy lots potatoes enough clams able call clam chowder i apologize anyone read first review chose purchase wonderful soup i serving soup lunch today i 1200 plus calories day i made

Review: love sure jell less sugar recipes granted still lot sugar way less full sugar recipes it sets softer side i like you really taste fruit i made strawberry jam pear jam excellent sure website recipes package insert 
Original summary: perfect 
Predicted summary:  great product


Review: gift box a br br i emailed great northern customer service couple times question request sample know i could try it they never responded i took chance ordered some br br this time taking chance paid off it pops great tastes great too br br my issue 4 ounce pack overflows popcorn popping bowl my next order 2 5 oz packs br br bottom line regardless size stuff tastes tons better popcorn i have tried works great microwave popping bowl 
Original summary: pops perfectly in a microwave popping bowl 
Predicted summary:  great for baking


Review: i love coffee i love chai lattes i love keurig but combination disgusting the taste first prepared strong ok i get could even even like it aftertaste kick in i ne

Review: fan newman's own brand human food their salad dressings amazing however cat would eat product i purchased beef flavor loves beef flavored cat treats i give organic dry food time talking cat owners informed i mixing wet food diet wild cats get moisture food nature drink lot water i thought hurt give cat treat couple times week even touch it i tried mixing regular dry food give texture ate dry food around it i sure preference cat particular cat food 
Original summary: cat will not eat it 
Predicted summary:  my cat loves it


Review: best bet figure allergic to it broken heart nurse i believe best him i cannot argue results seen since putting formula i hate cost worth every cent and i use 5 coupons whenever i buy helps lot and specialist said i give probiotics found local health food stores mimic benefits breastmilk one last thing i tried get covered insurance unsuccessful but luck all probably prescription side insurance medical part some cover pediatrician's office call child's

Review: i have browsing amazon tea sellers authentic restaurant style thai tea this one exactly i looking for br br amazing taste extremely easy make instructions provided show easy ways make tea traditional long boil br br however morning rush i simply use tea coffee machine 1 2 tbsp per cup water comes excellent cream sugar br br definitely recommended 
Original summary: excellent for those looking for restaurant thai tea taste 
Predicted summary:  delicious


Review: love love love metromint water chocolatemint favorite i like drink water like calories i drink i despise soda metromint water calories tastes great it extremely refreshing especially hot day i take water bottle metromint refreshing go drink i want splurge tap water if less expensive i would purchase often 
Original summary: best bottled water out there 
Predicted summary:  great soda


Review: i reluctant order item due negative reviews however i bit bullet subscribed i love coconut milk i buying health food store love 

Review: i always little skeptical comes things like pleasantly surprised this bar yummy i tried first without heating liked heating couple seconds microwave made even better highly recommend these super yummy 
Original summary: delicious 
Predicted summary:  delicious


Review: taste chalky disappointing love others line one serious need redo 
Original summary: forget this one 
Predicted summary:  not good


Review: why heck cannot townhouse get crackers right once remember feeling dread got slept friend's house brands foods part world that's feeling crackers arise a little crunchy little fried little much olive oil spice imbibed yes despite olive oil sea salt proclamation box garlic onion present hey see next make rambling excuse referencing obscure food allergies flee 
Original summary: flatbread crackers flat crackers 
Predicted summary:  great for the office


Review: these things like eating poker chips seasoning salt they're remarkably high calories too 
Original summary: hard as

Review: i found flavor store once searched forever find again then went amazon purchased was little afraid buying food amazon no fear like buy store could find cheaper amazon too 
Original summary: pop tarts semi frosted roll flavor 
Predicted summary:  great product


Review: what ask for this popcorn great taste nearly 100 pops i use stir crazy popper plain old crisco oil variety seasonings br br we go lot popcorn one we will ordering again 
Original summary: this is great popcorn 
Predicted summary:  great popcorn


Review: my husband loves these they pretty good they moist like real jerky sense otherwise texture similar i wish variety sweet i definitely recommend trying getting entire package luckily i got try them good 
Original summary: pretty good 
Predicted summary:  yummy


Review: item tasted plain crumbly did taste like described we disappointed product 
Original summary: plain 
Predicted summary:  not bad


Review: ive love flavor infused oils they make scrambled eggs saut 

Review: a good product i received one puree function didnt work sent back ordered another one 2nd one problem i figured two i received problem i must one i read reviews saw others saying take machine apart put black belt back on we works like champ whatever cause issue need better quality control 
Original summary: frustrated but its a good product 
Predicted summary:  good product but poor packaging




In [20]:
test_review = input("Enter a review: ")
test_sequence = x_tokenizer.texts_to_sequences([test_review])
test_sequence = pad_sequences(test_sequence, maxlen=max_len, padding='post')
decode_sequence(test_sequence.reshape(1, max_len))

Enter a review: sandwich was spicy. the sause was amaizing


' good but not great'