## Summarizing Text with Amazon Reviews - version 2

### 0. Description
1. **Project**: Amazon에서 팔린 식품의 리뷰를 요약하는 모델을 만드는 프로젝트로, [블로그](https://towardsdatascience.com/text-summarization-with-amazon-reviews-41801c2210b)와 [Github](https://github.com/Currie32/Text-Summarization-with-Amazon-Reviews)를 보고 참고함. 공부의 목적으로 이 프로젝트를 그대로 따라해 보는 중임.<br /><br />
2. **Data** : *Amazon Fine Food Reviews*. [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews)에서 다운로드함.<br />리뷰 내용(description)을 input으로 하고, 리뷰의 제목(title)을 target으로 하여 description이 text, title이 summary이다.
<br /><br />
3. **Tools** : Python, Tensorflow 1.2.1
<br /><br />
4. **Model**: 인코딩 레이어에 **bi-directional RNN과 LSTMs**을 사용하고, 디코딩 레이어에 **attention**을 사용한다. Textsum에서 사용한 seq2seq 모델과 유사함.
<br /><br />
5. **Sections** :
    - Inspection the Data
    - Preparing the Data
    - Building the Model
    - Training the Model
    - Making Our Own Summaries
<br/><br/>
6. **NEW for version2** : version2 에서는 data split 을 하고, evaluation을 할 것이다.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.13.1


### 1. Inspecting the Data

In [2]:
path = '/home/limhyesu/Summarization_study_data/Reviews.csv'
reviews = pd.read_csv(path)

pd.read_csv returns into DataFrame <br/>
reviews : DataFrame

In [3]:
reviews.shape

(568454, 10)

In [4]:
# Check for any nulls values. 칼럽별 null 개수 구하기.
reviews.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [5]:
# Remove null values and unneeded features
# drop null values
reviews = reviews.dropna()
# drop unneeded features. only Summary and Text remain.
reviews = reviews.drop(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 
                        'HelpfulnessDenominator', 'Score', 'Time'], 1)
# drop parameter avoids the old index being added as column
reviews = reviews.reset_index(drop=True)

### 2. Preparing the Data
- convert to lowercase
- replace contractions with their longer forms. (contraction : 줄임말 등)
- remove any unwanted characters (done after replacing contractions. backward slash before hyphen.)
- remove stopwords from description

### Clean the Data

In [6]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
# someone made contraction dictionary from wikipeda
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

In [7]:
def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # 1. Convert words to lower case
    text = text.lower()
    
    # 2. Replace contractions with their longer forms
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word]) # longer term을 nex_text에 append함.
            else:
                new_text.append(word)
        # join() method takes all items in an iterable and joins them into one string.
        text = " ".join(new_text)        
        
        # 3. Format words and remove unwanted characters
        # ?는 0번 또는 1차례까지의 발생을 의미함. http 또는 https를 의미함.
        # MULTILINE : '^'가 각 문자열, 문장의 처음에 매칭됨. '$'는 각 문자열과 문장의 마지막에 매칭됨.
        text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
        text = re.sub(r'\<a href', ' ', text)
        text = re.sub(r'\&amp;', ' ', text)
        text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
        text = re.sub(r'<br />', ' ', text)
        text = re.sub(r'\'', ' ', text)
        text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
        text = re.sub(r'\<a href', ' ', text)
        text = re.sub(r'\&amp;', ' ', text)
        text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
        text = re.sub(r'<br />', ' ', text)
        text = re.sub(r'\'', ' ', text)
        
        # 4. Optionally, remove stop words
        if remove_stopwords:
            text = text.split()
            stops = set(stopwords.words("english"))
            text = [w for w in text if not w in stops]
            text = " ".join(text)

    return text

### Data Split
Split data into two parts : Train, Test<br/>
**Train** - text_train, summary_train<br/>
**Test** - text_test, summary_test

In [8]:
type(reviews)

pandas.core.frame.DataFrame

In [9]:
from sklearn.model_selection import train_test_split
target_attribute = reviews['Summary']
tmp = reviews.drop(columns=['Summary'], axis=1)
text_train, text_test, summary_train, summary_test = train_test_split(tmp, target_attribute, test_size=0.05)

In [10]:
# convert to DataFrame
summary_test = pd.DataFrame(summary_test)

In [11]:
summary_train = pd.DataFrame(summary_train)

### Clean the split data

In [12]:
# Clean the summaries and texts
# stopwords will only be removed from the description to make training faster
# but they will reamin in the summaries to make them sound more like natural phrases.
clean_summaries_train = []
for summary in summary_train.Summary:
    clean_summaries_train.append(clean_text(summary, remove_stopwords=False))
print("Summaries for train are complete.")

clean_texts_train = []
for text in text_train.Text:
    clean_texts_train.append(clean_text(text, remove_stopwords=True))
print("Texts for train are complete.")

clean_summaries_test = []
for summary in summary_test.Summary:
    clean_summaries_test.append(clean_text(summary, remove_stopwords=False))
print("Summaries for test are complete.")

clean_texts_test = []
for text in text_test.Text:
    clean_texts_test.append(clean_text(text, remove_stopwords=True))
print("Texts for test are complete.")

Summaries for train are complete.
Texts for train are complete.
Summaries for test are complete.
Texts for test are complete.


In [13]:
# Inspect the cleaned summaries and texts to ensure they have been cleaned all.
for i in range(5):
    print("Clean Review #", i+1)
    print(clean_summaries_train[i])
    print(clean_texts_train[i])
    print()

Clean Review # 1
very yum
received one bars free influenster com return honest feedback glad try product quite yummy nice soft like like delicate banana nut taste aroma reminds mom homemade banana bread thanks influenster quaker

Clean Review # 2
super bland
candies tiny little flavor barely distinguish one tried 6 satisfy sweet urge definitely worth price would recommend walgreen nice store brand jolly rancher sugar free candies instead

Clean Review # 3
spills everytime
every time heated particular soup microwave container made popping sound popped turning tray fell spilled catch spills completely find eating noodles gravitate bottom left untasty pasty noodle mash end serving

Clean Review # 4
awesome
exactly expected rediculous amount laffy taffy shipping quick product showed perfect condition

Clean Review # 5
the best
husband coffee drinker family tried dozen flavors since got keurig last year one prefers says full bodied lots flavor without stoutness something like starbuck buy 1

In [14]:
def count_words(count_dict, text):
    '''Count the number of occurrences of each word in a set of text'''
    # build word histogram as dictioncary to count the word
    
    for sentence in text: # text가 하나의 문장을 element로 가진 배열
        for word in sentence.split():
            if word not in count_dict:
                count_dict[word] = 1
            else:
                count_dict[word] += 1

In [15]:
# Find the number of times each word was used and the size of the vocabulary both in summary and text
# Summary와 Text에 나타나는 서로 다른 단어의 종류 개수.
word_counts = {}

count_words(word_counts, clean_summaries_train)
count_words(word_counts, clean_texts_train)

print("Size of Vocabulary:", len(word_counts))

Size of Vocabulary: 130051


Load Conceptnet Numberbatch's (CN) embeddings, similar to GloVe, but probably better
<br />
- use pre-trained word vectors to help improve the performance of our model.
- **ConceptNet Numberbatch(CN)** : word embeddings that we are using.
- **ConceptNet** : semantic network. 컴퓨터가 자연어의 의미를 이해하는데 도움이 되도록 만들어짐. https://github.com/commonsense/conceptnet-numberbatch

In [16]:
# create an empty dictionary
embeddings_index = {}

# f로 파일을 오픈함. f의 각 line에 대하여, line을 단어 별로 나누어 values에 할당. word는 value[0].
# float32 type으로 values[1:]를 numpy array로 convert하여 embedding에 저장.
# word가 key, embedding이 value.

with open('/home/limhyesu/Summarization_study_data/numberbatch-en-17.04b.txt', encoding = 'utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding
        
print('Word embeddings:', len(embeddings_index))

Word embeddings: 418082


In [17]:
# Find the number of words that are missing from CN, and are used more than our threshold
# CN에 없고 threshold인 20번 보다 많이 등장한 단어의 개수를 찾는다. -> missing_words 에 추가
missing_words = 0
threshold = 20

# Dict.items() returns dict_items object that connects key and value.
for word, count in word_counts.items():
    if count > threshold:
        if word not in embeddings_index:
            missing_words += 1

missing_ratio = round(missing_words/len(word_counts), 4)*100

print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

Number of words missing from CN: 3685
Percent of words that are missing from vocabulary: 2.83%


CN에 없는 단어가 word_embedding_matrix에 더해지려면 그 단어는 적어도 20번 이상 등장해야 한다. 많이 등장해야 모델이 단어의 의미를 이해할 수 있기 때문이다.

In [18]:
# Limit the vocab that we will use to words that appear >= threshold or are in Glove

# dictionary to convert words to integers
vocab_to_int = {}

value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index: # 20번 이상 등장하거나, CN에 있다면
        vocab_to_int[word] = value # 단어마다 int 할당. 
        value += 1
        
# summary 나 text에 등장하는 단어 중
# vocab_to_int에는 CN에 있거나, CN에는 없지만 20번 이상 등장하는 단어가 key
# 각 key에 대해 0부터 차례대로 int 할당

In [19]:
print("Number of words we will use:", len(vocab_to_int))
print("Total number of unique words:", len(word_counts))

Number of words we will use: 58798
Total number of unique words: 130051


In [20]:
# Sepcial tokens that will be added to our vocab
codes = ["<UNK>", "<PAD>", "<EOS>", "<GO>"]

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int) # vocab_to_int의 마지막에 codes 추가.

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

In [21]:
usage_ratio = round(len(vocab_to_int)/ len(word_counts), 4)*100

print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

Total number of unique words: 130051
Number of words we will use: 58802
Percent of words we will use: 45.21%


- limit vocabulary to words that are either **in CN** or **occur more than 20 times** in our dataset
- model이 단어를 많이 볼 수록, 즉 단어가 많이 나타날 수록 단어들끼리 어떻게 연관되어 있는지 알기 쉽기 때문에 어휘를 위와 같이 제한하는 것이 좋은 word embedding을 만들 수 있다.
- word_embedding_matrix를 만들 때 np.zeros의 dtype을 float32로 설정하는 것은 매우 중요하다. 초기값이 float64인데 이는 Tensorflow에서 안돌아가므로 32로 낮춰야한다.

In [22]:
# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim = 300
nb_words = len(vocab_to_int)

# np.zeros(shape, type, order) : return a new array of given shape and type, filled with zeros.
# (nb_words, emabedding_dim) : shape of the new array.
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype = np.float32)

# vocab_to_int : 사용할 단어, embeddings_index : CN에 있는 단어
for word, i in vocab_to_int.items():
    if word in embeddings_index: 
        # CN에 있는 word라면 embedding 그대로 추가
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # if word not in CN, create a random embedding for it 
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding # embeddings_index에 word와 embedding 추가함.
        word_embedding_matrix[i] = new_embedding
        
# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

58802


In [23]:
def convert_to_ints(text, word_count, unk_count, eos=False):
    '''Convert words in text to an integer.
        If word is not in vocab_to_int, use UNK's integer.
        Total the number of words and UNKs.
        Add EOS token to the end of texts'''
    ints = []
    for sentence in text:
        sentences_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentences_ints.append(vocab_to_int[word])
            else:
                sentences_ints.append(vocab_to_int["<UNK>"])
                unk_count += 1
        if eos:
            sentences_ints.append(vocab_to_int["<EOS>"])
        ints.append(sentences_ints)
    return ints, word_count, unk_count

In [24]:
# Apply convert_to_ints to clean_summaries and clean_texts
word_count = 0
unk_count = 0

# summary를 int로 바꿈, text를 int로 바꾸고 eos 추가.
int_summaries_train, word_count, unk_count = convert_to_ints(clean_summaries_train, word_count, unk_count)
int_texts_train, word_count, unk_count = convert_to_ints(clean_texts_train, word_count, unk_count, eos = True)

unk_percent = round(unk_count/word_count, 4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

Total number of words in headlines: 24390949
Total number of UNKs in headlines: 185684
Percent of words that are UNK: 0.76%


In [25]:
def create_lengths(text):
    '''Create a data frame of the sentence lengths from a text'''
    # 각 sentence의 length를 counts 열에 작성.
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])

In [26]:
lengths_summaries_train = create_lengths(int_summaries_train)
lengths_texts_train = create_lengths(int_texts_train)

print("Summaries:")
print(lengths_summaries_train.describe())
print()
print("Texts:")
print(lengths_texts_train.describe())

Summaries:
              counts
count  539990.000000
mean        4.180633
std         2.657093
min         0.000000
25%         2.000000
50%         4.000000
75%         5.000000
max        48.000000

Texts:
              counts
count  539990.000000
mean       41.988628
std        42.512535
min         1.000000
25%        18.000000
50%        29.000000
75%        50.000000
max      2085.000000


In [27]:
# Inspect the length of texts
print(np.percentile(lengths_texts_train.counts, 90)) # compute the 90th percentile of the lengths_texts elements
print(np.percentile(lengths_texts_train.counts, 95))
print(np.percentile(lengths_texts_train.counts, 99))

84.0
115.0
206.0


In [28]:
# Inspect the length of summaries
print(np.percentile(lengths_summaries_train.counts, 90))
print(np.percentile(lengths_summaries_train.counts, 95))
print(np.percentile(lengths_summaries_train.counts, 99))

8.0
9.0
13.0


In [29]:
def unk_counter(sentence):
    '''Counts the number of the UNK appears in a sentence'''
    unk_count = 0
    for word in sentence:
        if word == vocab_to_int["<UNK>"]:
            unk_count += 1
    return unk_count

- To help train the model faster, **sort** the reviews by the **length of the descriptions** form shortest to longest.
- This maeks each batch to have descriptions of **similar lengths**, which will result int **less padding**, thus **less computing**.
- Some reviews will not be included because of the number of UNK tokens in the description or summary. If there is more than 1 UNK in the description or any UNKs in the summary, the review will not be used. 의미있는 데이터로 모델을 만들고 싶기 때문.

In [30]:
# Sort the summaries and texts by the length of the texts, shortest to largest
# Sorting is to make each batch to have descriptions of similar lengths, which will result in less padding, thus less computing.
# Limit the length of summaries and texts based on the min and max ranges
# Remove reviews that include too many UNKs

sorted_summaries = []
sorted_texts = []
max_text_length = 84 # 90% percentile
max_summary_length = 13 # 99% percentile
min_length = 2
unk_text_limit = 1
unk_summary_limit = 0
# text에는 1개의 UNK 까지 허용. summary에는 UNK가 없어야 함.

for length in range(min(lengths_texts_train.counts), max_text_length):
    for count, words in enumerate(int_summaries_train): # enumerate : 몇 번 째 반복문인지 확인 가능.
        if(len(int_summaries_train[count]) >= min_length and
           len(int_summaries_train[count]) <= max_summary_length and
           len(int_texts_train[count]) >= min_length and
           unk_counter(int_summaries_train[count]) <= unk_summary_limit and
           unk_counter(int_texts_train[count]) <= unk_text_limit and
           length == len(int_texts_train[count]) # min, max 사이의 범위에 있는 length가 text의 length일 때.
           ):
            sorted_summaries.append(int_summaries_train[count])
            sorted_texts.append(int_texts_train[count])

# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

403825
403825


### 3. Building the Model

In [31]:
def model_inputs():
    '''Create placeholders for input to the model'''
    
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    # summary_length, text_length are the lengths of each sentence within a batch
    # max_summary_length is maximum length of a summary within a batch
    summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
    
    # Computes the maximum of elements across dimensions of a tensor.
    max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,), name='text_length')
    
    return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length

In [32]:
def process_encoding_input(target_data, vocab_to_int, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''
    
    # ending = target_data의 마지막 단어를 추출함.
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    # dec_input = scalar value로 채워진 tensor를 만들어서 마지막에 <GO> 붙임.
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)
    
    return dec_input

In [33]:
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    '''Create the encoding layer'''
    
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            # Cell 만들기
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw,
                                                   input_keep_prob = keep_prob)
            
            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                             initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw,
                                                    input_keep_prob = keep_prob)
            ### MY CODE START
            
            
            # Cell 구동
            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw,
                                                                    cell_bw, rnn_inputs, sequence_length,
                                                                    dtype = tf.float32)

            # Join outputs since we are using a bidirectional RNN
            enc_output = tf.concat(enc_output, 2)

            return enc_output, enc_state

In [34]:
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state, output_layer,
                           vocab_size, max_summary_length):
    '''Create the training logits'''
    
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                       sequence_length=summary_length,
                                                       time_major=False)
    training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                      training_helper,
                                                      initial_state,
                                                      output_layer)
    training_logits, *_ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                          output_time_major=False,
                                                          impute_finished=True,
                                                          maximum_iterations=max_summary_length)
        
    return training_logits

**TrainingHelper** reads a sequence of integers from the encoding layer.
<br />
**BasicDecoder** processes the sequence with the decoding cell, an output layer, which is a fully connected layer. *initial_state* comes from our *DynamicAttentionWrapperState* that you will see soon.
<br />
**dynamic_decode** creates our outputs that will be used for training.

In [35]:
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                             max_summary_length, batch_size):
    '''Create the inference logits'''
    
    ### MY CODE START
    start_tokens = tf.tile(tf.constant([start_token], dtype = tf.int32), [batch_size], name='start_tokens')
    #start_toekns = tf.contrib.seq2seq.tile_batch(tf.constant([start_token], dtype = tf.int32), [batch_size], name='start_tokens')
    
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                               start_tokens, end_token)
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, inference_helper, initial_state,
                                                       output_layer)
    
    inference_logits, *_ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                           output_time_major=False,
                                                           impute_finished=True,
                                                           maximum_iterations=max_summary_length)
    
    return inference_logits

**inference_decoding_layer** is very similar to training layer. The main difference is **GreedyEmbeddingHelper**, which uses the argmax of the output (treated as logits) and passes the result through an embedding layer to get the next input. Although it is asking for **start_tokens**, we only have one, < GO >.

In [36]:
을def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length,
                  max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the trainig and inference decoding layers'''
    
    for layer in range(num_layers):
        with tf.variable_scope('decoder_{}'.format(layer)):
            lstm = tf.contrib.rnn.LSTMCell(rnn_size,
                                          initializer = tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
            
    output_layer = Dense(vocab_size, 
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev = 0.1))
    
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size, enc_output, text_length, normalize = False, name='BahdanauAttention')
    
    ### MY CODE START
    '''No DynamicAttentionWrapper in new version so I changed the code'''
    
    dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, attn_mech, rnn_size)
    initial_state = dec_cell.zero_state(dtype=tf.float32, batch_size=batch_size)
    
    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state,
                                                output_layer, vocab_size, max_summary_length)
        
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings, vocab_to_int['<GO>'],
                                                 vocab_to_int['<EOS>'],
                                                 dec_cell, initial_state, output_layer, max_summary_length,
                                                 batch_size)
        
    return training_logits, inference_logits

In [37]:
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length,
                 vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
    '''Use the previous functions to create the training and inference logits'''
    
    # Use Numberbatch's embeddings and the newly created ones as our embeddings
    embeddings = word_embedding_matrix
    
    enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
    enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
    
    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
    
    training_logits, inference_logits = decoding_layer(dec_embed_input, embeddings, enc_output, 
                                                      enc_state, vocab_size, text_length, summary_length,
                                                      max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size,
                                                      num_layers)
    
    return training_logits, inference_logits

In [38]:
을def pad_sentence_batch(sentence_batch):
    '''Pad sentences with <PAD> so that each sentence of batch has the same length'''
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]

In [39]:
def get_batches(summaries, texts, batch_size):
    '''Batch summaries, texts, and the lengths of their sentences together'''
    for batch_i in range(0, len(texts)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i+batch_size]
        texts_batch = texts[start_i:start_i+batch_size]
        pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
        pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
        
        # Need the lengths for the _lengths parameters
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
            
        pad_texts_lengths = []
        for text in pad_texts_batch:
            pad_texts_lengths.append(len(text))
            
        yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths

In [40]:
# Set the Hyperparameters
epochs = 100
batch_size = 64
rnn_size = 256
num_layers = 2
learning_rate = 0.005
keep_probability = 0.75

In [41]:
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():
    
    # Load the model inputs
    input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()
    
    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                     targets, keep_prob, text_length, summary_length,
                                                     max_summary_length, len(vocab_to_int)+1, rnn_size, num_layers,
                                                     vocab_to_int, batch_size)
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
    
    # Create the weights for sequence_loss
    masks = tf.sequence_mask(summary_length, max_summary_length, dtype = tf.float32, name='masks')
    
    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(training_logits, targets, masks)
        
        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)
        
        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)
        
print("Graph is built")

Instructions for updating:
Colocations handled automatically by placer.

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Graph is built


### 4. Training the Model

used subset of the data since the whole data will take too long time

In [42]:
# Subset the data for training
start = 200000
end = start + 100000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))
print("The longest text length:", len(sorted_texts_short[-1]))

The shortest text length: 26
The longest text length: 42


In [43]:
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1

update_loss = 0
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the model

checkpoint = "./best_model.ckpt"

In [44]:
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    
    # If we want to continue training a previous session
    # loader = tf.train.import_meta_graph("./"+checkpoint+'.meta')
    # loader.restore(sess, checkpoint)
    
    for epoch_i in range(1, epochs+1):
        update_loss = 0
        batch_loss = 0
        for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost], 
                {input_data: texts_batch,
                 targets: summaries_batch,
                 lr: learning_rate,
                 summary_length: summaries_lengths,
                 text_length: texts_lengths,
                 keep_prob: keep_probability})
            
            batch_loss += loss
            update_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time

            if (batch_i % display_step == 0) and (batch_i > 0):
                print('Epoch {:>3}/{} Batch{:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                     .format(epoch_i, epochs, batch_i, len(sorted_texts_short) // batch_size,
                            batch_loss / display_step, batch_time*display_step))
                batch_loss = 0
                
            if (batch_i % update_check == 0) and (batch_i > 0):
                print("Average loss for this update:", round(update_loss/update_check,3))
                summary_update_loss.append(update_loss)
                
                # If the update loss is at a new minimum, save the model
                if update_loss <= min(summary_update_loss):
                    print('New Record!')
                    stop_early = 0
                    saver = tf.train.Saver()
                    saver.save(sess, checkpoint)
                    
                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
                update_loss = 0
                
            
            # Reduce learnig rate, but not below its minimum value
            learning_rate *= learning_rate_decay
            if learning_rate < min_learning_rate:
                learning_rate = min_learning_rate
            
            if stop_early == stop:
                print("Stopping Training.")
                break   

Epoch   1/100 Batch  20/1562 - Loss:  4.663, Seconds: 23.13
Epoch   1/100 Batch  40/1562 - Loss:  2.739, Seconds: 18.00
Epoch   1/100 Batch  60/1562 - Loss:  2.689, Seconds: 16.55
Epoch   1/100 Batch  80/1562 - Loss:  2.772, Seconds: 19.06
Epoch   1/100 Batch 100/1562 - Loss:  2.562, Seconds: 20.28
Epoch   1/100 Batch 120/1562 - Loss:  2.546, Seconds: 19.06
Epoch   1/100 Batch 140/1562 - Loss:  2.541, Seconds: 15.24
Epoch   1/100 Batch 160/1562 - Loss:  2.674, Seconds: 17.94
Epoch   1/100 Batch 180/1562 - Loss:  2.562, Seconds: 20.27
Epoch   1/100 Batch 200/1562 - Loss:  2.592, Seconds: 20.20
Epoch   1/100 Batch 220/1562 - Loss:  2.533, Seconds: 15.16
Epoch   1/100 Batch 240/1562 - Loss:  2.447, Seconds: 13.92
Epoch   1/100 Batch 260/1562 - Loss:  2.521, Seconds: 19.68
Epoch   1/100 Batch 280/1562 - Loss:  2.525, Seconds: 19.58
Epoch   1/100 Batch 300/1562 - Loss:  2.600, Seconds: 20.07
Epoch   1/100 Batch 320/1562 - Loss:  2.519, Seconds: 20.88
Epoch   1/100 Batch 340/1562 - Loss:  2.

Epoch   2/100 Batch1120/1562 - Loss:  1.627, Seconds: 18.11
Epoch   2/100 Batch1140/1562 - Loss:  1.580, Seconds: 21.08
Epoch   2/100 Batch1160/1562 - Loss:  1.942, Seconds: 22.61
Epoch   2/100 Batch1180/1562 - Loss:  1.836, Seconds: 21.49
Epoch   2/100 Batch1200/1562 - Loss:  1.587, Seconds: 22.62
Epoch   2/100 Batch1220/1562 - Loss:  1.661, Seconds: 20.11
Epoch   2/100 Batch1240/1562 - Loss:  1.816, Seconds: 19.69
Epoch   2/100 Batch1260/1562 - Loss:  1.878, Seconds: 19.54
Epoch   2/100 Batch1280/1562 - Loss:  1.713, Seconds: 19.94
Epoch   2/100 Batch1300/1562 - Loss:  1.627, Seconds: 22.81
Epoch   2/100 Batch1320/1562 - Loss:  1.547, Seconds: 20.02
Epoch   2/100 Batch1340/1562 - Loss:  1.883, Seconds: 21.55
Epoch   2/100 Batch1360/1562 - Loss:  1.772, Seconds: 19.92
Epoch   2/100 Batch1380/1562 - Loss:  1.691, Seconds: 19.84
Epoch   2/100 Batch1400/1562 - Loss:  1.621, Seconds: 21.42
Epoch   2/100 Batch1420/1562 - Loss:  1.888, Seconds: 20.11
Epoch   2/100 Batch1440/1562 - Loss:  1.

Epoch   4/100 Batch 660/1562 - Loss:  1.593, Seconds: 20.83
Epoch   4/100 Batch 680/1562 - Loss:  1.592, Seconds: 17.97
Epoch   4/100 Batch 700/1562 - Loss:  1.423, Seconds: 19.42
Epoch   4/100 Batch 720/1562 - Loss:  1.461, Seconds: 16.25
Epoch   4/100 Batch 740/1562 - Loss:  1.393, Seconds: 20.63
Epoch   4/100 Batch 760/1562 - Loss:  1.496, Seconds: 17.64
Epoch   4/100 Batch 780/1562 - Loss:  1.544, Seconds: 21.98
Epoch   4/100 Batch 800/1562 - Loss:  1.403, Seconds: 19.78
Epoch   4/100 Batch 820/1562 - Loss:  1.409, Seconds: 19.30
Epoch   4/100 Batch 840/1562 - Loss:  1.387, Seconds: 20.61
Epoch   4/100 Batch 860/1562 - Loss:  1.420, Seconds: 19.39
Epoch   4/100 Batch 880/1562 - Loss:  1.515, Seconds: 19.34
Epoch   4/100 Batch 900/1562 - Loss:  1.457, Seconds: 19.34
Epoch   4/100 Batch 920/1562 - Loss:  1.463, Seconds: 21.24
Epoch   4/100 Batch 940/1562 - Loss:  1.325, Seconds: 20.76
Epoch   4/100 Batch 960/1562 - Loss:  1.450, Seconds: 20.63
Epoch   4/100 Batch 980/1562 - Loss:  1.

Epoch   6/100 Batch 200/1562 - Loss:  1.285, Seconds: 21.87
Epoch   6/100 Batch 220/1562 - Loss:  1.283, Seconds: 16.28
Epoch   6/100 Batch 240/1562 - Loss:  1.248, Seconds: 14.96
Epoch   6/100 Batch 260/1562 - Loss:  1.261, Seconds: 21.05
Epoch   6/100 Batch 280/1562 - Loss:  1.286, Seconds: 19.29
Epoch   6/100 Batch 300/1562 - Loss:  1.409, Seconds: 21.61
Epoch   6/100 Batch 320/1562 - Loss:  1.332, Seconds: 22.23
Epoch   6/100 Batch 340/1562 - Loss:  1.259, Seconds: 20.77
Epoch   6/100 Batch 360/1562 - Loss:  1.209, Seconds: 20.65
Epoch   6/100 Batch 380/1562 - Loss:  1.208, Seconds: 19.60
Epoch   6/100 Batch 400/1562 - Loss:  1.189, Seconds: 22.37
Epoch   6/100 Batch 420/1562 - Loss:  1.457, Seconds: 19.24
Epoch   6/100 Batch 440/1562 - Loss:  1.368, Seconds: 21.06
Epoch   6/100 Batch 460/1562 - Loss:  1.278, Seconds: 17.97
Epoch   6/100 Batch 480/1562 - Loss:  1.251, Seconds: 19.40
Epoch   6/100 Batch 500/1562 - Loss:  1.225, Seconds: 21.26
Average loss for this update: 1.285
New 

Epoch   7/100 Batch1300/1562 - Loss:  1.218, Seconds: 22.96
Epoch   7/100 Batch1320/1562 - Loss:  1.130, Seconds: 20.09
Epoch   7/100 Batch1340/1562 - Loss:  1.359, Seconds: 22.21
Epoch   7/100 Batch1360/1562 - Loss:  1.287, Seconds: 20.56
Epoch   7/100 Batch1380/1562 - Loss:  1.243, Seconds: 20.77
Epoch   7/100 Batch1400/1562 - Loss:  1.201, Seconds: 21.22
Epoch   7/100 Batch1420/1562 - Loss:  1.362, Seconds: 20.10
Epoch   7/100 Batch1440/1562 - Loss:  1.305, Seconds: 18.42
Epoch   7/100 Batch1460/1562 - Loss:  1.280, Seconds: 19.93
Epoch   7/100 Batch1480/1562 - Loss:  1.196, Seconds: 18.64
Epoch   7/100 Batch1500/1562 - Loss:  1.300, Seconds: 22.33
Epoch   7/100 Batch1520/1562 - Loss:  1.329, Seconds: 23.97
Epoch   7/100 Batch1540/1562 - Loss:  1.265, Seconds: 18.87
Average loss for this update: 1.266
No Improvement.
Epoch   7/100 Batch1560/1562 - Loss:  1.161, Seconds: 21.94
Epoch   8/100 Batch  20/1562 - Loss:  1.390, Seconds: 21.61
Epoch   8/100 Batch  40/1562 - Loss:  1.185, Sec

Epoch   9/100 Batch 840/1562 - Loss:  1.109, Seconds: 21.46
Epoch   9/100 Batch 860/1562 - Loss:  1.149, Seconds: 19.92
Epoch   9/100 Batch 880/1562 - Loss:  1.223, Seconds: 20.15
Epoch   9/100 Batch 900/1562 - Loss:  1.186, Seconds: 19.55
Epoch   9/100 Batch 920/1562 - Loss:  1.189, Seconds: 21.58
Epoch   9/100 Batch 940/1562 - Loss:  1.100, Seconds: 22.73
Epoch   9/100 Batch 960/1562 - Loss:  1.180, Seconds: 21.75
Epoch   9/100 Batch 980/1562 - Loss:  1.229, Seconds: 23.41
Epoch   9/100 Batch1000/1562 - Loss:  1.194, Seconds: 22.70
Epoch   9/100 Batch1020/1562 - Loss:  1.156, Seconds: 15.93
Average loss for this update: 1.17
No Improvement.
Epoch   9/100 Batch1040/1562 - Loss:  1.143, Seconds: 18.72
Epoch   9/100 Batch1060/1562 - Loss:  1.217, Seconds: 21.56
Epoch   9/100 Batch1080/1562 - Loss:  1.223, Seconds: 21.66
Epoch   9/100 Batch1100/1562 - Loss:  1.218, Seconds: 21.96
Epoch   9/100 Batch1120/1562 - Loss:  1.130, Seconds: 18.74
Epoch   9/100 Batch1140/1562 - Loss:  1.104, Seco

Epoch  11/100 Batch 380/1562 - Loss:  1.033, Seconds: 25.11
Epoch  11/100 Batch 400/1562 - Loss:  0.994, Seconds: 22.90
Epoch  11/100 Batch 420/1562 - Loss:  1.217, Seconds: 19.39
Epoch  11/100 Batch 440/1562 - Loss:  1.128, Seconds: 21.10
Epoch  11/100 Batch 460/1562 - Loss:  1.085, Seconds: 18.66
Epoch  11/100 Batch 480/1562 - Loss:  1.061, Seconds: 20.24
Epoch  11/100 Batch 500/1562 - Loss:  1.047, Seconds: 20.45
Average loss for this update: 1.079
New Record!
Epoch  11/100 Batch 520/1562 - Loss:  1.063, Seconds: 22.37
Epoch  11/100 Batch 540/1562 - Loss:  1.195, Seconds: 22.83
Epoch  11/100 Batch 560/1562 - Loss:  1.154, Seconds: 24.65
Epoch  11/100 Batch 580/1562 - Loss:  1.121, Seconds: 21.98
Epoch  11/100 Batch 600/1562 - Loss:  1.057, Seconds: 21.98
Epoch  11/100 Batch 620/1562 - Loss:  1.079, Seconds: 18.41
Epoch  11/100 Batch 640/1562 - Loss:  1.078, Seconds: 18.68
Epoch  11/100 Batch 660/1562 - Loss:  1.171, Seconds: 21.32
Epoch  11/100 Batch 680/1562 - Loss:  1.202, Seconds

Epoch  12/100 Batch1480/1562 - Loss:  1.055, Seconds: 18.74
Epoch  12/100 Batch1500/1562 - Loss:  1.144, Seconds: 21.57
Epoch  12/100 Batch1520/1562 - Loss:  1.176, Seconds: 23.38
Epoch  12/100 Batch1540/1562 - Loss:  1.116, Seconds: 19.56
Average loss for this update: 1.118
No Improvement.
Epoch  12/100 Batch1560/1562 - Loss:  1.032, Seconds: 21.68
Epoch  13/100 Batch  20/1562 - Loss:  1.213, Seconds: 21.54
Epoch  13/100 Batch  40/1562 - Loss:  1.038, Seconds: 16.02
Epoch  13/100 Batch  60/1562 - Loss:  1.043, Seconds: 17.39
Epoch  13/100 Batch  80/1562 - Loss:  1.056, Seconds: 19.98
Epoch  13/100 Batch 100/1562 - Loss:  0.962, Seconds: 21.52
Epoch  13/100 Batch 120/1562 - Loss:  0.954, Seconds: 20.48
Epoch  13/100 Batch 140/1562 - Loss:  0.908, Seconds: 16.14
Epoch  13/100 Batch 160/1562 - Loss:  1.079, Seconds: 17.72
Epoch  13/100 Batch 180/1562 - Loss:  1.042, Seconds: 21.49
Epoch  13/100 Batch 200/1562 - Loss:  1.022, Seconds: 21.71
Epoch  13/100 Batch 220/1562 - Loss:  1.052, Sec

Epoch  14/100 Batch1020/1562 - Loss:  1.046, Seconds: 15.53
Average loss for this update: 1.052
No Improvement.
Epoch  14/100 Batch1040/1562 - Loss:  1.027, Seconds: 17.96
Epoch  14/100 Batch1060/1562 - Loss:  1.102, Seconds: 21.11
Epoch  14/100 Batch1080/1562 - Loss:  1.098, Seconds: 20.86
Epoch  14/100 Batch1100/1562 - Loss:  1.084, Seconds: 20.87
Epoch  14/100 Batch1120/1562 - Loss:  1.034, Seconds: 17.91
Epoch  14/100 Batch1140/1562 - Loss:  1.006, Seconds: 21.10
Epoch  14/100 Batch1160/1562 - Loss:  1.157, Seconds: 22.37
Epoch  14/100 Batch1180/1562 - Loss:  1.101, Seconds: 21.38
Epoch  14/100 Batch1200/1562 - Loss:  0.985, Seconds: 22.48
Epoch  14/100 Batch1220/1562 - Loss:  1.021, Seconds: 19.87
Epoch  14/100 Batch1240/1562 - Loss:  1.133, Seconds: 19.76
Epoch  14/100 Batch1260/1562 - Loss:  1.115, Seconds: 19.72
Epoch  14/100 Batch1280/1562 - Loss:  1.046, Seconds: 19.86
Epoch  14/100 Batch1300/1562 - Loss:  1.026, Seconds: 22.98
Epoch  14/100 Batch1320/1562 - Loss:  0.978, Sec

Epoch  16/100 Batch 540/1562 - Loss:  1.081, Seconds: 22.13
Epoch  16/100 Batch 560/1562 - Loss:  1.042, Seconds: 22.15
Epoch  16/100 Batch 580/1562 - Loss:  1.024, Seconds: 22.07
Epoch  16/100 Batch 600/1562 - Loss:  0.971, Seconds: 21.67
Epoch  16/100 Batch 620/1562 - Loss:  0.982, Seconds: 17.46
Epoch  16/100 Batch 640/1562 - Loss:  0.972, Seconds: 17.63
Epoch  16/100 Batch 660/1562 - Loss:  1.076, Seconds: 20.64
Epoch  16/100 Batch 680/1562 - Loss:  1.093, Seconds: 17.88
Epoch  16/100 Batch 700/1562 - Loss:  0.998, Seconds: 18.86
Epoch  16/100 Batch 720/1562 - Loss:  1.034, Seconds: 16.20
Epoch  16/100 Batch 740/1562 - Loss:  0.980, Seconds: 20.40
Epoch  16/100 Batch 760/1562 - Loss:  1.025, Seconds: 17.95
Epoch  16/100 Batch 780/1562 - Loss:  1.034, Seconds: 21.89
Epoch  16/100 Batch 800/1562 - Loss:  0.983, Seconds: 19.44
Epoch  16/100 Batch 820/1562 - Loss:  0.981, Seconds: 19.18
Epoch  16/100 Batch 840/1562 - Loss:  0.958, Seconds: 20.69
Epoch  16/100 Batch 860/1562 - Loss:  1.

Epoch  18/100 Batch  80/1562 - Loss:  0.992, Seconds: 19.89
Epoch  18/100 Batch 100/1562 - Loss:  0.883, Seconds: 21.20
Epoch  18/100 Batch 120/1562 - Loss:  0.876, Seconds: 20.03
Epoch  18/100 Batch 140/1562 - Loss:  0.842, Seconds: 16.23
Epoch  18/100 Batch 160/1562 - Loss:  0.989, Seconds: 16.96
Epoch  18/100 Batch 180/1562 - Loss:  0.971, Seconds: 21.63
Epoch  18/100 Batch 200/1562 - Loss:  0.944, Seconds: 21.56
Epoch  18/100 Batch 220/1562 - Loss:  0.984, Seconds: 15.85
Epoch  18/100 Batch 240/1562 - Loss:  0.948, Seconds: 14.77
Epoch  18/100 Batch 260/1562 - Loss:  0.942, Seconds: 20.14
Epoch  18/100 Batch 280/1562 - Loss:  0.950, Seconds: 18.90
Epoch  18/100 Batch 300/1562 - Loss:  1.008, Seconds: 20.29
Epoch  18/100 Batch 320/1562 - Loss:  0.987, Seconds: 21.46
Epoch  18/100 Batch 340/1562 - Loss:  0.931, Seconds: 20.25
Epoch  18/100 Batch 360/1562 - Loss:  0.917, Seconds: 20.32
Epoch  18/100 Batch 380/1562 - Loss:  0.923, Seconds: 18.86
Epoch  18/100 Batch 400/1562 - Loss:  0.

Epoch  19/100 Batch1180/1562 - Loss:  1.045, Seconds: 21.36
Epoch  19/100 Batch1200/1562 - Loss:  0.930, Seconds: 22.81
Epoch  19/100 Batch1220/1562 - Loss:  0.955, Seconds: 19.48
Epoch  19/100 Batch1240/1562 - Loss:  1.063, Seconds: 20.10
Epoch  19/100 Batch1260/1562 - Loss:  1.042, Seconds: 19.66
Epoch  19/100 Batch1280/1562 - Loss:  0.975, Seconds: 19.81
Epoch  19/100 Batch1300/1562 - Loss:  0.971, Seconds: 22.39
Epoch  19/100 Batch1320/1562 - Loss:  0.915, Seconds: 19.97
Epoch  19/100 Batch1340/1562 - Loss:  1.061, Seconds: 21.43
Epoch  19/100 Batch1360/1562 - Loss:  1.012, Seconds: 20.06
Epoch  19/100 Batch1380/1562 - Loss:  0.999, Seconds: 19.89
Epoch  19/100 Batch1400/1562 - Loss:  0.963, Seconds: 21.35
Epoch  19/100 Batch1420/1562 - Loss:  1.058, Seconds: 19.85
Epoch  19/100 Batch1440/1562 - Loss:  1.033, Seconds: 18.49
Epoch  19/100 Batch1460/1562 - Loss:  1.029, Seconds: 20.17
Epoch  19/100 Batch1480/1562 - Loss:  0.957, Seconds: 18.68
Epoch  19/100 Batch1500/1562 - Loss:  1.

Epoch  21/100 Batch 720/1562 - Loss:  0.989, Seconds: 16.37
Epoch  21/100 Batch 740/1562 - Loss:  0.946, Seconds: 20.21
Epoch  21/100 Batch 760/1562 - Loss:  0.964, Seconds: 17.88
Epoch  21/100 Batch 780/1562 - Loss:  0.983, Seconds: 22.31
Epoch  21/100 Batch 800/1562 - Loss:  0.930, Seconds: 19.28
Epoch  21/100 Batch 820/1562 - Loss:  0.935, Seconds: 19.43
Epoch  21/100 Batch 840/1562 - Loss:  0.922, Seconds: 20.75
Epoch  21/100 Batch 860/1562 - Loss:  0.954, Seconds: 19.47
Epoch  21/100 Batch 880/1562 - Loss:  0.979, Seconds: 18.93
Epoch  21/100 Batch 900/1562 - Loss:  0.977, Seconds: 19.78
Epoch  21/100 Batch 920/1562 - Loss:  0.990, Seconds: 20.87
Epoch  21/100 Batch 940/1562 - Loss:  0.937, Seconds: 20.76
Epoch  21/100 Batch 960/1562 - Loss:  0.974, Seconds: 20.67
Epoch  21/100 Batch 980/1562 - Loss:  1.004, Seconds: 22.30
Epoch  21/100 Batch1000/1562 - Loss:  0.973, Seconds: 22.12
Epoch  21/100 Batch1020/1562 - Loss:  0.964, Seconds: 15.11
Average loss for this update: 0.967
No I

### 5. Test and Evaluate the Model

After training the model, we are going to test whether the model is trained well using test data set. In this section, we will going to use ROUGE score as a evaluation metric.

In [45]:
def text_to_seq(text):
    '''Prepare the text for the model'''
    text = clean_text(text)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]

In [47]:
# Making Our Own Summaries
input_sentence = clean_texts_test
text = []
texts_batch_words = []
answer_logits_words = []

for text_single in clean_texts_test:
    text.append(text_to_seq(text_single))

checkpoint = "./best_model.ckpt"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    pad = vocab_to_int["<PAD>"]

    # Get batches of test data
    for batch_i, (_, texts_batch, _, texts_length) in enumerate(
        get_batches(text, text, batch_size)):
        answer_logits = sess.run(logits, {input_data: texts_batch,
                                summary_length:[np.random.randint(5,8)],
                                text_length: texts_length,
                                keep_prob: 1.0})
        
        for j, text_i in enumerate(texts_batch):
            texts_batch_words.append(" ".join([int_to_vocab[i] for i in text_i if i != pad]))

        for j, answer_i in enumerate(answer_logits):
            answer_logits_words.append(" ".join([int_to_vocab[i] for i in answer_i if i != pad]))

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./best_model.ckpt


In [48]:
original_text = np.asarray(texts_batch_words)
answer_summary = np.asarray(answer_logits_words)

original_text = pd.DataFrame(original_text)
answer_summary = pd.DataFrame(answer_summary)

original_text.columns=["text"]
answer_summary.columns=["system summary"]

model_summary=pd.DataFrame({'model summary':clean_summaries_test})

text_and_summary = pd.concat([original_text, answer_summary, model_summary], axis=1)

In [49]:
text_and_summary.dropna(axis=0, how='any')

Unnamed: 0,text,system summary,model summary
0,bought chocolates mother birthday little scare...,best chocolate ever,great
1,dogs love treat good value compared prices gro...,great value great value,my dogs love this item
2,husband extensive taste test hard find boutiqu...,best kettle bought,these chips rock
3,great diets 2 weight watchers points per entir...,a lot of fiber,only 2 weight watchers points per brownie
4,raw eating organic garden fresh grown fruits v...,great product,raw organics going back to paradise
5,even opened yet wanted warn 3 packages receive...,check mine,arrived 8 days after the best before date
6,delicious brimming flavor individual piece cru...,delicious hard to prepare,delicious brimming with natural flavor
7,bought milk reading reviews daughter gas probl...,milk milk,better milk than other big brands
8,drinking coconut milk supposed good <br >but s...,eh coconut milk milk pudding,coconut milk is good for you
9,asin <UNK> let organic jelly gummi bears 3 5 o...,best gummi bears,delicious snack


In [50]:
def list_to_file(file_name, list_name):
    with open(file_name, 'w') as f:
        for sentence in list_name:
            f.write("%s\n" % sentence)

In [51]:
list_to_file('model_sum.txt', text_and_summary['model summary'])

In [52]:
list_to_file('system_sum.txt', text_and_summary['system summary'])

#### NOT NEEDED IN THIS PROJECT : START

In [53]:
'''
n=0
for sentence in text_and_summary['model summary']:
    with open("./models/model_sum.{}.txt".format(n), 'w') as f:
        f.write("%s\n" %sentence)
    n+=1
    
n=0
for sentence in text_and_summary['system summary']:
    with open("./systems/system_sum.{}.txt".format(n), 'w') as f:
        f.write("%s\n" %sentence)
    n+=1
'''

In [54]:
#from pyrouge import Rouge155

#Rouge155.convert_summaries_to_rouge_format('./models', './models_out')
#Rouge155.convert_summaries_to_rouge_format('./systems', './systems_out')

In [55]:
#from pyrouge import Rouge155
'''
Rouge155.write_config_static(
    './systems_out', 'system_sum.(\d+).txt',
    './models_out', 'model_sum.(\d+).txt',
    './config')
'''

"\nRouge155.write_config_static(\n    './systems_out', 'system_sum.(\\d+).txt',\n    './models_out', 'model_sum.(\\d+).txt',\n    './config')\n"

#### NOT NEEDED IN THIS PROJECT : END

In [2]:
from rouge import FilesRouge

files_rouge = FilesRouge('./system_sum.txt', './model_sum.txt')
scores = files_rouge.get_scores()

In [3]:
len(scores)

28421

드디어 rouge 스코어를 구했다..!!!!!

In [4]:
type(scores)

list

In [5]:
scores

[{'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}},
 {'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}},
 {'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}},
 {'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}},
 {'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}},
 {'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}},
 {'rouge-1': {'f': 0.22222221728395072, 'p': 0.25, 'r': 0.2},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.21693121693096162, 'p': 0.25, 'r': 0.2}},
 {'rouge-1': {

In [28]:
print(np.asarray(scores).shape)

(28421,)


scores 에 id column 을 추가해주자.

In [6]:
import numpy as np
import pandas as pd

rouge_score = np.asarray(scores)
rouge_score = pd.DataFrame(scores)


In [7]:
rouge_score

Unnamed: 0,rouge-1,rouge-2,rouge-l
0,"{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}"
1,"{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}"
2,"{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}"
3,"{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}"
4,"{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}"
5,"{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}"
6,"{'f': 0.22222221728395072, 'p': 0.25, 'r': 0.2}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.21693121693096162, 'p': 0.25, 'r': 0.2}"
7,"{'f': 0.2857142832653061, 'p': 1.0, 'r': 0.166...","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.17050691244243746, 'p': 1.0, 'r': 0.16..."
8,"{'f': 0.39999999520000007, 'p': 0.5, 'r': 0.33...","{'f': 0.22222221728395072, 'p': 0.25, 'r': 0.2}","{'f': 0.371428571428493, 'p': 0.5, 'r': 0.3333..."
9,"{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}","{'f': 0.0, 'p': 0.0, 'r': 0.0}"


rouge_score dataframe 을 csv 파일로 변환해서 저장해야 한다. 

In [8]:
rouge_score.to_csv('rouge_score.csv', encoding='utf-8')

'rouge_score.csv' 에 저장되어 있는 것을 rouge_score라는 df로 읽어오는 작업을 진행하려 한다. 이는 매번 새로 실행시켜야 하는 상황을 방지하기 위함이다.

In [4]:
import pandas as pd

rouge_score_path = "./rouge_score.csv"
rouge_score = pd.read_csv(rouge_score_path, index_col=[0])

In [9]:
import numpy as np
rouge_1 = np.asarray(rouge_score['rouge-1'])
rouge_2 = np.asarray(rouge_score['rouge-2'])
rouge_L = np.asarray(rouge_score['rouge-l'])

In [10]:
import re

In [11]:
text = []
for element in rouge_1:
    string = str(element)
    text.append(re.sub('[\sa-z{\[\]:"\'}]', '', string))

In [12]:
rouge_1_list = []
for k in text:
    rouge_1_list.append(k.split(','))

In [13]:
rouge_1_df = pd.DataFrame(rouge_1_list)
rouge_1_df.columns = ['rouge-1-f', 'rouge-1-p', 'rouge-1-r']

In [14]:
text = []
for element in rouge_2:
    string = str(element)
    text.append(re.sub('[\sa-z{\[\]:"\'}]', '', string))

rouge_2_list = []
for k in text:
    rouge_2_list.append(k.split(','))

rouge_2_df = pd.DataFrame(rouge_2_list)
rouge_2_df.columns = ['rouge-2-f', 'rouge-2-p', 'rouge-2-r']

In [15]:
text = []
for element in rouge_L:
    string = str(element)
    text.append(re.sub('[\sa-z{\[\]:"\'}]', '', string))

rouge_L_list = []
for k in text:
    rouge_L_list.append(k.split(','))

rouge_L_df = pd.DataFrame(rouge_L_list)
rouge_L_df.columns = ['rouge-L-f', 'rouge--L-p', 'rouge-L-r']

In [16]:
result = rouge_1_df
result = result.join(rouge_2_df)
result = result.join(rouge_L_df)

In [17]:
result 

Unnamed: 0,rouge-1-f,rouge-1-p,rouge-1-r,rouge-2-f,rouge-2-p,rouge-2-r,rouge-L-f,rouge--L-p,rouge-L-r
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.22222221728395072,0.25,0.2,0.0,0.0,0.0,0.21693121693096162,0.25,0.2
7,0.2857142832653061,1.0,0.16666666666666666,0.0,0.0,0.0,0.17050691244243746,1.0,0.16666666666666666
8,0.39999999520000007,0.5,0.3333333333333333,0.22222221728395072,0.25,0.2,0.371428571428493,0.5,0.3333333333333333
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
result.describe()

Unnamed: 0,rouge-1-f,rouge-1-p,rouge-1-r,rouge-2-f,rouge-2-p,rouge-2-r,rouge-L-f,rouge--L-p,rouge-L-r
count,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0
unique,205.0,19.0,55.0,119.0,13.0,40.0,245.0,19.0,55.0
top,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
freq,20021.0,20021.0,20021.0,26365.0,26365.0,26365.0,20021.0,20021.0,20021.0


In [21]:
# 이렇게 describe() 결과가 나온다는 것은 dtype 이 object 임을 의미하고, categorical series 임을 의미함.
# categorical -> numeric 로 변환하는 방법이 없을까?

In [25]:
result = result.convert_objects(convert_numeric=True)

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.


In [26]:
result.describe()

Unnamed: 0,rouge-1-f,rouge-1-p,rouge-1-r,rouge-2-f,rouge-2-p,rouge-2-r,rouge-L-f,rouge--L-p,rouge-L-r
count,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0,28421.0
mean,0.116639,0.139479,0.115046,0.039031,0.045383,0.039212,0.104803,0.137613,0.11381
std,0.218001,0.256351,0.224888,0.16154,0.183428,0.166812,0.204875,0.253442,0.223454
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.2,0.25,0.166667,0.0,0.0,0.0,0.159705,0.25,0.166667
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# 바꿨다...!