## Summarizing Text with Amazon Reviews - version 2

### 0. Description
1. **Project**: Amazon에서 팔린 식품의 리뷰를 요약하는 모델을 만드는 프로젝트로, [블로그](https://towardsdatascience.com/text-summarization-with-amazon-reviews-41801c2210b)와 [Github](https://github.com/Currie32/Text-Summarization-with-Amazon-Reviews)를 보고 참고함. 공부의 목적으로 이 프로젝트를 그대로 따라해 보는 중임.<br /><br />
2. **Data** : *Amazon Fine Food Reviews*. [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews)에서 다운로드함.<br />리뷰 내용(description)을 input으로 하고, 리뷰의 제목(title)을 target으로 하여 description이 text, title이 summary이다.
<br /><br />
3. **Tools** : Python, Tensorflow 1.2.1
<br /><br />
4. **Model**: 인코딩 레이어에 **bi-directional RNN과 LSTMs**을 사용하고, 디코딩 레이어에 **attention**을 사용한다. Textsum에서 사용한 seq2seq 모델과 유사함.
<br /><br />
5. **Sections** :
    - Inspection the Data
    - Preparing the Data
    - Building the Model
    - Training the Model
    - Making Our Own Summaries
<br/><br/>
6. **NEW for version2** : version2 에서는 data split 을 하고, evaluation을 할 것이다.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
print('TensorFlow Version: {}'.format(tf.__version__))

### 1. Inspecting the Data

In [None]:
path = '/home/limhyesu/Summarization_study_data/Reviews.csv'
reviews = pd.read_csv(path)

pd.read_csv returns into DataFrame <br/>
reviews : DataFrame

In [None]:
reviews.shape

In [None]:
# Check for any nulls values. 칼럽별 null 개수 구하기.
reviews.isnull().sum()

In [None]:
# Remove null values and unneeded features
# drop null values
reviews = reviews.dropna()
# drop unneeded features. only Summary and Text remain.
reviews = reviews.drop(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 
                        'HelpfulnessDenominator', 'Score', 'Time'], 1)
# drop parameter avoids the old index being added as column
reviews = reviews.reset_index(drop=True)

### 2. Preparing the Data
- convert to lowercase
- replace contractions with their longer forms. (contraction : 줄임말 등)
- remove any unwanted characters (done after replacing contractions. backward slash before hyphen.)
- remove stopwords from description

In [None]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
# someone made contraction dictionary from wikipeda
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

In [None]:
def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word]) # longer term을 nex_text에 append함.
            else:
                new_text.append(word)
        # join() method takes all items in an iterable and joins them into one string.
        text = " ".join(new_text)        
        
        # Format words and remove unwanted characters
        # ?는 0번 또는 1차례까지의 발생을 의미함. http 또는 https를 의미함.
        # MULTILINE : '^'가 각 문자열, 문장의 처음에 매칭됨. '$'는 각 문자열과 문장의 마지막에 매칭됨.
        text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
        text = re.sub(r'\<a href', ' ', text)
        text = re.sub(r'\&amp;', ' ', text)
        text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
        text = re.sub(r'<br />', ' ', text)
        text = re.sub(r'\'', ' ', text)
        text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
        text = re.sub(r'\<a href', ' ', text)
        text = re.sub(r'\&amp;', ' ', text)
        text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
        text = re.sub(r'<br />', ' ', text)
        text = re.sub(r'\'', ' ', text)
        
        # Optionally, remove stop words
        if remove_stopwords:
            text = text.split()
            stops = set(stopwords.words("english"))
            text = [w for w in text if not w in stops]
            text = " ".join(text)

    return text

### START Data Split
Split data into two parts : Train, Test<br/>
**Train** - text_train, summary_train<br/>
**Test** - text_test, summary_test

In [None]:
type(reviews)

In [None]:
from sklearn.model_selection import train_test_split
target_attribute = reviews['Summary']
tmp = reviews.drop(columns=['Summary'], axis=1)
text_train, text_test, summary_train, summary_test = train_test_split(tmp, target_attribute, test_size=0.05)

In [None]:
# convert to DataFrame
summary_test = pd.DataFrame(summary_test)

In [None]:
summary_train = pd.DataFrame(summary_train)

### END Data Split

In [None]:
# Clean the summaries and texts
# stopwords will only be removed from the description to make training faster
# but they will reamin in the summaries to make them sound more like natural phrases.
clean_summaries_train = []
for summary in summary_train.Summary:
    clean_summaries_train.append(clean_text(summary, remove_stopwords=False))
print("Summaries for train are complete.")

clean_texts_train = []
for text in text_train.Text:
    clean_texts_train.append(clean_text(text, remove_stopwords=True))
print("Texts for train are complete.")

clean_summaries_test = []
for summary in summary_test.Summary:
    clean_summaries_test.append(clean_text(summary, remove_stopwords=False))
print("Summaries for test are complete.")

clean_texts_test = []
for text in text_test.Text:
    clean_texts_test.append(clean_text(text, remove_stopwords=True))
print("Texts for test are complete.")

In [None]:
# Inspect the cleaned summaries and texts to ensure they have been cleaned all.
for i in range(5):
    print("Clean Review #", i+1)
    print(clean_summaries_train[i])
    print(clean_texts_train[i])
    print()

# 여기까지함.

In [None]:
def count_words(count_dict, text):
    '''Count the number of occurrences of each word in a set of text'''
    # build word histogram as dictioncary to count the word
    
    for sentence in text: # text가 하나의 문장을 element로 가진 배열
        for word in sentence.split():
            if word not in count_dict:
                count_dict[word] = 1
            else:
                count_dict[word] += 1

In [None]:
# Find the number of times each word was used and the size of the vocabulary both in summary and text
# Summary와 Text에 나타나는 서로 다른 단어의 종류 개수.
word_counts = {}

count_words(word_counts, clean_summaries_train)
count_words(word_counts, clean_texts_train)

print("Size of Vocabulary:", len(word_counts))

Load Conceptnet Numberbatch's (CN) embeddings, similar to GloVe, but probably better
<br />
- use pre-trained word vectors to help improve the performance of our model.
- **ConceptNet Numberbatch(CN)** : word embeddings that we are using.
- **ConceptNet** : semantic network. 컴퓨터가 자연어의 의미를 이해하는데 도움이 되도록 만들어짐. https://github.com/commonsense/conceptnet-numberbatch

In [None]:
# create an empty dictionary
embeddings_index = {}

# f로 파일을 오픈함. f의 각 line에 대하여, line을 단어 별로 나누어 values에 할당. word는 value[0].
# float32 type으로 values[1:]를 numpy array로 convert하여 embedding에 저장.
# word가 key, embedding이 value.

with open('/home/limhyesu/Summarization_study_data/numberbatch-en-17.04b.txt', encoding = 'utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding
        
print('Word embeddings:', len(embeddings_index))

In [None]:
# Find the number of words that are missing from CN, and are used more than our threshold
# CN에 없고 threshold인 20번 보다 많이 등장한 단어의 개수를 찾는다. -> missing_words 에 추가
missing_words = 0
threshold = 20

# Dict.items() returns dict_items object that connects key and value.
for word, count in word_counts.items():
    if count > threshold:
        if word not in embeddings_index:
            missing_words += 1

missing_ratio = round(missing_words/len(word_counts), 4)*100

print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

CN에 없는 단어가 word_embedding_matrix에 더해지려면 그 단어는 적어도 20번 이상 등장해야 한다. 많이 등장해야 모델이 단어의 의미를 이해할 수 있기 때문이다.

In [None]:
# Limit the vocab that we will use to words that appear >= threshold or are in Glove

# dictionary to convert words to integers
vocab_to_int = {}

value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index: # 20번 이상 등장하거나, CN에 있다면
        vocab_to_int[word] = value # 단어마다 int 할당. 
        value += 1
        
# summary 나 text에 등장하는 단어 중
# vocab_to_int에는 CN에 있거나, CN에는 없지만 20번 이상 등장하는 단어가 key
# 각 key에 대해 0부터 차례대로 int 할당

In [None]:
print("Number of words we will use:", len(vocab_to_int))
print("Total number of unique words:", len(word_counts))

In [None]:
# Sepcial tokens that will be added to our vocab
codes = ["<UNK>", "<PAD>", "<EOS>", "<GO>"]

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int) # vocab_to_int의 마지막에 codes 추가.

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

In [None]:
usage_ratio = round(len(vocab_to_int)/ len(word_counts), 4)*100

print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

- limit vocabulary to words that are either **in CN** or **occur more than 20 times** in our dataset
- model이 단어를 많이 볼 수록, 즉 단어가 많이 나타날 수록 단어들끼리 어떻게 연관되어 있는지 알기 쉽기 때문에 어휘를 위와 같이 제한하는 것이 좋은 word embedding을 만들 수 있다.
- word_embedding_matrix를 만들 때 np.zeros의 dtype을 float32로 설정하는 것은 매우 중요하다. 초기값이 float64인데 이는 Tensorflow에서 안돌아가므로 32로 낮춰야한다.

In [None]:
# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim = 300
nb_words = len(vocab_to_int)

# np.zeros(shape, type, order) : return a new array of given shape and type, filled with zeros.
# (nb_words, emabedding_dim) : shape of the new array.
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype = np.float32)

# vocab_to_int : 사용할 단어, embeddings_index : CN에 있는 단어
for word, i in vocab_to_int.items():
    if word in embeddings_index: 
        # CN에 있는 word라면 embedding 그대로 추가
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # if word not in CN, create a random embedding for it 
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding # embeddings_index에 word와 embedding 추가함.
        word_embedding_matrix[i] = new_embedding
        
# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

In [None]:
def convert_to_ints(text, word_count, unk_count, eos=False):
    '''Convert words in text to an integer.
        If word is not in vocab_to_int, use UNK's integer.
        Total the number of words and UNKs.
        Add EOS token to the end of texts'''
    ints = []
    for sentence in text:
        sentences_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentences_ints.append(vocab_to_int[word])
            else:
                sentences_ints.append(vocab_to_int["<UNK>"])
                unk_count += 1
        if eos:
            sentences_ints.append(vocab_to_int["<EOS>"])
        ints.append(sentences_ints)
    return ints, word_count, unk_count

In [None]:
# Apply convert_to_ints to clean_summaries and clean_texts
word_count = 0
unk_count = 0

# summary를 int로 바꿈, text를 int로 바꾸고 eos 추가.
int_summaries_train, word_count, unk_count = convert_to_ints(clean_summaries_train, word_count, unk_count)
int_texts_train, word_count, unk_count = convert_to_ints(clean_texts_train, word_count, unk_count, eos = True)

unk_percent = round(unk_count/word_count, 4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

In [None]:
def create_lengths(text):
    '''Create a data frame of the sentence lengths from a text'''
    # 각 sentence의 length를 counts 열에 작성.
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])

In [None]:
lengths_summaries_train = create_lengths(int_summaries_train)
lengths_texts_train = create_lengths(int_texts_train)

print("Summaries:")
print(lengths_summaries_train.describe())
print()
print("Texts:")
print(lengths_texts_train.describe())

In [None]:
# Inspect the length of texts
print(np.percentile(lengths_texts_train.counts, 90)) # compute the 90th percentile of the lengths_texts elements
print(np.percentile(lengths_texts_train.counts, 95))
print(np.percentile(lengths_texts_train.counts, 99))

In [None]:
# Inspect the length of summaries
print(np.percentile(lengths_summaries_train.counts, 90))
print(np.percentile(lengths_summaries_train.counts, 95))
print(np.percentile(lengths_summaries_train.counts, 99))

In [None]:
def unk_counter(sentence):
    '''Counts the number of the UNK appears in a sentence'''
    unk_count = 0
    for word in sentence:
        if word == vocab_to_int["<UNK>"]:
            unk_count += 1
    return unk_count

- To help train the model faster, **sort** the reviews by the **length of the descriptions** form shortest to longest.
- This maeks each batch to have descriptions of **similar lengths**, which will result int **less padding**, thus **less computing**.
- Some reviews will not be included because of the number of UNK tokens in the description or summary. If there is more than 1 UNK in the description or any UNKs in the summary, the review will not be used. 의미있는 데이터로 모델을 만들고 싶기 때문.

In [None]:
# Sort the summaries and texts by the length of the texts, shortest to largest
# Sorting is to make each batch to have descriptions of similar lengths, which will result in less padding, thus less computing.
# Limit the length of summaries and texts based on the min and max ranges
# Remove reviews that include too many UNKs

sorted_summaries = []
sorted_texts = []
max_text_length = 84 # 90% percentile
max_summary_length = 13 # 99% percentile
min_length = 2
unk_text_limit = 1
unk_summary_limit = 0
# text에는 1개의 UNK 까지 허용. summary에는 UNK가 없어야 함.

for length in range(min(lengths_texts_train.counts), max_text_length):
    for count, words in enumerate(int_summaries_train): # enumerate : 몇 번 째 반복문인지 확인 가능.
        if(len(int_summaries_train[count]) >= min_length and
           len(int_summaries_train[count]) <= max_summary_length and
           len(int_texts_train[count]) >= min_length and
           unk_counter(int_summaries_train[count]) <= unk_summary_limit and
           unk_counter(int_texts_train[count]) <= unk_text_limit and
           length == len(int_texts_train[count]) # min, max 사이의 범위에 있는 length가 text의 length일 때.
           ):
            sorted_summaries.append(int_summaries_train[count])
            sorted_texts.append(int_texts_train[count])

# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

### 3. Building the Model

In [None]:
def model_inputs():
    '''Create placeholders for input to the model'''
    
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    # summary_length, text_length are the lengths of each sentence within a batch
    # max_summary_length is maximum length of a summary within a batch
    summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
    
    # Computes the maximum of elements across dimensions of a tensor.
    max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,), name='text_length')
    
    return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length

In [None]:
def process_encoding_input(target_data, vocab_to_int, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''
    
    # ending = target_data의 마지막 단어를 추출함.
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    # dec_input = scalar value로 채워진 tensor를 만들어서 마지막에 <GO> 붙임.
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)
    
    return dec_input

In [None]:
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    '''Create the encoding layer'''
    
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            # Cell 만들기
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw,
                                                   input_keep_prob = keep_prob)
            
            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                             initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw,
                                                    input_keep_prob = keep_prob)
            ### MY CODE START
            
            
            # Cell 구동
            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw,
                                                                    cell_bw, rnn_inputs, sequence_length,
                                                                    dtype = tf.float32)
            
            '''
            ((enc_output_fw, enc_output_bw),
            (enc_state_fw, enc_state_bw)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, 
                                                                            rnn_inputs, sequence_length, dtype=tf.float32)
            '''
            # Join outputs since we are using a bidirectional RNN
            enc_output = tf.concat(enc_output, 2)
            '''
            enc_output = tf.concat((enc_output_fw, enc_output_bw), 2)
            
            enc_state = []
            for i in range(num_layers):
                if isinstance(enc_state_fw[i], tf.contrib.rnn.LSTMStateTuple):
                    enc_state_c = tf.concat(values=(enc_state_fw[i].c, enc_state_bw[i].c), 
                                            axis=1, name="enc_state_fw_c")
                    enc_state_h = tf.concat(values=(enc_state_fw[i].h, enc_state_bw[i].h), 
                                            axis=1, name="enc_state_fw_h")
                    enc_state = tf.contrib.rnn.LSTMStateTuple(c=encoder_state_c, h=enc_state_h)
                elif isinstance(enc_state_fw[i], tf.Tensor):
                    enc_state = tf.concat(values=(enc_state_fw[i], enc_state_bw[i]), 
                                          axis=1, name='bidirectional_concat')
            
            enc_state = tuple(enc_state)
            '''
            
            return enc_output, enc_state

In [None]:
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state, output_layer,
                           vocab_size, max_summary_length):
    '''Create the training logits'''
    
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                       sequence_length=summary_length,
                                                       time_major=False)
    training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                      training_helper,
                                                      initial_state,
                                                      output_layer)
    training_logits, *_ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                          output_time_major=False,
                                                          impute_finished=True,
                                                          maximum_iterations=max_summary_length)
        
    return training_logits

**TrainingHelper** reads a sequence of integers from the encoding layer.
<br />
**BasicDecoder** processes the sequence with the decoding cell, an output layer, which is a fully connected layer. *initial_state* comes from our *DynamicAttentionWrapperState* that you will see soon.
<br />
**dynamic_decode** creates our outputs that will be used for training.

In [None]:
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                             max_summary_length, batch_size):
    '''Create the inference logits'''
    
    ### MY CODE START
    start_tokens = tf.tile(tf.constant([start_token], dtype = tf.int32), [batch_size], name='start_tokens')
    #start_toekns = tf.contrib.seq2seq.tile_batch(tf.constant([start_token], dtype = tf.int32), [batch_size], name='start_tokens')
    
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                               start_tokens, end_token)
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, inference_helper, initial_state,
                                                       output_layer)
    
    inference_logits, *_ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                           output_time_major=False,
                                                           impute_finished=True,
                                                           maximum_iterations=max_summary_length)
    
    return inference_logits

**inference_decoding_layer** is very similar to training layer. The main difference is **GreedyEmbeddingHelper**, which uses the argmax of the output (treated as logits) and passes the result through an embedding layer to get the next input. Although it is asking for **start_tokens**, we only have one, < GO >.

In [None]:
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length,
                  max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the trainig and inference decoding layers'''
    
    for layer in range(num_layers):
        with tf.variable_scope('decoder_{}'.format(layer)):
            lstm = tf.contrib.rnn.LSTMCell(rnn_size,
                                          initializer = tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
            
    output_layer = Dense(vocab_size, 
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev = 0.1))
    
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size, enc_output, text_length, normalize = False, name='BahdanauAttention')
    
    ### MY CODE START
    '''No DynamicAttentionWrapper in new version so I changed the code'''
    
    dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, attn_mech, rnn_size)
    initial_state = dec_cell.zero_state(dtype=tf.float32, batch_size=batch_size)
    
    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state,
                                                output_layer, vocab_size, max_summary_length)
        
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings, vocab_to_int['<GO>'],
                                                 vocab_to_int['<EOS>'],
                                                 dec_cell, initial_state, output_layer, max_summary_length,
                                                 batch_size)
        
    return training_logits, inference_logits

In [None]:
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length,
                 vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
    '''Use the previous functions to create the training and inference logits'''
    
    # Use Numberbatch's embeddings and the newly created ones as our embeddings
    embeddings = word_embedding_matrix
    
    enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
    enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
    
    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
    
    training_logits, inference_logits = decoding_layer(dec_embed_input, embeddings, enc_output, 
                                                      enc_state, vocab_size, text_length, summary_length,
                                                      max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size,
                                                      num_layers)
    
    return training_logits, inference_logits

In [None]:
def pad_sentence_batch(sentence_batch):
    '''Pad sentences with <PAD> so that each sentence of batch has the same length'''
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]

In [None]:
def get_batches(summaries, texts, batch_size):
    '''Batch summaries, texts, and the lengths of their sentences together'''
    for batch_i in range(0, len(texts)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i+batch_size]
        texts_batch = texts[start_i:start_i+batch_size]
        pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
        pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
        
        # Need the lengths for the _lengths parameters
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
            
        pad_texts_lengths = []
        for text in pad_texts_batch:
            pad_texts_lengths.append(len(text))
            
        yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths

In [None]:
# Set the Hyperparameters
epochs = 100
batch_size = 64
rnn_size = 256
num_layers = 2
learning_rate = 0.005
keep_probability = 0.75

In [None]:
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():
    
    # Load the model inputs
    input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()
    
    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                     targets, keep_prob, text_length, summary_length,
                                                     max_summary_length, len(vocab_to_int)+1, rnn_size, num_layers,
                                                     vocab_to_int, batch_size)
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
    
    # Create the weights for sequence_loss
    masks = tf.sequence_mask(summary_length, max_summary_length, dtype = tf.float32, name='masks')
    
    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(training_logits, targets, masks)
        
        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)
        
        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)
        
print("Graph is built")

### Training the Model

used subset of the data since the whole data will take too long time

In [None]:
# Subset the data for training
start = 200000
end = start + 50000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))
print("The longest text length:", len(sorted_texts_short[-1]))

In [None]:
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1

update_loss = 0
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the model

checkpoint = "./best_model.ckpt"

In [None]:
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    
    # If we want to continue training a previous session
    # loader = tf.train.import_meta_graph("./"+checkpoint+'.meta')
    # loader.restore(sess, checkpoint)
    
    for epoch_i in range(1, epochs+1):
        update_loss = 0
        batch_loss = 0
        for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost], 
                {input_data: texts_batch,
                 targets: summaries_batch,
                 lr: learning_rate,
                 summary_length: summaries_lengths,
                 text_length: texts_lengths,
                 keep_prob: keep_probability})
            
            batch_loss += loss
            update_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time

            if (batch_i % display_step == 0) and (batch_i > 0):
                print('Epoch {:>3}/{} Batch{:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                     .format(epoch_i, epochs, batch_i, len(sorted_texts_short) // batch_size,
                            batch_loss / display_step, batch_time*display_step))
                batch_loss = 0
                
            if (batch_i % update_check == 0) and (batch_i > 0):
                print("Average loss for this update:", round(update_loss/update_check,3))
                summary_update_loss.append(update_loss)
                
                # If the update loss is at a new minimum, save the model
                if update_loss <= min(summary_update_loss):
                    print('New Record!')
                    stop_early = 0
                    saver = tf.train.Saver()
                    saver.save(sess, checkpoint)
                    
                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
                update_loss = 0
                
            
            # Reduce learnig rate, but not below its minimum value
            learning_rate *= learning_rate_decay
            if learning_rate < min_learning_rate:
                learning_rate = min_learning_rate
            
            if stop_early == stop:
                print("Stopping Training.")
                break   

### Making Our Own Summaries

To see the quality of the summaries that this model can generate, you can either create your own review, or use a review from the dataset. You can set the length of the summary to a fixed value, or use a random value like I have here.

In [None]:
def text_to_seq(text):
    '''Prepare the text for the model'''
    text = clean_text(text)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]

In [None]:
vocab_to_int["<PAD>"]

In [None]:
# Making Our Own Summaries
input_sentence = clean_texts_test
text = []
texts_batch_words = []
answer_logits_words = []

for text_single in clean_texts_test:
    text.append(text_to_seq(text_single))

checkpoint = "./best_model.ckpt"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    pad = vocab_to_int["<PAD>"]

    # Multiply by batch_size to match the model's input parameters.
    for batch_i, (_, texts_batch, _, texts_length) in enumerate(
        get_batches(text, text, batch_size)):
        answer_logits = sess.run(logits, {input_data: texts_batch,
                                summary_length:[np.random.randint(5,8)],
                                text_length: texts_length,
                                keep_prob: 1.0})
        
        for j, text_i in enumerate(texts_batch):
            texts_batch_words.append(" ".join([int_to_vocab[i] for i in text_i if i != pad]))

        for j, answer_i in enumerate(answer_logits):
            answer_logits_words.append(" ".join([int_to_vocab[i] for i in answer_i if i != pad]))

In [None]:
original_text = np.asarray(texts_batch_words)
answer_summary = np.asarray(answer_logits_words)

original_text = pd.DataFrame(original_text)
answer_summary = pd.DataFrame(answer_summary)

original_text.columns=["text"]
answer_summary.columns=["system summary"]

model_summary=pd.DataFrame({'model summary':clean_summaries_test})

text_and_summary = pd.concat([original_text, answer_summary, model_summary], axis=1)

In [None]:
text_and_summary.dropna(axis=0, how='any')

In [None]:
def list_to_file(file_name, list_name):
    with open(file_name, 'w') as f:
        for sentence in list_name:
            f.write("%s\n" % sentence)

In [None]:
list_to_file('model_sum.txt', text_and_summary['model summary'])

In [None]:
list_to_file('system_sum.txt', text_and_summary['system summary'])

In [None]:
n=0
for sentence in text_and_summary['model summary']:
    with open("./models/model_sum.{}.txt".format(n), 'w') as f:
        f.write("%s\n" %sentence)
    n+=1
    
n=0
for sentence in text_and_summary['system summary']:
    with open("./systems/system_sum.{}.txt".format(n), 'w') as f:
        f.write("%s\n" %sentence)
    n+=1

In [None]:
from pyrouge import Rouge155

Rouge155.convert_summaries_to_rouge_format('./models', './models_out')
Rouge155.convert_summaries_to_rouge_format('./systems', './systems_out')

In [None]:
#from pyrouge import Rouge155
'''
Rouge155.write_config_static(
    './systems_out', 'system_sum.(\d+).txt',
    './models_out', 'model_sum.(\d+).txt',
    './config')
'''

In [None]:
from rouge import FilesRouge

files_rouge = FilesRouge('./system_sum.txt', './model_sum.txt')
scores = files_rouge.get_scores()

드디어 rouge 스코어를 구했다..!!!!!

scores 에 id column 을 추가해주자.

In [None]:
import numpy as np
import pandas as pd

rouge_score = np.asarray(scores)
rouge_score = pd.DataFrame(scores)


rouge_score dataframe 을 csv 파일로 변환해서 저장해야 한다. 

In [None]:
rouge_score.to_csv('rouge_score.csv', encoding='utf-8')

In [None]:
list_to_file('rouge_score.txt', rouge_score)