# **Baseline and LSTM Model**
As described in the README.md file, all changes to rouge for implementation in FinRouge take place in the rouge package folder contained within this repo. These changes are automatically reflected upon import of rouge, and are confirmed by a message that prints at import.

**Note: In order to run this notebook, you must first follow the instructions located in the README.md file.**

In [None]:
#This cell contains packages that may need to be installed. If you do not have a package, you may 
#uncomment the appropriate line and run it.

#!pip install transformers
#!pip install bert-extractive-summarizer
#!pip install tensorflow
#!pip install sentencepiece
#!pip install rouge
#!pip install neuralcoref
#!pip install spacy
#!python -m spacy download en_core_web_md

In [None]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, TFBertModel
from rouge import Rouge


  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
100%|██████████| 40155833/40155833 [00:00<00:00, 48463998.54B/s]


***Baseline with PyPi's Bert Extractive Summarizer***

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [None]:
df = pd.read_csv("newsdf.csv")

In [None]:
# make sure that the text is strictly of type string and the language is english.
df = df[df['text'].apply(lambda x: isinstance(x, str))]
df = df.loc[df['language'] == "english"]

***Sample article in the corpus***

In [None]:
df['text'][0]

'March 27(Reuters) - AU Optronics Corp :\n* Says it plans to pay cash dividend of T$1.2/share for 2017\nSource text in Chinese: goo.gl/uxuxci\nFurther company coverage: (Beijing Headline News)\n '

In [None]:
from summarizer import Summarizer
model = Summarizer()

In [None]:
res = model(df['text'][0])
res

'March 27(Reuters) - AU Optronics Corp :\n* Says it plans to pay cash dividend of T$1.2/share for 2017\nSource text in Chinese: goo.gl/uxuxci\nFurther company coverage: (Beijing Headline News)'

Here, we use elbow to specify the optimal number of sentences in the hypothesis sentence summarization. 

In [None]:
summaries = []
for t in list(df['text']):
  num_sentences = model.calculate_optimal_k(t, k_max=10)
  result = model(body= t, num_sentences=num_sentences)
  summaries.append(''.join(result))

In [None]:
summaries[0]

'March 27(Reuters) - AU Optronics Corp :\n* Says it plans to pay cash dividend of T$1.2/share for 2017\nSource text in Chinese: goo.gl/uxuxci\nFurther company coverage: (Beijing Headline News)'

In [None]:
df['summaries'] = pd.Series(summaries)

In [None]:
np.mean(np.array(df['text']) == np.array(df['summaries']))

0.1

About 10% of the summarized articles are identical to the original article text.

**Now that we have summarized the articles in our corpus, let us use Rouge metric to attemp to score our initial attemp.** We use our aforementioned FinRouge package, which is a modification of PyRouge, a summarization evaluation package.

In [None]:
from rouge import Rouge
# rouge = Rouge()
# scores = rouge.get_scores(res, df['text'][0])
# scores

hyps, refs = list(df['summaries']), list(df['text'])
rouge = Rouge()
scores = rouge.get_scores(hyps, refs, avg = True)
scores

{'rouge-1': {'f': 0.5721986992959578,
  'p': 0.9991596638655462,
  'r': 0.46414224691138095},
 'rouge-2': {'f': 0.5657310743409856,
  'p': 0.9864468872588035,
  'r': 0.4592730245376804},
 'rouge-l': {'f': 0.6575778499591654,
  'p': 0.9989247311827956,
  'r': 0.5487586495283288}}

We get a Rouge score for bigram recall of 0.459/1. **Next, we attempt to summarize a document by simply using the first 3 sentences.**

In [None]:
import re
sentences = list(df['text'])
first_3 = []
for sentence in sentences:
  first_3.append(' '.join(re.split(r'(?<=[.:;])\s', sentence)[:4]))

In [None]:
scores = rouge.get_scores(first_3, sentences, avg = True)
scores

{'rouge-1': {'f': 0.6251823344699587, 'p': 1.0, 'r': 0.5389781379708752},
 'rouge-2': {'f': 0.6240696186269217, 'p': 1.0, 'r': 0.5380210322606225},
 'rouge-l': {'f': 0.6772991857863555, 'p': 1.0, 'r': 0.5805159844148441}}

Using this method, we get a Rouge score of 0.538/1 . It turns out that using the first 3 sentences is producing 'better' summarizations according to FinRouge scoring

**This section takes the above code for a basic summarizer baseline, first 3-sentence as summary, and FinRouge scoring. It puts them into funcitons for easy future use:**

In [None]:
#Functions to score, make a model, and summarize

from summarizer import Summarizer
from rouge import Rouge
import re
from summarizer.coreference_handler import CoreferenceHandler

from rouge import Rouge
def score_it(df):
  """This function takes in a df that already includes two columns - one labeled text and one labeled summaries.
  It outputs rouge-1, rouge-2, and rouge-l scores """
  hyps, refs = list(df['summaries']), list(df['text'])
  rouge = Rouge()
  scores = rouge.get_scores(hyps, refs, avg = True)
  return scores

def summarize_and_score(df, model = "bert-extractive-summarizer", num_articles = 10, handler = None ):
  """This funciton accepts a dataset, model = "first_3" or "bert-extractive-summarizer", and __________ parameters as input and outputs rouge-1, rouge-2, and rouge-l scores"""
  if random_or_ordered == 'ordered':
    df = df.iloc[:num_articles]
  else:
    print("Sorry, ",random_or_ordered,"not yet implemented")
  
  if model == 'first_3':
    sentences = list(df['text'])
    first_3 = []
    for sentence in sentences:
      first_3.append(' '.join(re.split(r'(?<=[.:;])\s', sentence)[:4]))
    df['summaries'] = first_3
  elif model == 'bert-extractive-summarizer':
    if handler:
      model = Summarizer(sentence_handler = handler)
    else:
      model = Summarizer()
    summaries = []
    for t in list(df['text']):
      num_sentences = model.calculate_optimal_k(t, k_max=10)
      result = model(body= t, num_sentences=num_sentences)
      summaries.append(''.join(result))
    df['summaries'] = pd.Series(summaries)
  else:
    raise ValueError(model, "is not a model that can be used here") 
  return score_it(df)

## ***FINAL LSTM MODEL USED***

In [None]:
!pip install contractions
import contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/0a/04/d5e0bb9f2cef5d15616ebf68087a725c5dbdd71bd422bcfb35d709f98ce7/contractions-0.0.48-py2.py3-none-any.whl
Collecting textsearch>=0.0.21
  Downloading https://files.pythonhosted.org/packages/d3/fe/021d7d76961b5ceb9f8d022c4138461d83beff36c3938dc424586085e559/textsearch-0.0.21-py2.py3-none-any.whl
Collecting anyascii
[?25l  Downloading https://files.pythonhosted.org/packages/09/c7/61370d9e3c349478e89a5554c1e5d9658e1e3116cc4f2528f568909ebdf1/anyascii-0.1.7-py3-none-any.whl (260kB)
[K     |████████████████████████████████| 266kB 6.5MB/s 
[?25hCollecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/7f/c2/eae730037ae1cbbfaa229d27030d1d5e34a1e41114b21447d1202ae9c220/pyahocorasick-1.4.2.tar.gz (321kB)
[K     |████████████████████████████████| 327kB 8.7MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
 

In [None]:
df_copy = df[['title', 'text']]
df_copy.head(3)
# make sure that the text is strictly of type string and the language is english.
df_copy = df_copy[df_copy['text'].apply(lambda x: isinstance(x, str))]
df_copy = df_copy.loc[df['language'] == "english"]
df_copy = df_copy.sample(500).reset_index()

Unnamed: 0,title,text
0,BRIEF-AU Optronics to pay cash dividend of T$1...,March 27(Reuters) - AU Optronics Corp :\n* Say...
1,British stats watchdog - stop using RPI inflat...,"March 8, 2018 / 1:35 PM / Updated an hour ago ..."
2,Dropbox shares surge in IPO,"Dropbox shares surge in IPO Saturday, March 24..."


Preprocessing data

Reme Extra Whitespaces and lowercase sentences

In [None]:
texts = df_copy['text']
new_texts = []
for sentence in list(texts):
  edited = " ".join(sentence.split())
  edited = edited.lower()
  new_texts.append(edited)

In [None]:
titles = df_copy['title']
new_titles = []
for sentence in list(titles):
  edited = " ".join(sentence.split())
  edited = edited.lower()
  new_titles.append(edited)

In [None]:
df_copy['text'] = pd.Series(new_texts)
df_copy['title'] = pd.Series(new_titles)
df_copy.head(5)

Unnamed: 0,title,text
0,brief-au optronics to pay cash dividend of t$1...,march 27(reuters) - au optronics corp : * says...
1,british stats watchdog - stop using rpi inflat...,"march 8, 2018 / 1:35 pm / updated an hour ago ..."
2,dropbox shares surge in ipo,"dropbox shares surge in ipo saturday, march 24..."
3,bookkeeper of auschwitz dies before starting s...,berlin (reuters) - the man known as the âboo...
4,us stocks set for a negative open as trade war...,dow closes 336 points higher as trade-war worr...


Expand Contractions

In [None]:
texts = df_copy['text']
sentence_index = -1
new_texts_col = []
for sentence in list(texts):
  sentence_index += 1
  new_sentence = ""
  first = True
  for word in sentence.split():
    new_word = contractions.fix(word)
    if first:
      new_sentence += new_word
      first = False
    else:
      new_sentence += " " + new_word
  new_texts_col.append(new_sentence)

In [None]:
texts = df_copy['title']
sentence_index = -1
new_titles_col = []
for sentence in list(texts):
  sentence_index += 1
  new_sentence = ""
  first = True
  for word in sentence.split():
    new_word = contractions.fix(word)
    if first:
      new_sentence += new_word
      first = False
    else:
      new_sentence += " " + new_word
  new_titles_col.append(new_sentence)

In [None]:
df_copy['text'] = pd.Series(new_texts_col)
df_copy['title'] = pd.Series(new_titles_col)

In [None]:
df_copy.head()

Unnamed: 0,title,text
0,brief-au optronics to pay cash dividend of t$1...,march 27(reuters) - au optronics corp : * says...
1,british stats watchdog - stop using rpi inflat...,"march 8, 2018 / 1:35 pm / updated an hour ago ..."
2,dropbox shares surge in ipo,"dropbox shares surge in ipo saturday, march 24..."
3,bookkeeper of auschwitz dies before starting s...,berlin (reuters) - the man known as the âboo...
4,us stocks set for a negative open as trade war...,dow closes 336 points higher as trade-war worr...


In [1]:
#note - hyperparameter tuning was done on multiple machines

Set the Parameters

In [None]:
batch_size = 10  # Batch size for training.
epochs = 5  # Number of epochs to train for
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = len(list(df_copy['text'])) #10000  Number of samples to train on.
# Path to the data txt file on disk.
data_path = "newsdf.csv"

In [None]:
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
texts = df_copy['text']
titles = df_copy['title']
for i in range(len(list(texts))):
  input_text = list(texts)[i] + " "
  target_text = list(titles)[i] + " "
  target_text = "\t" + target_text + "\n"
  input_texts.append(input_text)
  target_texts.append(target_text)
  for char in input_text:
    if char not in input_characters:
        input_characters.add(char)
  for char in target_text:
    if char not in target_characters:
        target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0


Number of samples: 500
Number of unique input tokens: 99
Number of unique output tokens: 99
Max sequence length for inputs: 38084
Max sequence length for outputs: 225


Build the Model

In [None]:
import keras 
encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))
encoder = keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

Train the Model

In [None]:
model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)
model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)
# Save model
model.save("lstm_baseline_model_new.pt")


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5




INFO:tensorflow:Assets written to: /content/drive/MyDrive/lstm_baseline_model_new.pt/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/lstm_baseline_model_new.pt/assets
