# Neural Networks and Deep Learning
---

The use of various neural network architectures for achieving desired outcomes has been a key factor leading to their incorporation into a wide array of tasks and industries. In this information age, we have access to a lot of information and at the same time need to consume as much of it as possible, thus requiring a system that can summarise the data concisely.

News is one such source of data that most people turn to to keep up to date about the happenings around the world. The headlines of news articles must be concise enough to get as much key information to the reader in as short a sentence as possible, thus making it ideal for training a summarisation engine.

Additionally, categorising the information based on the content is equally as necessary as the summary itself as different people might want to know about a different subset of information as compared to other people.

Therefore, in this project, we will be looking at 2 models. The first is a categorisation model that seeks to categorise news articles and the second is one that aims to summarise the content to extract its meaning.

## Importing Libraries

First, we need to download `spaCy`, which is a NLP library that has pretrained word vectors that capture information about the words themselves, thus making it easier for us to train our final model. If you are running this on Google Colab, you will have to restart the kernel after installation for the changes to take effect.

In [None]:
# !python -m spacy download en_core_web_lg

Next we will be importing all the things that we will be using in this project. If you are missing any libraries do install them first.

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sb
import spacy
import re
import warnings
import nltk
from bs4 import BeautifulSoup
from matplotlib.colors import LogNorm
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from json import load, dump

import tensorflow as tf
from tensorflow import keras, constant
from tensorflow.data import Dataset
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.text import Tokenizer, tokenizer_from_json
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Layer, Dense, Input, LSTM, TimeDistributed, Concatenate, Attention, Embedding, Bidirectional, Dropout, BatchNormalization, Masking
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.utils import plot_model, to_categorical
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy, CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.initializers import Constant, GlorotUniform

In [None]:
plt.rcParams['figure.figsize'] = (6,6)
nltk.download('stopwords')
pd.set_option("display.max_colwidth", 200)
warnings.filterwarnings("ignore")
tf.config.run_functions_eagerly(True)
tf.get_logger().setLevel('WARNING')

We also need to check if we have a GPU installed and configured to speed up the learning process later.

In [None]:
devices = tf.config.list_physical_devices('GPU')
print(devices)

We also need to set some parameters here so that it is easy to vary them later on for fine tuning of the model.

In [None]:
# Category Model Params
category_model_name = 'CategorisationModel'
embedding_dim = 300
learning_rate = 0.001

# Summary Model Params
summary_model_name = 'SummarisationModel'
batch_size = 256
latent_dim = 200

## Data Preparation and Exploration


### Reading the Data

Firstly we need to import the data, which is in a JSON file. This dataset was retrieved from [Kaggle](https://www.kaggle.com/rmisra/news-category-dataset?select=News_Category_Dataset_v2.json), which we have expanded upon and [scraped](https://gist.github.com/majulahsingapuri/535ac8d3daac708996a3588e5c9d18e6) the actual articles for their first 3 paragraphs of content. We chose this dataset as it had over 200k data points, which is more than enough data for our needs, and it also provided links to the original articles, making it easier for us to scrape them for their content.

In [None]:
# load data
df = pd.read_json('./data/News_Category_Dataset_pretty_v3.json', orient = 'records')
df = df.drop(columns=['authors', 'link', 'date', 'short_description'])
df = df.dropna()
df.head()

In the process of scraping we did lose a few thousand data points as the articles were either removed or not accessible but it is still more than enough for us to train the model.

In [None]:
df.shape

### Article Categories
Next, lets take a look at how many categories we have to categorise the news articles into.

In [None]:
cates = df.groupby('category')
print("total categories:", cates.ngroups)
print(cates.size())

Replacing `THE WORLDPOST` with `WORLDPOST` as they are in essence the same tag.

In [None]:
# as shown above, THE WORLDPOST and WORLDPOST should be the same category, so merge them.
df.category = df.category.map(lambda x: "WORLDPOST" if x == "THE WORLDPOST" else x)

### Data Cleaning

Now to actually process the data. First, since the content has been scraped from the internet, in the event that some html code has made its way into the dataset, we will have to remove it. In addtion, we will also be removing other text fragments as highlighted in the code comments below. We will also remove remove all contractions to ensure that the data is written in as proper english as possible.

In [None]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}

In [None]:
def text_cleaner(text):
    newString = text.lower() # Lower Case
    newString = BeautifulSoup(newString, "lxml").text # Remove html fragments
    newString = re.sub(r'\([^)]*\)', '', newString) # Anything in brackets
    newString = re.sub('"','', newString) # Quotes
    newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")]) # Contractions   
    newString = re.sub(r"'s\b","",newString) # Possessive apostrophe
    newString = re.sub("[^a-zA-Z]", " ", newString) # Non-letters
    newString = re.sub('[m]{2,}', 'mm', newString) # "Hmmm" sort of letters?
    return newString

In [None]:
df['content'] = df.content.apply(text_cleaner)

### Visualising Data
Next, since we will be using LSTMs and Embedding layers, we need to know what are the input and output sizes that we will be dealing with so that we can build the model appropriately. The chart below shows us that a majority of the content is about 100-200 words long and the headlines are 12-14 words long.

In [None]:
text_word_count = []
headline_word_count = []

# populate the lists with sentence lengths
for i in df['content']:
      text_word_count.append(len(i.split()))

for i in df['headline']:
      headline_word_count.append(len(i.split()))

length_df = pd.DataFrame({'content':text_word_count, 'headline':headline_word_count})

In [None]:
ax_list = length_df['content'].hist(bins = 1000)
ax_list.set_xlim(0, 500)
plt.savefig('./Images/content_distribution')
plt.show()

In [None]:
ax_list = length_df['headline'].hist(bins = 50)
plt.savefig('./Images/headline_distribution')
plt.show()

Hence we will be setting the `max_text_length` and `max_headline_length` to the following values.

In [None]:
max_text_len=200
max_headline_len=14

### Data Preparation

Next we need to add the start and end tokens into the headlines in order for the decoder portion of the summary LSTM model to work.

In [None]:
df['headline'] = df['headline'].apply(lambda x : 'sostok '+ x + ' eostok')

We will also be trimming all the articles with headlines below `max_headline_len` number of words to ensure that every headline has a start and end token.

In [None]:
df = df[df['headline'].str.split().str.len().le(max_headline_len)]

Next we need to convert the categories from text to numbers so that they are easier to process.

In [None]:
df['category'], categories = pd.factorize(pd.Categorical(df['category']))

In [None]:
df.head(10)

### Splitting the Data

We will be using `sklearn`'s `train_test_split` to split the data into `train`, `test` and `validation` sets. 

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    df['content'],
    df[['headline', 'category']],
    test_size=0.2,
    random_state=0,
    shuffle=True
)
x_train, x_val, y_train, y_val = train_test_split(
    x_train,
    y_train,
    test_size=0.25,
    random_state=0,
    shuffle=True
)

In [None]:
print(x_train.shape)
print(y_train.shape)

### Tokenising Data

Now we need to fit the `Tokenizer`s onto the `x_train` and `y_train` data create our bag of words.

In [None]:
x_tokenizer = Tokenizer() 
x_tokenizer.fit_on_texts(list(x_train))

Rather than covering all the words that exist in the dataset, we want to focus on the ones that appear more often as that will aid in the training of the model. We are taking all words that appear at least 4 times.

In [None]:
thresh=4

cnt=0
tot_cnt=0
freq=0
tot_freq=0

for key,value in x_tokenizer.word_counts.items():
    tot_cnt=tot_cnt+1
    tot_freq=tot_freq+value
    if(value<thresh):
        cnt=cnt+1
        freq=freq+value
    
print("% of rare words in vocabulary:",(cnt/tot_cnt)*100)
print("Total Coverage of rare words:",(freq/tot_freq)*100)

Now we will refit the data on the reduced vocabulary size on each of the `train`, `test` and `validation` datasets

In [None]:
#prepare a tokenizer for reviews on training data
x_tokenizer = Tokenizer(num_words=tot_cnt-cnt) 
x_tokenizer.fit_on_texts(list(x_train))

#convert text sequences into integer sequences
x_train    =   x_tokenizer.texts_to_sequences(x_train) 
x_test   =   x_tokenizer.texts_to_sequences(x_test)
x_val   =   x_tokenizer.texts_to_sequences(x_val)

#padding zero upto maximum length
x_train    =   pad_sequences(x_train,  maxlen=max_text_len, padding='post')
x_test   =   pad_sequences(x_test, maxlen=max_text_len, padding='post')
x_val   =   pad_sequences(x_val, maxlen=max_text_len, padding='post')

#size of vocabulary ( +1 for padding token)
x_voc   =  x_tokenizer.num_words + 1

Checking size of vocabulary

In [None]:
x_voc

Checking to see if the padding has been done correctly

In [None]:
x_train

Repeat the process for `y` data

In [None]:
#prepare a tokenizer for reviews on training data
y_tokenizer = Tokenizer()   
y_tokenizer.fit_on_texts(list(y_train['headline']))

In [None]:
thresh=4

cnt=0
tot_cnt=0
freq=0
tot_freq=0

for key,value in y_tokenizer.word_counts.items():
    tot_cnt=tot_cnt+1
    tot_freq=tot_freq+value
    if(value<thresh):
        cnt=cnt+1
        freq=freq+value
    
print("% of rare words in vocabulary:",(cnt/tot_cnt)*100)
print("Total Coverage of rare words:",(freq/tot_freq)*100)

In [None]:
#prepare a tokenizer for reviews on training data
y_tokenizer = Tokenizer(num_words=tot_cnt-cnt) 
y_tokenizer.fit_on_texts(list(y_train['headline']))

#convert text sequences into integer sequences
y_train_seq    =   y_tokenizer.texts_to_sequences(y_train['headline']) 
y_test_seq   =   y_tokenizer.texts_to_sequences(y_test['headline']) 
y_val_seq   =   y_tokenizer.texts_to_sequences(y_val['headline']) 

#padding zero upto maximum length
y_train_seq    =   pad_sequences(y_train_seq, maxlen=max_headline_len, padding='post')
y_test_seq   =   pad_sequences(y_test_seq, maxlen=max_headline_len, padding='post')
y_val_seq   =   pad_sequences(y_val_seq, maxlen=max_headline_len, padding='post')

#size of vocabulary
y_voc  =   y_tokenizer.num_words +1

In [None]:
y_train = y_train.drop(columns='headline')
y_test = y_test.drop(columns='headline')
y_val = y_val.drop(columns='headline')

In [None]:
y_train = to_categorical(list(y_train['category']))
y_test = to_categorical(list(y_test['category']))
y_val = to_categorical(list(y_val['category']))

Checking to see if the data has been correctly categorised

In [None]:
y_train[0]

Checking to see if the padding has been done correctly

In [None]:
y_train_seq

### Saving Tokenizers for future use
In the event that we want to deploy this model, we need to save the tokenizers.

In [None]:
tokenizer_json = x_tokenizer.to_json(ensure_ascii=False, indent=4)
with open('./tokenizer/xtokenizer.json', 'w+', encoding='utf-8') as f:
    f.write(tokenizer_json)

In [None]:
tokenizer_json = y_tokenizer.to_json(ensure_ascii=False, indent=4)
with open('./tokenizer/ytokenizer.json', 'w+', encoding='utf-8') as f:
    f.write(tokenizer_json)

## Categorising Articles
Now lets begin building the first model that categorises news articles based on its content.

### Loading spaCy
We will be using [spaCy](https://spacy.io) for this part as it is the state of the art model for NLP processing. It contains many pipelines that we have had to disable as we are only interested in the `word_2_vect` pipe, which converts words to their vector representation. These vectors have been derived after the spacy model has been trained on an extremely large dataset to accurately encapsulate the meanings of these words into a 300-dimensional vector. We will use this for our Embedding layer. 

In [None]:
nlp = spacy.load('en_core_web_lg', exclude=['tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'] )

In [None]:
x_word_index = x_tokenizer.word_index

print('Found %s unique tokens.' % len(x_word_index))

In [None]:
x_embedding_matrix = np.zeros((len(x_word_index) + 1, 300))
for word, i in x_word_index.items():
    embedding_word = nlp(word)
    embedding_vector = embedding_word.vector
    if embedding_vector is not None:
        x_embedding_matrix[i] = embedding_vector

In [None]:
y_word_index = y_tokenizer.word_index

print('Found %s unique tokens.' % len(y_word_index))

In [None]:
y_embedding_matrix = np.zeros((len(y_word_index) + 1, 300))
for word, i in y_word_index.items():
    embedding_word = nlp(word)
    embedding_vector = embedding_word.vector
    if embedding_vector is not None:
        y_embedding_matrix[i] = embedding_vector

### Saving Embeddings for future use

In [None]:
with open('./embeddings/xEmbedding.npy', 'wb') as f:
    np.save(f, x_embedding_matrix)

In [None]:
with open('./embeddings/yEmbedding.npy', 'wb') as f:
    np.save(f, y_embedding_matrix)

### Attention Layer

For this Model, we will not be using the TensorFlow Attention layer but rather th is custom implementation.

In [None]:
class CustomAttention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = GlorotUniform
        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)
        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)
        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(CustomAttention, self).__init__(**kwargs)

    def get_config(self):

        base_config = super().get_config()
        config = {
            'step_dim' : self.step_dim,
            'W_regularizer' : self.W_regularizer,
            'b_regularizer' : self.b_regularizer,
            'W_constraint' : self.W_constraint,
            'b_constraint' : self.b_constraint,
            'bias' : self.bias
        }
        return dict(list(base_config.items()) + list(config.items()))

    def build(self, input_shape):
        assert len(input_shape) == 3
        self.W = self.add_weight(shape=(input_shape[-1],),
                                 initializer=self.init,
                                 name=f'{self.name}_W',
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]
        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),
                                     initializer='zero',
                                     name=f'{self.name}_b',
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None
        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim
        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
        if self.bias:
            eij += self.b
        eij = K.tanh(eij)
        a = K.exp(eij)
        if mask is not None:
            a *= K.cast(mask, K.floatx())
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

### Building the Model

Now we will combine all the different layers into the model shown below.

In [None]:
K.clear_session()

inp = Input(shape=(max_text_len,), dtype='int32')
x = Embedding(len(x_word_index)+1, embedding_dim, embeddings_initializer=Constant(x_embedding_matrix), input_length=max_text_len, trainable=False, mask_zero=True)(inp)
x = Bidirectional(LSTM(embedding_dim, dropout=0.25, return_sequences=True))(x)
x = CustomAttention(max_text_len)(x)
merged = Dense(256, activation='relu')(x)
merged = Dropout(0.25)(merged)
merged = BatchNormalization()(merged)
outp = Dense(len(categories), activation='softmax')(merged)

AttentionLSTM = Model(inputs=inp, outputs=outp, name=category_model_name)
AttentionLSTM.compile(loss=CategoricalCrossentropy(), optimizer=Adam(learning_rate=learning_rate), metrics=[CategoricalAccuracy(name='acc')])

AttentionLSTM.summary()

### Visualise the Model

In [None]:
plot_model(AttentionLSTM, to_file='./Images/' + category_model_name + '.png', rankdir='TB', show_shapes=True)

### Setting the Callbacks

We have implemented an `EarlyStopping` and `ModelCheckpoint` callback which will allow us to obtain the most optimal model for our uses.

In [None]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,patience=3)
checkpoint = ModelCheckpoint(
    './models/' + category_model_name + '.h5', 
    monitor='val_loss',  
    mode='min', 
    verbose=1, 
    save_best_only=True
)

### Training the Model

In [None]:
attlstm_history = AttentionLSTM.fit(
    x_train, 
    y_train,
    callbacks=[es, checkpoint],
    batch_size=batch_size, 
    epochs=50, 
    validation_data=(x_test, y_test)
)

### Plotting the results


In [None]:
acc = attlstm_history.history['acc']
val_acc = attlstm_history.history['val_acc']
loss = attlstm_history.history['loss']
val_loss = attlstm_history.history['val_loss']
epochs = range(1, len(acc) + 1)

In [None]:
plt.title('Training and validation accuracy')
plt.plot(epochs, acc, 'red', label='Training acc')
plt.plot(epochs, val_acc, 'blue', label='Validation acc')
plt.legend()
plt.savefig('./Images/categorisation_accuracy')
plt.show()

In [None]:
plt.title('Training and validation loss')
plt.plot(epochs, loss, 'red', label='Training loss')
plt.plot(epochs, val_loss, 'blue', label='Validation loss')
plt.legend()
plt.savefig('./Images/categorisation_loss')
plt.show()

### Loading the Best Model

In [None]:
AttentionLSTM = load_model('./models/' + category_model_name + '.h5', custom_objects={
    'CustomAttention': CustomAttention
    })

### Plotting the Confusion Matrix

In [None]:
predicted = AttentionLSTM.predict(x_val)
cm = confusion_matrix(y_val.argmax(axis=1), predicted.argmax(axis=1))

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
sb.heatmap(cm, annot=True, square=True, norm=LogNorm())
plt.savefig('./Images/categorisation_confusion_matrix')
plt.show()

### Evaluating the Model
We will now use the `validation` dataset to validate our model on data it has not seen before.

In [None]:
AttentionLSTM.evaluate(x_val, y_val)

## Summarising the content
Now we will build our second model, which is the summary model. This model requires multiple layers of LSTMs as we need to gather information from the entirety of the text.

In [None]:
K.clear_session()

# Encoder
encoder_inputs = Input(shape=(max_text_len, ))

# Embedding layer
enc_emb =  Embedding(len(x_word_index)+1, embedding_dim, embeddings_initializer=Constant(x_embedding_matrix), input_length=max_text_len, trainable=False, mask_zero=True)(encoder_inputs)

# Encoder LSTM 1
encoder_lstm1 = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.2)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

# Encoder LSTM 2
encoder_lstm2 = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.2)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

# Encoder LSTM 3
encoder_lstm3 = LSTM(latent_dim, return_state=True, return_sequences=True, dropout=0.2)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)

# Set up the decoder, using `encoder_states` as the initial state
decoder_inputs = Input(shape=(None, ))

# Embedding layer
dec_emb_layer = Embedding(len(y_word_index)+1, embedding_dim, embeddings_initializer=Constant(y_embedding_matrix), input_length=max_text_len, trainable=False, mask_zero=True)
dec_emb = dec_emb_layer(decoder_inputs)

# Decoder LSTM
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.2)
decoder_outputs, decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb, initial_state=[state_h, state_c])

# Attention layer
attn_out = Attention()([decoder_outputs, encoder_outputs])

# Concat attention input and decoder LSTM output
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])

# Dense layer
decoder_dense =  TimeDistributed(Dense(y_voc, activation='softmax'))
decoder_outputs = decoder_dense(decoder_concat_input)

# Define the model 
model = Model([encoder_inputs, decoder_inputs], decoder_outputs, name=summary_model_name)

model.summary()

In [None]:
plot_model(model, to_file='./Images/' + summary_model_name + '.png', rankdir='TB', show_shapes=True)

In [None]:
model.compile(optimizer=RMSprop(), loss=SparseCategoricalCrossentropy())

### Creating the Datasets

We will be using the Dataset class to load the data and send it to the appropriate inputs in our model. The first input is the entire 100 word content and the second input is entire headline minus the last token. The output is the entire headline minus the first token, as individual outputs.

In [None]:
train_data = Dataset.from_tensor_slices(({
        "input_1": x_train, 
        "input_2": y_train_seq[:,:-1]
    }, 
    y_train_seq.reshape(y_train_seq.shape[0], y_train_seq.shape[1], 1)[:,1:]
)).batch(batch_size)

In [None]:
test_data = Dataset.from_tensor_slices(({
        "input_1": x_test, 
        "input_2": y_test_seq[:,:-1]
    }, 
    y_test_seq.reshape(y_test_seq.shape[0], y_test_seq.shape[1], 1)[:,1:]
)).batch(batch_size)

In [None]:
val_data = Dataset.from_tensor_slices(({
        "input_1": x_val, 
        "input_2": y_val_seq[:,:-1]
    }, 
    y_val_seq.reshape(y_val_seq.shape[0], y_val_seq.shape[1], 1)[:,1:]
)).batch(batch_size)

### Setting the Callbacks

In [None]:
checkpoint = ModelCheckpoint(
    './models/' + summary_model_name + '.h5', 
    monitor='val_loss',  
    mode='min', 
    verbose=1, 
    save_best_only=True
)

### Training the Model

In [None]:
history = model.fit(
    train_data,
    epochs=1,
    callbacks=[es,checkpoint],
    batch_size=batch_size, 
    validation_data=test_data
)

### Plotting the Results

In [None]:
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.savefig('./Images/summary_loss')
plt.show()

### Loading the Best Model

In [None]:
model = load_model('./models/' + summary_model_name + '.h5')

### Evaluating the Model

In [None]:
model.evaluate(val_data)

### Building the Encoder and Decoder Models

In [None]:
reverse_target_word_index=y_tokenizer.index_word
reverse_source_word_index=x_tokenizer.index_word
target_word_index=y_tokenizer.word_index

In [None]:
# Encode the input sequence to get the feature vector
encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_hidden_state_input = Input(shape=(max_text_len,latent_dim))

# Get the embeddings of the decoder sequence
dec_emb2= dec_emb_layer(decoder_inputs) 

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])

# Attention inference
attn_out_inf = Attention()([decoder_outputs2, decoder_hidden_state_input])
decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])

# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_outputs2 = decoder_dense(decoder_inf_concat) 

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
    [decoder_outputs2] + [state_h2, state_c2])

#### Saving the Encoder and Decoder Models

In [None]:
encoder_model.save('./models/encoder_' + summary_model_name + '.h5')
decoder_model.save('./models/decoder_' + summary_model_name + '.h5')

In [None]:
plot_model(encoder_model, to_file='./Images/encoder_model.png', rankdir='TB', show_shapes=True)

In [None]:
plot_model(decoder_model, to_file='./Images/decoder_model.png', rankdir='TB', show_shapes=True)

### Functions to predict the headlines

In [None]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    e_out, e_h, e_c = encoder_model.predict(input_seq)
    
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    
    # Populate the first word of target sequence with the start word.
    target_seq[0, 0] = target_word_index['sostok']

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
      
        output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = reverse_target_word_index[sampled_token_index]
        
        if(sampled_token!='eostok'):
            decoded_sentence += ' '+sampled_token

        # Exit condition: either hit max length or find stop word.
        if (sampled_token == 'eostok'  or len(decoded_sentence.split()) >= (max_headline_len-1)):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update internal states
        e_h, e_c = h, c

    return decoded_sentence

In [None]:
def seq2headline(input_seq):
    newString=''
    for i in input_seq:
        if((i!=0 and i!=target_word_index['sostok']) and i!=target_word_index['eostok']):
            newString=newString+reverse_target_word_index[i]+' '
    return newString

def seq2text(input_seq):
    newString=''
    for i in input_seq:
        if(i!=0):
            newString=newString+reverse_source_word_index[i]+' '
    return newString

In [None]:
with open('./outputs/' + summary_model_name + '.txt', 'w') as f:
    for i in range(0,50):
        f.write("Article:" + seq2text(x_val[i]) + '\n')
        f.write("Original headline:" + seq2headline(y_val_seq[i]) + '\n')
        f.write("Predicted headline:" + decode_sequence(x_val[i].reshape(1,max_text_len)) + '\n')
        f.write("\n")

### Creating headlines from custom text

In [None]:
# Need to add the category to this too
def create_headline(_input):
    _input = x_tokenizer.texts_to_sequences(_input)
    _input = pad_sequences(_input,  maxlen=max_text_len, padding='post')
    return decode_sequence(_input.reshape(1,max_text_len))

In [None]:
text = '''
SINGAPORE: From Dec 8, all COVID-19 patients who are unvaccinated "by choice" will have to pay their own medical bills if they are admitted to hospitals or COVID-19 treatment facilities, the Ministry of Health (MOH) said on Monday (Nov 8).

The Government is currently footing the full COVID-19 medical bills of all Singaporeans, permanent residents and long-term pass holders, other than for those who test positive soon after returning from overseas travel.

"Currently, unvaccinated persons make up a sizeable majority of those who require intensive inpatient care, and disproportionately contribute to the strain on our healthcare resources," said MOH.

'''

create_headline([text])