In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from tensorflow.keras import callbacks, models, layers, preprocessing as TFprocessing

Loads the dataset CSV file the specified file path using the pandas library's read_csv() function. The encoding is set to 'latin-1' to ensure that all characters in the CSV file are properly read.

Note: This data is preprocessed data

In [2]:
dataset = pd.read_csv(r'C:\Users\moeme\Desktop\Text Summariztation Dataset\final_dataset.csv', encoding='latin-1')
dataset = dataset.loc[:, ~dataset.columns.str.contains('^Unnamed')]
display(dataset.head(10))

Unnamed: 0,text,summary
0,administration union territory daman diu ha re...,daman diu revoke mandatory rakshabandhan offic...
1,malaika arora slammed instagram user trolled d...,malaika slam user trolled divorcing rich man
2,indira gandhi institute medical science igims ...,virgin corrected unmarried igims form
3,lashkaretaibas kashmir commander abu dujana wa...,aaj aapne pakad liya let man dujana killed
4,hotel maharashtra train staff spot sign sex tr...,hotel staff get training spot sign sex traffic...
5,32yearold man wednesday wa found hanging insid...,man found dead delhi police station kin allege...
6,delhi high court reduced compensation awarded ...,delhi hc reduces aid negligent accident victim 45
7,60year old dalit woman wa allegedly lynched ag...,60yrold lynched rumour wa cutting people hair
8,inquiry aircraft accident investigation bureau...,chopper flying critically low led 2015 bombay ...
9,congress party ha opened bank called state ban...,congress open state bank tomato lucknow


This block of code performs text preprocessing tasks for a dataset, including tokenization and sequence padding, using TensorFlow libraries.

The first three lines of code initialize a tokenizer object, fit it on the text data from the "text" column of the dataset, and create a vocabulary dictionary containing the unique words in the text data, with "<PAD>" token assigned to 0.

The next two lines of code convert the text data into sequences of integers using the texts_to_sequences() method of the tokenizer object and stores them in the variable text2seq.

Finally, the last line of code pads the sequences to a fixed length of 1024 and stores them in the text variable. This is achieved using the pad_sequences() function from the sequence module of the TensorFlow package. The resulting sequences can be used for text summarization or other NLP tasks that require fixed-length input.

In [3]:
text_tokenizer = TFprocessing.text.Tokenizer(num_words=10000, lower=False, split=' ', oov_token=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
text_tokenizer.fit_on_texts(dataset["text"])
text_dic_vocabulary = {"<PAD>":0}
text_dic_vocabulary.update(text_tokenizer.word_index)

## text to seq
text2seq= text_tokenizer.texts_to_sequences(dataset["text"])

## padding sequence
text = TFprocessing.sequence.pad_sequences(text2seq, maxlen=1024, padding="post", truncating="post")

This block of code adds START and END tokens to the summaries (y) in the dataset. It then applies tokenization and sequence padding on the summary data similar to the previous block of code.

The first three lines of the code add "<START>" and "<END>" tokens to the summaries in the dataset by concatenating them to the beginning and end of each summary using a lambda function applied to the "summary" column of the dataset.

The next four lines of code are similar to the previous block of code.

In [4]:
# Add START and END tokens to the summaries (y)
special_tokens = ("<START>", "<END>")
dataset["summary"] = dataset['summary'].apply(lambda x: special_tokens[0]+' '+x+' '+special_tokens[1])
# check example
print(dataset["summary"][1])

summary_tokenizer = TFprocessing.text.Tokenizer(num_words=10000, lower=False, split=' ', oov_token=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
summary_tokenizer.fit_on_texts(dataset["summary"])
summary_dic_vocabulary = {"<PAD>":0}
summary_dic_vocabulary.update(summary_tokenizer.word_index)

## summary to seq
summary2seq = summary_tokenizer.texts_to_sequences(dataset["summary"])
## padding sequence
summary = TFprocessing.sequence.pad_sequences(summary2seq, maxlen=256, padding="post", truncating="post")

<START> malaika slam user trolled divorcing rich man <END>


In [5]:
# check example for the previous two block of code
print(text[8])
print(summary[8])

[1720 1167  842 ...    0    0    0]
[   1 7620 1428 6458 1051  539 1121 9093  148  338    2    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0 

This block of code uses the train_test_split() function from scikit-learn library to split the preprocessed text and summary data into training and testing sets.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(text, summary, test_size=0.2, shuffle = True)

This block of code defines a Seq2Seq model using Keras. The model architecture consists of an encoder and a decoder. The encoder consists of an input layer, an embedding layer, and an LSTM layer. The decoder consists of another input layer, an embedding layer, an LSTM layer, and a dense layer. The model takes two inputs: the input sequence (x) and the target sequence (y), and outputs the predicted target sequence (y). The model is compiled with the RMSprop optimizer and sparse categorical cross-entropy loss function, and accuracy is used as a metric. The model summary is displayed at the end.

As you see in model summary there is 396,732,106 trainable parameters in on the model. 

Understand RNN and LSTM:

[LSTM](https://drive.google.com/file/d/1gSBaOkAWy9JGfp6IgZahw5_1hbaJV-b6/view?usp=sharing)

[LSTM Lecture](https://drive.google.com/file/d/1aC1NR3aw3AzrOKkbORfr0HDiy0GcV4JA/view?usp=sharing)

In [8]:
lstm_units = 250
embeddings_size = 300
##------------ ENCODER (embedding + lstm) ------------------------##
x_input_layer = layers.Input(name="x_input_layer", shape=(X_train.shape[1],))
### embedding
x_embedding = layers.Embedding(name="x_embedding_layer", input_dim=len(text_dic_vocabulary),output_dim=embeddings_size, trainable=True)
embedding_layer = x_embedding(x_input_layer)

### lstm 
x_lstm_layer = layers.LSTM(name="x_lstm_layer", units=lstm_units, dropout=0.4, return_sequences=True, return_state=True)
x_out, state_h, state_c = x_lstm_layer(embedding_layer)

##------------ DECODER (embedding + lstm + dense) ----------------##
y_input_layer = layers.Input(name="y_input_layer", shape=(None,))

### embedding
y_embedding = layers.Embedding(name="y_embedding_layer", input_dim=len(summary_dic_vocabulary), output_dim=embeddings_size, trainable=True)
y_embedding_layer = y_embedding(y_input_layer)

### lstm 
y_lstm_layer = layers.LSTM(name="y_lstm_layer", units=lstm_units, dropout=0.4, return_sequences=True, return_state=True)
y_out, _, _ = y_lstm_layer(y_embedding_layer, initial_state=[state_h, state_c])

### final dense layers
dense_layer = layers.TimeDistributed(name="dense", layer=layers.Dense(units=len(summary_dic_vocabulary), activation='softmax'))
y_out = dense_layer(y_out)

##---------------------------- COMPILE ---------------------------##
model = models.Model(inputs=[x_input_layer, y_input_layer], outputs=y_out, name="Seq2Seq")
model.compile(optimizer='rmsprop',loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "Seq2Seq"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 x_input_layer (InputLayer)     [(None, 1024)]       0           []                               
                                                                                                  
 y_input_layer (InputLayer)     [(None, None)]       0           []                               
                                                                                                  
 x_embedding_layer (Embedding)  (None, 1024, 300)    268015200   ['x_input_layer[0][0]']          
                                                                                                  
 y_embedding_layer (Embedding)  (None, None, 300)    69481800    ['y_input_layer[0][0]']          
                                                                                            

This block of code trains the Seq2Seq model using the fit() method of the model object. The training is performed using the input data X_train and target data y_train. The target data is first reshaped to remove the first column, which is the start token. The training is done for 100 epochs, with a batch size of 16. The validation split is set to 0.3, which means 30% of the training data is used for validation. The training progress is displayed using the verbose parameter. The EarlyStopping callback is also used to stop training if the validation loss does not improve for 2 consecutive epochs.

In [None]:
## train
training = model.fit(x=[X_train, y_train[:,:-1]], 
                     y=y_train.reshape(y_train.shape[0], y_train.shape[1], 1)[:,1:],
                     batch_size=16, 
                     epochs=100,  
                     verbose=1, 
                     validation_split=0.3,
                     callbacks=[callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)])

Epoch 1/100
    6/14521 [..............................] - ETA: 242:21:01 - loss: 10.9173 - accuracy: 0.7563

In [None]:
## plot loss and accuracy
metrics = [k for k in training.history.keys() if ("loss" not in k) and ("val" not in k)]
fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True)
ax[0].set(title="Training")
ax11 = ax[0].twinx()
ax[0].plot(training.history['loss'], color='black')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss', color='black')
for metric in metrics:
    ax11.plot(training.history[metric], label=metric)
ax11.set_ylabel("Score", color='steelblue')
ax11.legend()
ax[1].set(title="Validation")
ax22 = ax[1].twinx()
ax[1].plot(training.history['val_loss'], color='black')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Loss', color='black')
for metric in metrics:
     ax22.plot(training.history['val_'+metric], label=metric)
ax22.set_ylabel("Score", color="steelblue")
plt.show()