<a href="https://colab.research.google.com/github/johnprasanth93/Machine_Learning_Projects/blob/master/BiLSTM_Story_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Import Libraries**

In [1]:
import numpy as np
import pandas as pd
import re
import os

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from tensorflow.keras.layers import *
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')

**Load and Merge all 5 sentences column in one sentences column**

In [2]:
df = pd.read_csv("/content/ROCStories_winter2017.csv")

In [3]:
len(df)

52665

In [4]:
merge_df = df
merge_df['sentences'] = merge_df['sentence1']+' '+merge_df['sentence2']+' '+merge_df['sentence3']+' '+merge_df['sentence4']+' '+merge_df['sentence5']
merge_df.head()

Unnamed: 0,storyid,storytitle,sentence1,sentence2,sentence3,sentence4,sentence5,sentences
0,8bbe6d11-1e2e-413c-bf81-eaea05f4f1bd,David Drops the Weight,David noticed he had put on a lot of weight re...,He examined his habits to try and figure out t...,He realized he'd been eating too much fast foo...,He stopped going to burger places and started ...,"After a few weeks, he started to feel much bet...",David noticed he had put on a lot of weight re...
1,0beabab2-fb49-460e-a6e6-f35a202e3348,Frustration,Tom had a very short temper.,One day a guest made him very angry.,He punched a hole in the wall of his house.,Tom's guest became afraid and left quickly.,Tom sat on his couch filled with regret about ...,Tom had a very short temper. One day a guest m...
2,87da1a22-df0b-410c-b186-439700b70ba6,Marcus Buys Khakis,Marcus needed clothing for a business casual e...,All of his clothes were either too formal or t...,He decided to buy a pair of khakis.,The pair he bought fit him perfectly.,Marcus was happy to have the right clothes for...,Marcus needed clothing for a business casual e...
3,2d16bcd6-692a-4fc0-8e7c-4a6f81d9efa9,Different Opinions,Bobby thought Bill should buy a trailer and ha...,Bill thought a truck would be better for what ...,Bobby pointed out two vehicles were much more ...,Bill was set in his ways with conventional thi...,He ended up buying the truck he wanted despite...,Bobby thought Bill should buy a trailer and ha...
4,c71bb23b-7731-4233-8298-76ba6886cee1,Overcoming shortcomings,John was a pastor with a very bad memory.,He tried to memorize his sermons many days in ...,He decided to learn to sing to overcome his ha...,He then made all his sermons into music and sa...,His congregation was delighted and so was he.,John was a pastor with a very bad memory. He t...


**Finding the maximum length sentences in dataset**

In [5]:
max_len_sentence = 0;
for i in range(len(merge_df)):#this for loop running through the entire document
  res = len(re.findall(r'\w+', merge_df.loc[i,'sentences']))
  if(res>max_len_sentence):
    max_len_sentence = res

print(max_len_sentence)

74


### **Text Data Preprocessing**


In [6]:
tokenizer = Tokenizer(filters='#$^&*')
sentences = []
#here we are taking for 3000 records we can also increase the size.
for i in merge_df.loc[0:3000,'sentences'].values: 
  corpus = i.lower().split('\n')
  tokenizer.fit_on_texts(corpus)
  sentences.append(corpus)

total_unique_words = len(tokenizer.word_index)+1


In [7]:
display(len(tokenizer.word_index),total_unique_words)

13080

13081

In [8]:
sentences[0]

["david noticed he had put on a lot of weight recently. he examined his habits to try and figure out the reason. he realized he'd been eating too much fast food lately. he stopped going to burger places and started a vegetarian diet. after a few weeks, he started to feel much better."]

**Embedding each text to an integer value**

In [9]:
embeddings = []
for line in sentences:
  embeddings.append(tokenizer.texts_to_sequences(line)[0])

In [10]:
embeddings[0]

[394,
 198,
 5,
 14,
 94,
 16,
 3,
 127,
 10,
 968,
 2389,
 5,
 4587,
 9,
 6684,
 2,
 174,
 6,
 1318,
 28,
 1,
 6685,
 5,
 115,
 265,
 84,
 376,
 176,
 144,
 447,
 206,
 2390,
 5,
 279,
 81,
 2,
 1808,
 3523,
 6,
 69,
 3,
 6686,
 1423,
 39,
 3,
 132,
 3524,
 5,
 69,
 2,
 302,
 144,
 458]

**Spliting the list into other list using n_gram_sequence(for predict the next suitable word)**

In [11]:
input_sequences = []

for embedding in embeddings:
  for j in range(1,len(embedding)):
    n_gram_sequence = embedding[:j+1]
    input_sequences.append(n_gram_sequence)

In [12]:
print(len(input_sequences))

134008


**Finding max len of sentence in sequence**

In [13]:
max =0;
for x in embeddings:
  if len(x) > max:
    max = len(x)
print(max)

69


In [14]:
merge_df['sentences'].values.reshape(-1,1)

array([["David noticed he had put on a lot of weight recently. He examined his habits to try and figure out the reason. He realized he'd been eating too much fast food lately. He stopped going to burger places and started a vegetarian diet. After a few weeks, he started to feel much better."],
       ["Tom had a very short temper. One day a guest made him very angry. He punched a hole in the wall of his house. Tom's guest became afraid and left quickly. Tom sat on his couch filled with regret about his actions."],
       ['Marcus needed clothing for a business casual event. All of his clothes were either too formal or too casual. He decided to buy a pair of khakis. The pair he bought fit him perfectly. Marcus was happy to have the right clothes for the event.'],
       ...,
       ['Janice was out exercising for her big soccer game. She was doing some drills with her legs. While working out and exercising she slips on the grass. She falls down and uses her wrist to break her fall. She 

**Pre Padding the values [Sequences will be padded to the length of the longest individual sequence]**

In [15]:
input_sequences = np.array(pad_sequences(input_sequences, maxlen =max, padding='pre'))

In [16]:
input_sequences.shape

(134008, 69)

In [17]:
count = len(input_sequences)

In [18]:
input_sequences[:,-1]

array([  198,     5,    14, ...,    57, 13080,   659], dtype=int32)

**Generating X values and labels**

In [19]:
x_value = input_sequences[:,:-1]
x_value.shape

(134008, 68)

In [20]:
x_value = np.array(x_value).reshape((x_value.shape[0],x_value.shape[1],1))
x_value.shape

(134008, 68, 1)

In [21]:
labels =input_sequences[:,-1]
len(labels)

134008

**Y values are categorical and one hot encoded**

In [22]:
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)

In [23]:
y_values.shape

(134008, 13081)

## **Training the Neural Network**

In [24]:
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential
def create_model():
  #n_timesteps = 10
  model = Sequential()
  model.add(Embedding(total_unique_words, 200, input_length=max-1))
  model.add(Bidirectional(LSTM(150, return_sequences=True)))
  model.add(Flatten())
  model.add(Dense(total_unique_words, activation='softmax'))
  return model

In [25]:
model = create_model()
adam = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer = adam, metrics = ['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 68, 200)           2616200   
_________________________________________________________________
bidirectional (Bidirectional (None, 68, 300)           421200    
_________________________________________________________________
flatten (Flatten)            (None, 20400)             0         
_________________________________________________________________
dense (Dense)                (None, 13081)             266865481 
Total params: 269,902,881
Trainable params: 269,902,881
Non-trainable params: 0
_________________________________________________________________


In [26]:
# Creating early stopping callback
early_stop = tf.keras.callbacks.EarlyStopping(patience=4, monitor='accuracy')

In [27]:
history=model.fit(
    x_value, y_values, 
    epochs = 100, verbose= 1,
    batch_size=128,
    callbacks = [early_stop]
    )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100


In [28]:
model.save('my_model')

## Generating story with trained model

In [29]:
init_text = "david"
gen_words = 40

for i in range(gen_words):
  token_list = tokenizer.texts_to_sequences([init_text])[0]
  token_list = pad_sequences([token_list], maxlen =max-1, padding='pre')
  pred = model.predict_classes(token_list)
  output_word = ""
  for word, index in tokenizer.word_index.items():
    if index == pred:
      output_word = word
      break
    
  init_text += " "+ word

print(init_text)



Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
david was faced a particularly difficult exam. he expected to just pass based out it. however, he saw a person carelessly showing his exam. noble david decided to not cheat. unfortunately, david did not pass this difficult exam. to be that
