<a href="https://colab.research.google.com/github/lalitpandey02/PythonNotebooks/blob/main/WordLevel_LSTM0226.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

<center><h1>Introduction to Word Level Language Modelling(Practical Implementation)</center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Description**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
**4.** [**Data Acquisition & Description**](#Section4)<br>
**5.** [**Data Preprocessing**](#Section5)<br>
**6.** [**Train Language Model**](#Section6)<br>
  - **6.1** [**Load Sequence**](#Section61)
  - **6.2** [**Encode Sequence**](#Section62) 
  - **6.3** [**Sequence Inputs and Output**](#Section63)
  - **6.4** [**Fit Model**](#Section61)

**7.** [**Use Language Model**](#Section7)<br>
  - **7.1** [**Load Sequence**](#Section61)
  - **7.2** [**Load Model**](#Section62) 
  - **7.3** [**Fit Model**](#Section63)

**8.** [**Conclusion**](#Section8)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- **Language models** learn and **predict** one word at a time. The **training** of the network involves **providing** sequences of words as **input** that are processed one at a time where a **prediction** can be made and learned for each **input sequence**.

- Neural Language Models (NLM) address the **N-gram data sparsity** issue through **parameterization** of words as **vectors** (word embeddings) and using them as inputs to a neural network.

- Word **embeddings** obtained through NLMs **exhibit** the **property** whereby semantically close **words** are likewise **close** in the induced **vector space**.

---
<a name = Section2></a>
# **2. Problem Statement**
---

- The **problem statement** is to train a **language model** on the given text and then **generate** text given an input text in such a way that it looks **straight** out of this document and is **grammatically** correct and **legible** to read.

* For this, we need to develop **word-level** neural language **model** and use  it to generate text.

* A **language model** can predict the probability of the next word in the sequence, based on the **words** already **observed** in the sequence.

* **Neural network models** are a preferred method for **developing statistical language models** because they can use a **distributed representation** where different words with similar meanings have **similar representation**.

- Also, it is because they can use a **large context** of recently observed words when **making predictions**.

---
<a name = Section3></a>
# **3. Installing and Importing Libraries**
---

In [11]:
# Import tensorflow 2.x
# This code block will only work in Google Colab.
try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


In [12]:
from random import randint
from pickle import load
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from random import randint
from pickle import load
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from numpy import array
from pickle import dump
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
import string

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- **The Republic by Plato**
<br>
<center> <img src="https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/socrates.JPG" /></center>
<br>

-  The Republic is the **classical Greek philosopher Plato’s** most famous work.

- It is **structured** as a **dialog** (e.g. conversation) on the topic of **order and justice** within a city state

- Download the ASCII **text version** of the entire book (or books) here: [The Republic](https://https://www.gutenberg.org/ebooks/1497) and save it as *republic.txt*

- Open the file in a **text editor** and delete the **front** and **back** matter. 

- This includes details about the **book** at the beginning, a **long analysis**, and **license** information at the end.

In [13]:
import urllib
response = urllib.request.urlopen('https://raw.githubusercontent.com/insaid2018/DeepLearning/master/Data/republic_clean.txt')
doc = response.read().decode('utf8')
print(doc[:800])

﻿BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what manner they would
celebrate the festival, which was a new thing. I was delighted with the
procession of the inhabitants; but that of the Thracians was equally,
if not more, beautiful. When we had finished our prayers and viewed the
spectacle, we turned in the direction of the city; and at that instant
Polemarchus the son of Cephalus chanced to catch sight of us from a
distance as we were starting on our way home, and told his servant to
run and bid us wait for him. The servant took hold of me by the cloak
behind, and said: Polemarchus desires you to wait.

I turn


---
<a name = Section5></a>
# **5. Data Preprocessing**
---

We'll be using the following **process sequence** in this notebook:

<br>   
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/word_lstm_flow0.png"width="600" height="400"/></center>

<br>    


#### Clean Text

* **Replace ‘–‘** with a white space so we can split words better.

* **Split words** based on **white space**.

* Remove all **punctuation** from **words** to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).

* **Remove all words** that are not alphabetic to remove standalone **punctuation tokens**.

* Normalize **all words** to **lowercase** to reduce the **vocabulary size**.

In [14]:


# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [15]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid',

In [17]:
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 118633


In [18]:
sequences[:5]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what man

**Observations:**

- Transforming the tokens into **space-separated strings** for later storage in a file.

- Splitting the list of **clean tokens** into **sequences**.

In [19]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
    
# save sequences to file
out_filename = 'republic_sequences.txt'
save_doc(sequences, out_filename)

----
<a id=section6></a>
## **6. Train Language Model**
----


<br>   
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/word_lstm_flow4.png"width="700" height="400"/></center>

<br>    

* Model uses a **distributed** representation for words so that different words with similar meanings will have a similar representation.

* It **learns** the **representation** at the same time as **learning the model.**

* It **learns** to predict the **probability** for the next **word** using the **context** of the last **100 words**.

- We will use an **Embedding Layer** to learn the representation of words, and a **Long Short-Term Memory (LSTM)** recurrent neural network to learn to **predict words** based on their context.

<a id=section601></a>
### **6.1 Load Sequences**

- We can load our **training data** using the **`load_doc()`** function defined below.


- Once loaded, we can **split the data into separate training sequences** by splitting based on new lines.


- The snippet below will load the **‘republic_sequences.txt‘** data file from the current working directory.

In [20]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

In [21]:
lines[:2]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted']

<a id=section602></a>
### **6.2 Encode Sequences**

- The **word embedding layer** expects input sequences to be comprised of integers.

- We can **map each word in our vocabulary** to a unique integer and encode our input sequences.

- Later, when we make predictions, we can convert the **prediction to numbers** and look up their **associated words** in the **same mapping**.

In [22]:

"""
First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer.

We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.

"""

# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

- We can access the **mapping** of **words** to **integers** as a dictionary attribute called **`word_index`** on the **tokenizer** object.

- We need to know the **size** of the **vocabulary** for defining the **embedding** layer later. 

- We can determine the vocabulary by **calculating** the size of the **mapping dictionary**.


In [23]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
vocab_size

7410

In [24]:
tokenizer.word_index

{'the': 1,
 'and': 2,
 'of': 3,
 'to': 4,
 'is': 5,
 'in': 6,
 'he': 7,
 'a': 8,
 'that': 9,
 'be': 10,
 'i': 11,
 'not': 12,
 'which': 13,
 'are': 14,
 'you': 15,
 'they': 16,
 'or': 17,
 'will': 18,
 'said': 19,
 'as': 20,
 'we': 21,
 'but': 22,
 'have': 23,
 'them': 24,
 'his': 25,
 'for': 26,
 'by': 27,
 'who': 28,
 'their': 29,
 'what': 30,
 'then': 31,
 'this': 32,
 'one': 33,
 'if': 34,
 'with': 35,
 'there': 36,
 'all': 37,
 'true': 38,
 'at': 39,
 'when': 40,
 'do': 41,
 'other': 42,
 'has': 43,
 'yes': 44,
 'any': 45,
 'him': 46,
 'no': 47,
 'good': 48,
 'would': 49,
 'may': 50,
 'state': 51,
 'from': 52,
 'man': 53,
 'say': 54,
 'our': 55,
 'only': 56,
 'was': 57,
 'an': 58,
 'must': 59,
 'should': 60,
 'so': 61,
 'more': 62,
 'us': 63,
 'can': 64,
 'on': 65,
 'were': 66,
 'very': 67,
 'now': 68,
 'like': 69,
 'such': 70,
 'replied': 71,
 'just': 72,
 'certainly': 73,
 'than': 74,
 'also': 75,
 'these': 76,
 'men': 77,
 'same': 78,
 'another': 79,
 'about': 80,
 'justice': 8

<a id=section603></a>
### **6.3 Sequence Inputs and Output**

In [25]:
# separate into input and output
sequences = array(sequences) #array slicing

X, y = sequences[:,:-1], sequences[:,-1]

#one hot encode the output word.
#Keras provides the to_categorical() that can be used to one hot encode the output words for each input-output sequence pair.

y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

In [26]:
X[0]

array([1046,   11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,
          1,  261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,
          4,    1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11,
       1266,    4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,
         13,   57,    8,  549,  151,   11])

In [27]:
y.shape

(118633, 7410)

In [28]:
sequences.shape

(118633, 51)

In [29]:
sequences[:,:-1]

array([[1046,   11,   11, ...,  549,  151,   11],
       [  11,   11, 1045, ...,  151,   11,   57],
       [  11, 1045,  329, ...,   11,   57, 1147],
       ...,
       [ 382,  467,    4, ..., 1044,  414,   13],
       [ 467,    4,   33, ...,  414,   13,   21],
       [   4,   33,   79, ...,   13,   21,   23]])

In [30]:
sequences[:,-1]

array([  57, 1147,   35, ...,   21,   23,   85])

In [31]:
X.shape

(118633, 50)

In [32]:
X[0]

array([1046,   11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,
          1,  261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,
          4,    1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11,
       1266,    4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,
         13,   57,    8,  549,  151,   11])

In [33]:
y.shape

(118633, 7410)

In [34]:
y[0]

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

<a id=section604></a>
### **6.4 Fit Model**

- The learned **embedding** needs to know the size of the **vocabulary** and the length of **input sequences** as previously discussed.

 - The **output layer** predicts the **next word** as a single **vector** the size of the **vocabulary** with a **probability** for each word in the vocabulary.

 - A **softmax** activation function is used to **ensure** the outputs have the **characteristics** of normalized probabilities.
 
 <center><img src = "https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/images.png"width="400" height="150"/></center>

In [35]:
# define model
model = Sequential()
"""" 
- Size of the **embedding** vector space: a parameter to specify how many dimensions will be used to represent each word

- Common values are **50, 100, and 300**. 

- We will use 50 here, but consider **testing smaller or larger values**.

- We will use a two LSTM hidden layers with **100 memory cells** each. 

- More memory cells and a deeper network may achieve better results.
"""
model.add(Embedding(vocab_size, 50, input_length=seq_length))

model.add(LSTM(200, return_sequences=True))
model.add(LSTM(200))
model.add(Dense(200, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            370500    
                                                                 
 lstm (LSTM)                 (None, 50, 200)           200800    
                                                                 
 lstm_1 (LSTM)               (None, 200)               320800    
                                                                 
 dense (Dense)               (None, 200)               40200     
                                                                 
 dense_1 (Dense)             (None, 7410)              1489410   
                                                                 
Total params: 2,421,710
Trainable params: 2,421,710
Non-trainable params: 0
_________________________________________________________________
None


In [36]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

**Observation:** 

- The model is compiled specifying the **categorical** cross **entropy** loss needed to fit the **model**.

- Technically, the **model** is learning a **multi-class** classification and this is the **suitable** loss function for this type of problem.

- The efficient **Adam** optimizers to **mini-batch** gradient descent is used and **accuracy** is evaluated of the model.

- **Model Training** on the data 

In [None]:
# fit model
model.fit(X, y, batch_size=128, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
 75/927 [=>............................] - ETA: 6:30 - loss: 5.1153 - accuracy: 0.1501

- Use the **Keras model API** to save the model to the file **‘model.h5‘** in the current working directory.

- This is in the **Tokenizer object**, and we can save that too **using Pickle**.

In [32]:
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

----
<a id=section7></a>
## **7. Use Language model**

---


<br>   
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/word_lstm_flow10.png"width="700" height="400"/></center>

<br>    

<a id=section701></a>
### **7.1 Load the data**

In [2]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load cleaned text sequences
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

FileNotFoundError: ignored

In [35]:
lines[:10]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what man

- We need the text so that we can choose a **source sequence** as input to the model for generating a **new sequence of text**.

- The model will require **50 words** as **input**.

- Later, we will need to specify the **expected length** of input.

- We can determine this from the **input sequences** by **calculating** the length of one line of the loaded data and **subtracting** **1** for the **expected output** word that is also on the same line.



In [1]:
seq_length = len(lines[0].split()) - 1

NameError: ignored

<a id=section702></a>
### **7.2 Load Model**

- We can now **load the model** from file.


- Keras provides the **load_model() function** for loading the model, ready for use.

In [None]:


# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

<a id=section703></a>
### **7.3 Generate Text**

* The first step in generating text is **preparing a seed input**.


* We will select a **random line** of text from the **input text** for this purpose. 

In [None]:
import numpy as np

In [None]:


# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = np.argmax(model.predict(encoded, verbose=0))
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)


# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print("seed_text:" + '\n')
print(seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print("generated_text:" + '\n')
print(generated)

seed_text:

that time which the poets call the threshold of old age is life harder towards the end or what report do you give of it i will tell you socrates he said what my own feeling is men of my age flock together we are birds of a feather as the

generated_text:

good and thirdly aim prevail over them the third trial of conjuring and of the fiction which they is to be unjust and reject the good man will be imagined to be sure he said and in the same time be as well as the good and tyrannical peace at


**Observations:**

 - In fact, the **addition** of **concatenation** would help in interpreting the seed and the **generated** text. Nevertheless, the **generated** text gets the right kind of words in the **right** kind of order.

 - Try running the **example** a few times to see other examples of **generated** text. Let me know in the **comments** below if you see anything interesting.

----
<a id=section8></a>
## **8. Conclusion**

---

- That **statistical** language models are **central** to many challenging natural language processing tasks.

- That state-of-the-art **results** are achieved using **neural language models**, specifically those with **word embeddings** and recurrent neural network algorithms.

- In general, **word-level language** models tend to **display** higher accuracy than **character-level language models**. 

- This is because they can form **shorter** representations of **sentences** and preserve the **context between** words easier than character-level language models.

- They allow **conditioning** on increasingly large **context** sizes with only a linear increase in the number of parameters, and they support generalization across **different** contexts.