### Problem Statement - 62 – Natural Language Processing (NLP) Assignment

## Group No - 26

## Group Member Names:
1. Sunil Mittal (BITS ID : 2021SC04968) - 100%
2. Indira Saha (BITS ID : 2021SC04956) - 100%
3. Vikram Panwar (BITS ID : 2021SC04958) - 100%
4. Muhammad Iqbal J (BITS ID : 2021SC04960) - 100%

<hr/>
<p>
Link to the Dataset: <a href="https://s3.amazonaws.com/text-datasets/nietzsche.txt">Links</a> to an external site.

 Description of Data: This is a rich English word dataset. The main task is  Preparing text for developing a word-level language model. And then Train a neural network that contains an embedding and LSTM layer then used the learned model to generate new text with similar properties as the input text.

<ol>
<li>Define the above text in Python and encode the text as an integer. Determine the vocabulary size. Create the word sequence. 3</li>
<li>Split the sequences into input (X) and output elements (y). fit your model to predict a probability distribution across all words in the vocabulary. 2</li>
<li>Define and build the LSTM model for text generation.  3</li>
<li>valuate the performance of the model. 2</li>
</ol>
</p>

### Explanation :
<p>
Preparing text for developing a word-level language model involves several key steps:
<p>
<b>Text Gathering:</b> Gather the text data that you want to use for training your model. It could be a corpus of text from novels, newspapers, web pages, etc. It is essential to choose a corpus that is representative of the type of language model you want to develop.
</p>
<p>
<b>Text Cleaning:</b> The raw text data usually contains a lot of noise like HTML tags, emojis, special characters, etc. that are not necessary for our language model. The text needs to be cleaned by removing these unnecessary characters.
</p>
<p>
<b>Text Normalization:</b> This involves several steps such as:
</p>
<p>
<b>Lowercasing:</b> To ensure that the model doesn't treat 'word' and 'Word' as two different words, it is a good idea to convert all the text into lowercase.
</p>
<p>
<b>Lemmatization/Stemming:</b> This reduces the words to their base or root form. For instance, 'running' will be reduced to 'run'. However, whether you do this or not will depend on the specific requirements of your model.
Removing Stop Words: Stop words like 'is', 'the', 'and' etc. occur very frequently in text data and don't contain valuable information, so they can be removed.
Handling Punctuation: Depending on your needs, you may want to remove punctuation, or replace them with token representations.
Tokenization: Tokenization is the process of breaking down the text into individual words or tokens. In a word-level language model, tokens are typically individual words.
</p>
<p>
<b>Vocabulary Creation: After tokenization, a vocabulary of unique words is created. This vocabulary serves as the input feature space for the model.
</p>
<p>
<b>Sequence Creation:</b> Language models are trained to predict the next word in a sequence. Therefore, from your tokenized text, you need to create sequences of words. The length of these sequences is a parameter that you can tune.
</p>
<p>
<b>Encoding Sequences:</b> The sequences of words are then encoded into sequences of integers or one-hot encoded vectors. The encoding process transforms the textual information into numerical input that the language model can process.
</p>
<p>
<b>Preparing Training and Validation Set:</b> Divide the dataset into a training set and validation set. The training set is used to train the model while the validation set is used to evaluate the model's performance during the training process.
</p>
<p>
<b>Padding Sequences :</b> Depending on your model architecture, you may need to ensure that all sequences have the same length. You can do this by padding shorter sequences with a special symbol (like 'PAD').
</p>
<p>
<b>Preparing Labels:</b> For each sequence, the corresponding label will be the word that comes after the sequence in the text data. These labels are what the model will try to predict.
</p>
</p>


#### 1. Define the above text in Python and encode the text as an integer. Determine the vocabulary size. Create the word sequence:

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
import math
import numpy as np
import pandas as pd
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
#nltk.download lines are required to download the necessary resources for lemmatization and stopwords.
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Iqbal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Iqbal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Iqbal\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

##### i. Load the text File

In [3]:
def create_sequence(all_words, seq_size):

  for i in range(0, len(all_words), seq_size):
    yield ' '.join(all_words[i:i + seq_size])

In [4]:
print("preprocessing data...")
# Assume we have some text
text=""
with open('nietzsche.txt', 'r') as file:
    text_unprocessed = file.read()
    
lemmatizer = WordNetLemmatizer()
# Function to clean and preprocess text
#The preprocess_text function first converts the text to lowercase, then removes punctuation and
#numbers using regular expressions. Next, it splits the text into individual words, lemmatizes each word
#using NLTK's WordNet lemmatizer, and then joins the words back together. 
#The lemmatizer reduces each word to its base or root form (e.g., "running" becomes "run").
# Contractions dictionary
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are",
"womenthat":"women that",
"womanwhat":"woman what"
}

def preprocess_text(text_unprocessed):
    # Lowercase
    text_unprocessed = text_unprocessed.lower()
    
    # Remove punctuation
    text_unprocessed = re.sub(r'[^\w\s]', '', text_unprocessed)
    
    # Replace contractions
    for contraction, replacement in contractions.items():
         text_unprocessed = text_unprocessed.replace(contraction, replacement)
    
    # Remove special characters
    # \W pattern in the regular expression matches any non-word character (equivalent to [^a-zA-Z0-9_]), 
    text_unprocessed = re.sub(r'\W', ' ', text_unprocessed)
    
    # Remove emojis
    # and the re.UNICODE flag makes the regular expression engine treat the input as a Unicode string.
    RE_EMOJI = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)
    text_unprocessed = RE_EMOJI.sub(r'', text_unprocessed)
    
    # Remove numbers
    text_unprocessed = re.sub(r'\d+', '', text_unprocessed)
    
    # Lemmatize words
    words = text_unprocessed.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    # print(words[:1000])
    text_unprocessed = ' '.join(words)
    
    seq_len = 10
    # print(len(words))
    # print(type(words))
    return list(create_sequence(words, seq_len))


text = preprocess_text(text_unprocessed)


preprocessing data...


##### ii. Below code first tokenizes the input text, converting it into sequences of integers where each integer represents a unique word.

In [5]:
# Step 1: Preprocess the text
tokenizer = Tokenizer()
print("Fitting the tokenizer on the text...\n")
tokenizer.fit_on_texts(text)
print("Converting text into sequences of integers...\n")
# The vocabulary size is the number of unique words plus one
vocab_size = len(tokenizer.word_index) + 1
print(f"Vocabulary size: {vocab_size}")
# Convert text into sequences of integers
sequences = []
for line in text:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        sequences.append(n_gram_sequence)

# Print out the first 15 sequences
for seq in sequences[:15]:
    print(' '.join([tokenizer.index_word[i] for i in seq]))

Fitting the tokenizer on the text...

Converting text into sequences of integers...

Vocabulary size: 10404
preface supposing
preface supposing truth
preface supposing truth woman
preface supposing truth woman ground
preface supposing truth woman ground suspecting
preface supposing truth woman ground suspecting philosopher
preface supposing truth woman ground suspecting philosopher far
preface supposing truth woman ground suspecting philosopher far dogmatist
preface supposing truth woman ground suspecting philosopher far dogmatist failed
understand woman
understand woman terrible
understand woman terrible seriousness
understand woman terrible seriousness clumsy
understand woman terrible seriousness clumsy importunity
understand woman terrible seriousness clumsy importunity usually


#### 2. Split the sequences into input (X) and output elements (y). fit your model to predict a probability distribution across all words in the vocabulary:

It creates input-output sequence pairs, where the model is trained to predict the next word given a sequence of previous words.
It pads sequences to ensure they are of equal length.


In [6]:
# Pad sequences for equal input length 
print("Padding sequences...\n")
max_sequence_len = max([len(seq) for seq in sequences])
sequences = np.array(pad_sequences(sequences, maxlen=max_sequence_len, padding='pre'))
# Split sequences into input (X) and output (y)
print("Splitting sequences into input and output...\n")
X = sequences[:,:-1]
y = sequences[:,-1]
y = to_categorical(y, num_classes=len(tokenizer.word_index) + 1)

# Printing first sequence and its corresponding output
print("X[0] (input sequence): ", X[0])
print("y[0] (output word): ", np.argmax(y[0]))

Padding sequences...

Splitting sequences into input and output...

X[0] (input sequence):  [   0    0    0    0    0    0    0    0    0 3089]
y[0] (output word):  545


#### 3. Define and build the LSTM model for text generation:
Constructs the LSTM model with an Embedding layer.

In [7]:
# Step 2: Build the LSTM model
print("Building the LSTM model...\n")
#Define Model
model = Sequential()
# Below Embedding layer converts the word integers into dense vectors of length 10. 
# The input dimension is the vocabulary size (the number of unique words plus one), 
# and the input length is the length of the input sequences.
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=50, input_length=max_sequence_len-1))
#LSTM layer captures the sequence structure of the input. Here we're using 50 hidden units
model.add(LSTM(50))
#Dense layer outputs a probability distribution across all words. 
#The number of units is the vocabulary size, 
# and the softmax activation function ensures that the output is a probability distribution.
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))

# Learning rate schedule function to optimize the model,It is to adjust the learning rate during training. 
# This can help to quickly converge during the initial stages of training and then slow down to finely tune the 
# model in the later stages.
def scheduler(epoch, lr):
    if epoch < 5:
        return lr
    else:
        return lr * math.exp(-0.1)

callbacks = [
    LearningRateScheduler(scheduler, verbose=1),
    EarlyStopping(monitor='loss', patience=5),
    ModelCheckpoint('best_model.h5', monitor='loss', save_best_only=True)
]

#The model is compiled with the categorical cross entropy loss function, 
# #which is suitable for multi-class classification problems, and the Adam optimizer.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


Building the LSTM model...



### It trains this model on the sequence data.

In [8]:
# Step 3: Train the model
print("Training the model...\n")
#model is fitted on the input sequences X and their corresponding outputs y. 
#the number of epochs is set to 10, but can be increased to get higher accuracy
model.fit(X, y, epochs=150, verbose=2)


Training the model...

Epoch 1/150
1369/1369 - 31s - loss: 8.4604 - accuracy: 0.0122 - 31s/epoch - 23ms/step
Epoch 2/150
1369/1369 - 35s - loss: 8.0564 - accuracy: 0.0124 - 35s/epoch - 25ms/step
Epoch 3/150
1369/1369 - 31s - loss: 7.9526 - accuracy: 0.0125 - 31s/epoch - 23ms/step
Epoch 4/150
1369/1369 - 32s - loss: 7.8372 - accuracy: 0.0125 - 32s/epoch - 24ms/step
Epoch 5/150
1369/1369 - 32s - loss: 7.6809 - accuracy: 0.0129 - 32s/epoch - 24ms/step
Epoch 6/150
1369/1369 - 33s - loss: 7.4898 - accuracy: 0.0139 - 33s/epoch - 24ms/step
Epoch 7/150
1369/1369 - 33s - loss: 7.2786 - accuracy: 0.0155 - 33s/epoch - 24ms/step
Epoch 8/150
1369/1369 - 36s - loss: 7.0572 - accuracy: 0.0186 - 36s/epoch - 27ms/step
Epoch 9/150
1369/1369 - 35s - loss: 6.8328 - accuracy: 0.0243 - 35s/epoch - 25ms/step
Epoch 10/150
1369/1369 - 36s - loss: 6.6085 - accuracy: 0.0319 - 36s/epoch - 26ms/step
Epoch 11/150
1369/1369 - 36s - loss: 6.3864 - accuracy: 0.0450 - 36s/epoch - 26ms/step
Epoch 12/150
1369/1369 - 36s 

<keras.src.callbacks.History at 0x2ce2d8f2620>

### Finally, it uses the trained model to generate new text.


In [9]:
# Step 4: Generate new text
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        #print("Predicting word number {}...\n".format(_+1))
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)
        predicted = np.argmax(predicted_probs, axis=-1)
        
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text


In [19]:
# Generate new text
print("\nGenerating new text...\n")
print(generate_text("But to speak seriously", 20, model, max_sequence_len))


Generating new text...

But to speak seriously truth perception say much mutual sense man sinful know would like already something sought origin test presence said ideal laugh


#### 4. Evaluate the performance of the model:

After running the below evaluation, the result gives that the below gives a better performance.
<ul>
    <li>Optimizer: adam</li>
    <li>Activation: softmax</li>
</ul>

For the detailed metrics please find the code execution below

In [7]:
print("Building multiple LSTM model for comparison...\n")
class eval_result:
    def __init__(self, opt, act, loss, accuracy):
        self.opt = opt
        self.act = act
        self.loss = loss
        self.accuracy = accuracy

def build_models_eval(optimizer, activation):
    print(optimizer + ' ' + activation)
    model = Sequential()
    model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=10, input_length=max_sequence_len-1))
    model.add(LSTM(50))
    model.add(Dense(len(tokenizer.word_index) + 1, activation=activation))
    
    def scheduler(epoch, lr):
        if epoch < 5:
            return lr
        else:
            return lr * math.exp(-0.1)

    callbacks = [
        LearningRateScheduler(scheduler, verbose=1),
        EarlyStopping(monitor='loss', patience=5),
        ModelCheckpoint('best_model.h5', monitor='loss', save_best_only=True)
    ]

    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

Building multiple LSTM model for comparison...



In [8]:
print("Training multiple models...\n")
eval_results = []
res_compare = []
# res_compare.append(eval_result('Optimizer', 'Activation', 'loss', 'accuracy'))

# optimizers = ['adadelta', 'adagrad', 'adam']
# activation = ['tanh', 'softmax', 'relu']
optimizers = ['adam']
activation = ['tanh', 'softmax', 'relu']

for opt in optimizers:
    for act in activation:
        model = build_models_eval(opt, act)
        #model is fitted on the input sequences X and their corresponding outputs y. 
        #the number of epochs is set to 10, but can be increased to get higher accuracy
        model.fit(X, y, epochs=1, verbose=2)

        res = model.evaluate(X, y)
        # print(opt, act, res[0], res[1])
        res_compare.append(eval_result(opt, act, res[0], res[1]))
        eval_results.append([opt, act, res[0], res[1]])

Training multiple models...

adam tanh
1369/1369 - 34s - loss: 11.0858 - accuracy: 0.0034 - 34s/epoch - 25ms/step
adam softmax
1369/1369 - 35s - loss: 8.4610 - accuracy: 0.0123 - 35s/epoch - 26ms/step
adam relu
1369/1369 - 38s - loss: 15.9577 - accuracy: 0.0019 - 38s/epoch - 28ms/step


In [9]:
import numpy as np
import pandas as pd

optimizer = ""
acti = ""
accuracy = 0

for result in res_compare:
    # print(result.opt, result.act, result.loss, result.accuracy)
    if result.accuracy > accuracy:
        accuracy = result.accuracy
        optimizer = result.opt
        acti = result.act

print('Optimizer: ', optimizer)
print('Activation: ', acti)
print('Best accuracy: ', accuracy * 100, '%')



Optimizer:  adam
Activation:  softmax
Best accuracy:  1.2441501952707767 %
