### Problem Statement - 62 – NLP Assignment
<hr/>
<p>
Link to the Dataset: <a href="https://s3.amazonaws.com/text-datasets/nietzsche.txt">Links</a> to an external site.

 Description of Data: This is a rich English word dataset. The main task is  Preparing text for developing a word-level language model. And then Train a neural network that contains an embedding and LSTM layer then used the learned model to generate new text with similar properties as the input text.

<ol>
<li>Define the above text in Python and encode the text as an integer. Determine the vocabulary size. Create the word sequence. 3</li>
<li>Split the sequences into input (X) and output elements (y). fit your model to predict a probability distribution across all words in the vocabulary. 2</li>
<li>Define and build the LSTM model for text generation.  3</li>
<li>valuate the performance of the model. 2</li>
</ol>
</p>

### Explanation :
<p>
Preparing text for developing a word-level language model involves several key steps:
<p>
<b>Text Gathering:</b> Gather the text data that you want to use for training your model. It could be a corpus of text from novels, newspapers, web pages, etc. It is essential to choose a corpus that is representative of the type of language model you want to develop.
</p>
<p>
<b>Text Cleaning:</b> The raw text data usually contains a lot of noise like HTML tags, emojis, special characters, etc. that are not necessary for our language model. The text needs to be cleaned by removing these unnecessary characters.
</p>
<p>
<b>Text Normalization:</b> This involves several steps such as:
</p>
<p>
<b>Lowercasing:</b> To ensure that the model doesn't treat 'word' and 'Word' as two different words, it is a good idea to convert all the text into lowercase.
</p>
<p>
<b>Lemmatization/Stemming:</b> This reduces the words to their base or root form. For instance, 'running' will be reduced to 'run'. However, whether you do this or not will depend on the specific requirements of your model.
Removing Stop Words: Stop words like 'is', 'the', 'and' etc. occur very frequently in text data and don't contain valuable information, so they can be removed.
Handling Punctuation: Depending on your needs, you may want to remove punctuation, or replace them with token representations.
Tokenization: Tokenization is the process of breaking down the text into individual words or tokens. In a word-level language model, tokens are typically individual words.
</p>
<p>
<b>Vocabulary Creation: After tokenization, a vocabulary of unique words is created. This vocabulary serves as the input feature space for the model.
</p>
<p>
<b>Sequence Creation:</b> Language models are trained to predict the next word in a sequence. Therefore, from your tokenized text, you need to create sequences of words. The length of these sequences is a parameter that you can tune.
</p>
<p>
<b>Encoding Sequences:</b> The sequences of words are then encoded into sequences of integers or one-hot encoded vectors. The encoding process transforms the textual information into numerical input that the language model can process.
</p>
<p>
<b>Preparing Training and Validation Set:</b> Divide the dataset into a training set and validation set. The training set is used to train the model while the validation set is used to evaluate the model's performance during the training process.
</p>
<p>
<b>Padding Sequences :</b> Depending on your model architecture, you may need to ensure that all sequences have the same length. You can do this by padding shorter sequences with a special symbol (like 'PAD').
</p>
<p>
<b>Preparing Labels:</b> For each sequence, the corresponding label will be the word that comes after the sequence in the text data. These labels are what the model will try to predict.
</p>
</p>


<p>How to train a neural network that contains an embedding and LSTM layer then used the learned model to generate new text with similar properties as the input text.:</p>


<p>
Assuming that the text has been preprocessed and sequences of word indices have been created, here is how you can define and train a neural network:
</p>

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

vocab_size = len(word_to_index) + 1  # size of your vocabulary
embedding_dim = 100  # dimension of the embedding space
sequence_length = 50  # length of your sequences
num_units = 256  # number of units in LSTM layer

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=sequence_length),
    LSTM(num_units),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Assuming you have input sequences `x_train` and one-hot encoded target sequences `y_train`
model.fit(x_train, y_train, epochs=50, verbose=2)


<p>
This will train a simple LSTM-based language model. You can make the model more complex and possibly improve performance by using more layers, dropout, recurrent dropout, etc.

After training the model, you can use it to generate new text that has similar properties to the input text. Here's an example of how to do that:
</p>

In [None]:
import numpy as np

def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = [word_to_index[word] for word in seed_text.split()]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')  # pad the sequence
        predicted_probs = model.predict(token_list, verbose=0)
        predicted = np.argmax(predicted_probs, axis=-1)  # get the most probable next word

        output_word = ""
        for word, index in word_to_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

seed_text = "Once upon a time"
next_words = 100

print(generate_text(seed_text, next_words, model, sequence_length))


This will generate a text of 100 words starting with "Once upon a time". The quality of the generated text will depend on how well the model has been trained.

The function generate_text starts with some seed text, then predicts the next word, appends it to the text, and repeats this process as many times as specified. The pad_sequences function is used to ensure that the input sequence to the model always has the expected length.

Please note that the preprocessing, the model's architecture, and its parameters may need to be tweaked to achieve good results

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.utils import to_categorical
import numpy as np
import pandas as pd

In [2]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

dsf

In [5]:

# Assume we have some text
with open('nietzsche.txt', 'r') as file:
    text = file.read()

In [6]:

# Step 1: Preprocess the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])


In [7]:
# Convert text into sequences of integers
sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        sequences.append(n_gram_sequence)


adadad

In [8]:
# Pad sequences for equal input length 
max_sequence_len = max([len(seq) for seq in sequences])
sequences = np.array(pad_sequences(sequences, maxlen=max_sequence_len, padding='pre'))


asdadad

In [9]:
# Split sequences into input (X) and output (y)
X = sequences[:,:-1]
y = sequences[:,-1]
y = to_categorical(y, num_classes=len(tokenizer.word_index) + 1)


erwrr

In [10]:
# Step 2: Build the LSTM model
model = Sequential()
model.add(Embedding(len(tokenizer.word_index) + 1, 10, input_length=max_sequence_len-1))
model.add(LSTM(50))
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


To be edited

In [11]:
# Step 3: Train the model
model.fit(X, y, epochs=100, verbose=2)


Epoch 1/100
2886/2886 - 62s - loss: 6.7160 - accuracy: 0.0732 - 62s/epoch - 21ms/step
Epoch 2/100
2886/2886 - 56s - loss: 6.3315 - accuracy: 0.0944 - 56s/epoch - 19ms/step
Epoch 3/100
2886/2886 - 57s - loss: 6.1012 - accuracy: 0.1084 - 57s/epoch - 20ms/step
Epoch 4/100
2886/2886 - 62s - loss: 5.9327 - accuracy: 0.1225 - 62s/epoch - 21ms/step
Epoch 5/100
2886/2886 - 62s - loss: 5.7764 - accuracy: 0.1324 - 62s/epoch - 22ms/step
Epoch 6/100
2886/2886 - 59s - loss: 5.6414 - accuracy: 0.1398 - 59s/epoch - 20ms/step
Epoch 7/100
2886/2886 - 59s - loss: 5.5202 - accuracy: 0.1462 - 59s/epoch - 20ms/step
Epoch 8/100
2886/2886 - 62s - loss: 5.4077 - accuracy: 0.1520 - 62s/epoch - 21ms/step
Epoch 9/100
2886/2886 - 65s - loss: 5.3010 - accuracy: 0.1582 - 65s/epoch - 23ms/step
Epoch 10/100
2886/2886 - 71s - loss: 5.1996 - accuracy: 0.1627 - 71s/epoch - 25ms/step
Epoch 11/100
2886/2886 - 78s - loss: 5.1043 - accuracy: 0.1660 - 78s/epoch - 27ms/step
Epoch 12/100
2886/2886 - 80s - loss: 5.0128 - accura

to be edited

In [None]:
# Step 4: Generate new text
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text


to be edited

In [None]:
# Generate new text
print(generate_text("Once upon a time", 20, model, max_sequence_len))

This script does the following:

It first tokenizes the input text, converting it into sequences of integers where each integer represents a unique word.
It creates input-output sequence pairs, where the model is trained to predict the next word given a sequence of previous words.
It pads sequences to ensure they are of equal length.
It constructs an LSTM model with an Embedding layer.
It trains this model on the sequence data.
Finally, it uses the trained model to generate new text.
Note: For simplicity, I've kept the model's architecture and the preprocessing steps relatively straightforward. A larger dataset and more complex model architecture (more layers, regularization techniques like dropout, etc.) can potentially yield better results.