### *Deep Dive into*
# Natural Language Processing

#### Contributors
Helene Willits,
Shaina Bagri,
Rachel Castellino

This notebook builds on the notebook titled "An Introduction to Natural Language Processing" by the same contributors. Here, we explore more of the details of NLP and provide a work-along example that will give you an introduction on how to work with NLP.

Let's say we want to build a model that predicts the next word a user will type. We will show you how you can process a set of training data, train a Natural Language Processing model, and use that model to create a text predictor.

## Import Libraries
First, we need to import the required libraries. The majority of the libraries needed fall under the tensorflow overall library, which is very common in artificial intelligence and machine learning. Using the tensorflow libraries allows us to access various machine learning models, layers, and preprocessing techniques without having to manually code them.

In [5]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
import numpy as np
import os
import requests

# Clean the Data
import string

# Plot the Model
from tensorflow import keras
from keras.utils.vis_utils import plot_model

# Callbacks
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import TensorBoard

## Gather Data

You can use any data that you like to train the model, but the dataset needs to be large enough to aptly train our model to predict which word will come next in a text stream. We are going to develop our model with the text from the book Metamorphosis by Franz Kafka. If you are coding along with this guide, you can access the text file here: https://www.gutenberg.org/cache/epub/5200/pg5200.txt

In [None]:
url = 'https://www.gutenberg.org/cache/epub/5200/pg5200.txt'
r = requests.get(url, allow_redirects=True)

open('metamorphosis_gutenberg.txt', 'wb').write(r.content)
file = open('metamorphosis_gutenberg.txt', "r", encoding = "utf8")
allLines = []

for i in file:
    allLines.append(i)

## Preprocessing the Data
The first step in developing an Natural Language Processing model is to perform some preprocessing on the data that will remove any unnecessary data. This step includes both context-based preprocessing and standard NLP processing. 

The goal of preprocessing is to get the data that looks like this:

**The quick red fox jumped over the lazy dog.**

**The dog waited, and, to her surprise, she was untouched.**

To look like this:

**The quick red fox jumped over lazy dog  waited  and  to her surprise  she was untouched**

In the context of our dataset (a novel) there is data such as the copyright information that will note be useful in building our NLP model. We can pick out the useful text from the file using the following code.

NOTE : this code is specific to this data set

In [4]:
lines = allLines[46:1992]
print("The First Line: ", lines[0])
print("The Last Line: ", lines[-1])
print("Number of Lines: ", len(lines))

We will also need to reformat the text so that it is easier to process. 

In order to do this, we will first remove unnecessary characters. For example, some characters we will remove are the newline ('\n') and carriage return ('\r') characters that are used in text files to signal different types of spacing.

In [None]:
data = ""

for i in lines:
    data = ' '. join(lines)
    
data = data.replace('\n', '').replace('\r', '').replace('\ufeff', '')
data[:360]

We also need to remove punctuation, so we will convert any punctation marks into spaces.

In [None]:
translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) #map punctuation to space
new_data = data.translate(translator)

new_data[:500]

We'll then copy over the text into a new data object without duplicating any words. This simplifies the problem that our model must solve.

In [None]:
z = []

for i in data.split():
    if i not in z:
        z.append(i)
        
data = ' '.join(z)
data[:500]

Next, we will preform some preprocessing steps that are standard components of NLP.

## Tokenization

Remember from our Introduction to NLP that tokenization is when we break up the text into words or phrases so that later we can identify the relationships between them. There are many ways that we could do this, but we will use the Tokenizer library created by Keras. This tokenizer represents the words in the text as vectors where each word, or token, in the text is assigned a number. This number can represent the index of the word in our data set, it could be a measure of the relevancy of each word, or any one of many other statistics. 

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

sequence_data = tokenizer.texts_to_sequences([data])[0]
sequence_data[:10]

In [None]:
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

In [None]:
sequences = []

for i in range(1, len(sequence_data)):
    words = sequence_data[i-1:i+1]
    sequences.append(words)
    
print("The Length of sequences are: ", len(sequences))
sequences = np.array(sequences)
sequences[:10]

In [None]:
X = []
y = []

for i in sequences:
    X.append(i[0])
    y.append(i[1])
    
X = np.array(X)
y = np.array(y)

In [None]:
print("The Data is: ", X[:5])
print("The responses are: ", y[:5])

In [None]:
y = to_categorical(y, num_classes=vocab_size)
y[:5]

## Creating the Model

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))

In [None]:
model.summary()

## Plot the Model

In [None]:
keras.utils.plot_model(model, to_file='model.png', show_layer_names=True)

## Callbacks

In [None]:
# checkpoint = ModelCheckpoint("nextword1.h5", monitor='loss', verbose=1,
#     save_best_only=True, mode='auto')

# reduce = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose = 1)

# logdir='logsnextword1'
# tensorboard_Visualization = TensorBoard(log_dir=logdir)

## Compile the Model

In [None]:
model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001))

## Fit the Model

In [None]:
model.fit(X, y, epochs=150, batch_size=64, callbacks=[checkpoint, reduce, tensorboard_Visualization])

## Use the Model to Make Predictions

In [None]:
def predict_word(text):
    for i in range(3):
        sequence = tokenizer.texts_to_sequences([text])[0]
        sequence = np.array(sequence)
        
        preds = model.predict_classes(sequence)
#         print(preds)
        predicted_word = ""
        
        for key, value in tokenizer.word_index.items():
            if value == preds:
                predicted_word = key
                break
        
        print(predicted_word)
        return predicted_word

Try the model out yourself! Enter lines of text and watch the model guess the next word.

In [None]:
while(True):
    text = input("Enter your line: ")
    
    if text == "stop the script":
        print("Ending The Program.....")
        break
    
    else:
        try:
            text = text.split(" ")
            text = text[-1]

            text = ''.join(text)
            Predict_Next_Words(model, tokenizer, text)
            
        except:
            continue

## Additional Exercises

You can try replicating this process to develop a model that mimics your own writing style. In order to do this, use text that you have written in text messages, emails, documents, or other text files. Use these resources to train your own model to perform personalized text prediction. Make sure to perform context-based preprocessing steps that make sense for the data that you are using.

You can also refer to https://www.gutenberg.org/ for thousands of free text data sets that can be used for Natural Language Processing.

## Resources

The exercises developed in this notebook were originally outlined in this article:

https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

Here is the dataset that we used to generate our NLP model:

https://www.gutenberg.org/cache/epub/5200/pg5200.txt