### *Deep Dive into*
# Natural Language Processing

#### Contributors
Helene Willits,
Shaina Bagri,
Rachel Castellino

This notebook builds on the notebook titled "An Introduction to Natural Language Processing" by the same contributors. Here, we explore more of the details of NLP and provide a work-along example that will give you an introduction on how to work with NLP.

Let's say we want to build a model that predicts the next word a user will type.

## Import Libraries
First, we need to import the required libraries. The majority of the libraries needed fall under the tensorflow overall library, which is very common in artificial intelligence and machine learning. Using the tensorflow libraries allows us to access various machine learning models, layers, and preprocessing techniques without having to manually code them.

In [5]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
import numpy as np
import os

# Clean the Data
import string

# Plot the Model
from tensorflow import keras
from keras.utils.vis_utils import plot_model

# Callbacks
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import TensorBoard

You can use any data that you like to train the model, but the dataset should be considerably large. We are going to develop the model with the text from the book Metamorphosis by Franz Kafka. If you are coding along with this guide, you can access the text file here: https://www.gutenberg.org/cache/epub/5200/pg5200.txt

NOTE: NEED TO GET TO MET_CLEAN OURSELVES OR UPDATE BLURBS

In [None]:
txt_url = "https://github.com/helenewillits/Natural-Language-Processing-Research/blob/main/metamorphosis_clean.txt"

file = open(txt_url, "r", encoding = "utf8")
lines = []

for i in file:
    lines.append(i)
    
print("The First Line: ", lines[0])
print("The Last Line: ", lines[-1])

## Pre-processing the Data
The first step in developing an Natural Language Processing model is to perform some preprocessing on the data that will remove any unnecessary data. This step includes both context-based preprocessing and standard NLP processing.

In [4]:
# Our attempts to get to met_clean ourselves...

# url = "https://www.gutenberg.org/cache/epub/5200/pg5200.txt"

# data = urllib2.urlopen(url)
# lines = []

# for line in data:
#     lines.append(line)
    
# print(data)

In the context of our dataset (a novel) there is data such as the copyright information that will note be useful in building our NLP model. We can pick out the useful text from the file using the following code:

We will also need to reformat the text so that it is easier to process. In order to do this, we will remove unnecessary characters. For example, some characters we will remove are the “newline” and “tab” characters that are used in text files to signal different types of spacing.

In [None]:
data = ""

for i in lines:
    data = ' '. join(lines)
    
data = data.replace('\n', '').replace('\r', '').replace('\ufeff', '')
data[:360]

In [None]:
translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) #map punctuation to space
new_data = data.translate(translator)

new_data[:500]

In [None]:
z = []

for i in data.split():
    if i not in z:
        z.append(i)
        
data = ' '.join(z)
data[:500]

Next, we will preform some preprocessing steps that are standard components of NLP.

## Tokenization

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

sequence_data = tokenizer.texts_to_sequences([data])[0]
sequence_data[:10]

In [None]:
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

In [None]:
sequences = []

for i in range(1, len(sequence_data)):
    words = sequence_data[i-1:i+1]
    sequences.append(words)
    
print("The Length of sequences are: ", len(sequences))
sequences = np.array(sequences)
sequences[:10]

In [None]:
X = []
y = []

for i in sequences:
    X.append(i[0])
    y.append(i[1])
    
X = np.array(X)
y = np.array(y)

In [None]:
print("The Data is: ", X[:5])
print("The responses are: ", y[:5])

In [None]:
y = to_categorical(y, num_classes=vocab_size)
y[:5]

## Creating the Model

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))

In [None]:
model.summary()

## Plot the Model

In [None]:
keras.utils.plot_model(model, to_file='model.png', show_layer_names=True)

## Callbacks

In [None]:
# checkpoint = ModelCheckpoint("nextword1.h5", monitor='loss', verbose=1,
#     save_best_only=True, mode='auto')

# reduce = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose = 1)

# logdir='logsnextword1'
# tensorboard_Visualization = TensorBoard(log_dir=logdir)

## Compile the Model

In [None]:
model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001))

## Fit the Model

In [None]:
model.fit(X, y, epochs=150, batch_size=64, callbacks=[checkpoint, reduce, tensorboard_Visualization])

## Use the Model to Make Predictions

In [None]:
def predict_word(text):
    for i in range(3):
        sequence = tokenizer.texts_to_sequences([text])[0]
        sequence = np.array(sequence)
        
        preds = model.predict_classes(sequence)
#         print(preds)
        predicted_word = ""
        
        for key, value in tokenizer.word_index.items():
            if value == preds:
                predicted_word = key
                break
        
        print(predicted_word)
        return predicted_word

Try the model out yourself! Enter lines of text and watch the model guess the next word.

In [None]:
while(True):
    text = input("Enter your line: ")
    
    if text == "stop the script":
        print("Ending The Program.....")
        break
    
    else:
        try:
            text = text.split(" ")
            text = text[-1]

            text = ''.join(text)
            Predict_Next_Words(model, tokenizer, text)
            
        except:
            continue