# Harper Adams Data Science 
## NLP deeper dive
<center>
   <img src="https://github.com/ha-data-science/ha-data-science.github.io/blob/main/img/HAP-E-logo.png?raw=true" alt="HAP-E Group" width="125"/>
</center>

Ed Harris </br>
HARUG! / HAP-E Group </br>
2021-12-01 </br></br>

---

This notebook is adapted from Ch 11 from:</br> 
<center>
   <img src="https://github.com/ha-data-science/ha-data-science.github.io/blob/main/pages/harug-files/2021-12-01-NLP-sentiment-analysis/img/Chollet.jpg?raw=true" alt="HAP-E Group" width="125"/>
</center>
Chollet, F., 2021. Deep Learning with Python, 2nd ed. Manning Publications, Shelter Island, NY.
</br></br>

Also see:

<center>
   <img src="https://github.com/ha-data-science/ha-data-science.github.io/blob/main/pages/harug-files/2021-12-01-NLP-sentiment-analysis/img/Liu.png?raw=true" alt="HAP-E Group" width="125"/>


   <img src="https://github.com/ha-data-science/ha-data-science.github.io/blob/main/pages/harug-files/2021-12-01-NLP-sentiment-analysis/img/Rothman.png?raw=true" alt="HAP-E Group" width="125"/>
</center>

Liu, B., 2020. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions, 2nd ed. Cambridge University Press.

Rothman, D., 2021. Transformers for Natural Language Processing: Build innovative deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, RoBERTa, and more, 1st ed. Packt Publishing.

&nbsp;
---

>Frederick Jelinek, an early speech recognition researcher,
joked in the 1990s: “Every time I fire a linguist, the performance of the speech recognizer
goes up.”


# Deep learning for text

## 11.1 Natural-language processing: The bird's eye view

- “What’s the topic of this text?” (text classification)
- “Does this text contain abuse?” (content filtering)
- “Does this text sound positive or negative?” (sentiment analysis)
- “What should be the next word in this incomplete sentence?” (language modeling)
- “How would you say this in German?” (translation)
- “How would you summarize this article in one paragraph?” (summarization)

## 11.2 Preparing text data

>...keep in mind... the text-processing models you will train won’t possess a human-like understanding of language; rather, they simply look for statistical regularities in their input data, which turns out to be sufficient to perform well on many simple tasks.

</br></br>

- First, you standardize the text to make it easier to process, such as by converting
it to lowercase or removing punctuation.
- You split the text into units (called tokens), such as characters, words, or groups
of words. This is called tokenization.
- You convert each such token into a numerical vector. This will usually involve
first indexing all tokens present in the data.

</br></br>

<center>
   <img src="https://github.com/ha-data-science/ha-data-science.github.io/blob/main/pages/harug-files/2021-12-01-NLP-sentiment-analysis/img/11-1.png?raw=true" alt="Prepping text data steps" width="600"/>
</center>



### Text standardization

This involves removing punctualtion and formatting, and also subtle differences such as verb tense and similar language ambiguities.

&nbsp;

Consider:

- “sunset came. i was staring at the Mexico sky. Isnt nature splendid??”
- “Sunset came; I stared at the México sky. Isn’t nature splendid?”

&nbsp;

Remove punctuation and capitals:

- “sunset came i was staring at the mexico sky isnt nature splendid”
- “sunset came i stared at the méxico sky isnt nature splendid”

&nbsp;

'Stemming' (standardize variation of terms)

- “sunset came i [stare] at the mexico sky isnt nature splendid”



### Text splitting (tokenization)


&nbsp;

>Once your text is standardized, you need to break it up into units to be vectorized (tokens), a step called tokenization. 

- Word-level tokenization: Where tokens are space-separated (or punctuationseparated) substrings. A variant of this is to further split words into subwords when applicable. For instance, treating “staring” as “star+ing” or “called” as “call+ed.”
- N-gram tokenization: Where tokens are groups of N consecutive words. For instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).
- Character-level tokenization: Where each character is its own token. In practice, this scheme is rarely used, and you only really see it in specialized contexts, like text generation or speech recognition.
</br></br>

<center>
   <img src="https://github.com/ha-data-science/ha-data-science.github.io/blob/main/pages/harug-files/2021-12-01-NLP-sentiment-analysis/img/11-2.png?raw=true" alt="Prepping text data steps" width="600"/>
</center>

### Vocabulary indexing

>Once your text is split into tokens, you need to encode each token into a numerical representation.

Build an index of all terms found in the training data (the “vocabulary”), and assign a unique integer to each entry in the vocabulary.

### Using the TextVectorization layer


In [None]:
# Make a set of functions for vocabulary indexing
# (the hard way just, to illustrate)

import string

class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)

    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()

    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
vectorizer.make_vocabulary(dataset)

In [None]:
# test your encoding function
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

In [None]:
# decoded, and unknown values
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

> In practice, most people would work with the Keras TextVectorization layer, which is fast and efficient and can be dropped directly into a tf.data pipeline or a Keras model.

This is what the TextVectorization layer looks like:

In [None]:
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int",        # argument output text to integer
)

In [None]:
# some service libraries + tensorflow

import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor) # convert strings to lowercase
    return tf.strings.regex_replace(                 # replace punctuation with empty string
        lowercase_string, f"[{re.escape(string.punctuation)}]", "") 

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor) # split string on whitespace

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

In [None]:
# test it
# also you can play and experiment with this by editing the dictionary...
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

**Displaying the vocabulary**

In [None]:
# notice the symbolic [UNK]
text_vectorization.get_vocabulary()

In [None]:
# test sentence
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

In [None]:
# decode back from int
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

## 11.3 Two approaches for representing groups of words: Sets and sequences

The problem of word order is a principal challenge in Natrual Language Processing

Consider:

Regular: You will take out the trash

Yoda:    The trash you will take out

approach 1: ignore order (e.g. 'bag of words')

approach 2: consider order (e.g. RNN, Transformers)

Here we will look at sentiment useing each approach using the famous IMDB Movie Review Dataset.

### Preparing the IMDB movie reviews data

In [None]:
# Get the data - courtesy of Andrew Maas, Stanford U
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

In [None]:
# Delete and unneeded part of the data (optional... haha)
!rm -r aclImdb/train/unsup

In [None]:
# ls aclImdb/train/pos

In [None]:
# !cat aclImdb/train/pos/4077_10.txt

In [None]:
!cat aclImdb/train/pos/10000_8.txt

In [None]:
# Create train / test / validate partitions
# only run once, visually inspect dir structure if unsure

import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [None]:
# set location of train / test / validate files
# the 2 classes are negative versus positive

from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

**Displaying the shapes and dtypes of the first batch**

In [None]:
# These datasets yield inputs that are TensorFlow tf.string tensors and targets that are
# int32 tensors encoding the value “0” or “1.”
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

### Processing words as a set: The bag-of-words approach

#### Single words (unigrams) with binary encoding

If you use a bag of single words, the sentence “the cat sat on the mat” becomes

{"cat", "mat", "on", "sat", "the"}

The main advantage of this encoding is that you can represent an entire text as a single vector (e.g. there is no order or context imformation).

**Preprocessing our datasets with a `TextVectorization` layer**

In [None]:
# Practicality: Limit the vocabulary to the 20,000 most frequent words.
text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
)

# Prepare a dataset that only yields raw text inputs (no labels)
text_only_train_ds = train_ds.map(lambda x, y: x)

# Use that dataset to index the dataset vocabulary via the adapt() method
text_vectorization.adapt(text_only_train_ds)

# Prepare processed versions of our training, validation, and test dataset.
# NB you can specify num_parallel_calls to leverage multiple CPUs!
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

**Inspecting the output of our binary unigram dataset**

In [None]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape) # 32bit integer, 20000 possible words
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape) 
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0]) # just 0 or 1 for presence against the 20K
    print("targets[0]:", targets[0])
    break

**Our model-building utility**

In [None]:
# Uses popular and mainstream Tensorflow and Keras
# to set up our language 'model'
# need some neural network background here
# but we will ignore this...

# This will results in a 'model' definition

from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

**Training and testing the binary unigram model**

In [None]:
# this will take a few minutes to run (progress bar)

# 'Calls' and preps the model we just defined
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]

# this runs the model against our data
model.fit(binary_1gram_train_ds.cache(), #NB this caches the data in memory: called once
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

# stores the model
model = keras.models.load_model("binary_1gram.keras")

# print the results - should be pretty good... like > 80% accuracy in testing
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

#### Bigrams with binary encoding

Of course, discarding word order is very reductive, because even atomic concepts can be expressed via multiple words: the term “United States” conveys a concept that is quite distinct from the meaning of the words “states” and “united” taken separately.

For this reason, you will usually end up re-injecting local order information into your bag-of-words representation by looking at N-grams rather than single words (most commonly, bigrams).

&nbsp;

With bigrams, our sentence becomes

{"the", "the cat", "cat", "cat sat", "sat",

"sat on", "on", "on the", "the mat", "mat"}

&nbsp;

The TextVectorization layer can be configured to return arbitrary N-grams: bigrams, trigrams, etc. Just pass an ngrams=N argument as in the following listing.

**Configuring the `TextVectorization` layer to return bigrams**

In [None]:
text_vectorization = TextVectorization(
    ngrams=2,     # specifies 2-grams
    max_tokens=20000,
    output_mode="multi_hot",
)

**Training and testing the binary bigram model**

In [None]:
# rerun 2-gram model and print results
# should take a few mins
# DO YOU THINK this will do better, poorer, or about the same?

text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

#### Bigrams with TF-IDF encoding

You can also add a bit more information to this representation by counting how many times each word or N-gram occurs, that is to say, by taking the histogram of the words over the text:

{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,

"sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}


&nbsp;

If you’re doing text classification, knowing how many times a word occurs in a sample is critical: any sufficiently long movie review may contain the word “terrible” regardless of sentiment, but a review that contains many instances of the word “terrible” is likely a negative one.

**Configuring the `TextVectorization` layer to return token counts**

In [None]:
# count occurrence
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"  # count occurrence
)

**Configuring `TextVectorization` to return TF-IDF-weighted outputs**

Some important technical detail to UNDERSTAND, but it will work even if IGNORED (ha!).

Now, of course, some words are bound to occur more often than others no matter what the text is about. The words “the,” “a,” “is,” and “are” will always dominate your word count histograms, drowning out other words—despite being pretty much useless features in a classification context. 

&nbsp;

How could we address this?: via 'normalization'. 

&nbsp;

We could just normalize word counts by subtracting the mean and dividing by the variance (computed across the entire training dataset). That would make sense. Except most vectorized sentences consist almost entirely of zeros (our previous example features 12 non-zero entries and 19,988 zero entries), a property called “sparsity.” That’s a great property to have, as it dramatically reduces compute load and reduces the risk of overfitting. If we subtracted the mean from each feature, we’d wreck sparsity. Thus, whatever normalization scheme we use
should be divide-only. 

What, then, should we use as the denominator? The best practice
is to go with something called TF-IDF normalization—TF-IDF stands for “term frequency, inverse document frequency.”

In [None]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf", #set TF-IDF mode
)

**Training and testing the TF-IDF bigram model**

In [None]:
# setup model and run like before
# takes a few mins
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]

# NB adding model fit to object 'history'
history = model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")



In [None]:
# for plotting
import matplotlib.pyplot as plt
import numpy

# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

#### Testing our model on novel data

<center>
   <img src="https://github.com/ha-data-science/ha-data-science.github.io/blob/main/pages/harug-files/2021-12-01-NLP-sentiment-analysis/img/WoT.jpg?raw=true" alt="Wheel of Time starring Rosamund Pike" width="500"/>
</center>

In [None]:
# set up inputs to actually test our model on novel data
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

In [None]:
# test

my_Fake_Review0 = "The actress was terrible and the writing was bad"

my_Fake_Review1 = "The actress was relatable and the writing was fabulous"

# Amazon wheel of time from today, literally
# 2 stars
real_Review0 = "I have read all of the books cover to cover at least 4 times, but I am judging this as a standalone effort. To be Blunt, it is bad. We start with a big bit of drama in episode 1 and very little reason to care. As it continues into the 3rd episode the characters are so utterly devoid of any personality that there's still little reason to care. Pike chews the set with some of the worst acting I have ever seen. The rest of the cast is not much better. The script is as subtle as a hammer blow. It is childish, unnatural and lacking any art. The CGI is pretty and the monsters are genuinely scary to look at, but then it goes too far with Styrofoam bricks flying around. Touching a little upon the books you have a world where all races unite for a final battle and a world mostly run by women. Despite this, the writers have chosen to go OTT with some aspects of equality which completely devalues the equality the books naturally crafted. There is no art here (beyond set design, makeup, costume and some CGI). It is a bloated mess filled with bland characters in a constant state of reaction. There is no flow, no rhythm and no heart. It is a hollow big budget waste, sadly. Looking squarely at the books they have changed a lot of things. The journey is much the same, but Mat is not Mat, the Aes Sedai are just weird, Lan is a skinny dude, and there are small changes here and there that just weren't needed. If it was like this and good, I'd have loved it. I wouldn't mind the series and the books having their own identity. Sadly, this is a poor childish hackjob of a creation that failed to hold my interest in any meaningful way. New Spring and. The Eye of the World captivated me, made me care instantly for the people and held me for the whole series. This. This did not."

# 5 stars
real_Review1 = "Seems their are two parties who have watched this, the ones that are new, book readers who went in with an open mind and expected changes, and then the ones who went in with a closed mind giving 1 or 2 star reviews. The acting the scenery even seeing the one power wielderld (which I was a bit skeptical about in the trailers) was fantastic, I'm a massive fan of the books and I expected changed but nothing so far I feel has taken away from the overarching story. For open minded fans and people new to the wheel of time story this is a must watch!!!"

import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    [my_Fake_Review0],
])

predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")