<a href="https://colab.research.google.com/github/neildocs/InclusiveGatewayTestCase/blob/master/nlp/sentiment_in_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

## Word based encoding

Use API `tensorflow.keras.preprocessing.text.Tokenizer` to 

- generate the dictionary of word encodings, and 

- create vectors out from the sentences.



In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!', 
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)  # num of words big enough to hold all potential tokens
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

The `Tokenizer` automatically removes _punctuations_ and _cases_.

## Text to sequence

Previous step **tokenize** the words and sentences, building up a dictionary of all the words.


Next, turn sentences into lists of values based on the tokens.  Use `Tokenizer.texts_to_sequence()`


You need manipulate these lists, and make every sentence the same length.



In [None]:
# ---------------------------------
# On top of previous tokenize step
# ---------------------------------

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

The sequences contains **only tokens existing in dictionary**.

In below example, words like _really_, or _manatee_ are **not** in dictionary.  So the sequence does not show these unknown words.

In [None]:
test_data = [
    'I really love my dog',
    'My dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

Summary

- Need a lot of training data to get a **broad** dictionary.

- Instead of ignoring unseen words, put a special value use property `oov_token` (out-of-vacabulary) of `Tokenizer`

In [None]:
# ----------------------------
# This is a revised version 2
# ----------------------------

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!', 
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token = "<oov>")  # oov_token can be any special value
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequence = tokenizer.texts_to_sequences(sentences)

test_data = [
    'I really love my dog',
    'My dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)
print(word_index)

## Uniformize with Padding

To ensure all sequences are in same size, use `pad_sequences`.


By default, `pad_sequences` adds extra `0` before values to make the size the same as longest sequence.


By specifying `padding='post'`, the `0`s are appended after the values.


By `maxlen=x`, it forces all sentences only with `x` words.

By default, the truncated words are those at beginning.  Specify `truncating='post'` to truncate words from tail.

In [None]:
# ----------------------------
# This is a revised version 3
# ----------------------------

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!', 
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token = "<oov>")  # oov_token can be any special value
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

# padding
# padded = pad_sequences(sequences)
padded = pad_sequences(sequences, maxlen=5)

print("\nWord index = ", word_index)
print("\nSequences = ", sequences)
print("\nPadded Sequences: ")
print(padded)

test_data = [
    'I really love my dog',
    'My dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest sequence = ", test_seq)

padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded test sequence = ")
print(padded)

## Using Sarcasm


> *Sarcasm* is a publically available data set

### Prepare the dataset

In [None]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
    -O /tmp/sarcasm.json

In [None]:
import json

with open('/tmp/sarcasm.json', 'r') as f:
    data_store = json.load(f)

sentences = []
labels = []
urls = []

for item in data_store:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

### Tokenizer

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding="post")

print(padded[0])
print(padded.shape)

## Play with Sarcasm

In [None]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
    -O /tmp/sarcasm.json

In [None]:
# ----------
# Load data
# ----------

import json

with open("/tmp/sarcasm.json", "r") as f:
    datastore = json.load(f)

sentences = []
labels = []

for item in datastore:
    sentences.append(item["headline"])
    labels.append(item["is_sarcastic"])

In [None]:
# --------------------
# setup configurations
# --------------------

vocab_size = 10000
embedding_dim = 16
max_length = 32
trunc_type = "post"
padding_type = "post"
oov_tok = "<OOV>"
training_size = 20000


In [None]:
# ---------------------
# Building a classifier
# ---------------------
training_sentences = sentences[:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[:training_size]
testing_labels = labels[training_size:]


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

training_seq = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_seq, 
                                maxlen = max_length, 
                                padding = padding_type, 
                                truncating = trunc_type)

testing_seq = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_seq, 
                               maxlen = max_length, 
                               padding = padding_type, 
                               truncating = trunc_type)

In [None]:
# ---------------------
# Build neural network
# ---------------------
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length), 
    keras.layers.GlobalAveragePooling1D(),   # flatten
    keras.layers.Dense(24, activation = "relu"),
    keras.layers.Dense(1, activation = "sigmoid")
])
model.compile(loss = "binary_crossentropy",
              optimizer = "adam",
              metrics = ["accuracy"])

model.summary()

In [None]:
# --------
# Training
# --------

# This is required for TensorFlow 2.x
import numpy as np

training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

num_epochs = 30
history = model.fit(training_padded, 
                    training_labels, 
                    epochs = num_epochs, 
                    validation_data = (testing_padded, 
                                       testing_labels), 
                    verbose = 2)

In [None]:
# ----------------
# Plot the results
# ----------------
import matplotlib.pyplot as plt

def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_' + string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.legend([string, 'val_' + string])
    plt.show()

plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

From the graphs, the _training loss_ fall, but the _validation loss_ increased.

?? Overfitting ??