# IMDB USE and GNEWS Embedding Models

_By Nick Brooks, February 2020_


# TF Hub for TF2: Text classification with movie reviews (preview)

SOURCE: https://github.com/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb

This notebook classifies movie reviews as *positive* or *negative* using the text of the review. This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem. 

We'll use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews. 

This notebook uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow, and [TensorFlow Hub](https://www.tensorflow.org/hub), a library and platform for transfer learning. For a more advanced text classification tutorial using `tf.keras`, see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).

In [None]:
!pip install tensorflow_datasets > /dev/null

In [None]:
import time
import numpy as np
import gc
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from keras import backend as K

from tensorflow.keras.layers import Dense, Input, BatchNormalization, Dropout, concatenate, GlobalAveragePooling1D
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.optimizers import Adam, SGD

import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from wordcloud import WordCloud, STOPWORDS
from sklearn import metrics

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

start = time.time()
pd.options.display.max_colwidth = 1500

In [None]:
# My Parameters
BATCH_SIZE = 512
SEED=42

In [None]:
print("Word Cloud Function..")
stopwords = set(STOPWORDS)
size = (20,10)

def cloud(text, title, stopwords=stopwords, size=size):
    """
    Function to plot WordCloud
    Includes: 
    """
    # Setting figure parameters
    mpl.rcParams['figure.figsize']=(10.0,10.0)
    mpl.rcParams['font.size']=12
    mpl.rcParams['savefig.dpi']=100
    mpl.rcParams['figure.subplot.bottom']=.1 
    
    # Processing Text
    # Redundant when combined with my Preprocessing function
    wordcloud = WordCloud(width=1600, height=800,
                          background_color='black',
                          stopwords=stopwords,
                         ).generate(str(text))
    
    # Output Visualization
    fig = plt.figure(figsize=size, dpi=80, facecolor='k',edgecolor='k')
    plt.imshow(wordcloud,interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=50,color='y')
    plt.tight_layout(pad=0)
    plt.show()

## Download the IMDB dataset

The IMDB dataset is available on [TensorFlow datasets](https://github.com/tensorflow/datasets). The following code downloads the IMDB dataset to your machine (or the colab runtime):

In [None]:
train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], 
                                  batch_size=-1, as_supervised=True)

train_examples, train_labels = tfds.as_numpy(train_data)
test_examples, test_labels = tfds.as_numpy(test_data)

In [None]:
np.save("train_examples", train_examples)
np.save("train_labels", train_labels)

np.save("test_examples", test_examples)
np.save("test_labels", test_labels)

In [None]:
!ls

## Explore the data 

Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

In [None]:
print("Training entries: {}, test entries: {}".format(len(train_examples), len(test_examples)))

Let's print first 10 examples.

In [None]:
input_len = [len(x) for x in np.concatenate((train_examples, test_examples), axis=0)]
print("Input Lengths:\nAverage {:.1f} +/- {:.1f}\nMax {} Min {}".format(np.mean(input_len), np.std(input_len), np.max(input_len), np.min(input_len)))

In [None]:
train_examples[:10]

Let's also print the first 10 labels.

In [None]:
train_labels[:10]

In [None]:
# Look at class balance..
unique_elements, counts_elements = np.unique(train_labels, return_counts=True)
print("Frequency of unique values of the said array:")
print(np.asarray((unique_elements, counts_elements)))

## Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

* How to represent the text?
* How many layers to use in the model?
* How many *hidden units* to use for each layer?

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have two advantages:
*   we don't have to worry anout text preprocessing,
*   we can benefit from transfer learning.

For this example we will use a model from [TensorFlow Hub](https://www.tensorflow.org/hub) called [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1).

There are three other models to test for the sake of this tutorial:
* [google/tf2-preview/gnews-swivel-20dim-with-oov/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1) - same as [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1), but with 2.5% vocabulary converted to OOV buckets. This can help if vocabulary of the task and vocabulary of the model don't fully overlap.
* [google/tf2-preview/nnlm-en-dim50/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1) - A much larger model with ~1M vocabulary size and 50 dimensions.
* [google/tf2-preview/nnlm-en-dim128/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1) - Even larger model with ~1M vocabulary size and 128 dimensions.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that the output shape of the produced embeddings is a expected: `(num_examples, embedding_dimension)`.

## GNEWS Embeddings Model

In [None]:
%%time
model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(model, output_shape=[], input_shape=[], 
                           dtype=tf.string, trainable=True, name='gnews_embedding')

In [None]:
hub_layer(train_examples[:3])

Let's now build the full model:

In [None]:
def build_model(embed):
    
    model = Sequential([
        Input(shape=[], dtype=tf.string),
        embed,
        Dropout(.2),
        Dense(16, activation='relu'),
        Dropout(.2),
        Dense(1, activation='sigmoid')
    ])
    model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

model = build_model(hub_layer)
model.summary()

In [None]:
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=4, verbose=1, mode='min')
checkpoint = tf.keras.callbacks.ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

history = model.fit(
                    train_examples,
                    train_labels,
                    epochs=40,
                    batch_size=BATCH_SIZE,
                    validation_split = .2,
                    shuffle = True,
                    callbacks = [checkpoint, es],
                    verbose=1)

model.load_weights('model.h5')
results = model.evaluate(test_examples, test_labels)
print(results)

In [None]:
history_dict = history.history
history_dict.keys()

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

f, ax = plt.subplots(1,2, figsize = [11,4])

# "bo" is for "blue dot"
ax[0].plot(epochs, loss, 'r', label='Training loss')
# b is for "solid blue line"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss')
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend()

ax[1].plot(epochs, acc, 'r', label='Training acc')
ax[1].plot(epochs, val_acc, 'b', label='Validation acc')
ax[1].set_title('Training and validation accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend()

plt.tight_layout()

plt.show()

In [None]:
# Test Predictions
test_pred = model.predict(test_examples, batch_size = BATCH_SIZE)
results_pd = pd.DataFrame.from_dict({'text': test_examples, 'pred': test_pred[:,0], 'ground_truth': test_labels})
results_pd['error'] = results_pd['ground_truth'] - results_pd['pred']

print("Look at False Negative")
display(results_pd.sort_values(by = 'error', ascending=False).iloc[:10])

print("Look at False Positives")
display(results_pd.sort_values(by = 'error', ascending=True).iloc[:10])

In [None]:
# Clear Memory
K.clear_session()

del history
del model
_ = gc.collect()

## Universal Sentence Encoding Embeddings Model

In [None]:
%%time
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/4'
USE_embed = hub.KerasLayer(module_url, trainable=False, name='USE_embedding')

In [None]:
USE_embed(train_examples[:3])

In [None]:
def build_model(embed):
    
    model = Sequential([
        Input(shape=[], dtype=tf.string),
        embed,
        Dropout(.2),
        Dense(16, activation='relu'),
        Dropout(.2),
        Dense(1, activation='sigmoid')
    ])
    model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

model = build_model(USE_embed)
model.summary()

In [None]:
MAX_LEN = 2058

small_train_examples = np.array([x[:MAX_LEN] for x in train_examples])
small_test_examples = np.array([x[:MAX_LEN] for x in test_examples])

In [None]:
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=4, verbose=1, mode='min')
checkpoint = tf.keras.callbacks.ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

history = model.fit(
                    small_train_examples,
                    train_labels,
                    epochs=40,
                    batch_size=BATCH_SIZE,
                    validation_split = .2,
                    shuffle = True,
                    callbacks = [checkpoint, es],
                    verbose=1)

model.load_weights('model.h5')
results = model.evaluate(small_test_examples, test_labels)
print(results)

In [None]:
history_dict = history.history
history_dict.keys()

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

f, ax = plt.subplots(1,2, figsize = [11,4])

# "bo" is for "blue dot"
ax[0].plot(epochs, loss, 'r', label='Training loss')
# b is for "solid blue line"
ax[0].plot(epochs, val_loss, 'b', label='Validation loss')
ax[0].set_title('Training and validation loss')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss')
ax[0].legend()

ax[1].plot(epochs, acc, 'r', label='Training acc')
ax[1].plot(epochs, val_acc, 'b', label='Validation acc')
ax[1].set_title('Training and validation accuracy')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend()

plt.tight_layout()

plt.show()

In [None]:
# Test Predictions
test_pred = model.predict(small_test_examples, batch_size = BATCH_SIZE)
results_pd = pd.DataFrame.from_dict({'text': test_examples, 'pred': test_pred[:,0], 'ground_truth': test_labels})
results_pd['error'] = results_pd['ground_truth'] - results_pd['pred']

print("Look at False Negative")
display(results_pd.sort_values(by = 'error', ascending=False).iloc[:10])

print("Look at False Positives")
display(results_pd.sort_values(by = 'error', ascending=True).iloc[:10])

## Universal Sentence Encoding Clustering

In [None]:
# USE output shape..
USE_embed([small_train_examples[0]])['outputs'].numpy().shape

In [None]:
%%time
full_labels = np.concatenate((train_labels, test_labels))
full_txt = np.concatenate((small_train_examples, small_test_examples))

batch_size = 500
embeddings = []

for b in range(0, full_txt.shape[0] // batch_size):
    embeddings.extend(USE_embed(full_txt[batch_size*b: batch_size*(b+1)])['outputs'].numpy())

#### Fit Kmeans 

In [None]:
kmeans = KMeans(n_clusters=3, random_state=SEED).fit(embeddings)
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(embeddings, kmeans.labels_, sample_size=1000))

# Prepare DataFrame
df = pd.DataFrame.from_dict({"Text": full_txt,
                             "Labels": np.concatenate((train_labels, test_labels)),
                             "Clusters": kmeans.labels_})

#### Model Evaluation

In [None]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(full_labels, kmeans.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(full_labels, kmeans.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(full_labels, kmeans.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(full_labels, kmeans.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(embeddings, full_labels, sample_size=1000))

display(pd.crosstab(df['Clusters'], df['Labels']))

In [None]:
for c in sorted(df['Clusters'].unique()):
    cloud(df.loc[df.Clusters == c,"Text"].astype(str).str.title().values, title=f"Cluster ID: {c}", size=[8,5])
    display(df.loc[df.Clusters == c,:].sample(5, random_state=SEED))

In [None]:
print("Notebook Runtime: %0.2f Minutes"%((time.time() - start)/60))