# Natural Language Processing

In this notebook we introudce some of the ideas of natural language processing. To speed things up ww will be using a couple of optional Tensorflow libraries. Tensorflow Hub and Tensoflow Datasets and also a library called tf_keras which is basically the same as the keras already in tensorflow.

The content of this notebook is extremely heavily inspired by one of the [Tensorflow tutorials](https://www.tensorflow.org/tutorials/keras/text_classification_with_hub)

# Pre-trained models

This notebook is the first time in the course that we will be using a pre-trained model (in this case for a text embedding layer). Using these pre-trained models can dranatically speed up certain tasks.

## Installing the libraries
In the first cell we use pip to install these libraries.

In [1]:
!pip install tensorflow_hub
!pip install tensorflow_datasets
!pip install tf_keras



## Import libraries, check versions and GPU availability
Next up we import the libraries and check some of the Tensorflow settings and see if you are running on a GPUs or not.

In [2]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
import tf_keras as keras


print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")




Version:  2.18.0
Eager mode:  True
Hub version:  0.16.1
GPU is NOT AVAILABLE


## Load Internet Movie Database Review Dataset
Now we are going to load a text dataset from the [Internet Movie Database](https://www.imdb.com). This dataset consists of reviews of various lengths of text that have been labelled as either positive or negative.

In [3]:
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

## Printing some example reviews
The Tensorflow dataset is an interesting object, in the code below we create a batch of 10 reviews (and labels) and then turn it into a Tensorflow Iterator with `iter` and actually get the values with `next`. At first it might seem a little clunky using tensorflow datasets but they are useful resource to try and understand.

In [4]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
print(train_examples_batch)

tf.Tensor(
[b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
 b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot de

2025-02-06 15:14:22.236099: I tensorflow/core/kernels/data/tf_record_dataset_op.cc:376] The default buffer size is 262144, which is overridden by the user specified `buffer_size` of 8388608
2025-02-06 15:14:22.237980: W tensorflow/core/kernels/data/cache_dataset_ops.cc:914] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [5]:
#Also print the labels
print(train_labels_batch)

tf.Tensor([0 0 0 1 1 1 0 0 0 0], shape=(10,), dtype=int64)


## Text Embedding
Obvioulsy as we've stated repeatedly in the course neural networks take input vectors (of various sizes) and then apply some combination of matrix multiplications, additions, non-linear activation functions and other mathematical operations. Normally English sentences are not immediately well-suited to these operations so the first stage in any natural language processing is to convert the input into some kind of numerical vector via a process of embedding. Fortunately lots of people have done this before us so in true [Blue Peter](https://www.bbc.co.uk/cbbc/watch/bp-heres-one-i-made-earlier-video-challenge) style we can use one that is already been pre-trained.

In the code below we use Tensorflow Hub to get a pre-trained text embedding layer. The embedding we use is the catchily named [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2) which as it's name suggest is Neural Network Langauge Model in English which encodes sequences into vectors of dimension 50. We create a layer called hub_layer which is trainable but starts off from the already pre-trained weights.

#### From TensorFlow's Tutorial
There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

* [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2) - trained with the same NNLM architecture on the same data as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.
* [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - the same as [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2), but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.
* [google/universal-sentence-encoder/4](https://tfhub.dev/google/universal-sentence-encoder/4) - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.

And many more! Find more [text embedding models](https://tfhub.dev/s?module-type=text-embedding) on TFHub.



In [6]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423195 , -0.0119017 ,  0.06337538,  0.06862972, -0.16776837,
        -0.10581174,  0.16865303, -0.04998824, -0.31148055,  0.07910346,
         0.15442263,  0.01488662,  0.03930153,  0.19772711, -0.12215476,
        -0.04120981, -0.2704109 , -0.21922152,  0.26517662, -0.80739075,
         0.25833532, -0.3100421 ,  0.28683215,  0.1943387 , -0.29036492,
         0.03862849, -0.7844411 , -0.0479324 ,  0.4110299 , -0.36388892,
        -0.58034706,  0.30269456,  0.3630897 , -0.15227164, -0.44391504,
         0.19462997,  0.19528408,  0.05666234,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201318 , -0.04418665, -0.08550783,
        -0.55847436, -0.23336391, -0.20782952, -0.03543064, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862679,  0.7753425 , -0.07667089,
        -0.15752277,  0.01872335, -0.08169781, -0.3521876 ,  0.4637341 ,
        -0.08492756,  0.07166859, -0.00670817,  0.12686075, -0.19326553,
 

## Building a full model
To build the full model we need to take our embedding layer and then add a simple fully connected network that ends with a single neuron in the output layer (which we will use to compare to the labels). 

In [7]:
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 48191433 (183.84 MB)
Trainable params: 48191433 (183.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Model comments

So our full model contains 48,191,433 parameters. But the vast majority of these are in the form of the embedding layer and those parameters are already pre-trained. The fully connected classification part of the network is just 816 + 17 parameters.

# Model Compilation

We are going to use the `adam` optimizer and since we are doing a binary classification taks we will use BinaryCrossentropy as our loss function.

# Model training

Then we will train our model for 10 epochs. To speed things up we will only train on a subset of the data.

In [9]:
model.compile(optimizer='adam',
              loss=keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Model performance

The last step is to use our test dataset (which we haven't touched yet) to evaluate the performance of the model.

When I first ran this the accuracy was 85.3% which is not bad considering all we have done is add a single hidden layer of 16 neurons after our pre-trained embedding layer.

In [10]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 1s - loss: 0.3604 - accuracy: 0.8526 - 1s/epoch - 22ms/step
loss: 0.360
accuracy: 0.853


# Suggested tasks
1. Can you improve on the model performance?
2. How does the speed of model training change if you try some of the other pre-trained text embeddings layers?