<a href="https://colab.research.google.com/github/harnalashok/deeplearning-sequences/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Last amended: 12th March, 2021
# My folder: harnalashok/github/deeplearning-sequences
# References:
# https://www.tensorflow.org/tutorials/text/text_classification_rnn
# https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/data.ipynb#scrollTo=m5bz7R1xhX1f
# https://stackoverflow.com/a/49579995/3282777
# https://www.tensorflow.org/tutorials/load_data/text

# Objectives:
#            i)  Learning to work with tensors
#            ii) Learning to work with tf.data API
#           iii) Text Classification--Work in progess

In [14]:
# 1.0 Call libraries
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt
import os

In [15]:
# 1.1 More libraries
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental import preprocessing 

In [16]:
# 1.2 Set numpy decimal printoptions
#      Limit display to precision of 3

np.set_printoptions(precision=3)

In [17]:
# 1.3
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## preprocessing data
The Keras preprocessing layers API allows developers to build Keras-native input processing pipelines. These input processing pipelines can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel.  

With Keras preprocessing layers, you can build and export models that are truly end-to-end: models that accept raw images or raw structured data as input; models that handle feature normalization or feature value indexing on their own.

See this [link](https://www.tensorflow.org/guide/keras/preprocessing_layers) for usage of preprocessing library methods. Pay attention to the panel on the right.  



### Text Vectorization
Once our train, test and validation datasets are ready, we proceed to feed textvectorization layer.


#### Using `preprocessing.TextVectorization` layer. Its [full syntax](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) is:

`tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=None, standardize=LOWER_AND_STRIP_PUNCTUATION,
    split=SPLIT_ON_WHITESPACE, ngrams=None, output_mode=INT,
    output_sequence_length=None, pad_to_max_tokens=True, vocabulary=None, **kwargs
)
`

`TextVectorization` layer will standardize, tokenize, and vectorize the data using the preprocessing.TextVectorization layer.

> Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.

> Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).

> Vectorization refers to converting tokens into numbers so they can be fed into a neural network.

All of these tasks can be accomplished with this layer. You can learn more about each of these in the API doc.

> The default standardization converts text to lowercase and removes punctuation.

> The default tokenizer splits on whitespace.

> The default vectorization mode is int. This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes, like binary, to build bag-of-word models.


##### Simple **experiment**

In [18]:
# 2.0 One document
data = [
         "The title of Rachel Levin’s book, Look Big, is"
          "just about the best two words of advice one can"
          " give about how to survive most animal encounters."
          "In her illustrated service manual, Levin breaks down"
          "how to handle 50 different kinds of animals common in "
          "North America, based on expert advice. Let’s look at her"
          " tips for dealing with five of these creatures and see"
          " how they stack up with what the experts say—and with"
          " real-world experience. "
        ]

In [19]:
# 3.1 TextVectorization is a layer in tensorflow
#     This layer can be a part of model

layer = preprocessing.TextVectorization()

# 3.2 Train the layer. 
layer.adapt(data)

# 3.3 Transform data
vectorized_text = layer(data)

# 3.4 Examine transformed data
print(vectorized_text)


tf.Tensor(
[[ 4 15  2 25 31 53  6 54 35 10  4 55 14 11  2  9 26 51 39 10  7  5 19 29
  59 45  8 37 21 30 32 52 46  5 38 62 47 34  2 58 50 36 28 61 56 27 43  9
  33  6 57  8 16 40 48  3 41  2 18 49 60 22  7 17 20 13  3 12  4 42 23  3
  24 44]], shape=(1, 74), dtype=int64)


In [20]:
# 4.0 Two documents
# 4.1 Padding is done to make them equal

data = [ [
         "The title of Rachel Levin’s book, Look Big, is"
          "just about the best two words of advice one can"
          " give about how to survive most animal encounters."
          "In her illustrated service manual, Levin breaks down"
          ],
          ["how to handle 50 different kinds of animals common in "
          "North America, based on expert advice. Let’s look at her"
          " tips for dealing with five of these creatures and see"
          " how they stack up with what the experts say—and with"
          " real-world experience. "
          ]
        ]

# 4.2        
layer = preprocessing.TextVectorization()
layer.adapt(data)
vectorized_text = layer(data)
print(vectorized_text)


tf.Tensor(
[[ 4 15  2 25 31 53  7 54 35 10  4 55 14 11  2  9 26 51 39 10  5  6 19 29
  59 45  8 37 21 30 32 52 46  0  0  0  0  0  0  0  0  0]
 [ 5  6 38 62 47 34  2 58 50 36 28 61 56 27 43  9 33  7 57  8 16 40 48  3
  41  2 18 49 60 22  5 17 20 13  3 12  4 42 23  3 24 44]], shape=(2, 42), dtype=int64)


## Text Classification

In [12]:
# 5.0
dataset, info = tfds.load(
                           'imdb_reviews',
                            with_info=True,
                            as_supervised=True
                          )


[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteEJOXZ0/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteEJOXZ0/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteEJOXZ0/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [21]:
# 5.1
type(dataset)    # dict
dataset.keys()   # _keys(['test', 'train', 'unsupervised'])

dict

dict_keys(['test', 'train', 'unsupervised'])

In [22]:
# 5.2
train_dataset, test_dataset = dataset['train'], dataset['test']
type(train_dataset)


tensorflow.python.data.ops.dataset_ops.PrefetchDataset

In [23]:
# 5.2.1
train_dataset.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [24]:
# 5.2.2
for example, label in train_dataset.take(2):
  print('text: ', example.numpy())
  print('label: ', label.numpy())


text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0
text:  b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. 

In [25]:
# 6.0
BUFFER_SIZE = 10000
BATCH_SIZE = 64      # Try 2 and see what happens

In [26]:
# 6.1
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)


In [None]:
# 6.2 Each take is of batch size
for example, label in train_dataset.take(3):
  print('texts: ', example.numpy().shape)
  print('texts: ', example.numpy()[:4])
  print()
  print('labels: ', label.numpy()[:4])


In [28]:
# 6.3
# https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization
# Text vectorization layer.
VOCAB_SIZE=1000

# 6.3.1
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
                                                                         max_tokens=VOCAB_SIZE
                                                                       )

In [29]:
# 6.3.2 Train encoder. Takes time.....
encoder.adapt(train_dataset.map(lambda text, label: text))

In [None]:
# 6.3.3 Print 20 words from vocab
encoder.get_vocabulary()[:20]

In [34]:
example.shape


TensorShape([64])

In [None]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[  1,   1,   1, ...,   0,   0,   0],
       [ 11,   7,  29, ...,   0,   0,   0],
       [ 51,  10, 208, ...,   0,   0,   0]])

In [None]:
model = tf.keras.Sequential(
                             [
                              encoder,
                              tf.keras.layers.Embedding(
                                                         input_dim=len(encoder.get_vocabulary()),
                                                         output_dim=64,
                                                         # Use masking to handle the variable 
                                                         #  sequence lengths
                                                          mask_zero=True
                                                        ),
                             tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
                             tf.keras.layers.Dense(64, activation='relu'),
                             tf.keras.layers.Dense(1)
                           ]
                          )


In [None]:
model.summary()

ValueError: ignored

In [None]:
print([layer.supports_masking for layer in model.layers])


[False, True, True, True, True]


In [None]:
sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])


[-0.006]


In [None]:
# predict on a sample text with padding

padding = "the " * 2000
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])


[0.0159057]


In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
history = model.fit(
                    train_dataset,
                    epochs=10,
                    validation_data=test_dataset, 
                    validation_steps=30
                    )
   
# Each epoch takes 31 secs on GPU

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
