<a href="https://colab.research.google.com/github/harnalashok/deeplearning-sequences/blob/main/text_analytics_using_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# https://www.tensorflow.org/text/tutorials/text_classification_rnn
# Text classification with an RNN
# List of tensorflow datasets is here:
#  https://www.tensorflow.org/datasets/catalog/overview#all_datasets

In [54]:
import tensorflow as tf
import pathlib
import tensorflow_datasets as tfds


In [55]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


## Load the dataset

[tfds.load](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)

The easiest way of loading a dataset is tfds.load. It will:

    Download the data and save it as tfrecord files.
    Load the tfrecord and create the tf.data.Dataset.

It returns, `tf.data.Dataset`, the dataset requested.
    


List of available tfs datasets is [here](https://www.tensorflow.org/datasets/catalog/overview#all_datasets)

In [56]:
# Load imdb reviews dataset:

dataset = tfds.load("imdb_reviews",
                    split = None,          # If None, will return all 
                                           #   splits in a Dict format
                    data_dir = "/content/", # Dir where to cache data
                    as_supervised = True    # The returned tf.data.Dataset 
                                            #   will have a 2-tuple structure 
                                            #    (input, label) 
                    )


In [57]:
dataset.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [58]:
train = dataset["train"] ; test = dataset["test"]

In [59]:
# What are the types of train and test?

type(train)
print()
type(test)

tensorflow.python.data.ops.dataset_ops.PrefetchDataset




tensorflow.python.data.ops.dataset_ops.PrefetchDataset

In [60]:
# Look at one element of train
train.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [61]:
for i, j in train.take(2):
  print(i, i.numpy())
  print(j, j.numpy())

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string) b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline

In [62]:
next(iter(train))

(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">,
 <tf.Tensor: shape=(), dtype=int64, numpy=0>)

In [64]:
# Cons
VOCAB_SIZE = 5000

[tf.keras.layers.TextVectorization<br>](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)
A preprocessing layer which maps text features to integer sequences.

In [70]:
# Create layer object:
tv = tf.keras.layers.TextVectorization(
                                       max_tokens= VOCAB_SIZE,
                                       standardize='lower_and_strip_punctuation',
                                       split='whitespace',
                                       ngrams=None,
                                       output_mode='int',
                                       output_sequence_length=200,
                                       pad_to_max_tokens=False,
                                       vocabulary=None,
                                       idf_weights=None,
                                       sparse=False,
                                       ragged=False,
                                      )


In [73]:
tv.adapt(train.map(lambda text, label: text))


In [75]:
tv.get_vocabulary()[:10]

['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed output_sequence_length):

In [78]:
tv([["This is good"], ["This is not good"]])

<tf.Tensor: shape=(2, 200), dtype=int64, numpy=
array([[11,  7, 50,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0],
       [11,  7, 22, 50,  0,  0,  0,  0,  0,  0,

In [None]:
tv()

### Labeling each example

In [None]:
# tf.cast casts a tensor to a new datatype:
#  tf.int32, tf.int64, tf.float64 etc

tf.cast(1, tf.int64)
tf.cast(1.2, tf.float64)

<tf.Tensor: shape=(), dtype=float64, numpy=1.2000000476837158>

In [None]:
# Define a labeler function:
#  example is a tf.data.Dataset
#    having many examples:
#     Label each one with 'index' after
#      it is cast to integer (tensorflow):

def labeler(example, index):
  return example, tf.cast(index, tf.int64)


#### An example

In [None]:
# Create two tf.data.Datsets

d1 = tf.data.Dataset.from_tensor_slices([1,2,3,4])
d2 = tf.data.Dataset.from_tensor_slices([6,7,8,9])


In [None]:
d1
d2

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

In [None]:
# Print one element at a time:

for i in d1.take(2):
  print(i)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


In [None]:
# Use as_numpy_iterator():
#  as_numpy_iterator() returns an iterator through which you can convert
#   all elements of the dataset to numpy.

list(d1.as_numpy_iterator())
print("\n======\n")
for i in d1.as_numpy_iterator():
  print(i)

[1, 2, 3, 4]



1
2
3
4


map() function:<br>
See [here](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) examples of map() usage<br>
`map()` transformation applies `map_func` to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input. `map_func` can be used to change both the values **and the structure** of a dataset's elements.

In [None]:
# Apply labeler function:

t = []
for i,j in enumerate([d1,d2]):
  t.append(j.map(lambda x: labeler(x,i)))

In [None]:
# Examine each labeled Dataset:

for x in t:
  print(list(x.as_numpy_iterator()))
  print(type(x))

[(1, 0), (2, 0), (3, 0), (4, 0)]
<class 'tensorflow.python.data.ops.dataset_ops.MapDataset'>
[(6, 1), (7, 1), (8, 1), (9, 1)]
<class 'tensorflow.python.data.ops.dataset_ops.MapDataset'>


In [None]:
t = []
for j in [d1,d2]:
  t.append(j.map(lambda x: x+10))




In [None]:
for j in t[0].as_numpy_iterator():
  print(j)

11
12
13
14


## Next, with our data

In [None]:
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

In [None]:
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(data_path / file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

In [None]:
for text,label in labeled_data_sets[0].take(5):
  print(text)
  print(text.numpy())
  print(label)
  print(label.numpy())
  print("\n====\n")

tf.Tensor(b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;", shape=(), dtype=string)
b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
tf.Tensor(0, shape=(), dtype=int64)
0
tf.Tensor(b'His wrath pernicious, who ten thousand woes', shape=(), dtype=string)
b'His wrath pernicious, who ten thousand woes'
tf.Tensor(0, shape=(), dtype=int64)
0
tf.Tensor(b"Caused to Achaia's host, sent many a soul", shape=(), dtype=string)
b"Caused to Achaia's host, sent many a soul"
tf.Tensor(0, shape=(), dtype=int64)
0
tf.Tensor(b'Illustrious into Ades premature,', shape=(), dtype=string)
b'Illustrious into Ades premature,'
tf.Tensor(0, shape=(), dtype=int64)
0
tf.Tensor(b'And Heroes gave (so stood the will of Jove)', shape=(), dtype=string)
b'And Heroes gave (so stood the will of Jove)'
tf.Tensor(0, shape=(), dtype=int64)
0


## Prepare the dataset for training