<a href="https://colab.research.google.com/github/nigoda/machine_learning/blob/main/12_NLP_Load_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Load text**

Next, you will use lower-level utilities like *tf.data.TextLineDataset* to load text files, and *tf.text* to preprocess the data for finer-grain control

In [None]:
import collections
import pathlib
import re
import string

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds

In [None]:
!pip install tensorflow-text

Collecting tensorflow-text
[?25l  Downloading https://files.pythonhosted.org/packages/a0/86/22ad798f94d564c3e423758b60ddd3689e83ad629b3f31ff2ae45a6e3eed/tensorflow_text-2.4.3-cp36-cp36m-manylinux1_x86_64.whl (3.4MB)
[K     |████████████████████████████████| 3.4MB 5.0MB/s 
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.4.3


In [None]:
import tensorflow_text as tf_text

## Example 1: Predict the tag for a stack Overflow question

### Download and explore the dataset

In [None]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir=''
)
dataset_dir = pathlib.Path(dataset).parent

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz


In [None]:
list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz.tar.gz'),
 PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/README.md')]

In [None]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/python')]

The train/cshape, train/java, train/python and train/javascript directories contain many text files, each of which is a stack Overflow question. Print a file and inspect the data.

In [None]:
sample_file = train_dir/'python/1775.txt'
with open(sample_file) as f:
  print(f.read())

"check if a numbers comes before the same number but am not getting any output from the code number =[2,3,4,9,9,5,1].def checklist(list1):.    for i in range(len(list1 - 1)):..if list1[i] == 9 and list1[i+1] == 9:.            return true.    return false...this code does not output any value either true or false, it is suppose to output true if 9 comes after 9 or false if otherwise .[enter image description here][1]..[1]: https://i.stack.imgur.com/kwm0s.png  this link contain the code and the output"



### Load the dataset

Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use *text_dataset_from_directory* utility to create a labeled *tf.data.Dataset*. If you're new to tf.data, it's a power collection of tools for building input pipeline.

The *preprocessing.text_dataset_from_directory* expect a directory structure as follows.

train/

...csharp/
......1.txt
......2.txt

...java/
......1.txt
......2.txt

...javascript/
......1.txt
......2.txt

...python/
......1.txt
......2.txt

When running a ML experiment,it is a best practice to divide your dataset into three splits:train,validation, and test. The Stack Overflow dataset has already been divided into train and test,but it lacks a validation set. Create a validation set using an 80:20 split of the training data by using the validation_split argument below.



In [None]:
batch_size = 32
seed = 42

raw_train_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size = batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed
) 

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


There are 8,000 examples in the training folder, of which you will use 80%(or 6,400) for training. 

Train a model by passing a tf.data.Dataset directly to model.fit.

First, iterate over the dataset and print out a few examples, to get a feel for the data. 

In [None]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question : ",text_batch.numpy()[i][:100], '...')
    print("Labels : ",label_batch.numpy()[i])

Question :  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can' ...
Labels :  1
Question :  b'"blank code slow skin detection this code changes the color space to lab and using a threshold finds' ...
Labels :  3
Question :  b'"option and validation in blank i want to add a new option on my system where i want to add two text' ...
Labels :  1
Question :  b'"exception: dynamic sql generation for the updatecommand is not supported against a selectcommand th' ...
Labels :  0
Question :  b'"parameter with question mark and super in blank, i\'ve come across a method that is formatted like t' ...
Labels :  1
Question :  b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the object (i' ...
Labels :  0
Question :  b'how to correctly make the icon for systemtray in blank using icon sizes of any dimension for systemt' ...
Labels :  0
Question :  b'"is there a way to check a variable that exists in a differen

The labels are 0,1,2 or 3. To see which of these correspond to which string label, you can check class_names property on the dataset.

In [None]:
for i,label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python


In [None]:
raw_val_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset ='validation',
    seed=seed
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [None]:
test_dir = dataset_dir/'test'
raw_test_ds = preprocessing.text_dataset_from_directory(
    test_dir, batch_size=batch_size)

Found 8000 files belonging to 4 classes.


### Prepare the dataset for training

Next, you will standardize,tokenize,and vectorize the data using the *preprocessing.TextVectorization* layer.



*   Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.

*   Tokennization refers to splitting strings into tokens(for example, splittinig a sentence into individual words by splitting on whitespace).   

*   Vectorization refers to converting tokens into number so they can be fed into a neural network.

All of these tasks can be accomplished with this layer. You can learn more about each of thses in the API doc.



*   The default standardization convert text to lowercase and remove punctuation.

*   The default tokenizer splits on whitespace.

*   The default vectorization mode is int. this outputs integer indices(one per token). This mode can be used to build models that take word order into account. You can also use other modes, like binary, to build bag-of-word models.

Build two modes t learn more about these. First, use the binary model to build a bag-of-words model. Next, you will use the int mode with a 1D ConvNet.








In [None]:
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary'
)

For int mode,in addition to maximun vocabulary size, you need to set an explicit maximum sequence length, Which will cause the layer to pad or truncate sequences to exactly sequence_length values.

In [None]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

Next, you will call adapt to fit the state of the preprocessing layer to the dataset.

In [None]:
#Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda text, labels:text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

See the result of using these layers to preprocess data:

In [None]:
def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

In [None]:
def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [None]:
# Retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(raw_test_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

Question tf.Tensor(b"how to get 4 unique random numbers in range <0;9>? i want to get 4 unique random floating point numbers in the range &lt;0;9>. how could i do that. is it possible to do this with a single function so i don't need to generate random numbers in a loop?\n", shape=(), dtype=string)
Label tf.Tensor(0, shape=(), dtype=int32)


In [None]:
print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)


In [None]:
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[  24    4   41  149  723  247  183    7  427 1928    3   46    4   41
   149  723  247 3676  316  183    7    2  427    1   24  176    3   40
    14    6   11  204    4   40   13   21    5  394   38   50    3  129
    78    4  647  247  183    7    5  123    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0

  Binary mode returns an array denoting which tokens exist at least once in the input, while int mode replace each token by an integer, thus preserving their order. You can lookup the token(string) that each integer corresponds to by calling .get_vocabulary() on the layer.

In [None]:
print("1289 --->", int_vectorize_layer.get_vocabulary()[1289])
print("313 --->", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1289 ---> roman
313 ---> source
Vocabulary size: 10000


### Ready to train a model

In [None]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

### Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

*.cache()* keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a battleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performate on disk cache, Which is more efficient to  read than many small files.

*.prefetch()* overlaps data preprocessing and model execution while training.



In [None]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size = AUTOTUNE)

In [None]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

### Train the model

It's time to create our neural network. For the binary vectorized data, train a simple bag-of-words linear model:

In [None]:
binary_model = tf.keras.Sequential([layers.Dense(4)])
binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = 'adam',
    metrics = ['accuracy'])
history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Next, you will use the int vectorized layer to build a 1D ConvNet.

In [None]:
def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding ="valid", activation="relu", strides = 2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

In [None]:
# vocab_size is VOCAB_SIZE + 1 since 0 is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer ='adam',
    metrics = ['accuracy'])

history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Compare the two models:

In [None]:
print("Liner model on binary vectorized data: ")
print(binary_model.summary())

Liner model on binary vectorized data: 
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 4)                 40004     
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
print("ConvNet model on int vectorizer data:")
print(int_model.summary())

ConvNet model on int vectorizer data:
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          640064    
_________________________________________________________________
conv1d (Conv1D)              (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d (Global (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 260       
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None


Evaluate both models on the test data:

In [None]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy : {:2.2%}".format(binary_accuracy))
print("Int model accuracy : {:2.2%}".format(int_accuracy))

Binary model accuracy : 81.51%
Int model accuracy : 80.48%


### Export the model

In the code above, applied the TextVectorization layer to yhe dataset before feeding twxt to the model. If you want to make model capable of processing raw strings(for example, to simplify deploying it), you can include the TextVectorization layer inside your model. To do so you can create a new model using the weights you just trained.

In [None]:
export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics = ['accuracy'])

# Test it with 'raw_test_ds', which yiels raw string
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))


Accuracy: 81.51%


Now your model can take raw strings as input and predict a score for each label using *model.predict*. Define a function to find the label with the maximum score

In [None]:
def get_string_labels(predicted_score_batch):
  predicted_int_labels = tf.argmax(predicted_score_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

### Run interence on new data

In [None]:
inputs = [
          " public String getDepartment() {return department;", #python
          "debug public static void main(string[] args) {...}", # java
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ",input)
  print("Predicted label:", label.numpy())

Question:   public String getDepartment() {return department;
Predicted label: b'csharp'
Question:  debug public static void main(string[] args) {...}
Predicted label: b'csharp'


# **Example 2: Predict the author of Illiad translations**

In [None]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt


In [None]:
print(text_dir)

/root/.keras/datasets/butler.txt


In [None]:
parent_dir = pathlib.Path(text_dir).parent

In [None]:
print(parent_dir)
print(" ")
list(parent_dir.iterdir())


/root/.keras/datasets
 


[PosixPath('/root/.keras/datasets/derby.txt'),
 PosixPath('/root/.keras/datasets/butler.txt'),
 PosixPath('/root/.keras/datasets/cowper.txt')]

## Load the dataset

You will use `TextLineDataset`, Which is designed to create a `tf.data.Dataset` from a text file in which each example is a line of text from the original file, Whereas `text_dataset_from_directory` treats all contents of a file as a single example.
`TextLineDataset` is useful for text data that is primarily line-based(for example,poetry or error logs).

Iterte through these files, loading each one into its own dataset. Each example needs to be individually labeled, so use `tf.data.Dataset.map` to apply a labeler function to each one. This will iterate over every example in the dataset, returning (example, labels) pairs.

In [None]:
def labeler(example, index):
  return example, tf.cast(index, tf.int64)

In [None]:
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

print(labeled_data_sets[0])

<MapDataset shapes: ((), ()), types: (tf.string, tf.int64)>


Next, you'll combine these labeled datasets into a single dataset, and shuffle it.

In [None]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

In [None]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

In [None]:
for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("label: ", label.numpy())

Sentence:  b'Within my tent, leaning against the wall,'
label:  0
Sentence:  b'for the mules to pass you. Afterwards when I have taken the body home'
label:  2
Sentence:  b'In beauty far surpassing all their sex:'
label:  1
Sentence:  b'sent on a like design to the camp of the Grecians. From him they are'
label:  1
Sentence:  b'A barrier against all the winds of heaven,'
label:  0
Sentence:  b'With rapid strides he came; the mountains huge'
label:  0
Sentence:  b'The routed sons of Greece should feel how much'
label:  1
Sentence:  b'From Venus, child of Jove; his mother owns'
label:  1
Sentence:  b'From Chiron, justest of the Centaurs, he.'
label:  1
Sentence:  b"Oft as Achilles, swift of foot, essay'd"
label:  1


### Prepare the dataset for training
Instead of using the Keras `TextVectorization` layer to preprocess our text dataset, you will now use the `tf.text API` to standardize and tokenize the data, buid a vocabulary and use `StaticVocabularyTable` to map tokens to integers to feed to the model.

While tf.text provides various tokenizers, you will use the `UnicodeScriptTokenizer`to tokenize our dataset. Defiine a function to convert thr text to lower-case and tokenize it. You will use `tf.data.dataset.map` to apply the tokenization to the dataset.

In [None]:
tokenizer = tf_text.UnicodeScriptTokenizer()

In [None]:
def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

In [None]:
tokenized_ds = all_labeled_data.map(tokenize)

Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.


In [None]:
for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())

Tokens:  [b'within' b'my' b'tent' b',' b'leaning' b'against' b'the' b'wall' b',']
Tokens:  [b'for' b'the' b'mules' b'to' b'pass' b'you' b'.' b'afterwards' b'when'
 b'i' b'have' b'taken' b'the' b'body' b'home']
Tokens:  [b'in' b'beauty' b'far' b'surpassing' b'all' b'their' b'sex' b':']
Tokens:  [b'sent' b'on' b'a' b'like' b'design' b'to' b'the' b'camp' b'of' b'the'
 b'grecians' b'.' b'from' b'him' b'they' b'are']
Tokens:  [b'a' b'barrier' b'against' b'all' b'the' b'winds' b'of' b'heaven' b',']


Next,you will build a vocabulary by sorting tokens by frequency and keeping the top `VOCAB_SIZE` tokens.

In [None]:
tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1
print(vocab_dict)

vocab = sorted(vocab_dict.items(), key=lambda x:x[1], reverse = True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries: ", vocab[:5])



Vocab size:  10000
First five vocab entries:  [b',', b'the', b'and', b"'", b'of']


To convert the tokens into integers, use the `vocab` set to create a `StaticVocabularyTable`. You will map tokens to integer in the range`[2, vocab_size + 2]`. As with the TextVectorization layer, 0 is resrved to denote padding and 1 is reserved to denote an out-of-vocabulary(OOV) token.



In [None]:
keys = vocab
values = range(2, len(vocab) + 2) # reserve 0 for padding, 1  for OOV

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table =tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

Finally, define a function to standardize, tokenize and vectorize the dataset using the tokenizer and lookup table:

In [None]:
def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

In [None]:
#trying this on single example to see the output:

example_text, example_label = next(iter(all_labeled_data))
print("Sentence :", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

Sentence : b'Within my tent, leaning against the wall,'
Vectorized sentence:  [ 263   32  287    2 2500  199    3  257    2]


Now run the preprocess function on the dataset using tf.data.Dataset.map

In [None]:
all_encoded_data = all_labeled_data.map(preprocess_text)

### Split the dataset into train and test

The Keras `TextVectorization` layer also batches and pads the vectorized data. Padding is required because the example inside of a batch need to be the same size and shape, but the example in these dataset are not all the same size-each line of text has a differrent number of words. `tf.data.Dataset` support splitting and padded-batching dataset:

In [None]:
train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

In [None]:
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

Now, `validation_data` and `train_data` are not collections of (`example, labels`) pairs, but collections of batches. Eacg batch is pair of (many examples, many labels) represented as array.

To illustrate:

In [None]:
sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])


Text batch shape:  (64, 18)
Label batch shape:  (64,)
First text example:  tf.Tensor(
[ 263   32  287    2 2500  199    3  257    2    0    0    0    0    0
    0    0    0    0], shape=(18,), dtype=int64)
First label example:  tf.Tensor(0, shape=(), dtype=int64)


Since we use `0` for padding and `1` for out-of-vacabulary(OOV)tokens,the vocabulary size has increased by two.

In [None]:
vocab_size += 2

Configure the datasets for better performance as before.

In [None]:
trian_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

### Train the model

You can train a model on this dataset as before.

In [None]:
model = create_model(vocab_size = vocab_size, num_labels=3)
model.compile(
    optimizer = 'adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics = ['accuracy'])

history = model.fit(train_data, validation_data=validation_data, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.38415566086769104
Accuracy: 84.28%


### Export the model

To make our model capable to taking raw strings as imput, you will create a `TextVectorization`layer that performs the same steps as our custom preprocessing function. Since you already trained a vocabulary, you can use `set_vocabulary` instead of `adapt` which trains a new vocabulary.

In [None]:
preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)
preprocess_layer.set_vocabulary(vocab)

In [None]:
export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

In [None]:
# Create a test dataset of raw string

test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)
loss, accuracy = export_model.evaluate(test_ds)
print("Loss: ",loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.5135167241096497
Accuracy: 79.38%


The loss and accuracy for the model on encoded validation set and the exported model on the raw validation set are the same, as expeted.


## Run inference on the data

In [None]:
inputs=[
       "Join'd to th' Ionians with their flowing robes,",  # Label: 1
       "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
       "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted labels: ", label.numpy())

Question:  Join'd to th' Ionians with their flowing robes,
Predicted labels:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted labels:  2
Question:  And with loud clangor of his arms he fell.
Predicted labels:  0


### *Downloading more datasets using TensorFlow Datasets(TFDS)*

https://www.kaggle.com/maximgolovatchev/imdb-reviews/notebook


In [None]:
train_ds = tfds.load(
    'imdb_reviews',
    split='train',
    batch_size = BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteZVO3E7/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteZVO3E7/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteZVO3E7/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
val_ds = tfds.load(
    'imdb_reviews',
    split = 'train',
    batch_size=BATCH_SIZE,
    shuffle_files = True,
    as_supervised=True)

Print a few examples.

In [None]:
for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ",label_batch[i].numpy())


Review:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label:  0
Review:  b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbi

Prepare the dataset for training

In [None]:
vectorize_layer = TextVectorization(
    max_tokens = VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call adapt
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

In [None]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [None]:
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

In [None]:
# Configure dataset for performance as before

train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

### Train the model

In [None]:
model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 64)          640064    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

In [None]:
history = model.fit(train_ds, validation_data=val_ds, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
loss, accuracy = model.evaluate(val_ds)

print("loss: ",loss)
print("Accuracy: {:2.2%}".format(accuracy))

loss:  0.09483323246240616
Accuracy: 97.82%


### Export the model

In [None]:
export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

In [None]:
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
] 

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("predicted labels: ",label)

Question:  This is a fantastic movie.
predicted labels:  1
Question:  This is a bad movie.
predicted labels:  0
Question:  This movie was so bad that it was good.
predicted labels:  0
Question:  I will never say yes to watching this movie.
predicted labels:  1


In [None]:
print(predicted_scores)
print(predicted_labels)

[[0.99992895]
 [0.1504317 ]
 [0.36695465]
 [0.67529476]]
[1, 0, 0, 1]
