<a href="https://colab.research.google.com/github/menezesmalu/chatbot/blob/main/classification_yelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification with an RNN

## Setup

In [None]:
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

tfds.disable_progress_bar()

Import `matplotlib` and create a helper function to plot graphs:

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

## Setup input pipeline

[yelp_polarity_reviews](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews?hl=pt-br)

O conjunto utilizado é de avaliação binária de restaurantes na plataforma Yelp.

O conjunto de dados de polaridade das revisões do Yelp é construído considerando as estrelas 1 e 2 negativas e 3 e 4 positivas. Para cada polaridade, 280.000 amostras de treinamento e 19.000 amostras de teste são retiradas aleatoriamente. No total, são 560.000 amostras de treinamento e 38.000 amostras de teste. A polaridade negativa é classe 1 e classe positiva 2.


O dataset foi encontrado através do TFDS

In [None]:
dataset, info = tfds.load('yelp_polarity_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

[1mDownloading and preparing dataset yelp_polarity_reviews/plain_text/0.1.0 (download: Unknown size, generated: 435.14 MiB, total: 435.14 MiB) to /root/tensorflow_datasets/yelp_polarity_reviews/plain_text/0.1.0...[0m
Shuffling and writing examples to /root/tensorflow_datasets/yelp_polarity_reviews/plain_text/0.1.0.incompleteC3G1HU/yelp_polarity_reviews-train.tfrecord




Shuffling and writing examples to /root/tensorflow_datasets/yelp_polarity_reviews/plain_text/0.1.0.incompleteC3G1HU/yelp_polarity_reviews-test.tfrecord
[1mDataset yelp_polarity_reviews downloaded and prepared to /root/tensorflow_datasets/yelp_polarity_reviews/plain_text/0.1.0. Subsequent calls will reuse this data.[0m


(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

Initially this returns a dataset of (text, label pairs):

In [None]:
for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

text:  b"The Groovy P. and I ventured to his old stomping grounds for lunch today.  The '5 and Diner' on 16th St and Colter left me with little to ask for.  Before coming here I had a preconceived notion that 5 & Diners were dirty and nasty. Not the case at all.\\n\\nWe walk in and let the waitress know we want to sit outside (since it's so nice and they had misters).  We get two different servers bringing us stuff (talk about service) and I ask the one waitress for recommendations.  I didn't listen to her, of course, and ordered the Southwestern Burger w/ coleslaw and started with a nice stack of rings.\\n\\nThe Onion Rings were perfectly cooked.  They looked like they were prepackaged, but they were very crispy and I could actually bite through the onion without pulling the entire thing out (don't you hate that?!!!)\\n\\nThe Southwestern Burger was order Medium Rare and was cooked accordingly.  Soft, juicy, and pink with a nice crispy browned outer layer that can only be achieved on 

Next shuffle the data for training and create batches of these `(text, label)` pairs:

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

In [None]:
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for example, label in train_dataset.take(1):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

texts:  [b"Get here early. The place fills up quickly. \\n\\nI love this place now. The new update makes the place so much more inviting. \\n\\nThe parking is validated for up to 4 hours, which is really awesome. The only reason I gave it for stars was because of the popcorn. You are a movie theatre, your popcorn should be on point. I don't know what it is, but something is not right about it."
 b"Me and the wifey finally made it to this place. We are big seafood/oyster fans and as most of you know there are few options for good oysters in CLT area. Oysters were good but way pricey. Everything else was an epic FAIL. Service was weak. Bartenders and servers were lethargic. Draft beer was flat. We sat in bar area at a hightop and watched beer after beer get returned for same reason. Bartender did nothing to adjust CO2. He was more concerned with the FSU-NCSU football game. Wife ordered glass of Chardonnay  and the pour on the wine was very weak as well. Maybe a 1/4 glass full. Other food

## Create the text encoder

In [None]:
VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

In [None]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'and', 'i', 'to', 'a', 'was', 'of', 'it',
       'for', 'in', 'is', 'that', 'my', 'we', 'this', 'with', 'but',
       'they'], dtype='<U13')

In [None]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[ 43,  45, 605, ...,   0,   0,   0],
       [ 32,   3,   2, ...,   0,   0,   0],
       [ 21,   1,  19, ...,   0,   0,   0]])

In [None]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

Original:  b"Get here early. The place fills up quickly. \\n\\nI love this place now. The new update makes the place so much more inviting. \\n\\nThe parking is validated for up to 4 hours, which is really awesome. The only reason I gave it for stars was because of the popcorn. You are a movie theatre, your popcorn should be on point. I don't know what it is, but something is not right about it."
Round-trip:  get here early the place [UNK] up quickly nni love this place now the new [UNK] makes the place so much more [UNK] nnthe parking is [UNK] for up to 4 hours which is really awesome the only reason i gave it for stars was because of the [UNK] you are a [UNK] [UNK] your [UNK] should be on point i dont know what it is but something is not right about it                                                                                                                                                                                                                                            

## Create the model

The code to implement this is below:

In [None]:
model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [None]:
print([layer.supports_masking for layer in model.layers])

[False, True, True, True, True]


In [None]:
# predict on a sample text 
sample_text = ("The service is super slow!  We've been waiting for our food for an hour!")
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

[0.00877833]


Compile the Keras model to configure the training process:

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

## Train the model

In [None]:
model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='sigmoid'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])


In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10

In [None]:
model.summary()

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

In [None]:
# predict on a sample text without padding.

sample_text = ("The service is super slow!  We've been waiting for our food for an hour!")
predictions = model.predict(np.array([sample_text]))
print(predictions)

In [None]:
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')

# Extract data from datasets