# **Senitment Analysis with RNN Using Yelp Reviews Dataset**

### **Upload libraries and load the dataset**
- Explore dataset
- Provide detailed information using print(info)
- Viewing the dicitionary that has train and test folds
- Extract the train_ds and test_ds folds
- Check for the data types
- Load the first 3 reviews

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
import tensorflow as tf

In [None]:
# Load the Yelp Reviews dataset
dataset, info = tfds.load('yelp_polarity_reviews',
                          with_info=True,
                          as_supervised=True)


Downloading and preparing dataset 158.67 MiB (download: 158.67 MiB, generated: 435.14 MiB, total: 593.80 MiB) to /root/tensorflow_datasets/yelp_polarity_reviews/0.2.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/560000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/yelp_polarity_reviews/0.2.0.incompleteM21ZPK/yelp_polarity_reviews-train.t…

Generating test examples...:   0%|          | 0/38000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/yelp_polarity_reviews/0.2.0.incompleteM21ZPK/yelp_polarity_reviews-test.tf…

Dataset yelp_polarity_reviews downloaded and prepared to /root/tensorflow_datasets/yelp_polarity_reviews/0.2.0. Subsequent calls will reuse this data.


In [None]:
# detailed information about yelp reviews
print(info)

tfds.core.DatasetInfo(
    name='yelp_polarity_reviews',
    full_name='yelp_polarity_reviews/0.2.0',
    description="""
    Large Yelp Review Dataset.
    This is a dataset for binary sentiment classification. We provide a set of 560,000 highly polar yelp reviews for training, and 38,000 for testing. 
    ORIGIN
    The Yelp reviews dataset consists of reviews from Yelp. It is extracted
    from the Yelp Dataset Challenge 2015 data. For more information, please
    refer to http://www.yelp.com/dataset
    
    The Yelp reviews polarity dataset is constructed by
    Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset.
    It is first used as a text classification benchmark in the following paper:
    Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks
    for Text Classification. Advances in Neural Information Processing Systems 28
    (NIPS 2015).
    
    
    DESCRIPTION
    
    The Yelp reviews polarity dataset is constructed by considering stars 1 an

In [None]:
# view the predefined data within the object
dataset

{Split('train'): <_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
 Split('test'): <_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>}

In [None]:
# extract train and test datasets
train_ds = dataset['train']
test_ds = dataset['test']

In [None]:
# Identify the data types through this code
train_ds.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [None]:
# Loop through the dataset and preview the first three reviews

for X, y in train_ds.take(3):
  print('feature:', X)
  print('label:', y)
  print()

feature: tf.Tensor(b"The Groovy P. and I ventured to his old stomping grounds for lunch today.  The '5 and Diner' on 16th St and Colter left me with little to ask for.  Before coming here I had a preconceived notion that 5 & Diners were dirty and nasty. Not the case at all.\\n\\nWe walk in and let the waitress know we want to sit outside (since it's so nice and they had misters).  We get two different servers bringing us stuff (talk about service) and I ask the one waitress for recommendations.  I didn't listen to her, of course, and ordered the Southwestern Burger w/ coleslaw and started with a nice stack of rings.\\n\\nThe Onion Rings were perfectly cooked.  They looked like they were prepackaged, but they were very crispy and I could actually bite through the onion without pulling the entire thing out (don't you hate that?!!!)\\n\\nThe Southwestern Burger was order Medium Rare and was cooked accordingly.  Soft, juicy, and pink with a nice crispy browned outer layer that can only be 

In [None]:
# Randomly mixes the dataset with a buffer size of 1000 to ensure randomness and prevent the model from learning the order of the data.
# Groups the dataset into batches of 4 examples each, which allows the model to process multiple examples at once.
# Preloads the next batch of data while the current batch is being processed, improving the efficiency and speed of training.

train_ds = train_ds.shuffle(1000).batch(4).prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.shuffle(1000).batch(4).prefetch(tf.data.AUTOTUNE)

In [None]:
# This code retrieves and prints one batch of data (both features and labels) from the training dataset to give  a glimpse of what the data looks like.
for X_batch, y_batch in train_ds.take(1):
  print('features:', X_batch)
  print('\nlabels:', y_batch)

features: tf.Tensor(
[b'Went here with wife and some friends on 9/22 for the Bears game. My wife and I we near by so went early to get a good table and relax. Beer selection and waitress was great. Problem we had wad 20 minutes before kickoff some dumbass manager asked us to slide our table over to make room for another group.  We got there early for a reason, we moved a little bit. Turns out other group were a bunch of his buddys n jets fans. So when I cheeered for my Bears they got butt hurt n cried to the manager. He asked me to pace myself and then offered us free drinks. DUH. His buddies were all gone bye halftime. So after cashed out 100 drink tap n left. Within 15 minutes my credit card company contacted me about an attempted 550 charge from this place. Nice try dipshit. Credit company refused charge and this bar lost a customer who loves beer. All cause he wanted to b a bigshot for his boys who were drinkin redds apple ale. Nuff said'
 b"Make sure to double check your order! I 

## **Text Vectorization: Convert raw text strings into sequence of words**
- Using TextVectorizatoin from tensorflow
- Calling adapt( ) that helps to run on the training dataset.
- Calling lambda( ) function to pass the text to TextVectorization object

In [None]:
# we specify a maximum size for the vocabulary
VOCAB_SIZE = 9000

In [None]:
# preprocessing text data
encoder = tf.keras.layers.TextVectorization(max_tokens = VOCAB_SIZE)

In [None]:
encoder.adapt(train_ds.map(lambda X, y : X))

This code gets the list of unique words that the text encoder has learned and then displays the first 40 words from that list. This is useful for understanding which words are most common in my dataset.






In [None]:
vocab = encoder.get_vocabulary()

# show the first 40 words
vocab[:40]

['',
 '[UNK]',
 'the',
 'and',
 'i',
 'to',
 'a',
 'was',
 'of',
 'it',
 'for',
 'in',
 'is',
 'that',
 'my',
 'we',
 'this',
 'with',
 'but',
 'they',
 'you',
 'on',
 'not',
 'have',
 'had',
 'at',
 'were',
 'so',
 'are',
 'be',
 'food',
 'place',
 'me',
 'there',
 'as',
 'good',
 'out',
 'like',
 'if',
 'all']

Thw code here takes one batch of text reviews from the training dataset, prints the original text reviews, and then prints the encoded version of those reviews. This helps me to see how the text is transformed into numerical form by the encoder.

In [None]:
for X, y in train_ds.take(1):
  print('original reviews:', X)
  print()
  print('encoded reviews:', encoder(X))

original reviews: tf.Tensor(
[b"I have to say I was disappointed when someone told me this was the best Indian in Edinburgh.  I wanted to try some Indian while I was in the UK since it's suppose to be better than what we have in the US.  I'm still confident UK Indian is better than the US but I will need another place to corroborate that.  On the plus side, the staff is really nice and attentive."
 b"This joint is open 24 hours!!!  Sold. \\n\\nI had the pho dac biet, and it was super good.  The broth was flavorful and clear and the meat was, surprisingly, very tender.  The noodles were firm and everything was just delicious.  I would normally order pho thai rather than pho dac biet, but I was feeling like a fatty and went ahead with it.  \\nThe wait staff were friendly and patient so that's always a plus.  The facility was CLEAN!!  Now, that's something you don't see very often in pho joints.\\nI also ordered the che ba mau (three-color dessert) and it was just okay.  It was in a small

Here I set up an embedding layer that converts each word in a vocabulary of size len(vocab) (e.g., 9000 unique words) into a 64-dimensional vector. The mask_zero=True parameter ensures that the padding value 0 is ignored during the embedding process.

In [None]:
embedding_layer = tf.keras.layers.Embedding(
        input_dim = len(vocab),
        output_dim = 64,
        mask_zero = True)


Here I pust a sample text review, the code encodes it into numerical tokens using a text vectorization layer, and then converts those tokens into dense vectors using an embedding layer. Finally, it prints out the original review, the encoded version, and the dense vector representation after word embedding.

In [None]:
X = 'Excellent Place!'
X_enc = encoder(X)
X_embed = embedding_layer(X_enc)

print('Original reviews:', X)
print('\nEncoded reviews:', X_enc)
print('\nAfter word embedding:', X_embed)


Original reviews: Excellent Place!

Encoded reviews: tf.Tensor([336  31], shape=(2,), dtype=int64)

After word embedding: tf.Tensor(
[[ 0.01757331  0.04260396  0.04937771 -0.03792123 -0.03766849  0.01313165
  -0.03508587  0.01294206  0.03150516  0.03556773 -0.04271854 -0.02369543
   0.03375738 -0.0071109   0.01945225 -0.0323123  -0.0032388   0.04825736
  -0.00787268  0.03707803  0.01624823 -0.00302924  0.04584234  0.04013792
   0.04226087 -0.00163543 -0.03974649  0.02279266  0.01492066 -0.03629658
   0.01217761  0.0357567   0.00045057 -0.04354965  0.00516307  0.017185
   0.03342057 -0.02560713  0.0483175  -0.00678805 -0.04786395 -0.02327138
  -0.02112691  0.02540408  0.04368997 -0.0179338  -0.03367019  0.021217
   0.02563277 -0.02940058 -0.01704581 -0.00212426 -0.00797459  0.0189031
   0.02244843 -0.01856378  0.00859342  0.00554812  0.02787996  0.00798916
  -0.00202544  0.03363364  0.01444364  0.02613577]
 [-0.046948   -0.00757122 -0.00697754  0.01933506  0.04766598 -0.04873687
  -0.00

## **Building the RNN Model**
- Define a sequential neural network model for sentiment analysis.
- Starts with a text vectorization layer (encoder) to convert text reviews into numerical tokens.
- Followed by an embedding layer to convert these tokens into dense vectors.- - Next, there's an LSTM layer to capture sequential dependencies in the data. - After that, there's a dense layer with ReLU activation for feature extraction.
- Finally, there's an output layer with a single unit for sentiment prediction.

In [None]:
# define a neural network
model = tf.keras.Sequential([
    encoder,
    embedding_layer,
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(48, activation = 'relu'),
    tf.keras.layers.Dense(1)
])

- Specifying the loss function, optimizer, and evaluation metrics.
- Binary cross-entropy loss is used for binary classification
- the Adam optimizer with a specific learning rate is chosen for optimization, and accuracy is chosen as the evaluation metric to monitor during training.

In [None]:
# specify the loss function
model.compile(loss = tf.keras.losses.BinaryCrossentropy(from_logits = True),
              optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0001),
              metrics = ['accuracy'])

In [None]:
# This summary provides insights into the model's architecture, the number of parameters, and memory requirements, which are essential for understanding and optimizing the model.
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVe  (None, None)              0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, None, 64)          576000    
                                                                 
 lstm (LSTM)                 (None, 64)                33024     
                                                                 
 dense (Dense)               (None, 48)                3120      
                                                                 
 dense_1 (Dense)             (None, 1)                 49        
                                                                 
Total params: 612193 (2.34 MB)
Trainable params: 612193 (2.34 MB)
Non-trainable params: 0 (0.00 Byte)
____________________

## **Training the RNN Model**
- EarlyStopping callback will continuously monitor the validation loss during training. If it doesn't see any improvement in validation loss for 5 consecutive epochs, it will stop the training process early to prevent the model from overfitting.
- the callback will continuously monitor the validation loss during training, and whenever there is an improvement, it will save the model's weights to the specified directory, overwriting the previous best weights if save_best_only is set to True.

In [None]:
# setting up the earlystopping
earlystop_callback = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 5)

In [None]:
# setting up model checkpointing
checkpoint_path = ('./checkpoints')
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath = checkpoint_path,
                                                  save_weights_only = True,
                                                  save_best_only = True,
                                                  monitor = 'val_loss',
                                                  verbose = 1)


In [None]:
# takes a batched dataset and transforms it into a dataset where each element corresponds to an individual
# After unbatching, we're batching the dataset again with a batch size of 64.
train_ds = train_ds.unbatch().batch(64)
test_ds = test_ds.unbatch().batch(64)

In [None]:
history = model.fit(train_ds,
                    epochs = 50,
                    validation_data = test_ds,
                    callbacks = [earlystop_callback, cp_callback]).history    # the callabck functions defined earlier

Epoch 1/50
   1701/Unknown - 159s 90ms/step - loss: 0.3140 - accuracy: 0.8448

KeyboardInterrupt: 

## Using the RNN model for sentiment analysis

In [None]:
def show(pred):
  print('prediction score:', pred)
  if pred >= 0.5:
    print('A positive review')
  else:
    print('A negative review')


In [None]:
pred = model.predict(['Not at all impressed'])
show(pred)

NameError: name 'model' is not defined

In [None]:
pred = model.predict(['what an amazing place'])
show(pred)

In [None]:
pred = model.predict(['The best cupcakes in Henderson! Consistently moist and fresh!'])
show(pred)

In [None]:
pred = model.predict(["""Mediocre burgers - if you are in the area and want a fast food burger,
 Fatburger is  a better bet than Wendy's. But it is nothing to go out of your way for"""])

pred = tf.nn.sigmoid(pred)
show(pred)


## Saving RNN Model in Colab

In [None]:
# Saving the best model as a TensorFlow (.tf) model
latest_cp = tf.train.latest_checkpoint(checkpoint_path)

model.save('rnn_model.tf')

In [None]:
! ls -alt

In [None]:
from google.colab import drive

drive.mount('/googledrive')

In [None]:
model.save('/googledrive/My Drive/some_folder/rnn_model.tf')

In [None]:
! ls -alt '/googledrive/My Drive/some_folder'

In [None]:
model = tf.keras.models.load_model('/googledrive/My Drive/some_folder/rnn_model.tf')

model.summary()