# Assignment - optimization and regularization

Hi there! In this assignment, you will use review data to build an RNN model that is able to predict the overall rating of a product from the review text. You are welcome also to use the summary text for this task i you ind it usefull

To get you started, I have provided a complete working example.

When you are done, submit your results on the Kaggle webpage for this competition. If you do not like to show your score to everyone, you can use an anonymous username on Kaggle.

However, I suggest you use your real name, after all it is just meant as an exercise and it is more fun that way. You can submit 5 times every day, so you can experiment with some stuff without being "locked in".

# Setup
## Data: 

### y_train.npy:
- **overall**: The overall rating (1-5).
 
### X_train.npy and X_test.npy::
- **reviewerID**: The ID of the reviewer.
- **reviewText**: The text of the review.
- **summary**: The summary of the review.

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd

X_train =  np.load('X_train.npy',allow_pickle=True)
y_train =  np.load('y_train.npy',allow_pickle=True)
x_test =  np.load('X_test.npy',allow_pickle=True)

df_Xtrain = pd.DataFrame(X_train,columns=['reviewerID','reviewText','summary'])
df_ytrain = pd.DataFrame(y_train,columns=['overall'])
df_train = pd.concat([df_ytrain, df_Xtrain], axis=1)
df_Xtest = pd.DataFrame(x_test,columns=['reviewerID','reviewText','summary'])

In [2]:
print(f'First review = {df_train.loc[0, "reviewText"]}')
print(f'First review has length = {len(df_train.loc[0, "reviewText"])}\n ')
print(f'First review summary= {df_train.loc[0, "summary"]}')
print(f'First review summary has length = {len(df_train.loc[0, "summary"])}\n ')

print(f'First review overall rating = {df_train.loc[0, "overall"]}')

First review = One of my favorite perfumes and the fact that it is unisex is awesome. I'm gifting this for my nephew.
First review has length = 102
 
First review summary= One of my favorite perfumes and the fact that it is unisex is ...
First review summary has length = 65
 
First review overall rating = 4


Making the textual data ready for the RNN model is a bit more involved than for the previous models. We will use the Keras Tokenizer to do most of the work for us. The Tokenizer will split the text into words for us, and create a vocabulary with an index number for each word. We can then represent each text sample by the index numbers of the words in the text. See lecture notes for more details.

In [3]:
import pandas as pd
import tensorflow as tf

max_tokens = 1000 # the maximum number of words to keep, based on word frequency. Only the most common `max_tokens-1` words will be kept.
output_sequence_length = 100 # the maximum length of the sequence to keep. Sequences longer than this will be truncated.
pad_to_max_tokens = True # whether to pad to the `output_sequence_length`.

# Ensure the text column is of string type and handle NaN values
df_train['overall'] = df_train['overall'] - 1 # to make it 0-4
df_train['reviewText'] = df_train['reviewText'].fillna('').astype(str)

# Initialize the TextVectorization layer
encoder = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_sequence_length=output_sequence_length, pad_to_max_tokens=pad_to_max_tokens)

# Create a dataset of only text data to adapt the encoder
text_ds = tf.data.Dataset.from_tensor_slices(df_train['reviewText']).batch(128)
encoder.adapt(text_ds)
vocab = np.array(encoder.get_vocabulary()) 

# Create the full train dataset with text and labels
train_ds = tf.data.Dataset.from_tensor_slices((df_train['reviewText'], df_train['overall'])).batch(128)
# Apply TextVectorization to the text data in the dataset
train_ds = train_ds.map(lambda x, y: (encoder(x), y))

# Configure the dataset for performance
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)


In [4]:
# Making the test data ready as we did or the training data
df_Xtest['reviewText'] = df_Xtest['reviewText'].fillna('').astype(str)

# Convert the texts to sequences using the already adapted TextVectorization layer
text_test_ds = tf.data.Dataset.from_tensor_slices(df_Xtest['reviewText']).batch(128)
test_ds = text_test_ds.map(lambda x: encoder(x))

# Configure the dataset for performance
AUTOTUNE = tf.data.AUTOTUNE
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [5]:
# Inspect the first few batches of the train_ds
for text_batch, label_batch in train_ds.take(1):  # Adjust .take() for more batches
    for i in range(5):  # Adjust the range to see more or fewer examples
        print("Review:", text_batch.numpy()[i])
        print("Label:", label_batch.numpy()[i])
        print("---")


Review: [ 47  11  10 277 822   5   2 506  13   6   9   1   9 635  69   1   8  12
  10   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]
Label: 3
---
Review: [  3  82   1 111   2 232   1  96   5 352  21  48  21   2 248  96   5 352
   3  19  59   8 211 743  14 908  17   2 248 352 686   5   1  21  48  21
 116 402   3 177  13   6 827   2 149 104  17 220  21 201  21   2  73 216
 459  71   9   7 141  13 577  19  41   2  96  22 120  14   2 104 624 556
 938 622  13  22  49  19   7   1  39 149 104  12   4 131 285   7 112   8
 986   3 177  13   1   2  96  27  11   1]
Label: 4
---
Review: [  1  56   6 145  15   1   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

In [6]:
embedding_dimension = 128
embedding_model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=len(vocab), 
                              output_dim=embedding_dimension,
                              input_length=100,
                              name="embedding"), 
    tf.keras.layers.LSTM(128),
    tf.keras.layers.Dense(5, activation='softmax')
])

In [7]:
embedding_model.compile(optimizer='adam',
                        loss='sparse_categorical_crossentropy',
                        metrics=['accuracy'])


embedding_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 128)          128000    
                                                                 
 lstm (LSTM)                 (None, 128)               131584    
                                                                 
 dense (Dense)               (None, 5)                 645       
                                                                 
Total params: 260,229
Trainable params: 260,229
Non-trainable params: 0
_________________________________________________________________


In [8]:
embedding_model.fit(train_ds, epochs=1, verbose=1)



<keras.callbacks.History at 0x20320d85f00>

In [9]:
# Make predictions
predictions = embedding_model.predict(test_ds)
# The 'predictions' array will contain the probabilities of each class for each sample

# Convert probabilities to class labels
predicted_labels = np.argmax(predictions, axis=1)+1

# 'predicted_labels' now contains the class label (1 to 5) for each sample in your test dataset
print(predicted_labels)

[5 5 5 ... 5 5 4]


In [10]:
y_test_hat_pd = pd.DataFrame({
    'Id': list(range(len(predicted_labels))),
    'Predicted': predicted_labels.reshape(-1),
})
y_test_hat_pd.to_csv('y_test_hat.csv', index=False)