<a href="https://colab.research.google.com/github/lorenzouttini/Exam-Deep-Learning/blob/main/Rating_Movie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Prediction Rating Movie**

In this project I am going to predict Movie's rating based on their reviews.

## PART 0: Introduction to the model
The architecure I choose for this project is formed by a **RNN unit** (GRU or LSTM) followed by one (or more) **MLP layers**. <br>
As introductory part of the code, I have to import all the libraries that will be necessary for the running of the model.

In [None]:
# Standard Libraries
import pandas as pd
import numpy as np

In [None]:
# Libraries from Tensorflow and Keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, GRU
from tensorflow.keras.optimizers import Adam, SGD


In [None]:
# Libraries useful for unzip of the file
import os
import requests
import zipfile

In [None]:
# Libraries from Sklearn
from sklearn.metrics import accuracy_score, f1_score, precision_recall_curve, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split,  StratifiedKFold, GridSearchCV
from imblearn.under_sampling import RandomUnderSampler

In [None]:
# Special library from Scikeras (library that unite Keras and Scikitlearn)
! pip install scikeras
from scikeras.wrappers import KerasClassifier     #we are going to wrap keras with scikit-learn



### 0.0: Unzip and Read of the file
Since on GitHub we cannot upload file that are greater than 25 mb, I decide to **zip** it and upload on GitHub. <br>
Then I create a function to unzip and, consequently, read it as csv file.

In [None]:
def unzip_and_read_csv(url, csv_filename):
    # Download the zip file
    response = requests.get(url)

    # Save the zip file
    zip_filename = 'temp.zip'
    with open(zip_filename, 'wb') as f:
        f.write(response.content)

    # Extract the zip file
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        zip_ref.extractall()

    # Read the CSV file into a pandas DataFrame
    csv_filepath = os.path.join(os.getcwd(), csv_filename)
    df = pd.read_csv(csv_filepath,encoding='latin-1')

    # Clean up temporary files
    os.remove(zip_filename)
    os.remove(csv_filepath)

    return df

In [None]:
# Define my GitHub link and the name of our dataset (csv)
url = "https://github.com/lorenzouttini/Exam-Deep-Learning/raw/main/parkReviews.zip"
csv_filename = 'parkReviews.csv'

In [None]:
# Apply the function and show the columns of the dataset
df = unzip_and_read_csv(url, csv_filename)
df.columns

Index(['Review_ID', 'Rating', 'Year_Month', 'Reviewer_Location', 'Review_Text',
       'Branch'],
      dtype='object')

## PART 1: Input representation and Preprocessing of the data
As I explained on the project, I decide to use as input only the *'Review_Text'* feature. <br>
Since it is a text feature, I have to do some preprocessing on it before to pass as input of the RNN unit. <br>
Firstly, I **reduce** the rows of the dataset. <br>
Secondly, I have **tokenized** it. <br>
Finally, I create the embedding (**one-hot encoding**).

### 1.1: Reduce the rows of the dataset
Since the dataset has more than 42.000 rows and we have limited RAM, I decide to reduce the dataset of 90% and keep only reviews that are more than 20 words and less than 200 words.

In [None]:
# Remove 90% of the rows
df = df.sample(frac=0.08, random_state=42)

In [None]:
# Remove reviews with more than 200 words or less than 20 words
df = df[df['Review_Text'].str.split().apply(len).between(20, 200)]
df.shape

(2749, 6)

### 1.2: Adjust the balance of the classes (added, not in the project)
It is important that in a model the percentage of the classes in the output is almost equal. <br>
Since in this dataset the classes are too much imbalanced, I have used the integrated *"RandomUnderSampler"* to reduce the percentage of the most likely classes and have a balanced dataset. <br>
In this way the performances are **not influenced** by the percentage of some classes.

In [None]:
# Call our dataset
target_counts = df['Rating'].value_counts()

# Compute the percentage of each rating
target_percentage = (target_counts / len(df)) * 100

print(target_percentage)

5    56.929793
4    25.245544
3    10.985813
2     3.928701
1     2.910149
Name: Rating, dtype: float64


In [None]:
# Split the dataset into X (Rating review) and y (Rating stars)
X = df['Review_Text']
y = df['Rating']

In [None]:
# Create an instance of RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)

original_shape = X.shape

# Reshape in a numpy array
X = np.array(X).reshape(-1, 1)

# Perform undersampling
X, y = undersampler.fit_resample(X, y)

# Create a new DataFrame with the undersampled data
df = pd.DataFrame(X, columns=['Rating'])
df['Rating'] = y

# Shuffle the undersampled data
df = df.sample(frac=1, random_state=42)

In [None]:
# Print the new dimension of X after UnderSampling
new_shape = X.shape
X.shape[0]

400

In [None]:
#Return X to the original shape
X = X.reshape(new_shape[0])

# Convert X_undersampled to a pandas Series
X = pd.Series(X.squeeze())

Now we can see that our dataset is **balanced**.

In [None]:
target_counts = df['Rating'].value_counts()

# New percentage
target_percentage = (target_counts / len(df)) * 100

print(target_percentage)

3    20.0
4    20.0
1    20.0
2    20.0
5    20.0
Name: Rating, dtype: float64


In [None]:
# Decrease the rating of one unit (now it is from 0 to 4)
df['Rating'] = df['Rating'] - 1

In [None]:
len(df)

400

### 1.3: Tokenize
**Tokenization** of the text is another fundamental part in the preprocessing. <br>
As I explained in the project, the tokenization part consists in making all the words lower case, removing sign punctuations and separate the words in a list using space. Moreover, I have defined the vocabulary. <br>
For this type of task I have used a **pretrained Tokenizer** from Keras. <br>

In [None]:
# Create a vocabulary dictionary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
vocab_size = len(tokenizer.word_index) + 1
word_index = tokenizer.word_index

### 1.4: One-Hot Encoding
Since I have reduced the dataset to a small size, I am able to perform the embedding through one-hot encoding. <br>
Also in this case, I decide to use a pretrained one-hot encoding from keras.  that tranforms all the sequences vectors of **2 dimensions**:
- First: size of the dataset.
- Second: lenght of the sequence. <br>

Since the RNN unit accepts only inputs of the same shape, I have to pad all the sequences to a common dimension (padding). In this case I choose 150 and this is the second dimension of the one-hot vector.

In [None]:
# Transform words into a two-dimensional one-hot encoding vectors
sequences = tokenizer.texts_to_sequences(X)
padded_sequences = pad_sequences(sequences, maxlen=150, padding='post')
padded_sequences.shape

(400, 150)

### 1.5: Train and Test
In the last part of the input preprocessing, I divide X,y in train and test.

In [None]:
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(padded_sequences, y, test_size=0.2, random_state=42)

In [None]:
# Reshape the input data to 3D
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], 1)
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], 1)

## PART 2: Output Layer

As I have described in the project, the output is a **vector a probabilities**, each one assigned to a class.

The output layer is Dense layer with **5 units** (where this number corresponds to the number of classes). <br>


Since it is a multiclass task, the activation function I have used for this layer is the **softmax**. <br>

## PART 3: Activation and Loss


###3.1 **ACTIVATIONS:**<br>
The activation functions for the RNN units (LSTM and GRU) are already integrated in the structure of this architecture. <br>
For the hidden layer, the activation function I have choosen is **ReLU**.--> `'relu'`. <br>
However, since the vectors have lots of zeros (and with ReLU many gradients will be zero), I decide to try also the **Sigmoid** --> `'sigmoid'`.<br> <br>

###3.2 **LOSS FUNCTIONS:**<br>
Since the target feature ('Ratings') is formed by **integer values**, the loss function for this task is the Sparse Categorical Cross Entropy (`sparse_categorical_crossentropy`).


## PART 4: Initializers, Regularizers, Normalizers

### 4.1 **INITIALIZERS:**
For the RNN layer and output layer I have used the **Glorot (Normal**) weight initialization. This initialization is very efficient for layers that have as activations sigmoid, tahn, softmax. --> `'glorot_normal'`<br>
Instead for the hidden layer I have used the **He (Normal)** weight initialization that works more efficiently for layers that have ReLU as activation. --> `'he_normal`<br> <br>


### 4.2 **REGULARIZERS**
As regularizer in this model I have choosen to use the **Dropout**. With coding it is expressed as a new layer between the others layers. --> `Dropout(dropout_rate)`, where the dropout rate indicates how much percentage of the neurons will be removed temporarily. <br><br>


### 4.3 **NORMALIZERS**
In the project I mentioned the **Batch Normalization Layer** as a possible option as a normalizer. Since the Batch Normalization is useful when we deal with large networks architectures, in this case I decide to not use (our model is formed only by 3 layers).

All the elements I described in point 2 (*output*), 3 (*activation, loss*), 4 (*regularizers, initializers, normalizers*) are structured in the following instruction code:
- I defined a function `create_model` that contains the whole Sequential model (the 3 layers with corresponding hyperparameters) and the compile function that includes the loss function, the optimizer and the metric.
- Then, I defined a **sample model** (`model`) with the function KerasClassifier of SkiKeras and I fit it on x,y train.
- Finally, I try to show the accuracy on the training of this sample model.

In [None]:
#Define the model
def create_model(RNN_type,
                 nRNN,
                 nhid,
                 learning_rate= 0.001,
                 hid_act='relu',
                 out_act='softmax',
                 dropout_rate=0.2,
                 optimizer=SGD,
                 epochs=10,
                 batch_size=32):

  model = Sequential()
  model.add(RNN_type(nRNN,
                 input_shape=(150,1),
                 kernel_initializer='glorot_normal'))
  model.add(Dense(nhid,
                  activation=hid_act,
                  kernel_initializer='he_normal'))
  model.add(Dropout(dropout_rate))
  model.add(Dense(5,
                  activation=out_act,
                  kernel_initializer='glorot_normal'))
  model.compile(loss='sparse_categorical_crossentropy',
                optimizer=optimizer(learning_rate = learning_rate),
                metrics=['accuracy'])
  return model

# Define the sample model
model = KerasClassifier(model=create_model,
                        RNN_type = LSTM,
                        nRNN = 64,
                        nhid = 64,
                        epochs=10)



In [None]:
# Fit the model on the train
model.fit(x_train, y_train, validation_split=0.2)
pred = model.predict(x_train)

# Show the Training accuracy
print(f"Training accuracy: {accuracy_score(y_train, pred)}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training accuracy: 0.203125


## PART 5: Hyperparameters
There are a lot of hyperparameters that I could train on the model. Since we have limited RAM, I decide to tune only some of them. I choose:
- **RNN_type**: corresponds to the type of RNN architecture to use (LSTM or GRU).
- **learning_rate**: corresponds to the learning rate of the optimizers. I choose 0.001 and 0.01 as possible values.
- **hid_act**: corresponds to the activation of the hidden Dense Layer (ReLu or Sigmoid).
- **nRNN**: number of units of the RNN part.
- **nhid**: number of units of the hidden layer.
- **optimizer**: corresponds to the type of optimizer (SGD or Adam).
- **dropout_rate**: the percentage of neurons temporarily removed (0.3 or 0.2).
- **batch_size**: how many inputs processed each time.

Finally, I have applied a **GridSearch** to find the best model in performance with the best combination of the parameters.

In [None]:
# Defined the hyperparameters
RNN_type = [LSTM, GRU]
learning_rate = [0.01, 0.001]
hid_act = ['relu', 'sigmoid']
nRNN = [32, 64]
nhid = [32, 64]
optimizer = [SGD, Adam]
dropout_rate = [0.2,0.3]
batch_size = [32, 64]

In [None]:
#Define the parameters grid
param_grid = dict(model__dropout_rate = dropout_rate,
                  model__learning_rate = learning_rate,
                  model__hid_act = hid_act,
                  model__batch_size = batch_size,
                  model__optimizer = optimizer,
                  model__RNN_type = RNN_type,
                  model__nhid= nhid,
                  model__nRNN= nRNN)

In [None]:
# Redefined the model
model = KerasClassifier(model=create_model,
                        epochs = 10)

In [None]:
# Set the Grid Search
GS = GridSearchCV(estimator=model,
                  param_grid=param_grid,
                  n_jobs=-1,
                  scoring='accuracy',
                  refit=True,
                  cv=3,
                  verbose = 0)

##PART 6: Evaluation

In the evaluation part we fit the **GridSearch** and we see which model has obtained the best performance. <br>
In our case the best model is composed by the following hyperparameters:
- RNN_type = **LSTM**
- batch_size = **32**
- dropout_rate = **0.3**
- hid_act = **sigmoid**
- learning_rate = **0.001**
- nRNN = **32**
- nhid = **32**
- optimizer = **Adam**

This model has obtained an accuracy of 25% on the traning set. <br><br>

Finally, we use this model to predict on the test and we obtain as accuracy 16% (worse than the traning).

In [None]:
sol_model = GS.fit(x_train, y_train)
print(f'\tModel: Best score got by the best estimator: {sol_model.best_score_}')    # accuracy on the training
print(f'\tModel:Configuration for the best estimator/classifier: {sol_model.best_params_}')



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
	Model: Best score got by the best estimator: 0.26250514312584494
	Model:Configuration for the best estimator/classifier: {'model__RNN_type': <class 'keras.layers.rnn.lstm.LSTM'>, 'model__batch_size': 64, 'model__dropout_rate': 0.3, 'model__hid_act': 'relu', 'model__learning_rate': 0.001, 'model__nRNN': 64, 'model__nhid': 64, 'model__optimizer': <class 'keras.optimizers.adam.Adam'>}


In [None]:
# Prediction on the test with the best estimator
best_model = sol_model.best_estimator_
pred_model = best_model.predict(x_test)
model_acc = accuracy_score(y_test, pred_model)
model_acc      # accuracy on the test



0.225

In [None]:
print(f"Model: Mean Accuracy test:{np.mean(model_acc)}")
print(f"Model: Standard Deviation test:{np.std(model_acc)}")

Model: Mean Accuracy test:0.225
Model: Standard Deviation test:0.0


## FINAL CONSIDERATIONS
The performances of this model are not acceptables. But there are various reasons that we can claim to justify the results I obtained:
- The dataset I use for the prediction is composed only by **400 rows**, that is a very small number compared to the original dataset that contains 42,000 rows (about 1000 times). The reason for which I have used a small the dataset regards the limitations of the RAM.
- The dataset originally is very **imbalanced**. Most of the ratings correspond to 5 and this characteristics would influence too much the metrics (without balancing the dataset I would obtain 65% of accuracy on the test). For this reason I decide to undersampler the dataset but I use a technique that is not so efficient (RandomUnder Sampler) and I obtained these results.
- Nowdays a lot of pretrained **embeddings** exist. Moreover, there are a lot of NLP techniques that generate very efficient embedding for texts and words. I choose to use one-hot encoding because it is the only one that we have seen during lectures.
- Laslty, the **limitations** on RAM and computation units do not allow to construct great architectures.