# Final Project: Applied Mathematical Concepts For Deep Learning

# Food Reviews Sentiment Analysis



# **About The Project:**

In this project, we are solving a binary classification problem for food reviews. We analysed the text written by customers in natural language, cleaned the data, trained models with multiple approaches and performing prediction with sample testing data.

**About Dataset:**

This dataset is taken from Kaggle and it consists of 500,000 reviews of amazon fine foods collected over 10 years of time and with the file size of 300 MB. We have used the 'Score' which is the rating of the the product out of 5 and the text reviews written by the customers.

We used this data to perform a binary classification to determine whether the review submitted by the customer is positive or negative.

We performed different experiments with various type of neural networks and finally selected 2 approaches to demonstrate.

**Approach-1:** Using Bi-Directional LSTM to train the model from scratch.

**Approach-2:** Fine-Tuning DistilBERT model with customized data.

This notebook contains the codes and comments as per google standards of documenting and follows the object-oriented style of programming. Each cell contains a pipeline and at last we are calling all the functions in different class to run the training and prediction codes.


Data Source: [Kaggle Link](https://www.kaggle.com/code/sonalisingh1411/nlp-part-1-amazon-fine-food-sentiment-analysis/input)

# Data Loading Pipeline

In [103]:
""" Load the google drive """
from google.colab import drive

""" Data manipulation libraries """
import pandas as pd
import numpy as np

""" NLTK Libraries and modules """
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

""" Other libraries """
import re
import pickle

""" Scikit-learn libraries """
from sklearn.model_selection import train_test_split

""" Tensorflow libraries """
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense

from tensorflow.keras.models import load_model

""" Import libraries for pre-trained model """
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

""" Download nltk """
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Pre-Processing Pipeline
We made the pre-processing pipeline to be used for training, testing and inference data. Here are the main operations we performed on the text data.

**Data Cleaning:**
1.   Remove punctuations
2.   Convert words into lower case
3.   Remove stop words
4.   Stem the words
5.   Remove the non-alphanumeric words

We took 'Score' column for the output which is the rating out of 5. So, we labeled the output to give either 0 or 1.

**Label Encodings:**
1.   Review score > 3 = Positive Review [1]
2.   Review score <= 3 = Negative Review [0]

**Split Data:**

We kept 80% of the data for training and 20% for testing.

**Tokenization:**

We tokenized all the text based on the words in X_train data and converted training and testing text data into a sequence. We saved the tokenizer to be used in inference pipeline.

**Padding:**

At the end, to keep all the data with same size, we padded the tokenized data with the maximum length of the text from the training data.


In [104]:
class DataProcessor:
  """ Class to pre-process the data """
  def __init__(self):
      pass

  def data_cleaning(self, text):
    """ Clean the text data by removing punctuation, converting text to lower case, removing stopwords and applying stemming """
    text = text.str.replace('[^\w\s]', '')
    text = text.str.lower()
    stop_words = set(stopwords.words('english'))
    text = text.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
    stemmer = PorterStemmer()
    text = text.apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
    text = [re.sub(r'\W', ' ', word) for word in text]
    return text

  def label_encodings(self, score):
    """ Convert the score [1-5] into binary data [1-positive, 0-negative] """
    return int(score > 3)

  def split_data(self, X,y,split_size):
    """ Split the data into training and testing data """
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

  def tokenization(self, X_train, X_test, vocab_size):
    """ Tokenize the text data and convert the text into the sequence """
    tokenizer = Tokenizer(vocab_size)  # Adjust the vocabulary size as needed
    tokenizer.fit_on_texts(X_train)
    X_train_seq = tokenizer.texts_to_sequences(X_train)
    X_test_seq = tokenizer.texts_to_sequences(X_test)
    return X_train_seq, X_test_seq, tokenizer

  def save_tokenizer(self, tokenizer, tokenizer_file_path):
    """ Save the tokenizer file to use for prediction in inference """
    with open(tokenizer_file_path, 'wb') as tokenizer_file:
      pickle.dump(tokenizer, tokenizer_file)

  def padding_sequence(self, X_train_seq, X_test_seq, max_len):
    """ Pad the sequence to the maximum length of the text data """
    X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post')
    X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post')
    return X_train_pad, X_test_pad

# Model Training Pipeline

**Approach-1: Training model from scratch**

We used Bi-Directional LSTM architecture to train the model. Considering the size of the dataset, remembering the positive or negative words in the reviews from starting to ending is important. This was the main reason to select the LSTM architecture.

'Adam' optimizer with 'binary_crossentropy' loss gave us the best results. We used 'sigmoid' as an activation function considering the binary classification problem.

In [105]:
class ModelBuilder:
  """ Class to build the model and train the model from start """

  def __init__(self):
    pass

  def build_model(self, max_len):
    """
    Build a bidirectional LSTM model.
    Using 'sigmoid' activation and 'binary_crossentropy' loss for binary classification
    """
    model = Sequential()
    model.add(Embedding(input_dim=10000, output_dim=16, input_length=max_len))
    model.add(Bidirectional(LSTM(64)))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    return model

  def train_model(self, model, X_train, y_train, X_test, y_test, num_epochs):
    """ Train the model with training data and validate with the testing data. Save the model. """
    model.fit(X_train, y_train, epochs= num_epochs, validation_data=(X_test_pad, y_test))
    loss, accuracy = model.evaluate(X_test, y_test)
    """ Print the model summary """
    print("\n")
    print(model.summary())
    model.save('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/food_sentiment.keras')
    return loss, accuracy


# Fine-Tune Pre-Trained Model

**Approach-2: Using pre-trained model and fine-tune with customized data**

After training the sentiment analysis model we built from scratch, we decided to use a pre-trained transformer model called distil-BERT. We tried to fine-tune it with our food review dataset, where we freeze all the layers, except the last one, which is made trainable. Post training, we save the model and tokenizer to use it on command with new unseen food review data.

In [106]:
class FineTuningModel:
  """ Class to use pre-trained DistilBERT model and fine-tune the model """
  def __init__(self):
    pass

  def fine_tune_model_build(self, X_train, y_train, X_test, y_test):
    """ Fine tune the BERT model and fine tune the model """
    X_train_tensor = tf.constant(X_train)
    y_train_tensor = tf.constant(y_train)
    X_test_tensor = tf.constant(X_test)
    y_test_tensor = tf.constant(y_test)
    train_dataset = tf.data.Dataset.from_tensor_slices((X_train_tensor, y_train_tensor))
    eval_dataset = tf.data.Dataset.from_tensor_slices((X_test_tensor,y_test_tensor))

    """ Build the tokenizer and get the tokens for training and testing data """
    tokenizer_ft = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    X_train_tokens = tokenizer_ft(X_train, truncation=True, padding=True, max_length=max_len, return_tensors='tf')
    X_test_tokens = tokenizer_ft(X_test, truncation=True, padding=True, max_length=max_len, return_tensors='tf')

    distilbert_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

    """ Freeze all layers except the last layer """
    for layer in distilbert_model.layers[:-1]:
        layer.trainable = False

    """ Extract the logits from the DistilBERT model output """
    logits = distilbert_model(X_train_tokens)['logits']

    """ Adding the customized trainable layers on the top of pre-trained model """
    model_ft = tf.keras.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    optimizer_ft = tf.keras.optimizers.Adam(learning_rate=0.001)
    loss_ft = tf.keras.losses.BinaryCrossentropy()
    metrics_ft = ['accuracy']

    """ Compile the fine-tuned model """
    model_ft.compile(optimizer=optimizer_ft, loss=loss_ft, metrics=metrics_ft)

    """ Fit the model """
    model_ft.fit(
        logits,
        y_train,
        validation_data=(distilbert_model(X_test_tokens)['logits'], y_test),
        epochs=50,
        batch_size=8
    )

    """ Print the model summary """
    print("\n")
    print(model_ft.summary())

    """ Save the model """
    model_ft.save('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/food_sentiment_finetuned.h5')

    """ Evaluate the model """
    loss_ft, accuracy_ft = model_ft.evaluate(distilbert_model(X_test_tokens)['logits'], y_test, batch_size=32)
    return tokenizer_ft, model_ft, loss_ft, accuracy_ft, distilbert_model

  def save_fine_tuned_tokenizer(self, tokenizer_ft, tokenizer_ft_file_path):
    """ Save the tokenizer from fine tuning of the model """
    with open(tokenizer_ft_file_path, 'wb') as tokenizer_file:
      pickle.dump(tokenizer_ft, tokenizer_file)

  def save_fine_tuned_model(self, model_ft, model_ft_file_path):
    """ Save the fine-tuned model """
    model_ft.save(model_ft_file_path)

# Inference Pipeline

After training both the models, now it was time to test the results and for that, we decided to create a class that loads the saved tokenizers, does the required pre-processing step for both the models respectively, and gives the prediction results for the unseen food review data.

In [107]:
class SentimentPredictor:
  """ Class to predict the sentiment with the trained model """
  def __init__(self):
    self.data_preprocessor = DataProcessor()
    self.model_builder = ModelBuilder()

  def load_tokenizer(self, tokenizer_file_path):
    """ Load saved tokenizer from the path """
    with open(tokenizer_file_path, 'rb') as tokenizer_file:
      loaded_tokenizer = pickle.load(tokenizer_file)
    return loaded_tokenizer

  def load_fine_tuned_tokenizer(self, tokenizer_ft_file_path):
    """ Load saved tokenizer for fine tuned model from the path """
    with open(tokenizer_ft_file_path, 'rb') as tokenizer_file:
      loaded_tokenizer_ft = pickle.load(tokenizer_file)
    return loaded_tokenizer_ft

  def preprocess_input(self, text, tokenizer, max_len):
    """ Pre-process the input, convert the text to sequence and add the padding """
    text_sequence = tokenizer.texts_to_sequences(text)
    vocab_size = len(tokenizer.word_index)
    text_padded = pad_sequences(text_sequence, maxlen = max_len, padding= 'post')
    return text_padded

  def load_trained_model(self, model_file_path):
    """ Load the saved model """
    model = load_model(model_file_path)
    return model

  def predict_sentiment(self, text, model):
    """ Predict and return the sentiment as 1 or 0 """
    prediction = model.predict(text)
    if prediction > 0.5:
      prediction = 1
    else:
      prediction = 0
    return prediction

  def preprocess_input_fine_tuned(self, text_ft, tokenizer_ft, max_len):
    """ Pre-process the input text to feed to the fine tuned model """
    text_tokens = tokenizer_ft(text_ft, truncation=True, padding=True, max_length=max_len, return_tensors='tf')
    return text_tokens

  def predict_sentiment_fine_tuned(self, text_ft, model_ft, distilbert_model):
    """ Get the prediction from the fine-tuned model """
    prediction = model_ft.predict(distilbert_model(text_ft)['logits'])[0, 0]
    if prediction > 0.5:
      prediction = 1
    else:
      prediction = 0
    return prediction

  def print_output(self, prediction):
    """ Print the sentiment output for the user """
    if prediction == 1:
      print("Positive Review")
    else:
      print("Negative Review")

# Main Function

This function calls all the functions in the classes using objects.
Here are the overall steps that the code is following:


1.   Read the data.
2.   Instantiate the object for DataProcessor.
3.   Use the data processing object to call the functions from the class to pre-process the data
4.   Instantiate the object for ModelBuilder (Approach-1).
5.   Use the model training object to call the functions to train the model from scratch.
6.   Instantiate the object for FineTuningModel (Approach-2).
7.   Use the fine tuning model training object to call the function to process the data to be used with pre-trained model and then fine-tune the model.
8.   Instantiate the object for SentimentPredictor (Inference pipeline).
9.   Use the predictor object to call the functions to process the data to be give to the models, predict the output and print the output for both models.



In [108]:
if __name__ == "__main__":
  """ Mount the drive """
  drive.mount('/content/drive')

  """ Read the data """
  reviews = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/Reviews.csv').head(10000)

  """ Instantiate DataProcessor class to preprocess the data """
  data_processor = DataProcessor()

  """ Clean the text in the 'Text' column of DataFrame """
  texts = data_processor.data_cleaning(reviews['Text'])

  """ Encode the scores in binary labels """
  labels = reviews['Score'].apply(data_processor.label_encodings)

  """ Split the data into training and testing data """
  X_train, X_test, y_train, y_test = data_processor.split_data(texts, labels, split_size=0.2)

  """ Sent the vocabulary size for tokenization """
  vocab_size = 1000

  """ Tokenize the training and testing data """
  X_train_seq, X_test_seq, tokenizer = data_processor.tokenization(X_train, X_test, vocab_size)

  """ Finding maximum sequence length """
  max_len = max(len(seq) for seq in X_train_seq)

  """ Pad the training and testing data sequence with maximum length """
  X_train_pad, X_test_pad = data_processor.padding_sequence(X_train_seq, X_test_seq, max_len)

  """ Save the tokenizer """
  data_processor.save_tokenizer(tokenizer, '/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/tokenizer.pkl')

  """ Instantiate ModelBuilder class to build and train the model """
  model_builder = ModelBuilder()

  """ Build the model with maximum sequence length """
  print("\n\nModel Training:\n\n")
  model = model_builder.build_model(max_len)

  """ Train the model and obtain loss and accuracy """
  loss, accuracy = model_builder.train_model(model, X_train_pad, y_train, X_test_pad, y_test, num_epochs=10)

  """ Instantiate FineTuningModel class for fine-tuning the model """
  fine_tuning_model = FineTuningModel()

  """ Build and fine tune the model. Reducing size to train the model with limited computational power. """
  print("\n\nFine-Tuned Model Training:\n\n")
  X_train = X_train[0:100]
  y_train = y_train[0:100]
  X_test = X_test[0:100]
  y_test = y_test[0:100]
  tokenizer_ft, model_ft, loss_ft, accuracy_ft, distilbert_model = fine_tuning_model.fine_tune_model_build(X_train, y_train, X_test, y_test)

  """ Save the fine-tuned tokenizer """
  fine_tuning_model.save_fine_tuned_tokenizer(tokenizer_ft, '/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/tokenizer_ft.pkl')

  """ Save the fine-tuned model """
  fine_tuning_model.save_fine_tuned_model(model_ft, '/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/food_sentiment_finetuned.keras')


  """ PREDICT THE SENTIMENT """

  """ Instantiate SentimentPredictor class to predict sentiment """
  sentiment_predictor = SentimentPredictor()

  """ Load the tokenizer """
  tokenizer = sentiment_predictor.load_tokenizer('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/tokenizer.pkl')

  """ Take input from the user """
  print("\n\n")
  text = "Perfect size sea salt for the table or the picnic basket.  We love it. Shakes well, no clumping and flows freely."
  print("Sample Text For Model: ", text)

  """ Pad the input text  """
  text_padded = sentiment_predictor.preprocess_input([text], tokenizer, max_len)

  """ Train the model """
  model = sentiment_predictor.load_trained_model('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/food_sentiment.keras')

  """ Predict the sentiment """
  prediction = sentiment_predictor.predict_sentiment(text_padded, model)

  """ Print the prediction """
  sentiment_predictor.print_output(prediction)

  """ Instantiate another SentimentPredictor class for fine-tuned model inference """
  fine_tuned_predictor = SentimentPredictor()

  """ Load the fine-tuned model tokenizer for inference pipeline """
  tokenizer_ft = fine_tuned_predictor.load_fine_tuned_tokenizer('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/tokenizer_ft.pkl')

  """ Get the input from the user to predict with fine-tuned model """
  print("\n\n")
  text_ft = "Perfect size sea salt for the table or the picnic basket.  We love it. Shakes well, no clumping and flows freely."
  print("Sample Text For Fine-Tuned Model: ", text_ft)

  """ Pre-process the input for predicting with fine-tuned model """
  text_tokens = fine_tuned_predictor.preprocess_input_fine_tuned([text_ft], tokenizer_ft, max_len)

  """ Load the fine-tuned model for prediction """
  model_ft = fine_tuned_predictor.load_trained_model('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/food_sentiment_finetuned.keras')

  """ Predict the sentiment using the fine-tuned model """
  prediction_ft = fine_tuned_predictor.predict_sentiment_fine_tuned(text_tokens, model_ft, distilbert_model)

  """ Print the output """
  fine_tuned_predictor.print_output(prediction_ft)

  """ MAKING MULTIPLE PREDICTIONS """
  print("\n\n")
  print("Sample Predictions: ")
  test_texts = [
    "Great food! I love the idea of one food for all ages & breeds. A real convenience as well as a really good product.",
    "The worst products I ever tried in my life. Very bad quality and bad service."
  ]
  for text in test_texts:
    print("\n",text,"\n")
    text_padded = sentiment_predictor.preprocess_input([text], tokenizer, max_len)
    model = sentiment_predictor.load_trained_model('/content/drive/MyDrive/Colab Notebooks/projects/food_sentiment/food_sentiment.keras')
    prediction = sentiment_predictor.predict_sentiment(text_padded, model)
    print("Model Prediction: ")
    sentiment_predictor.print_output(prediction)

    text_tokens = fine_tuned_predictor.preprocess_input_fine_tuned([text], tokenizer_ft, max_len)
    prediction_ft = fine_tuned_predictor.predict_sentiment_fine_tuned(text_tokens, model_ft, distilbert_model)
    print("Fine-Tuned Model Prediction: ")
    fine_tuned_predictor.print_output(prediction_ft)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  text = text.str.replace('[^\w\s]', '')




Model Training:


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model: "sequential_32"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_17 (Embedding)    (None, 469, 16)           160000    
                                                                 
 bidirectional_17 (Bidirect  (None, 128)               41472     
 ional)                                                          
                                                                 
 dense_51 (Dense)            (None, 1)                 129       
                                                                 
Total params: 201601 (787.50 KB)
Trainable params: 201601 (787.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


Fine-Tuned Model Training:




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Model: "sequential_33"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_15 (Flatten)        (None, 2)                 0         
                                                                 
 dense_52 (Dense)            (None, 256)               768       
                                                      

  saving_api.save_model(





Sample Text For Model:  Perfect size sea salt for the table or the picnic basket.  We love it. Shakes well, no clumping and flows freely.
Positive Review



Sample Text For Fine-Tuned Model:  Perfect size sea salt for the table or the picnic basket.  We love it. Shakes well, no clumping and flows freely.
Positive Review



Sample Predictions: 

 Great food! I love the idea of one food for all ages & breeds. A real convenience as well as a really good product. 

Model Prediction: 
Positive Review
Fine-Tuned Model Prediction: 
Positive Review

 The worst products I ever tried in my life. Very bad quality and bad service. 

Model Prediction: 
Negative Review
Fine-Tuned Model Prediction: 
Negative Review


**Completion Note:**

Due to limited computation power and RAM, the training of fine-tuned data is performed with limited data. Hence, the accuracy of prediction with fine-tuned model is less than the model trained from scratch.