**DESCRIPTION**

**Problem Statement**

Amazon is an online shopping website that now caters to millions of people everywhere. Over 34,000 consumer reviews for Amazon brand products like Kindle, Fire TV Stick and more are provided.

The dataset has attributes like brand, categories, primary categories, reviews.title, reviews.text, and the sentiment. Sentiment is a categorical variable with three levels "Positive", "Negative“, and "Neutral". For a given unseen data, the sentiment needs to be predicted.

You are required to predict Sentiment or Satisfaction of a purchase based on multiple features and review text.

# Setup

In [1]:
#Import the necessary library
import matplotlib.pyplot as plt
from itertools import cycle

import pandas as pd
import numpy as np
import re
# import required libraries
import os

In [2]:
from sklearn import metrics, svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

from scikeras.wrappers import KerasClassifier

2025-08-11 22:05:47.050470: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-11 22:05:47.057768: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754939147.066072   74899 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754939147.068492   74899 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754939147.074914   74899 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [3]:
from scipy import stats as st

In [4]:
import string
from string import punctuation
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import sent_tokenize, word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gheorghe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/gheorghe/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/gheorghe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/gheorghe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/gheorghe/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [5]:
import warnings
warnings.filterwarnings("ignore")

In [6]:
from imblearn.over_sampling import SMOTE, SMOTEN

In [7]:
from livelossplot import PlotLossesKerasTF

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Input, Model

# Data aquisition

In [None]:

pd_df_train_data = pd.read_csv("/content/train_data.csv")
pd_df_test_data = pd.read_csv("/content/test_data.csv")
pd_df_test_data_hidden = pd.read_csv("/content/test_data_hidden.csv")

In [None]:
pd_df_test_data.head()

In [None]:
pd_df_test_data_hidden.head()

In [None]:
pd_df_train_data.head()

### Functions

In [None]:
def print_performance(labels, predictions):
  print("Accuracy = {}".format(accuracy_score(labels, predictions)))
  print("Precision = {}".format(precision_score(labels, predictions)))
  print("Recall = {}".format(recall_score(labels, predictions)))
  (tp, fp), (fn, tn)  = confusion_matrix(labels, predictions)
  print("Confusion matrix: tp {}, fp {}, fn {}, tn {}".format(tp, fp, fn, tn))



In [None]:
def pd_df_multi_class_confusion_matrix(pd_s_target, pd_s_predict):
  class_sample_ = pd_s_target.unique()
  cm = confusion_matrix(pd_s_target, pd_s_predict, labels=class_sample_)
  multi_columns = zip(['Predicted label']*(len(class_sample_)), class_sample_)
  multi_index = zip(['Actual label']*(len(class_sample_)), class_sample_)
  multi_columns = pd.MultiIndex.from_tuples(list(multi_columns))
  multi_index = pd.MultiIndex.from_tuples(list(multi_index))
  return pd.DataFrame(cm, columns=multi_columns, index=multi_index)

In [None]:
# Compute ROC curve and ROC area for each class
def roc_curve_multiclass(pd_s_target, pd_s_predict):
  fpr = dict()
  tpr = dict()
  roc_auc = dict()
  class_samples_ = pd_s_target.unique()
  # Binarize the output
  np_target = label_binarize(pd_s_target, classes=class_samples_)
  np_predict = label_binarize(pd_s_predict, classes=class_samples_)

  for sample, unique in zip(class_samples_, range(len(class_samples_))):
    fpr[sample], tpr[sample], _ = roc_curve(np_target[:, unique], np_predict[:, unique])
    roc_auc[sample] = auc(fpr[sample], tpr[sample])
  return fpr, tpr, roc_auc

In [None]:
def plot_auc_roc_multiclass(fpr, tpr, roc_auc, class_samples_):
  plt.figure()
  lw = len(class_samples_)
  for i in class_samples_:
      plt.plot(
          fpr[i],
          tpr[i],
          lw=lw,
          label="ROC curve of class {0} (area = {1:0.2f})".format(i, roc_auc[i]),
      )

  plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.05])
  plt.xlabel("False Positive Rate")
  plt.ylabel("True Positive Rate")
  plt.title("Receiver operating characteristic example")
  plt.legend(loc="lower right")
  plt.show()

In [None]:
class MyTextProcess():
  def __init__(self):
    re_exp_punctuation = '[{}]'.format('\\'.join([char_ for char_ in punctuation]))
    self.reObjPunct = re.compile(re_exp_punctuation)
    self.reObjWhiteSpace = re.compile(r'\s{2, 10}')

    self.wnl = WordNetLemmatizer()

  def txt_vectorization(self, sequence):
    sequence = sequence.lower()
    sequence = self.reObjPunct.sub(' ', sequence)
    sequence = self.reObjWhiteSpace.sub(' ', sequence)
    wordslist = nltk.word_tokenize(sequence)
    wordslist = [self.wnl.lemmatize(word) for word in wordslist if word not in stopwords.words('english')]
    return wordslist


## **Project Task: Week 2**

**Model Selection:**

1.   Apply multi-class SVM’s and neural nets.

2.   Use possible ensemble techniques like: XGboost + oversampled_multinomial_NB.

3.   Assign a score to the sentence sentiment (engineer a feature called sentiment score). Use this engineered feature in the model and check for improvements. Draw insights on the same.

**Applying LSTM:**

4.   Use LSTM for the previous problem (use parameters of LSTM like top-word, embedding-length, Dropout, epochs, number of layers, etc.)

**Hint:** Another variation of LSTM, GRU (Gated Recurrent Units) can be tried as well.

5.   Compare the accuracy of neural nets with traditional ML based algorithms.

6.   Find the best setting of LSTM (Neural Net) and GRU that can best classify the reviews as positive, negative, and neutral.

**Hint:** Use techniques like Grid Search, Cross-Validation and Random Search

**Topic Modeling:**

7.   Cluster similar reviews.

**Note:** Some reviews may talk about the device as a gift-option. Other reviews may be about product looks and some may highlight about its battery and performance. Try naming the clusters.

8.   Perform Topic Modeling

**Hint:** Use scikit-learn provided Latent Dirchlette Allocation (LDA) and Non-Negative Matrix Factorization (NMF).


### Apply multi-class SVM’s

#### Multi-class SVM’s

Source link: https://www.baeldung.com/cs/svm-multiclass-classification

We’ll create two objects from SVM, to create two different classifiers; one with Polynomial kernel, and another one with RBF kernel:

In [None]:
#SVM object, with RBF kernel
sentiment_SVC_rbf_detection_model = svm.SVC(kernel='rbf', gamma=0.5, C=0.1).fit(X_sm, y_sm)

In [None]:
#SVM object, with Polynomial kernel
sentiment_SVC_poly_detection_model = svm.SVC(kernel='poly', degree=3, C=1).fit(X_sm, y_sm)

In [None]:
#check SVM with RBF kernel model for prediction
predict_SVC_rbf = sentiment_SVC_rbf_detection_model.predict(test_hidden_tfidf)

#delete object
del sentiment_SVC_rbf_detection_model

In [None]:
#check SVM with Polynomial kernel model for prediction
predict_SVC_poly = sentiment_SVC_poly_detection_model.predict(test_hidden_tfidf)

#delete object
del sentiment_SVC_poly_detection_model

##### Evaluation metrics

In [None]:
#Evaluation Metrics for SVM with RBF kernel model
print(metrics.classification_report(pd_s_target_test_hidden, predict_SVC_rbf))

In [None]:
#Evaluation Metrics for SVM with Polynomial kernel model
print(metrics.classification_report(pd_s_target_test_hidden, predict_SVC_poly))

### Multi-class neural nets

Source link: https://www.tensorflow.org/text/tutorials/text_classification_rnn#create_the_text_encoder

#### **Create the text encoder**

The raw text loaded from *pd_s_feature* needs to be processed before it can be used in a model. The simplest way to process text for training is using the TextVectorization layer.

##### Calculate paramenter for encoder layer

In [None]:
#get vocabulary from bag of words object
lst_bag_of_words_vacabulary = list(obj_bag_of_words.vocabulary_.keys())
np_bag_of_words_vacabulary = np.array(lst_bag_of_words_vacabulary)

In [None]:
#show first 10 words
np_bag_of_words_vacabulary[:10]

In [None]:
#find max number of words per review
max_nbr_wors_per_review = 0
#find index of max number of words per review
idx_max_nbr_wors_per_review = 0
i = 0
for nbr_row_words in all_bag_of_words:
  tmp_nbr_words_per_review = nbr_row_words.sum()
  if (max_nbr_wors_per_review < tmp_nbr_words_per_review):
    max_nbr_wors_per_review = tmp_nbr_words_per_review
    idx_max_nbr_wors_per_review = i
  i+=1

In [None]:
#Show max number of words per review
max_nbr_wors_per_review

In [None]:
#Show sentiment of index of max number of words per review
pd_s_target_train[idx_max_nbr_wors_per_review]

In [None]:
#Show review of index of max number of words per review
pd_s_feature[idx_max_nbr_wors_per_review]

##### Create text vectorization layer

In [None]:
#Create text vectorization layer

#Note: that this vocabulary contains 1 OOV token,
#so the effective number of tokens is (max_tokens - 1 - (1 if output_mode == "int" else 0))
VOCAB_SIZE = np_bag_of_words_vacabulary.shape[0] + 2
# max number of words per review is 749, but we put 2000 for rezerv and stop words
OUTPUT_SESUENCE_LENGTH = 2000
encoder_layer_review = layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    standardize='strip_punctuation',
    output_sequence_length=OUTPUT_SESUENCE_LENGTH,
    pad_to_max_tokens=False,
    vocabulary=np_bag_of_words_vacabulary,# is vocabulary used for machine learning models
)

In [None]:
#show first 10 words from layer vacabulary
np.array(encoder_layer_review.get_vocabulary())[:10]

In [None]:
#Show encoded review with the biggest number of words
enc_review_big_nbr_words = encoder_layer_review(pd_s_feature[idx_max_nbr_wors_per_review]).numpy()
enc_review_big_nbr_words

In [None]:
#Show decoded review with the biggest number of words
print(" ".join(np.array(encoder_layer_review.get_vocabulary())[enc_review_big_nbr_words]))

#### Create the neural network model


In [None]:
input_embeding_layer_nn = Input(shape=(2000,), dtype=np.uint32)
x = layers.Embedding(
                    input_dim=VOCAB_SIZE,
                    output_dim=64,
                    mask_zero=True,
                    # Use masking to handle the variable sequence lengths
                    input_length=OUTPUT_SESUENCE_LENGTH,
                )(input_embeding_layer_nn)
x = layers.Dense(512, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2,
   strides=2, padding='valid')(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2,
   strides=2, padding='valid')(x)
x = layers.Dense(32, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(1, activation='relu')(x)
x = layers.Flatten()(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(rate=0.4)(x)
out_layer_nn = layers.Dense(len(pd_s_target_train.unique()), activation='sigmoid')(x)

In [None]:
#create neural network model
embeding_model_nn = Model(input_embeding_layer_nn, out_layer_nn)

#### Compile the model

We will, use the `tf.keras.optimizers.Adam` optimizer and `categorical_crossentropy` loss function.

In [None]:
embeding_model_nn.compile(optimizer='adam',
              loss='categorical_crossentropy', metrics=['accuracy'])

#### Model summary

View all the layers of the network using the Keras `Model.summary` method:

In [None]:
embeding_model_nn.summary()

#### Train and visualize the model results

In [None]:
#encode train data
encode_feature_train = encoder_layer_review(pd_s_feature_train)
#encode test data hidden
encode_feature_test_hidden = encoder_layer_review(pd_s_target_test_hidden)

In [None]:
#Oversampling is used to tackle the class imbalance problem.
smote = SMOTE(random_state=42)
X_sm_encode, y_sm_encode = smote.fit_resample(encode_feature_train, pd_s_target_train)

#delete object
del smote

#binary encode target data
class_samples_ = pd_s_target_train.unique()
#binary encode target train data
np_target_train = label_binarize(y_sm_encode, classes=class_samples_)
#binary encode target test data hidden
np_target_test_hidden = label_binarize(pd_s_target_test_hidden, classes=class_samples_)

In [None]:
epochs=20
history = embeding_model_nn.fit(
                        x = X_sm_encode,
                        y = np_target_train,
                        validation_data=(encode_feature_test_hidden, np_target_test_hidden),
                        epochs=epochs,
                        batch_size=20,
                        callbacks=[PlotLossesKerasTF()]
                      )

In [None]:
#embeding_model_nn.load_weights("nn_model_weights.h5", by_name=True)

In [None]:
filepath = 'nn_model_weights.h5'
embeding_model_nn.save_weights(filepath, overwrite=True)

#### Predict neural network

In [None]:
input_layer_nn = Input(shape=(1,), dtype=tf.string)
input_encoder_layer_nn = encoder_layer_review(input_layer_nn)

In [None]:
model_nn = Model(input_layer_nn, embeding_model_nn(input_encoder_layer_nn))

In [None]:
predict_nn = model_nn.predict(pd_s_feature_test_hidden)
predict_nn = class_samples_[[np.argmax(i) for i in predict_nn]]

#delete models
del embeding_model_nn
del model_nn

In [None]:
#Evaluation Metrics for neural network model
print(metrics.classification_report(pd_s_target_test_hidden, predict_nn))

### Ensemble techniques

In [None]:
dict_pred = {'NB': predict_NB,
             'RF': predict_RF_best_rand_params,
             'XGB': predict_XGB_best_rand_params,
             'SVC_rbf': predict_SVC_rbf,
             'SVC_poly': predict_SVC_poly,
             'NN': predict_nn,
             }
pd_s_pred_mode = pd.DataFrame(dict_pred).T.mode().T[0]

In [None]:
#Evaluation Metrics mode of Naive Bayes, Random Forest, Xgboost,
#SVM with RBF kernel, SVM with Polynomial kernel, Neural network
print(metrics.classification_report(pd_s_target_test_hidden, pd_s_pred_mode))

### RNN models

##### Make encoder layer

In [None]:
#find vocabulary size
VOCAB_SIZE_TF_TXT_VECT = 20000
encoder = layers.TextVectorization(
                                  max_tokens=VOCAB_SIZE_TF_TXT_VECT,
                                  output_mode='int',
                                  standardize='strip_punctuation',
                                  output_sequence_length=1,
                                  pad_to_max_tokens=False
                                  )
encoder.adapt(pd_s_feature.map(lambda text: text))
VOCAB_SIZE_TF_TXT_VECT = len(encoder.get_vocabulary())+2

In [None]:
#make encoder text vectorization layer
enc_review_tf_txt_vect = layers.TextVectorization(
                                  max_tokens=VOCAB_SIZE_TF_TXT_VECT,
                                  output_mode='int',
                                  standardize='strip_punctuation',
                                  output_sequence_length=2000,
                                  pad_to_max_tokens=False
                                  )
enc_review_tf_txt_vect.adapt(pd_s_feature.map(lambda text: text))

##### Encode train and test data

In [None]:
encode_feature_train = enc_review_tf_txt_vect(pd_s_feature_train)
encode_feature_test_hidden = enc_review_tf_txt_vect(pd_s_target_test_hidden)

##### Oversample encoded with tensorflow text vectorization

In [None]:
#Oversampling is used to tackle the class imbalance problem.
smote = SMOTE(random_state=42)
X_sm_tf_vec, y_sm_tf_vec = smote.fit_resample(encode_feature_train, pd_s_target_train)

#delete object
del smote

#One-hote encoding
class_samples_ = pd_s_target_train.unique()
np_target_train_rnn = label_binarize(y_sm_tf_vec, classes=class_samples_)
np_target_test_hidden_rnn = label_binarize(pd_s_target_test_hidden, classes=class_samples_)

##### Input string model rnn

In [None]:
input_layer_rnn = Input(shape=(1,), dtype=tf.string)
input_encoder_layer_rnn = enc_review_tf_txt_vect(input_layer_rnn)

#### Multiclass LSTM model

##### Build LSTM model

In [None]:
input_embeding_layer_lstm = Input(shape=(2000,), dtype=np.uint32)
x = layers.Embedding(
                    input_dim=VOCAB_SIZE_TF_TXT_VECT,
                    output_dim=64,
                    mask_zero=True,
                    # Use masking to handle the variable sequence lengths
                    input_length=2000,
                )(input_embeding_layer_lstm)
x = layers.Bidirectional(layers.LSTM(64,  return_sequences=True))(x)
x = layers.BatchNormalization()(x)
x = layers.LSTM(32)(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(rate=0.4)(x)
out_layer_lstm = layers.Dense(len(pd_s_target_train.unique()), activation='softmax')(x)

In [None]:
#Do model with input embeding layer
embeding_model_lstm = Model(input_embeding_layer_lstm, out_layer_lstm)

##### Compile model

We will, use the 'tf.keras.optimizers.Adam' optimizer and 'categorical_crossentropy' loss function.

In [None]:
embeding_model_lstm.compile(optimizer='adam',
              loss='categorical_crossentropy', metrics=['accuracy'])

##### Model summary

View all the layers of the network using the Keras `Model.summary` method:

In [None]:
embeding_model_lstm.summary()

##### Train model

In [None]:
epochs=10
history = embeding_model_lstm.fit(
                        x = X_sm_tf_vec,
                        y = np_target_train_rnn,
                        validation_data=(encode_feature_test_hidden, np_target_test_hidden_rnn),
                        epochs=epochs,
                        batch_size=20,
                        callbacks=[PlotLossesKerasTF()]
                      )

In [None]:
#embeding_model_lstm.load_weights("lstm_model_weights.h5", by_name=True)

In [None]:
filepath = 'lstm_model_weights.h5'
embeding_model_lstm.save_weights(filepath, overwrite=True)

##### Predict

In [None]:
model_lstm = Model(input_layer_rnn, embeding_model_lstm(input_encoder_layer_rnn))

In [None]:
predict_lstm = model_lstm.predict(pd_s_feature_test_hidden)
predict_lstm = class_samples_[[np.argmax(i) for i in predict_lstm]]

#delete models
del embeding_model_lstm
del model_lstm

In [None]:
#Evaluation Metrics for LSTM model
print(metrics.classification_report(pd_s_target_test_hidden, predict_lstm))

#### Multiclass GRU model

##### Build GRU model

In [None]:
input_embeding_layer_gru = Input(shape=(2000,), dtype=np.uint32)
x = layers.Embedding(
                    input_dim=VOCAB_SIZE_TF_TXT_VECT,
                    output_dim=64,
                    mask_zero=True,
                    # Use masking to handle the variable sequence lengths
                    input_length=2000,
                )(input_embeding_layer_gru)
x = layers.Bidirectional(layers.GRU(128,  return_sequences=True))(x)
x = layers.BatchNormalization()(x)
x = layers.GRU(128)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(rate=0.4)(x)
out_layer_gru = layers.Dense(len(pd_s_target_train.unique()), activation='softmax')(x)

In [None]:
#Do model with input embeding layer
embeding_model_gru = Model(input_embeding_layer_gru, out_layer_gru)

##### Compile model

We will, use the 'tf.keras.optimizers.Adam' optimizer and 'categorical_crossentropy' loss function.

In [None]:
embeding_model_gru.compile(optimizer='adam',
              loss='categorical_crossentropy', metrics=['accuracy'])

##### Model summary

View all the layers of the network using the Keras `Model.summary` method:

In [None]:
embeding_model_gru.summary()

##### Train model

In [None]:
epochs=10
history = embeding_model_gru.fit(
                        x = X_sm_tf_vec,
                        y = np_target_train_rnn,
                        validation_data=(encode_feature_test_hidden, np_target_test_hidden_rnn),
                        epochs=epochs,
                        batch_size=20,
                        callbacks=[PlotLossesKerasTF()]
                      )

In [None]:
#embeding_model_gru.load_weights("gru_model_weights.h5", by_name=True)

In [None]:
filepath = 'gru_model_weights.h5'
embeding_model_gru.save_weights(filepath, overwrite=True)

##### Predict

In [None]:
model_gru = Model(input_layer_rnn, embeding_model_gru(input_encoder_layer_rnn))

In [None]:
predict_gru = model_gru.predict(pd_s_feature_test_hidden)
predict_gru = class_samples_[[np.argmax(i) for i in predict_gru]]

#delete models
del embeding_model_gru
del model_gru

In [None]:
#Evaluation Metrics for GRU model
print(metrics.classification_report(pd_s_target_test_hidden, predict_gru))

### Compare the accuracy of neural nets with traditional ML based algorithms

#### Machine learning evaluation

In [None]:
#Evaluation Metrics Naive Bayes Clasifier
print(metrics.classification_report(pd_s_target_test_hidden, predict_NB))

In [None]:
#Evaluation Metrics Random Forest Clasifier
print(metrics.classification_report(pd_s_target_test_hidden, predict_RF_best_rand_params))

In [None]:
#Evaluation Metrics eXtreme Gradient Boosting
print(metrics.classification_report(pd_s_target_test_hidden, predict_XGB_best_rand_params))

In [None]:
#Evaluation Metrics for SVM with RBF kernel model
print(metrics.classification_report(pd_s_target_test_hidden, predict_SVC_rbf))

In [None]:
#Evaluation Metrics for SVM with Polynomial kernel model
print(metrics.classification_report(pd_s_target_test_hidden, predict_SVC_poly))

#### Neuronal network evaluation

In [None]:
#Evaluation Metrics for neural network model
print(metrics.classification_report(pd_s_target_test_hidden, predict_nn))

In [None]:
#Evaluation Metrics for LSTM model
print(metrics.classification_report(pd_s_target_test_hidden, predict_lstm))

In [None]:
#Evaluation Metrics for GRU model
print(metrics.classification_report(pd_s_target_test_hidden, predict_gru))

#### Conclusion

The result prediction of machine learning techniques are more bigger that neuronal network solution.

### Fine tuning of hyperparameter of deep learning

##### Build the model

In [None]:
def create_model(neurons_bid_gru, neurons_dense, nbr_out_net, optimizer='adam', activation='relu'):
  input_embeding_layer_gru = Input(shape=(2000,), dtype=np.uint32)
  x = layers.Embedding(
                      input_dim=VOCAB_SIZE_TF_TXT_VECT,
                      output_dim=64,
                      mask_zero=True,
                      # Use masking to handle the variable sequence lengths
                      input_length=2000,
                      )(input_embeding_layer_gru)
  x = layers.Bidirectional(layers.GRU(neurons_bid_gru,  return_sequences=True))(x)
  x = layers.BatchNormalization()(x)
  x = tf.keras.layers.GRU(32)(x)
  x = layers.Dense(neurons_dense, activation='relu')(x)
  x = layers.BatchNormalization()(x)
  x = layers.Dropout(rate=0.4)(x)
  out_layer_gru = layers.Dense(nbr_out_net, activation=activation)(x)

  #create neural network model
  embeding_model_gru = Model(input_embeding_layer_gru, out_layer_gru)

  # Compile model
  embeding_model_gru.compile(loss='categorical_crossentropy',
                              optimizer=optimizer, metrics=['accuracy'])
  return embeding_model_gru

##### Define the grid search parameters

In [None]:
# define the grid search parameters
batch_size = [10, 40, 80]
optimizer = ['SGD', 'Adam']
learn_rate = [0.01, 0.3]
momentum = [0.0, 0.4, 0.8]
neurons_bid_gru = [32, 64]
neurons_dense = [64, 128]
activation = ['softmax', 'sigmoid']
nbr_out_net = [int(len(pd_s_target_train.unique()))]

param_grid = dict(
                  batch_size=batch_size,
                  optimizer__learning_rate=learn_rate,
                  optimizer__momentum=momentum,
                  model__activation=activation,
                  model__optimizer=optimizer,
                  model__neurons_bid_gru=neurons_bid_gru,
                  model__neurons_dense=neurons_dense,
                  model__nbr_out_net=nbr_out_net,
                  )

##### Train the models

In [None]:
# create model
model_GRU = KerasClassifier(model=create_model, epochs=1)
# define the grid search parameters

gru_hyper_tune_random = RandomizedSearchCV(estimator = model_GRU, param_distributions = param_grid,
                                 n_iter = 2, cv = 3, verbose=2, random_state=42, n_jobs = -1)
pred_gru_hyper_tune_random = gru_hyper_tune_random.fit(X_sm_tf_vec, np_target_train_rnn)

In [None]:
#show best pamameters
pred_gru_hyper_tune_random.best_params_

##### Prediction of best model

In [None]:
#predict sentiment with best random paramenters
predict_GRU_best_rand_params = pred_gru_hyper_tune_random.best_estimator_.predict(encode_feature_test_hidden)
predict_GRU_best_rand_params = class_samples_[[np.argmax(i) for i in predict_GRU_best_rand_params]]

In [None]:
#Evaluation Metrics
print(metrics.classification_report(pd_s_target_test_hidden, predict_GRU_best_rand_params))

### Perform Topic Modeling

In [None]:
n_components = 100

##### TFIDF transform

In [None]:
# apply transform method for the bag of words of all data
all_data_bag_of_words = obj_bag_of_words.transform(pd_s_feature)
# apply tfidf transformer for all bag of words into it (transformed version)
all_data_tfidf = obj_tfidf.transform(all_data_bag_of_words)

# apply transform method for the bag of words of train data
train_bag_of_words = obj_bag_of_words.transform(pd_s_feature_train)
# apply tfidf transformer for train bag of words into it (transformed version)
train_tfidf = obj_tfidf.transform(train_bag_of_words)

# apply transform method for the bag of words of test data hidden
test_hidden_bag_of_words = obj_bag_of_words.transform(pd_s_feature_test_hidden)
# apply tfidf transformer for train bag of words into it (transformed version)
test_hidden_tfidf = obj_tfidf.transform(test_hidden_bag_of_words)

##### Build neuronal network

In [None]:
input_nn_TM = Input(shape=(n_components,), dtype=np.float32)
x = layers.Dense(182, activation='relu')(input_nn_TM)
x = layers.BatchNormalization()(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(rate=0.5)(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(rate=0.3)(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(16, activation='relu')(x)
x = layers.Dropout(rate=0.5)(x)
x = layers.Dense(8, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(rate=0.6)(x)
out_nn_TM = layers.Dense(len(pd_s_target_train.unique()), activation='sigmoid')(x)

#### Latent Dirchlette Allocation

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

##### LDA transform

In [None]:
# This produces a feature matrix of token counts, similar to what
# CountVectorizer would produce on text.
lda = LatentDirichletAllocation(n_components=n_components, random_state=0)
lda.fit(all_data_tfidf)

In [None]:
# get topics for train data
X_train_lda = lda.transform(train_tfidf)
# get topics for test hidden
X_test_hidden_lda = lda.transform(test_hidden_tfidf)

#Oversampling is used to tackle the class imbalance problem.
smote = SMOTE(random_state=42)
X_train_sm_lda, y_train_sm = smote.fit_resample(X_train_lda, pd_s_target_train)
del smote

In [None]:
class_samples_ = pd_s_target_train.unique()
y_train_sm_one_hot = label_binarize(y_train_sm, classes=class_samples_)
y_test_hidden_one_hot = label_binarize(pd_s_target_test_hidden, classes=class_samples_)

##### Build neural network model

In [None]:
model_nn_lda_TM = Model(input_nn_TM, out_nn_TM)

In [None]:
model_nn_lda_TM.compile(optimizer='adam',
              loss='categorical_crossentropy', metrics=['accuracy'])

##### Train

In [None]:
epochs=60
history = model_nn_lda_TM.fit(
                        x = X_train_sm_lda,
                        y = y_train_sm_one_hot,
                        validation_data=(X_test_hidden_lda, y_test_hidden_one_hot),
                        epochs=epochs,
                        batch_size=40,
                        callbacks=[PlotLossesKerasTF()]
                      )

##### Prediction

In [None]:
#predict sentiment with LDA decomposition
predict_nn_lda_TM = model_nn_lda_TM.predict(X_test_hidden_lda)
predict_nn_lda_TM = class_samples_[[np.argmax(i) for i in predict_nn_lda_TM]]

#delete model
del model_nn_lda_TM

In [None]:
#Evaluation Metrics for LDA decomposition
print(metrics.classification_report(pd_s_target_test_hidden, predict_nn_lda_TM))

#### Non-Negative Matrix Factorization

In [None]:
from sklearn.decomposition import NMF

##### NMF Transform

In [None]:
model_nmf = NMF(n_components=n_components, init='random', random_state=0)
model_nmf.fit(all_data_tfidf)

In [None]:
# get topics for train data
X_train_nmf = model_nmf.transform(train_tfidf)
# get topics for test hidden
X_test_hidden_nmf = model_nmf.transform(test_hidden_tfidf)

#Oversampling is used to tackle the class imbalance problem.
smote = SMOTE(random_state=42)
X_train_sm_nmf, y_train_sm = smote.fit_resample(X_train_nmf, pd_s_target_train)
del smote

In [None]:
class_samples_ = pd_s_target_train.unique()
y_train_sm_one_hot = label_binarize(y_train_sm, classes=class_samples_)
y_test_hidden_one_hot = label_binarize(pd_s_target_test_hidden, classes=class_samples_)

##### Build nn model

In [None]:
model_nn_nmf_TM = Model(input_nn_TM, out_nn_TM)

In [None]:
model_nn_nmf_TM.compile(optimizer='adam',
              loss='categorical_crossentropy', metrics=['accuracy'])

##### Train model

In [None]:
epochs=90
history = model_nn_nmf_TM.fit(
                        x = X_train_sm_nmf,
                        y = y_train_sm_one_hot,
                        validation_data=(X_test_hidden_nmf, y_test_hidden_one_hot),
                        epochs=epochs,
                        batch_size=40,
                        callbacks=[PlotLossesKerasTF()]
                      )

##### Predict

In [None]:
#predict sentiment with NMF decomposition
predict_nn_nmf_TM = model_nn_nmf_TM.predict(X_test_hidden_nmf)
predict_nn_nmf_TM = class_samples_[[np.argmax(i) for i in predict_nn_nmf_TM]]

#delete model
#del model_nn_nmf_TM

In [None]:
#Evaluation Metrics for NMF decomposition
print(metrics.classification_report(pd_s_target_test_hidden, predict_nn_nmf_TM))

# Conclusion

Working on this project:

- Perform an EDA on the dataset
- Convert the reviews in Tf-Idf score
- Text procesing (drop stopwords, drop punctuation and lemmatization)
- Implement several ML algorithms(Naive Bayes, Random Forest, Xgboost and SVM’s)
- Tackle the class of imbalance problem with SMOTE
- Use the following metrices for evaluating model performance: precision, recall, F1-score, AUC-ROC curve
- Use fine-tuning parameter for ML algoritm like RandomizedSearchCV
- Use ensemble techniques like: XGboost + NB + RF + SVM.
- Use LSTM, GRU and NN deep learning model
- Use fine-tuning parameter for DL models like RandomizedSearchCV
- Provided Latent Dirchlette Allocation (LDA) and Non-Negative Matrix Factorization (NMF)


I have come to the conclusion that some ML algorithms offer much better results to predict unbalanced data, but if we use some decomposition algorithms like LDA and NMF, for NN learning algorithms we can obtain results comparable to ML models. We can use fine-tuning parameter to find best parameter for best prediction.