<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 5px; height: 50px"> 

#   Personalizing Music Video Recommendations with Emotional Intelligence

> Capstone Project: Lionel Foo

---

#### <b> Notebook: 02B Classification Model (Multiclass Classification - Naive Bayes, Logistic Regression, XGBoost) </b>

#### Naive Bayes (MultinomialNB)
**Rationale:**
Naive Bayes is a probabilistic algorithm based on Bayes' theorem. It's particularly suitable for text classification tasks due to its simplicity, efficiency, and effectiveness, especially with a relatively small dataset. In the context of emotional text classification, Naive Bayes can capture the likelihood of specific words or features contributing to each emotion class independently, making it robust for sentiment analysis.

#### Logistic Regression
**Rationale:**
Logistic Regression is a linear model used for binary and multi-class classification. In the case of emotional text classification, it can be effective in modeling the relationship between the input features (words in the text) and the probability of belonging to each emotion class. Logistic Regression is known for its simplicity and interpretability.

#### XGBoostClassifier
**Rationale:**
XGBoost is an ensemble learning algorithm that combines the strengths of decision trees. XGBoost builds an ensemble of decision trees sequentially, where each tree corrects the errors of the previous ones. It uses a gradient boosting framework to minimize a loss function and optimize predictive performance. Each tree corrects the errors of the previous ones, leading to a strong predictive model. It has high predictive performance and ability to handle complex relationships in data. In emotional text classification, where capturing nuanced patterns is crucial, XGBoost can provide superior predictive power.


---

<b> 1. Import Libraries</b>

In [1]:
#!pip install xgboost

# Imports: standard
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

[nltk_data] Downloading package wordnet to /Users/lionel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



---

<b> 2. Import dataframe and assess imported data </b>

In [2]:
# Import data
df = pd.read_csv("Data/emotions_processed_dataset.csv")
df.head(3)

Unnamed: 0,text,label,emotion_name
0,i just feel really helpless and heavy hearted,4,Fear
1,i have enjoyed being able to slouch about rela...,0,Sadness
2,i gave up my internship with the dmrg and am f...,4,Fear


In [3]:
# Summary of DataFrame information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380365 entries, 0 to 380364
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   text          380365 non-null  object
 1   label         380365 non-null  int64 
 2   emotion_name  380365 non-null  object
dtypes: int64(1), object(2)
memory usage: 8.7+ MB



---

<b> 2. Prepare Data for Modelling </b>

Outline
* (a) Evaluate Class Imbalance
* (b) Lemmatize 'text' Column
* (c) Perform Train-Test-Split
* (d) Address Class Imbalance


<b> (a) Evaluate Class Imbalance </b>

In [4]:
# Examine Class Imbalance
print("== Examine Class Imbalance ==\n")

# Counts
class_counts = df.groupby(['label', 'emotion_name']).size().reset_index(name='count')
print("= Counts =\n")
print(class_counts[['label', 'emotion_name', 'count']])

# Proportions
print("\n= Proportions =\n")
class_proportions = df.groupby(['label', 'emotion_name']).size() / len(df)
print(class_proportions)

== Examine Class Imbalance ==

= Counts =

   label emotion_name   count
0      0      Sadness  118152
1      1          Joy  134709
2      2         Love   29410
3      3        Anger   54597
4      4         Fear   43497

= Proportions =

label  emotion_name
0      Sadness         0.310628
1      Joy             0.354157
2      Love            0.077320
3      Anger           0.143538
4      Fear            0.114356
dtype: float64


Comments: Class Imbalance
* Class [0]: Sadness ; Class [1]: Joy, Class [2]: Love, Class [3]: Anger, Class [4]: Fear
* There are a disproportionately high number of "Sadness" (94522), "Joy" (107767), and "Anger" (43678) instances.
* There are a disproportionately low number of "Love" (23528) instances.

Comments: Approach to Addressing Class Imbalance
* Undersampling (oversampling) of the most (least) numerous class to attain a count equivalent to the "Fear" class (as it has a count between both classes), where we'll:
    * undersample: the most numerous classes ("Sadness", "Joy", and "Anger") 
    * oversample: the least numerous class ("Love")

* To prevent "training" data from leaking into the "testing" data - the class rebalancing procedure will be applied:
    * after the "train-test-split" step, and
    * applied to the "training" sample only

<b> (b) Lemmatize 'text' Column </b>

In [5]:
# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Define a function to perform lemmatization on a text
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [6]:
# Apply lemmatization to all rows in 'text' column
df['text'] = df['text'].apply(lemmatize_text)

<b> (c) Perform Train-Test-Split </b>

In [7]:
# Split the dataset first
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["label"])

In [8]:
# Balance the training dataset
class_4_size = train_df[train_df["label"] == 4].shape[0]
balanced_train_df = pd.concat([train_df[train_df["label"] == label].sample(n=class_4_size, replace=True, random_state=42) for label in train_df["label"].unique()], axis=0)

In [9]:
# Display the count of each unique label in the 'label' column of the balanced_train_df DataFrame after balancing
balanced_train_df['label'].value_counts()

label
1    34797
4    34797
3    34797
0    34797
2    34797
Name: count, dtype: int64

In [10]:
# Extract 'text' and 'label' columns from balanced_train_df for training
X_train_bal = balanced_train_df['text']
y_train_bal = balanced_train_df['label']
# Extract 'text' and 'label' columns from test_df for testing
X_test = test_df['text']
y_test = test_df['label']
# Print the shape of training and test data
print(X_train_bal.shape, X_test.shape, y_train_bal.shape, y_test.shape)

(173985,) (76073,) (173985,) (76073,)



---

<b> 3. Modeling using Pipeline and GridSearch </b>

Pipeline to streamline the text classification process and grid search, aiming to find the optimal hyperparameters for both CountVectorizer and Model

<br>
<b> (a) Naive Bayes model(MultinomialNB) </b>

In [11]:
# Create copies of the train and test data for Naive Bayes model
X_train_nb = X_train_bal.copy()
y_train_nb = y_train_bal.copy()

X_test_nb = X_test.copy()
y_test_nb = y_test.copy()

In [12]:
# Create a pipeline with CountVectorizer and MultinomialNB
pipeline_nb = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words= 'english')),
    ('nb', MultinomialNB())
])

# Define parameter grid for grid search
param_grid_nb = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [0.4, 0.6, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'nb__alpha': [0.1, 0.2, 0.3],
}

# Perform grid search with cross-validation
grid_search_nb = GridSearchCV(pipeline_nb, param_grid_nb, cv=5, scoring='accuracy', verbose=1)
start_time = time.time()
grid_search_nb.fit(X_train_nb, y_train_nb)
end_time = time.time()

# Print best parameters
best_params_nb = grid_search_nb.best_params_
print("Best Parameters:")
print(best_params_nb)

# Print computational time
print(f"Grid Search took {end_time - start_time:.2f} seconds")

# Print accuracy score for the test set
y_pred_nb_test = grid_search_nb.best_estimator_.predict(X_test_nb)
accuracy_nb_test = accuracy_score(y_test_nb, y_pred_nb_test)
print("Accuracy Score on Test Set:", accuracy_nb_test)

# Print accuracy score for the training set
y_pred_nb_train = grid_search_nb.best_estimator_.predict(X_train_nb)
accuracy_nb_train = accuracy_score(y_train_nb, y_pred_nb_train)
print("Accuracy Score on Training Set:", accuracy_nb_train)

# Print classification report for the test set
classification_report_nb = classification_report(y_test_nb, y_pred_nb_test, target_names = ["Sadness", "Joy", "Love", "Anger", "Fear"])
print("Classification Report on Test Set:")
print(classification_report_nb)

Fitting 5 folds for each of 162 candidates, totalling 810 fits
Best Parameters:
{'cvec__max_df': 0.4, 'cvec__max_features': 5000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'nb__alpha': 0.2}
Grid Search took 1496.93 seconds
Accuracy Score on Test Set: 0.9286080475332904
Accuracy Score on Training Set: 0.9503462942207661
Classification Report on Test Set:
              precision    recall  f1-score   support

     Sadness       0.97      0.93      0.95     23630
         Joy       0.97      0.90      0.94     26942
        Love       0.74      0.96      0.84      5882
       Anger       0.91      0.95      0.93     10919
        Fear       0.88      0.96      0.92      8700

    accuracy                           0.93     76073
   macro avg       0.90      0.94      0.91     76073
weighted avg       0.93      0.93      0.93     76073



<br>
<b> (b) Logistic Regression model <b>

In [13]:
# Create copies of the train and test data for Log Regression model
X_train_lr = X_train_bal.copy()
y_train_lr = y_train_bal.copy()

X_test_lr = X_test.copy()
y_test_lr = y_test.copy()

In [14]:
# Create a pipeline with CountVectorizer and LogisticRegression
pipeline_lr = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words='english')),
    ('lr', LogisticRegression(max_iter=15000, random_state=42))
])

# Define parameter grid for grid search
param_grid_lr = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [0.40, 0.60],
    'cvec__ngram_range': [(1, 2), (1, 3)],
    'lr__C': [0.02, 0.2],
}

# Perform grid search with cross-validation
grid_search_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, scoring='accuracy', verbose=1)
start_time_lr = time.time()
grid_search_lr.fit(X_train_lr, y_train_lr)
end_time_lr = time.time()

# Print best parameters
best_params_lr = grid_search_lr.best_params_
print("Best Parameters for Logistic Regression:")
print(best_params_lr)

# Print computational time
print(f"Grid Search took {end_time_lr - start_time_lr:.2f} seconds")

# Print accuracy score for the test set
y_pred_lr_test = grid_search_lr.best_estimator_.predict(X_test_lr)
accuracy_lr_test = accuracy_score(y_test_lr, y_pred_lr_test)
print("Accuracy Score on Test Set (Logistic Regression):", accuracy_lr_test)

# Print accuracy score for the training set
y_pred_lr_train = grid_search_lr.best_estimator_.predict(X_train_lr)
accuracy_lr_train = accuracy_score(y_train_lr, y_pred_lr_train)
print("Accuracy Score on Training Set (Logistic Regression):", accuracy_lr_train)

# Print classification report for the test set
classification_report_lr = classification_report(y_test_lr, y_pred_lr_test, target_names = ["Sadness", "Joy", "Love", "Anger", "Fear"])
print("Classification Report on Test Set (Logistic Regression):")
print(classification_report_lr)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Parameters for Logistic Regression:
{'cvec__max_df': 0.4, 'cvec__max_features': 5000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'lr__C': 0.2}
Grid Search took 1043.01 seconds
Accuracy Score on Test Set (Logistic Regression): 0.9438039777582059
Accuracy Score on Training Set (Logistic Regression): 0.9683765841882921
Classification Report on Test Set (Logistic Regression):
              precision    recall  f1-score   support

     Sadness       0.97      0.94      0.96     23630
         Joy       0.97      0.93      0.95     26942
        Love       0.81      0.98      0.89      5882
       Anger       0.93      0.95      0.94     10919
        Fear       0.92      0.96      0.94      8700

    accuracy                           0.94     76073
   macro avg       0.92      0.95      0.93     76073
weighted avg       0.95      0.94      0.94     76073



<br>
<b> (c) XGBoost Model </b>

In [15]:
# Create copies of the train and test data for XG Boost model
X_train_xg = X_train_bal.copy()
y_train_xg = y_train_bal.copy()

X_test_xg = X_test.copy()
y_test_xg = y_test.copy()

In [16]:
# Create a pipeline with CountVectorizer and XGBClassifier
pipeline_xgb = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words= 'english')),
    ('xgb', XGBClassifier(objective='multi:softmax', num_class=5, random_state=42))
])

# Define parameter grid for grid search
param_grid_xgb = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [0.4, 0.6, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'xgb__learning_rate': [0.01, 0.2],  # you can add more hyperparameters for XGBoost here
}

# Perform grid search with cross-validation
grid_search_xgb = GridSearchCV(pipeline_xgb, param_grid_xgb, cv=5, scoring='accuracy', verbose=1)
start_time = time.time()
grid_search_xgb.fit(X_train_xg, y_train_xg)  # assuming your training data is in X_train and y_train
end_time = time.time()

# Print best parameters
best_params_xgb = grid_search_xgb.best_params_
print("Best Parameters:")
print(best_params_xgb)

# Print computational time
print(f"Grid Search took {end_time - start_time:.2f} seconds")

# Print accuracy score for the test set
y_pred_xgb_test = grid_search_xgb.best_estimator_.predict(X_test_xg)  # assuming your test data is in X_test
accuracy_xgb_test = accuracy_score(y_test_xg, y_pred_xgb_test)  # assuming your test labels are in y_test
print("Accuracy Score on Test Set:", accuracy_xgb_test)

# Print accuracy score for the training set
y_pred_xgb_train = grid_search_xgb.best_estimator_.predict(X_train_xg)
accuracy_xgb_train = accuracy_score(y_train_xg, y_pred_xgb_train)
print("Accuracy Score on Training Set:", accuracy_xgb_train)

# Print classification report for the test set
classification_report_xgb = classification_report(y_test_xg, y_pred_xgb_test, target_names = ["Sadness", "Joy", "Love", "Anger", "Fear"])
print("Classification Report on Test Set:")
print(classification_report_xgb)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters:
{'cvec__max_df': 0.8, 'cvec__max_features': 5000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'xgb__learning_rate': 0.2}
Grid Search took 2010.38 seconds
Accuracy Score on Test Set: 0.9266888383526349
Accuracy Score on Training Set: 0.9475299594792654
Classification Report on Test Set:
              precision    recall  f1-score   support

     Sadness       0.95      0.91      0.93     23630
         Joy       0.96      0.92      0.94     26942
        Love       0.78      0.99      0.87      5882
       Anger       0.93      0.92      0.92     10919
        Fear       0.89      0.96      0.92      8700

    accuracy                           0.93     76073
   macro avg       0.90      0.94      0.92     76073
weighted avg       0.93      0.93      0.93     76073



---

<b> 4. Evaluate Classification Models </b>

<br>

(a) Naive Bayes

|Multinomial|Train Accuracy|Test Precision|Test Recall|Test F1-Score|Test Accuracy|Support|Time seconds|
|---|---|---|---|---|---|---|---|
|Class 0: Sadness|---|0.97|0.93|0.95|---|23630|---|
|Class 1: Joy|---|0.97|0.90|0.94|---|26942|---|
|Class 2: Love|---|0.74|0.96|0.84|---|5882|---|
|Class 3: Anger|---|0.91|0.95|0.93|---|10919|---|
|Class 4: Fear|---|0.88|0.96|0.92|---|8700|---|
|Overall|0.95|---|---|---|0.93|76073|1497|

<br>

<br>

(b) Logistic Regression

|Multinomial|Train Accuracy|Test Precision|Test Recall|Test F1-Score|Test Accuracy|Support|Time seconds|
|---|---|---|---|---|---|---|---|
|Class 0: Sadness|---|0.97|0.94|0.96|---|23630|---|
|Class 1: Joy|---|0.97|0.93|0.95|---|26942|---|
|Class 2: Love|---|0.81|0.98|0.89|---|5882|---|
|Class 3: Anger|---|0.93|0.95|0.94|---|10919|---|
|Class 4: Fear|---|0.92|0.96|0.94|---|8700|---|
|Overall|0.97|---|---|---|0.94|76073|1043|

<br>

(c) XG Boost

|Multinomial|Train Accuracy|Test Precision|Test Recall|Test F1-Score|Test Accuracy|Support|Time seconds|
|---|---|---|---|---|---|---|---|
|Class 0: Sadness|---|0.95|0.91|0.93|---|23630|---|
|Class 1: Joy|---|0.96|0.92|0.94|---|26942|---|
|Class 2: Love|---|0.78|0.99|0.87|---|5882|---|
|Class 3: Anger|---|0.93|0.92|0.92|---|10919|---|
|Class 4: Fear|---|0.89|0.96|0.92|---|8700|---|
|Overall|0.95|---|---|---|0.93|76073|2010|

<br>

# Emotion Classification Models Evaluation

We have built three models here to classify text into five different emotion classes: Sadness, Joy, Love, Anger, and Fear. The models are evaluated based on their F1-score, accuracy, and efficiency (time to run each model).

## Evaluation Metrics

1. **Accuracy**: This metric is particularly important in this case because we have a multi-class problem. A high accuracy means that the model is good at predicting the correct emotion class out of the five possible classes. 

2. **F1-Score**: This metric is crucial for this task because it balances precision and recall. In the context of emotion classification, precision means the percentage of correct predictions for a particular emotion out of all predictions for that emotion, while recall is the percentage of correct predictions for a particular emotion out of all actual instances of that emotion. The F1-score is particularly useful if we want to have a balance between identifying as many instances of each emotion as possible (high recall) and keeping the number of incorrect predictions low (high precision).

3. **Efficiency**: This is important because we want the model to be able to process text quickly, especially if we’re dealing with large amounts of data. A model that takes too long to make predictions might not be practical for this use case.

## Models Evaluation

### Naive Bayes
The model has a high overall accuracy and F1-score, which suggests that it’s good at identifying the correct emotion class. However, the F1-score for the ‘Love’ class is relatively low, which means the model might struggle to correctly identify this emotion. The model is also relatively efficient, with a runtime of 1497 seconds.

### Logistic Regression
This model has the highest overall accuracy and F1-score, suggesting that it’s the best at identifying the correct emotion class. It also has a high F1-score for all individual classes, including ‘Love’. Moreover, it’s more efficient than the Naive Bayes model, with a runtime of 1043 seconds.

### XG Boost
This model has a similar overall accuracy and F1-score to the Naive Bayes model. However, it’s less efficient, with a runtime of 2010 seconds. The F1-score for the ‘Love’ class is higher than for the Naive Bayes model, but lower than for the Logistic Regression model.

### Based on these evaluations, the Logistic Regression model seems to be the best choice for this task. It has the highest accuracy and F1-scores, and it’s also relatively efficient. We will move on next to try a Multinomial Classifier RNN with Glove word embeddings Model as semantic understanding can be crucial for emotion classification tasks