<center>
    <h1> Real-Time Emotion Detection with Kafka, Spark Streaming, and Machine Learning </h1>
    <h2> Data Preprocessing </h2>
    <h4> Ann Maria John, Divya Neelamegam, Kartik Mukkavilli, Poojitha Venkat Ram, Shruti Badrinarayanan </h4>
</center>

## Naive Bayes Baseline Model

In [2]:
import numpy as np
import pandas as pd
import os
import scipy.sparse as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# File paths for TF-IDF data
training_data_tfidf_file = 'training_tfidf.npz'
validation_data_tfidf_file = 'validation_tfidf.npz'
test_data_tfidf_file = 'test_tfidf.npz'

# Load the original data with labels
original_training_data = pd.read_csv('training.csv')  
original_validation_data = pd.read_csv('validation.csv') 
original_test_data = pd.read_csv('test.csv')  

# Extract labels from the original data
training_labels = original_training_data['label']
validation_labels = original_validation_data['label']
test_labels = original_test_data['label']

# Load the TF-IDF data
training_data_tfidf = sp.load_npz(training_data_tfidf_file)
validation_data_tfidf = sp.load_npz(validation_data_tfidf_file)
test_data_tfidf = sp.load_npz(test_data_tfidf_file)

# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier
nb_classifier.fit(training_data_tfidf, training_labels)

# Predict on validation and test data
validation_predictions = nb_classifier.predict(validation_data_tfidf)
test_predictions = nb_classifier.predict(test_data_tfidf)

# Evaluate the classifier
print("Validation Set Performance:")
print(classification_report(validation_labels, validation_predictions))
print("Accuracy:", accuracy_score(validation_labels, validation_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(validation_labels, validation_predictions))

Validation Set Performance:
              precision    recall  f1-score   support

           0       0.72      0.94      0.82       550
           1       0.69      0.98      0.81       704
           2       1.00      0.14      0.25       178
           3       0.96      0.48      0.64       275
           4       0.90      0.49      0.63       212
           5       1.00      0.09      0.16        81

    accuracy                           0.74      2000
   macro avg       0.88      0.52      0.55      2000
weighted avg       0.80      0.74      0.69      2000

Accuracy: 0.739

Confusion Matrix:
[[518  30   0   1   1   0]
 [ 11 693   0   0   0   0]
 [ 35 118  25   0   0   0]
 [ 69  72   0 132   2   0]
 [ 55  50   0   4 103   0]
 [ 28  37   0   0   9   7]]


## NLP Model for Textual Data: Naive Bayes and Hyperparameter Tuning

In [3]:
import numpy as np
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Load your dataset
original_training_data = pd.read_csv('training.csv')  
original_validation_data = pd.read_csv('validation.csv') 
original_test_data = pd.read_csv('test.csv')  

# Assuming your dataset has 'text' column for the input text and 'label' column for the labels
text_data = original_training_data['text']
labels = original_training_data['label']

# Split the dataset into training, validation, and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(text_data, labels, test_size=0.2, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(test_texts, test_labels, test_size=0.5, random_state=42)

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)  
train_tfidf = vectorizer.fit_transform(train_texts)
val_tfidf = vectorizer.transform(val_texts)
test_tfidf = vectorizer.transform(test_texts)

# Convert TF-IDF matrices to pandas DataFrames
train_tfidf_df = pd.DataFrame(train_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
val_tfidf_df = pd.DataFrame(val_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
test_tfidf_df = pd.DataFrame(test_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

# Display the resulting DataFrames
print("Train TF-IDF DataFrame")
display(train_tfidf_df)

print("\nValidation TF-IDF DataFrame")
display(val_tfidf_df)

# Display the Test TF-IDF DataFrame
print("\nTest TF-IDF DataFrame")
display(test_tfidf_df)

# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Define the parameter grid to search
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0],
    'fit_prior': [True, False]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=nb_classifier, param_grid=param_grid, scoring='accuracy', cv=5)
grid_search.fit(train_tfidf, train_labels)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("\nBest Hyperparameters:")
print(best_params)

# Train the classifier with the best hyperparameters
best_nb_classifier = MultinomialNB(alpha=best_params['alpha'], fit_prior=best_params['fit_prior'])
best_nb_classifier.fit(train_tfidf, train_labels)

# Predict on validation data
val_predictions = best_nb_classifier.predict(val_tfidf)

# Evaluate the classifier on the validation set
print("\nValidation Set Performance")
print(classification_report(val_labels, val_predictions))
print("Accuracy:", accuracy_score(val_labels, val_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(val_labels, val_predictions))

# Predict on test data
test_predictions = best_nb_classifier.predict(test_tfidf)

# Evaluate the classifier on the test set
print("\nTest Set Performance")
print(classification_report(test_labels, test_predictions))
print("Accuracy:", accuracy_score(test_labels, test_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(test_labels, test_predictions))

Train TF-IDF DataFrame


Unnamed: 0,aa,abandon,abandoned,abc,abdomen,abilities,ability,abit,able,about,...,yours,yourself,youth,youtube,youve,zealand,zero,zombie,zone,zumba
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12795,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12796,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Validation TF-IDF DataFrame


Unnamed: 0,aa,abandon,abandoned,abc,abdomen,abilities,ability,abit,able,about,...,yours,yourself,youth,youtube,youve,zealand,zero,zombie,zone,zumba
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Test TF-IDF DataFrame


Unnamed: 0,aa,abandon,abandoned,abc,abdomen,abilities,ability,abit,able,about,...,yours,yourself,youth,youtube,youve,zealand,zero,zombie,zone,zumba
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.088186,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Best Hyperparameters:
{'alpha': 0.5, 'fit_prior': False}

Validation Set Performance
              precision    recall  f1-score   support

           0       0.88      0.89      0.88       458
           1       0.87      0.89      0.88       528
           2       0.74      0.66      0.69       140
           3       0.82      0.85      0.84       221
           4       0.80      0.79      0.79       203
           5       0.62      0.46      0.53        50

    accuracy                           0.84      1600
   macro avg       0.79      0.76      0.77      1600
weighted avg       0.84      0.84      0.84      1600

Accuracy: 0.838125

Confusion Matrix:
[[406  12   5  16  14   5]
 [ 15 472  22  11   5   3]
 [  4  38  92   3   3   0]
 [ 14   7   1 188  10   1]
 [ 17   9   2  10 160   5]
 [  7   7   3   1   9  23]]

Test Set Performance
              precision    recall  f1-score   support

           0       0.88      0.89      0.88       488
           1       0.83      0.86      

#### The best hyperparameters obtained through GridSearchCV for the Multinomial Naive Bayes classifier are {'alpha': 0.5, 'fit_prior': False}. The model was trained using these hyperparameters on the training data and evaluated on the validation set, achieving an accuracy of approximately 83.8%. The confusion matrix and classification report on the validation set indicate good performance across different classes, with notable precision, recall, and F1-score values. The model's overall weighted average F1-score on the validation set is 0.84.

#### Upon testing the model on a separate test set, it achieved an accuracy of approximately 82.6%. The confusion matrix and classification report for the test set mirror the strong performance observed in the validation set, reinforcing the generalization capabilities of the model. The weighted average F1-score on the test set is 0.82, indicating robust performance across various classes. Overall, the Multinomial Naive Bayes model with the tuned hyperparameters demonstrates reliable and balanced performance in classifying text data.

In [8]:
from sklearn.externals import joblib  

# Save the trained Naive Bayes classifier using joblib
model_filename = 'naive_bayes_model.joblib'
joblib.dump(best_nb_classifier, model_filename)
print(f"\nNaive Bayes Model saved as {model_filename}")

# Load the saved Naive Bayes model using joblib
loaded_nb_classifier = joblib.load(model_filename)
print("\nNaive Bayes Model loaded successfully!")

# Example: Predict using the loaded model
loaded_predictions = loaded_nb_classifier.predict(test_tfidf)

# Evaluate the loaded classifier on the test set
print("\nLoaded Model Performance on Test Set")
print(classification_report(test_labels, loaded_predictions))
print("Accuracy:", accuracy_score(test_labels, loaded_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(test_labels, loaded_predictions))

ImportError: cannot import name 'joblib' from 'sklearn.externals' (/Users/poojithavenkatram/opt/anaconda3/lib/python3.9/site-packages/sklearn/externals/__init__.py)