<center>
    <h1> Real-Time Emotion Detection with Kafka, Spark Streaming, and Machine Learning </h1>
    <h2> Data Preprocessing </h2>
    <h4> Ann Maria John, Divya Neelamegam, Kartik Mukkavilli, Poojitha Venkat Ram, Shruti Badrinarayanan </h4>
</center>

## Load already Split Data (CSV files) into separate DataFrames

In [1]:
import pandas as pd

# Load the datasets
training_data_path = 'Raw Data/training.csv'
validation_data_path = 'Raw Data/validation.csv'
test_data_path = 'Raw Data/test.csv'

# Read the CSV files
training_data = pd.read_csv(training_data_path)
validation_data = pd.read_csv(validation_data_path)
test_data = pd.read_csv(test_data_path)

# Display the first few rows of each dataset to understand the structure
(training_data.head(), validation_data.head(), test_data.head())

(                                                text  label
 0                            i didnt feel humiliated      0
 1  i can go from feeling so hopeless to so damned...      0
 2   im grabbing a minute to post i feel greedy wrong      3
 3  i am ever feeling nostalgic about the fireplac...      2
 4                               i am feeling grouchy      3,
                                                 text  label
 0  im feeling quite sad and sorry for myself but ...      0
 1  i feel like i am still looking at a blank canv...      0
 2                     i feel like a faithful servant      2
 3                  i am just feeling cranky and blue      3
 4  i can have for a treat or if i am feeling festive      1,
                                                 text  label
 0  im feeling rather rotten so im not very ambiti...      0
 1          im updating my blog because i feel shitty      0
 2  i never make her separate from me because i do...      0
 3  i left with my bou

In [2]:
training_data["label"].value_counts()

label
1    5362
0    4666
3    2159
4    1937
2    1304
5     572
Name: count, dtype: int64

In [3]:
training_data

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3
...,...,...
15995,i just had a very brief time in the beanbag an...,0
15996,i am now turning and i feel pathetic that i am...,0
15997,i feel strong and good overall,1
15998,i feel like this was such a rude comment and i...,3


In [4]:
validation_data

Unnamed: 0,text,label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1
...,...,...
1995,im having ssa examination tomorrow in the morn...,0
1996,i constantly worry about their fight against n...,1
1997,i feel its important to share this info for th...,1
1998,i truly feel that if you are passionate enough...,1


In [5]:
test_data

Unnamed: 0,text,label
0,im feeling rather rotten so im not very ambiti...,0
1,im updating my blog because i feel shitty,0
2,i never make her separate from me because i do...,0
3,i left with my bouquet of red and yellow tulip...,1
4,i was feeling a little vain when i did this one,0
...,...,...
1995,i just keep feeling like someone is being unki...,3
1996,im feeling a little cranky negative after this...,3
1997,i feel that i am useful to my people and that ...,1
1998,im feeling more comfortable with derby i feel ...,1


#### Drop Duplicates

In [6]:
training_data = training_data.drop_duplicates()
validation_data = validation_data.drop_duplicates()
test_data = test_data.drop_duplicates()

## Data Preprocessing
This notebook is a complete pipeline for preprocessing text data for natural language processing (NLP) tasks for the emotion recognition/detection project. It begins by downloading necessary resources from NLTK, like WordNet for lemmatization and the Punk tokenizer. The preprocess_text function is defined to convert text to lowercase, tokenize it, perform lemmatization, handle negations (e.g., transforming "not happy" into "not_happy"), and remove punctuation, numbers, and English stop words. This function is then applied to text data from training, validation, and test datasets loaded from CSV files. Next, a TF-IDF Vectorizer is initialized to consider uni-grams, bi-grams, and tri-grams, while ignoring rare words appearing in fewer than two documents. This vectorizer is then fitted on the training data and used to transform all datasets. Finally, the transformed TF-IDF data is saved in a sparse matrix format (npz files) in a directory named 'Preprocessed Data', and the file paths for the saved data are returned for confirmation. This pipeline effectively prepares text data for the machine learning modeling stage, optimizing it for analysis and pattern recognition.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re
import scipy.sparse as sp
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import os

lemmatizer = WordNetLemmatizer()

# Define the preprocessing function including stop words removal
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Tokenize text
    tokens = word_tokenize(text)
    # Lemmatization and handling negations
    prev_word = ""
    processed_tokens = []
    for word in tokens:
        if word in ENGLISH_STOP_WORDS:
            continue
        if word == "not":
            prev_word = "not_"
        else:
            if prev_word == "not_":
                word = prev_word + word
                prev_word = ""
            word = lemmatizer.lemmatize(word)
            # Remove punctuation and numbers
            word = re.sub(r'[^\w\s]', '', word)
            word = re.sub(r'\d+', '', word)
            processed_tokens.append(word)
    return ' '.join(processed_tokens)

# Load the datasets again
training_data = pd.read_csv('Raw Data/training.csv')
validation_data = pd.read_csv('Raw Data/validation.csv')
test_data = pd.read_csv('Raw Data/test.csv')

# Apply the preprocessing to the text data
training_data['text'] = training_data['text'].apply(preprocess_text)
validation_data['text'] = validation_data['text'].apply(preprocess_text)
test_data['text'] = test_data['text'].apply(preprocess_text)

# Initialize TF-IDF Vectorizer without max_features to keep all words
# Configure the TF-IDF vectorizer to include bi-grams and tri-grams and to ignore rare words that appear in less than two documents.
tfidf_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3), min_df=2)

# Fit the vectorizer on the training text data and transform all datasets
tfidf_vectorizer.fit(training_data['text'])
training_data_tfidf = tfidf_vectorizer.transform(training_data['text'])
validation_data_tfidf = tfidf_vectorizer.transform(validation_data['text'])
test_data_tfidf = tfidf_vectorizer.transform(test_data['text'])

# Save the TF-IDF data as .npz files since they are in sparse format
preprocessed_data_dir = 'Preprocessed Data/'
os.makedirs(preprocessed_data_dir, exist_ok=True)

# Define file paths for the TF-IDF data
training_data_tfidf_file = os.path.join(preprocessed_data_dir, 'training_tfidf.npz')
validation_data_tfidf_file = os.path.join(preprocessed_data_dir, 'validation_tfidf.npz')
test_data_tfidf_file = os.path.join(preprocessed_data_dir, 'test_tfidf.npz')

# Save the TF-IDF data
sp.save_npz(training_data_tfidf_file, training_data_tfidf)
sp.save_npz(validation_data_tfidf_file, validation_data_tfidf)
sp.save_npz(test_data_tfidf_file, test_data_tfidf)

# Return the file paths for confirmation
(training_data_tfidf_file, validation_data_tfidf_file, test_data_tfidf_file)

[nltk_data] Downloading package wordnet to /Users/shruti/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/shruti/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/shruti/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


('Preprocessed Data/training_tfidf.npz',
 'Preprocessed Data/validation_tfidf.npz',
 'Preprocessed Data/test_tfidf.npz')

In [15]:
tfidf_vectorizer.get_feature_names_out()

array(['aa', 'aa meeting', 'abandon', ..., 'zoom', 'zooming', 'zumba'],
      dtype=object)

## Baseline Modeling using Multinomial Naive Bayes on Preprocessed Data

In [10]:
import numpy as np
import os
import scipy.sparse as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# File paths
training_data_tfidf_file = 'Preprocessed Data/training_tfidf.npz'
validation_data_tfidf_file = 'Preprocessed Data/validation_tfidf.npz'
test_data_tfidf_file = 'Preprocessed Data/test_tfidf.npz'

# Load the data
training_data_tfidf = sp.load_npz(training_data_tfidf_file)
validation_data_tfidf = sp.load_npz(validation_data_tfidf_file)
test_data_tfidf = sp.load_npz(test_data_tfidf_file)

# Load the labels (assuming they are stored in the DataFrame)
training_labels = training_data['label']
validation_labels = validation_data['label']
test_labels = test_data['label']

# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier
nb_classifier.fit(training_data_tfidf, training_labels)

# Predict on validation and test data
validation_predictions = nb_classifier.predict(validation_data_tfidf)
test_predictions = nb_classifier.predict(test_data_tfidf)

# Evaluate the classifier
print("Validation Set Performance:")
print(classification_report(validation_labels, validation_predictions))
print("Accuracy:", accuracy_score(validation_labels, validation_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(validation_labels, validation_predictions))

Validation Set Performance:
              precision    recall  f1-score   support

           0       0.72      0.94      0.82       550
           1       0.69      0.98      0.81       704
           2       1.00      0.14      0.25       178
           3       0.96      0.48      0.64       275
           4       0.90      0.49      0.63       212
           5       1.00      0.09      0.16        81

    accuracy                           0.74      2000
   macro avg       0.88      0.52      0.55      2000
weighted avg       0.80      0.74      0.69      2000

Accuracy: 0.739

Confusion Matrix:
[[518  30   0   1   1   0]
 [ 11 693   0   0   0   0]
 [ 35 118  25   0   0   0]
 [ 69  72   0 132   2   0]
 [ 55  50   0   4 103   0]
 [ 28  37   0   0   9   7]]


In [11]:
from joblib import dump

# After training the classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(training_data_tfidf, training_labels)

# Save the model to a joblib file
model_filename = 'naive_bayes_classifier.joblib'
dump(nb_classifier, model_filename)

['naive_bayes_classifier.joblib']