# Sentiment Analysis of Emotions

## Introduction

The aim of this project is to classify emotions from text data using four machine learning models: Naive Bayes, Logistic Regression, Random Forest, and Support Vector Machine (SVM). Sentiment analysis, in the context of emotion classification, involves categorizing text based on the emotional tone it conveys. This has wide applications in customer service, social media monitoring, and emotional intelligence systems.

#### Models Overview

* Naive Bayes: A probabilistic classifier based on Bayes' Theorem, Naive Bayes assumes that the features (words) in the data are independent given the class (emotion). It is efficient and performs well with high-dimensional data like text, making it an excellent choice for text classification tasks, especially when the data is noisy.

* Logistic Regression: This is a linear model for binary and multi-class classification problems. Logistic regression models the probability that a given input belongs to a particular class. It is a well-known and simple algorithm often used in text classification problems, particularly when the relationships between features and the target variable are assumed to be linear.

* Random Forest: An ensemble learning method that constructs multiple decision trees and merges them to get a more accurate and stable prediction. It works well for complex datasets and is less prone to overfitting compared to a single decision tree. Random Forest is known for handling a large number of features, making it suitable for text data with many dimensions.

* Support Vector Machine (SVM): A powerful classifier that works by finding the optimal hyperplane that separates the classes in high-dimensional space. SVM can efficiently handle both linear and non-linear data through kernel tricks, making it ideal for text data where class separation is complex.

#### Evaluation Metrics
In this project, we evaluate and compare the performance of the models using the following metrics:
* Accuracy: The percentage of correctly classified instances out of all instances.

* Precision: The ratio of correctly predicted positive instances to the total predicted positive instances, which indicates how reliable the model’s positive predictions are.

* Recall (Sensitivity): The ratio of correctly predicted positive instances to all actual positive instances, indicating how well the model identifies positive instances.

* F1 Score: The harmonic mean of precision and recall, providing a single metric to evaluate the balance between them.

* Confusion Matrix: A matrix that shows the true vs. predicted classifications for each emotion label, offering insight into model performance for each class.

We will evaluate these models on emotion-labeled datasets and compare their performance based on these metrics to identify the best-performing model for emotion classification.

### 1. Importing Libraries and Loading Data

#### 1.1 Importing Libraries

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from wordcloud import WordCloud

# Libraries for NLP and model building
import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

#### 1.2 Importing Dataset

In [None]:
# Load the test.txt dataset
df = pd.read_csv('./dataset/train.txt', names=['Text', 'Emotion'], sep=';')

In [None]:
# Display the first few rows of the dataset
df.head()

In [None]:
# Print the shape of the dataset (number of rows and columns)
print(df.shape)

### 2. Data Preprocessing

#### 2.1 Cleaning the Data

Remove the HTML tags, URL patters, unwanted patters, special characters and numbers, and removal of stopwords

In [None]:
# Function to clean the text
def clean_text(text):
    # Remove HTML tags using BeautifulSoup (in case some tags are still in text form)
    text = BeautifulSoup(text, "html.parser").get_text()

    # Remove any URL patterns (http, https, ftp, etc.)
    text = re.sub(r'http[s]?://\S+', '', text)
    
    # Remove any other unwanted patterns like href, src, etc.
    text = re.sub(r'\b(?:href|src|alt|title|class|id|style|rel|data)\b', '', text)

    # Remove special characters and numbers (keeping only alphabets and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Convert to lowercase
    text = text.lower()

    # Remove stopwords (optional, can be added if you have a stopwords list)
    stop_words = set(stopwords.words('english'))
    text = ' '.join(word for word in text.split() if word not in stop_words)

    return text

# Apply the cleaning function to the 'Text' column
df['Cleaned_Text'] = df['Text'].apply(clean_text)
# Save the cleaned DataFrame to a CSV file
df.to_csv('./dataset/cleaned_data.csv', index=False)

In [None]:
# Display cleaned text
df[['Text', 'Cleaned_Text']].head()

#### 2.2 Emotion Distribution using Pie Chart

In [None]:
# Display a pie chart of emotion counts
emotion_counts = df['Emotion'].value_counts()

# Create the pie chart
plt.figure(figsize=(8, 6))
plt.pie(emotion_counts, labels=emotion_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette("Set3", len(emotion_counts)))
plt.title('Emotion Distribution in Dataset')
plt.axis('equal')  # Equal aspect ratio ensures that pie chart is circular.
plt.show()

#### 2.3 Emotion Counts using Bar Chart

In [None]:
# Display emotion counts
emotion_counts = df['Emotion'].value_counts()

# Convert the emotion counts to a DataFrame for use in seaborn
emotion_counts_df = emotion_counts.reset_index()
emotion_counts_df.columns = ['Emotion', 'Count']

# Bar chart with hue assigned
plt.figure(figsize=(8, 6))
sns.barplot(x='Emotion', y='Count', data=emotion_counts_df, palette="Set3", hue='Emotion')
plt.title('Emotion Count Distribution')
plt.xlabel('Emotion')
plt.ylabel('Count')
plt.show()

#### 2.4 Word Cloud

In [None]:
# Generate word clouds for each emotion category
unique_emotions = df['Emotion'].unique()

# Dynamically calculate the number of rows and columns for the subplots
num_emotions = len(unique_emotions)
cols = 2  # Fixed number of columns (you can adjust this)
rows = np.ceil(num_emotions / cols).astype(int)  # Calculate the required rows

# Create a plot for each emotion category
plt.figure(figsize=(10, 5 * rows))
for i, emotion in enumerate(unique_emotions, 1):
    # Filter the DataFrame for the current emotion
    emotion_text = df[df['Emotion'] == emotion]['Cleaned_Text'].str.cat(sep=' ')

    # Generate the word cloud
    wordcloud = WordCloud(width=500, height=500, background_color='white', max_words=200).generate(emotion_text)

    # Display the word cloud
    plt.subplot(rows, cols, i)  # Dynamically adjust subplots based on the number of categories
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'Word Cloud for {emotion}')

plt.tight_layout()
plt.show()

#### 2.5 Label Encoding

In [None]:
# Convert categorical emotions to numerical values
label_encoder = LabelEncoder()
df['Emotion_Label'] = label_encoder.fit_transform(df['Emotion'])

# Display the mapping of labels
print(dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

In [None]:
df[['Cleaned_Text', 'Emotion', 'Emotion_Label']].head()

### 3. Splitting the Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['Cleaned_Text'], df['Emotion_Label'], test_size = 0.2, random_state=42)

#### 3.1 TF-IDF Vectorization

In [None]:
tfidfvectorizer = TfidfVectorizer()
X_train_tfidf = tfidfvectorizer.fit_transform(X_train)
X_test_tfidf = tfidfvectorizer.transform(X_test)

In [None]:
# Display the shape of the transformed features
print("Shape of X_train:", X_train_tfidf.shape)
print("Shape of X_test:", X_test_tfidf.shape)

### 4. Machine Learning Algorithms and Training the Model

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define emotion labels mapping
emotion_labels = {0: 'anger', 1: 'fear', 2: 'joy', 3: 'love', 4: 'sadness', 5: 'surprise'}

# Define classifiers
classifier = {
    'MultinomialNB': MultinomialNB(),
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Support Vector Machine': SVC(),
}

# Initialize a dictionary to store results for plotting
accuracy_results = {}

# Set up the plot for accuracy bar chart
plt.figure(figsize=(8, 6))
for name, clf in classifier.items():
    # Fit the classifier
    clf.fit(X_train_tfidf, y_train)
    
    # Predict using the model
    y_pred_tfidf = clf.predict(X_test_tfidf)
    
    # Calculate accuracy
    accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
    accuracy_results[name] = accuracy_tfidf  # Store the accuracy
    
    # Print classification report
    print(f"\n============{name}============")
    print(f"Accuracy: {accuracy_tfidf}")
    print("Classification Report")
    print(classification_report(y_test, y_pred_tfidf, target_names=[emotion_labels[i] for i in range(6)], zero_division=0))

#### 4.2 Accuracy

In [None]:
# Plot bar chart for accuracy results
plt.bar(accuracy_results.keys(), accuracy_results.values(), color='skyblue')
plt.title('Accuracy of Classifiers')
plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
plt.show()

#### 4.3 Confusion Matrix

In [None]:
# Set up the plot for confusion matrix heatmaps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, (name, clf) in enumerate(classifier.items()):
    # Predict using the model
    y_pred_tfidf = clf.predict(X_test_tfidf)
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred_tfidf)
    
    # Plot confusion matrix as a heatmap
    ax = axes[i//2, i%2]
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=ax,
                xticklabels=[emotion_labels[i] for i in range(6)], 
                yticklabels=[emotion_labels[i] for i in range(6)])
    ax.set_title(f'Confusion Matrix for {name}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('True')

# Adjust layout for confusion matrix subplots
plt.tight_layout()
plt.show()