<a href="https://colab.research.google.com/github/petedanN/DSML-Machine-Learning-Tasks/blob/main/NLP_Emotion_Classification_in_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Formative Assessment: NLP - Emotion Classification in Text**

In [17]:
#Import necessary libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, f1_score

In [18]:
#Load the dataset
data = pd.read_csv ('nlp_dataset.csv')
data.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5937 entries, 0 to 5936
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  5937 non-null   object
 1   Emotion  5937 non-null   object
dtypes: object(2)
memory usage: 92.9+ KB


In [20]:
data.nunique()

Unnamed: 0,0
Comment,5934
Emotion,3


In [21]:
print(data.columns)
print(data['Emotion'].unique())

Index(['Comment', 'Emotion'], dtype='object')
['fear' 'anger' 'joy']


In [22]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
stop_words = set(stopwords.words('english'))

In [24]:
def preprocess_text(text):
    text = text.lower()

    text = re.sub(r'http\S+|[^a-z\s]', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(tokens)

data['Cleaned_Comment'] = data['Comment'].apply(preprocess_text)

data[['Comment', 'Cleaned_Comment', 'Emotion']].head()

Unnamed: 0,Comment,Cleaned_Comment,Emotion
0,i seriously hate one subject to death but now ...,seriously hate one subject death feel reluctan...,fear
1,im so full of life i feel appalled,im full life feel appalled,anger
2,i sit here to write i start to dig out my feel...,sit write start dig feelings think afraid acce...,fear
3,ive been really angry with r and i feel like a...,ive really angry r feel like idiot trusting fi...,joy
4,i feel suspicious if there is no one outside l...,feel suspicious one outside like rapture happe...,fear


In [25]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = tfidf_vectorizer.fit_transform(data['Cleaned_Comment'])
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.head()

Unnamed: 0,ability,able,absolutely,accept,acceptable,accepted,across,act,actions,actually,...,wrong,wronged,wrote,year,years,yes,yesterday,yet,young,youre
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.440367,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, data['Emotion'], test_size=0.2, random_state=42)

In [27]:
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

In [28]:
nb_predictions = nb_model.predict(X_test)

In [29]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

In [30]:
svm_predictions = svm_model.predict(X_test)

In [31]:
nb_report = classification_report(y_test, nb_predictions)
svm_report = classification_report(y_test, svm_predictions)
print("Naive Bayes Classification Report:\n", nb_report)
print("SVM Classification Report:\n", svm_report)

Naive Bayes Classification Report:
               precision    recall  f1-score   support

       anger       0.89      0.93      0.91       392
        fear       0.91      0.91      0.91       416
         joy       0.94      0.89      0.92       380

    accuracy                           0.91      1188
   macro avg       0.91      0.91      0.91      1188
weighted avg       0.91      0.91      0.91      1188

SVM Classification Report:
               precision    recall  f1-score   support

       anger       0.93      0.95      0.94       392
        fear       0.98      0.91      0.95       416
         joy       0.93      0.98      0.95       380

    accuracy                           0.95      1188
   macro avg       0.95      0.95      0.95      1188
weighted avg       0.95      0.95      0.95      1188



In [32]:
nb_accuracy = accuracy_score(y_test, nb_predictions)
nb_f1_score = f1_score(y_test, nb_predictions, average='weighted')

svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_f1_score = f1_score(y_test, svm_predictions, average='weighted')

print(f"Naive Bayes - Accuracy: {nb_accuracy:.2f}, F1-Score: {nb_f1_score:.2f}")
print(f"SVM - Accuracy: {svm_accuracy:.2f}, F1-Score: {svm_f1_score:.2f}")

Naive Bayes - Accuracy: 0.91, F1-Score: 0.91
SVM - Accuracy: 0.95, F1-Score: 0.95


**Naive Bayes works well for text classification due to its assumption of feature independence, which fits the bag-of-words model used in TF-IDF. It's fast and simple but may struggle with more complex decision boundaries, which might explain why its performance is slightly lower than SVM.**

**SVM performs well because it excels at finding the optimal hyperplane that separates different emotion classes. It's particularly effective for high-dimensional data like TF-IDF vectors, which explains its superior performance. The linear kernel captures the distinctions between different emotions with high accuracy.**

**SVM seems to be the better model in this case due to its ability to handle complex boundaries and high-dimensional feature spaces, making it more suitable for nuanced emotion classification.**