# Title

## **1. Business Understanding**

### Business Overview


Modern organizations collect large volumes of unstructured textual data from sources such as customer feedback, social media posts, support tickets and online reviews.
Manually reading and categorizing this information is time-consuming, inconsistent and difficult to scale.

This project focuses on building an end-to-end Natural Language Processing (NLP) solution that automatically classifies text into * emotion categories: positive, negative, none and unknown.
The solution is intended to support business teams by enabling faster understanding of public and customer sentiment, improving monitoring of user experience, and assisting decision-making through data-driven insights.

By automating text classification, organizations can:

    detect customer satisfaction and dissatisfaction trends,

    identify neutral or non-informative messages,

    filter ambiguous or unclear content,

    and prioritize responses to high-impact feedback.


##  Problem Statement

Organizations increasingly rely on large volumes of unstructured text data to understand customer opinions, experiences and engagement. However, this data is difficult to analyze at scale because it is noisy, inconsistent and highly variable in language, structure and quality.

In the provided dataset, text entries must be classified into four emotion categories: **positive, negative, none and unknown**. The dataset also exhibits a strong class imbalance, with the *none* category representing a significantly larger proportion of the data compared to the other classes. This imbalance makes it difficult to build models that perform reliably across all categories, especially for minority classes.

Additionally, raw text contains noise such as irrelevant tokens and informal language, which can negatively affect model performance if not properly cleaned and transformed into meaningful numerical representations.

The key challenge addressed in this project is to design and evaluate an end-to-end NLP pipeline that can reliably transform raw text into structured features and accurately perform multi-class emotion classification, while mitigating the effects of class imbalance and minimizing information leakage during model evaluation.

This project therefore seeks to establish a robust and reproducible workflow for text preprocessing, feature extraction and supervised learning that can generalize well to unseen data and support real-world business use cases.


## Objectives

### Main Objective
To develop an end-to-end Natural Language Processing (NLP) pipeline for automated multi-class text classification that accurately categorizes text data into positive, negative, none and unknown emotion classes, and evaluates the performance of different machine learning models on an imbalanced dataset.

### Specific Objectives
* To explore and understand the structure and distribution of the text dataset and emotion labels.

* To clean and preprocess raw text data by removing noise and irrelevant tokens in order to improve data quality for modelling.

* To transform textual data into numerical feature representations using TF-IDF vectorization.

* To build machine learning pipelines that integrate text vectorization and classification models.

* To train and evaluate multiple classification models, including Logistic Regression, Naïve Bayes and Linear Support Vector Machines, for multi-class text classification.

* To assess model performance using appropriate evaluation metrics such as accuracy, precision, recall and F1-score, with particular focus on macro-averaged metrics due to class imbalance.

### Research Objectives
* To investigate how different linear machine learning classifiers perform on an imbalanced multi-class text classification problem.

* To examine the effect of TF-IDF feature engineering, including the use of n-grams, on the predictive performance of text classification models.

* To determine which classification model provides the most balanced and reliable performance across all emotion categories.

### Success Criteria
The project will be considered successful if the following criteria are met:

* The trained models are able to correctly perform multi-class emotion classification for the four target classes: positive, negative, none and unknown.

* At least one trained model demonstrates improved and more balanced performance across minority classes (positive, negative and unknown) when compared to a naive majority-class baseline.

* The modelling workflow avoids data leakage by separating training and testing data and by performing feature extraction and model training within a unified pipeline.


# Data Understanding
This section uses a dataset of tweets from Data.World to build a model that classifies the presence and polarity of emotions directed at brands or products. The dataset contains raw tweet text, the referenced brand or product (if any), and the corresponding emotion label.

The goal of this stage is to understand the dataset’s structure and content. This includes reviewing the features, verifying data types, and identifying potential quality issues such as missing values, duplicates, class imbalance, or corrupted entries.

By exploring the data at this stage, it is possible to detect quality concerns and inform decisions for text cleaning, preprocessing, and subsequent Natural Language Processing (NLP) model development.

## Importing Required Libraries

In [79]:
# Data loading and manipulation
import pandas as pd
import numpy as np

# Text preprocessing and NLP
import nltk
import re
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)



# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.colors as pc
from IPython.display import Image, display
from PIL import Image, ImageDraw, ImageFont

# Machine learning and preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import LabelEncoder
from imblearn.pipeline import Pipeline


# Classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB


# Model evaluation metrics
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score

# Model interpretation and saving
import lime
import lime.lime_text
import joblib
import pickle

#webscrapping
#import tweepy
import time

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
sns.set(style="whitegrid")
sns.set_theme(rc={'figure.figsize':(11.7,8.27)})


## Data Loading and Inspection

In [80]:
df = pd.read_csv('./Data/judge-1377884607_tweet_product_company.csv', encoding='ISO-8859-1')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [82]:
df.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [83]:
df.columns = ['Tweet', 'Brand', 'Sentiment']

df.head(7)

Unnamed: 0,Tweet,Brand,Sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product


## Data Cleaning

In [84]:
df['Brand'].fillna('Unknown', inplace=True)

In [85]:
df.dropna(inplace= True)

In [86]:
df.duplicated().sum()

22

In [87]:
df.drop_duplicates(inplace=True)

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9070 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Tweet      9070 non-null   object
 1   Brand      9070 non-null   object
 2   Sentiment  9070 non-null   object
dtypes: object(3)
memory usage: 283.4+ KB


## Exploratory Data Analysis

In [None]:

# Prepare data
counts = df['sentiment'].value_counts()
percentages = counts / counts.sum() * 100

# Fixed color mapping for consistency
color_map = {
    'Positive': 'green',
    'Negative': 'red',
    'Neutral': 'orange'
}

# Ensure colors follow class order
colors = [color_map[label] for label in counts.index]

# Create figure
plt.figure(figsize=(14, 5))

# ---- Bar Chart (Counts) ----
plt.subplot(1, 2, 1)
sns.barplot(
    x=counts.index,
    y=counts.values,
    palette=colors
)
plt.title("Sentiment Class Distribution (Counts)")
plt.xlabel("Sentiment")
plt.ylabel("Count")

# Add count labels
for i, v in enumerate(counts.values):
    plt.text(i, v, str(v), ha='center', va='bottom')

# ---- Pie Chart (Percentages) ----
plt.subplot(1, 2, 2)
plt.pie(
    percentages.values,
    labels=percentages.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=colors
)
plt.title("Sentiment Class Distribution (Percentage)")
plt.axis('equal')

plt.tight_layout()
plt.show()


## Feature Engineering

In [89]:
def categorize_emotion(row):
    if row['Sentiment'] == 'No emotion toward brand or product':
        return 'neutral'
    elif row['Sentiment'] == 'Positive emotion':
        return 'positive'
    elif row['Sentiment'] == 'Negative emotion':
        return 'negative'
    else:
        return 'neutral'
    
df['emotion'] = df.apply(categorize_emotion, axis=1)

In [90]:
df['cleaned_text'] = df['Tweet'].str.lower().str.strip()


In [91]:
def clean_tweet(text):
    text = re.sub(r'http\S+|www\S+|@\w+|#\w+', '', text)  # URLs, mentions, hashtags
    text = re.sub(r'[^\x00-\x7F]+', '', text)             # non-ASCII chars
    text = re.sub(r'\d+', '', text)                       # numbers
    text = re.sub(r'[^\w\s]', '', text)                  # punctuation
    text = re.sub(r'\s+', ' ', text).strip()            # extra whitespace
    return text

df['cleaned_text'] = df['cleaned_text'].apply(clean_tweet)    

In [92]:
df.head()

Unnamed: 0,Tweet,Brand,Sentiment,emotion,cleaned_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,negative,i have a g iphone after hrs tweeting at it was...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,positive,know about awesome ipadiphone app that youll l...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,positive,can not wait for also they should sale them do...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,negative,i hope this years festival isnt as crashy as t...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,positive,great stuff on fri marissa mayer google tim or...


In [93]:
df['cleaned_text'].tail()

9088                                 ipad everywhere link
9089    wave buzz rt we interrupt your regularly sched...
9090    googles zeiger a physician never reported pote...
9091    some verizon iphone customers complained their...
9092            ___rt google tests checkin offers at link
Name: cleaned_text, dtype: object

In [94]:
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:'[a-z]+)?")

In [95]:
stopwords_list=set(stopwords.words('english'))
print(stopwords_list)

{'than', "it'll", 'during', 'against', 'few', 'shan', 'just', 'you', 'at', 'how', 'after', 've', 'don', 'ours', 'itself', "aren't", 'do', 'through', "i've", 'under', 'hers', "shan't", 'over', 'while', 'were', "weren't", 'nor', 'she', 'wasn', 'does', 'my', "wasn't", "we're", "mightn't", 'should', "won't", "we'd", 'above', 'an', 'most', "haven't", 're', "doesn't", 'because', 'he', 'only', "i'd", "that'll", 'from', 'd', 'didn', 'am', 'again', "isn't", 'himself', 'ourselves', 'to', 'why', 'did', 'our', 'the', 'what', 'i', 'themselves', 'we', 'will', 'it', "shouldn't", 'if', 'their', 'y', 'same', "they'll", 'whom', 'all', 'so', 'more', 't', 'until', 'aren', "you'll", 'of', 'couldn', 'isn', 'other', 'further', 'about', 'a', 'no', 'yours', 'its', "she's", "he'd", 'won', 'having', 'can', 'weren', 'this', 'them', 'him', 'then', "we'll", "couldn't", "needn't", 'any', 'm', 'your', 'be', "should've", "she'll", "didn't", 'both', 'off', 'has', 'myself', 'each', 'shouldn', "you're", 'o', 'wouldn', 'n

In [96]:
def remove_stopwords(text):
    if not isinstance(text, str):
        return ""
    
    tokens = tokenizer.tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stopwords_list]
    cleaned_tokens= [word for word in filtered_tokens if len(word) > 1 ]
    
    return " ".join(cleaned_tokens) 

In [97]:
df['cleaned_text'] = df['cleaned_text'].apply(remove_stopwords)
df.head()

Unnamed: 0,Tweet,Brand,Sentiment,emotion,cleaned_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,negative,iphone hrs tweeting dead need upgrade plugin s...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,positive,know awesome ipadiphone app youll likely appre...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,positive,wait also sale
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,negative,hope years festival isnt crashy years iphone app
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,positive,great stuff fri marissa mayer google tim oreil...


In [98]:
#removing punctuation-
df['cleaned_text'] = df['cleaned_text'].str.replace(r'[^A-Za-z\s]', '', regex=True)
df.head()

Unnamed: 0,Tweet,Brand,Sentiment,emotion,cleaned_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,negative,iphone hrs tweeting dead need upgrade plugin s...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,positive,know awesome ipadiphone app youll likely appre...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,positive,wait also sale
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,negative,hope years festival isnt crashy years iphone app
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,positive,great stuff fri marissa mayer google tim oreil...


In [99]:
def get_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {
        'J': wordnet.ADJ,  #Represents an Adjective
        'N': wordnet.NOUN, #Represents a Noun
        'V': wordnet.VERB, #Represents a Verb
        'R': wordnet.ADV   #Represents an Adverb
    }
    return tag_dict.get(tag, wordnet.NOUN)

In [100]:
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word, get_pos(word)) for word in tokenizer.tokenize(text)])
    
df['cleaned_text'] = df['cleaned_text'].apply(lemmatize_text)
df.head()

Unnamed: 0,Tweet,Brand,Sentiment,emotion,cleaned_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,negative,iphone hr tweet dead need upgrade plugin station
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,positive,know awesome ipadiphone app youll likely appre...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,positive,wait also sale
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,negative,hope year festival isnt crashy year iphone app
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,positive,great stuff fri marissa mayer google tim oreil...


In [101]:
df_cleaned = df[['cleaned_text', 'emotion']]
df_cleaned

Unnamed: 0,cleaned_text,emotion
0,iphone hr tweet dead need upgrade plugin station,negative
1,know awesome ipadiphone app youll likely appre...,positive
2,wait also sale,positive
3,hope year festival isnt crashy year iphone app,negative
4,great stuff fri marissa mayer google tim oreil...,positive
...,...,...
9088,ipad everywhere link,positive
9089,wave buzz rt interrupt regularly schedule geek...,neutral
9090,google zeiger physician never report potential...,neutral
9091,verizon iphone customer complain time fell bac...,neutral


In [103]:
label_encoder = LabelEncoder()
df_cleaned['target'] = label_encoder.fit_transform(df_cleaned['emotion'])
df_cleaned.tail()

Unnamed: 0,cleaned_text,emotion,target
9088,ipad everywhere link,positive,2
9089,wave buzz rt interrupt regularly schedule geek...,neutral,1
9090,google zeiger physician never report potential...,neutral,1
9091,verizon iphone customer complain time fell bac...,neutral,1
9092,rt google test checkin offer link,neutral,1


In [104]:
X = df_cleaned['cleaned_text']
y = df_cleaned['target']

In [105]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)


In [106]:
# Pipeline
logreg_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2)
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
        solver='saga'
    ))
])

# Train
logreg_pipeline.fit(X_train, y_train)

# Predict
y_pred = logreg_pipeline.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.5865490628445424
              precision    recall  f1-score   support

           0       0.18      0.71      0.29       114
           1       0.75      0.69      0.72      1106
           2       0.65      0.37      0.47       594

    accuracy                           0.59      1814
   macro avg       0.52      0.59      0.49      1814
weighted avg       0.68      0.59      0.61      1814



In [None]:
param_grid = {
    "tfidf__max_features": [5000, 8000],
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "tfidf__min_df": [2, 5],
    "clf__C": [0.1, 1, 10],
    "clf__penalty": ["l1", "l2"]
}
grid = GridSearchCV(
    estimator=logreg_pipeline,
    param_grid=param_grid,
    scoring="f1_macro",
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid.fit(X_train, y_train)

print(grid.best_params_)

best_logreg_model = grid.best_estimator_

best_model = grid.best_estimator_

y_pred_tuned = best_model.predict(X_test)

print("Tuned accuracy:", accuracy_score(y_test, y_pred_tuned))
print(classification_report(y_test, y_pred_tuned))


Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   49.1s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 10.6min


In [None]:
#naive bayes
nb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])
param_grid_nb = {
    "tfidf__max_features": [3000, 5000, 8000],
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "tfidf__min_df": [2, 5],
    "clf__alpha": [0.01, 0.1, 0.5, 1.0]
}
grid_nb = GridSearchCV(
    estimator=nb_pipeline,
    param_grid=param_grid_nb,
    scoring="f1_macro",
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_nb.fit(X_train, y_train)

print(grid_nb.best_params_)
print("Best CV macro-F1:", grid_nb.best_score_)

best_nb = grid_nb.best_estimator_

y_pred_nb = best_nb.predict(X_test)

print("Tuned NB accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))



In [None]:
#linear svc

svc_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", LinearSVC(class_weight="balanced"))
])
param_grid_svc = {
    "tfidf__max_features": [5000],
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "tfidf__min_df": [2],
    "clf__C": [0.1, 1, 3]
}
grid_svc = GridSearchCV(
    svc_pipeline,
    param_grid=param_grid_svc,
    scoring="f1_macro",
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_svc.fit(X_train, y_train)

print(grid_svc.best_params_)
print("Best CV macro-F1:", grid_svc.best_score_)

best_svc = grid_svc.best_estimator_

y_pred_svc = best_svc.predict(X_test)

print("LinearSVC accuracy:", accuracy_score(y_test, y_pred_svc))
print(classification_report(y_test, y_pred_svc))


In [None]:
models = {
    "Logistic Regression": best_model,
    "Multinomial NB": best_nb,
    "Linear SVC": best_svc
}

print("===== Final model comparison =====\n")

for name, model in models.items():
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    f1_macro = f1_score(y_test, y_pred, average="macro")
    f1_weighted = f1_score(y_test, y_pred, average="weighted")

    print(f"{name}")
    print(f"  Accuracy     : {acc:.4f}")
    print(f"  Macro F1     : {f1_macro:.4f}")
    print(f"  Weighted F1  : {f1_weighted:.4f}")
    print("-" * 35)



In [102]:
# save the model after