In [1]:
!pip install datasets




In [2]:
!pip install --upgrade fsspec datasets


Collecting fsspec
  Using cached fsspec-2025.3.2-py3-none-any.whl.metadata (11 kB)


Dataset Acquisition – GoEmotions.

For this project, I used the GoEmotions dataset, developed by Google and available via the HuggingFace Datasets library. This dataset contains over 58,000 English sentences annotated with emotion labels, making it ideal for training and evaluating emotion classification models.

Unlike basic sentiment analysis (positive/negative/neutral), GoEmotions includes 27 fine-grained emotions plus a neutral label. It also provides mappings to Ekman's 6 basic emotions (joy, sadness, anger, fear, disgust, surprise) for simpler use cases.

✅ Key Steps:
Loaded the dataset using load_dataset("go_emotions") from HuggingFace

Explored the dataset structure: it includes train, validation, and test splits

Each example contains:

A sentence of text (e.g., "I'm feeling really down today")

One or more emotion labels represented by integers

This dataset forms the foundation for training the machine learning model to recognize emotions from written text.


In [3]:
from datasets import load_dataset
dataset = load_dataset("go_emotions")

# Print example
print(dataset['train'][0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'text': "My favourite food is anything I didn't have to cook myself.", 'labels': [27], 'id': 'eebbqej'}


In [4]:
labels = dataset['train'].features['labels'].feature.names
print(labels)


['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']


Text Preprocessing

Preprocessing is a crucial step in preparing raw text for machine learning. Different models have different needs when it comes to input text, so this step varies depending on the type of model used.

🟩 Preprocessing for Traditional Machine Learning Models
Traditional models like Support Vector Machines (SVM) or Random Forests require numerical input. So, we need to clean the text and convert it into a format the model can understand.

The preprocessing includes:

Lowercasing the text (to treat “Happy” and “happy” the same)

Removing punctuation, numbers, and special characters

Tokenizing: splitting the sentence into words

Removing stop words (common words like “the”, “is”, “and” that add little meaning)

In [5]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab') # Download the punkt_tab for tokenization

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # Remove punctuation and numbers
    tokens = word_tokenize(text)  # Tokenize
    tokens = [word for word in tokens if word not in stop_words]  # Remove stop words
    return " ".join(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [6]:
sample_text = dataset['train'][0]['text']
print("Original:", sample_text)
print("Processed:", preprocess_text(sample_text))


Original: My favourite food is anything I didn't have to cook myself.
Processed: favourite food anything didnt cook


Feature Engineering and Model Development (Traditional ML)

In this step, we’ll turn the cleaned text data into a format that the machine learning model can understand and learn from. Specifically, we’ll convert the text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). Then, we’ll train a traditional model, like SVM (Support Vector Machine) or Random Forest, on that data.

🧑‍🏫 What is TF-IDF?
TF (Term Frequency) tells you how often a word appears in a document (sentence).

IDF (Inverse Document Frequency) tells you how important a word is in the entire dataset (text corpus).

So, TF-IDF gives each word a weight based on:

How often it appears in the sentence

How unique it is across the whole dataset

This makes important words stand out more and less meaningful words (like “the” or “and”) get less weight.

🎯 Steps for Feature Extraction (TF-IDF) and Model Training
Convert Text to Numerical Features with TF-IDF

We'll use TfidfVectorizer from scikit-learn to do this.

Train a Traditional ML Model

We’ll train a classifier, like SVM, on the TF-IDF features.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import hamming_loss, f1_score, classification_report
from sklearn.multiclass import OneVsRestClassifier

# Step 1: Preprocess the text
# Apply preprocessing function to the dataset
processed_texts = [preprocess_text(text) for text in dataset['train']['text']]

# Step 2: Convert text to numerical features using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)  # Limiting to 5000 features for simplicity
X = vectorizer.fit_transform(processed_texts)

# Step 3: Extract labels as a 1D array
y = dataset['train']['labels']

# Convert multi-labels to single-label using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y)

# Step 4: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Train a Support Vector Machine (SVM) model
# Use OneVsRestClassifier for multi-label classification
from sklearn.multiclass import OneVsRestClassifier
model = OneVsRestClassifier(SVC(kernel='linear'))
model.fit(X_train, y_train)

# Step 6: Evaluate the model
y_pred = model.predict(X_test)

# Output accuracy and classification report
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Hamming Loss:", hamming_loss(y_test, y_pred))
print("Micro F1-score:", f1_score(y_test, y_pred, average='micro'))
print("Macro F1-score:", f1_score(y_test, y_pred, average='macro'))

Accuracy: 0.32688320663441606
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.45      0.57       863
           1       0.77      0.71      0.74       453
           2       0.71      0.15      0.25       323
           3       0.63      0.02      0.05       483
           4       0.72      0.07      0.13       577
           5       0.75      0.03      0.05       212
           6       0.87      0.05      0.10       258
           7       0.95      0.08      0.14       460
           8       0.71      0.27      0.40       128
           9       0.92      0.05      0.09       244
          10       0.50      0.01      0.01       383
          11       0.86      0.19      0.31       156
          12       0.87      0.22      0.36        58
          13       0.74      0.11      0.20       175
          14       0.74      0.32      0.45       116
          15       0.97      0.83      0.90       544
          16       0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [9]:
from google.colab import drive
drive.mount("/content/drive")


Mounted at /content/drive


In [10]:
import joblib
from google.colab import drive

# 1. Mount  Drive
drive.mount("/content/drive")

# 2. Create a folder in  Drive to hold the artifacts
!mkdir -p /content/drive/MyDrive/svm_emotion_baseline

# 3. Dump the objects
joblib.dump(vectorizer, "/content/drive/MyDrive/svm_emotion_baseline/tfidf_vectorizer.joblib")
joblib.dump(model,      "/content/drive/MyDrive/svm_emotion_baseline/svm_model.joblib")
joblib.dump(mlb,        "/content/drive/MyDrive/svm_emotion_baseline/label_binarizer.joblib")

# 4. Verify
!ls -1 /content/drive/MyDrive/svm_emotion_baseline


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
label_binarizer.joblib
svm_model.joblib
tfidf_vectorizer.joblib


📝 Baseline Model Results Summary (TF-IDF + SVM)
For the initial stage of the project, we implemented a baseline machine learning model using a TF-IDF vectorizer combined with a Support Vector Machine (SVM) in a multi-label classification setup. The goal was to establish a performance benchmark before moving on to more complex models like BERT.

⚙️ Model Setup:
Feature extraction: TF-IDF (max_features=5000)

Classifier: SVM with linear kernel, wrapped in a OneVsRestClassifier for multi-label support

Label binarization: MultiLabelBinarizer to handle multiple emotion labels per text sample

📊 Evaluation Metrics:

Metric	Score
Hamming Loss	0.033
Micro F1-score	0.464
Macro F1-score	0.304

🔍 Interpretation:

Low Hamming Loss (0.033) indicates that, on average, very few labels were incorrectly predicted per sample.

Micro F1-score of 0.464 shows good performance on frequent emotions, which dominate the dataset.

Macro F1-score of 0.304 is expected to be lower due to the imbalanced distribution of emotions in the dataset — rarer emotions are harder for the model to learn.

🧠 Insights:
This model provides a strong baseline, especially considering it was built using simple feature extraction and traditional machine learning. However, it has limited understanding of context and nuance, which are critical for accurately detecting emotions in natural language. Therefore, we will now move on to a deep learning approach, leveraging a pre-trained language model (BERT), to improve emotion classification performance.

