<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/course_project_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Corpus information

- Description of the chosen corpus: Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise
- Paper(s) and other published materials related to the corpus: https://aclanthology.org/D18-1404.pdf
- State-of-the-art performance (best published results) on this corpus: The best performance on the Emotion dataset was achieved using a DistilBERT which is a distilled version of BERT, designed to be smaller and faster while retaining most of BERT's language understanding capabilities. This model was fine-tuned on the Emotion dataset and achieves a macro F1-score of 0.938 and accuracy of 93.75% on the evaluation set (https://huggingface.co/esuriddick/distilbert-base-uncased-finetuned-emotion)

#### Code references
- https://www.geeksforgeeks.org/text-preprocessing-for-nlp-tasks/
- https://stackoverflow.com/questions/64009196/nlp-text-preprocessing
- https://stackoverflow.com/questions/77552178/nlp-preprocessing-text-in-data-frame-what-is-the-correct-order
- https://stackoverflow.com/questions/68429871/how-to-deal-with-support-vector-machine-and-text-data-when-there-are-multiple-te
- https://stackoverflow.com/questions/64300362/gridsearch-for-nlp-how-to-combine-countvec-and-other-features
- https://stackoverflow.com/questions/51787997/python-using-gridsearchcv-with-nltk

---

## 1. Setup

In [11]:
# installing Hugging Face datasets library
!pip install datasets

# installing NLTK and scikit-learn for NLP and ML
!pip install nltk scikit-learn

# installing contractions package to expand contracted forms
!pip install contractions

# installing ftfy to fix text encoding issues
!pip install ftfy

# loading Hugging Face dataset utility
from datasets import load_dataset

# Natural Language Toolkit (NLTK) for preprocessing
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# for text cleaning and preprocessing
import string
import re
import contractions
import ftfy
import html
import requests
import csv

# for data handling
import pandas as pd
import joblib

# for models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# for text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# for model selection and tuning
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.utils import shuffle

# for evaluation metrics
from sklearn.metrics import accuracy_score, classification_report



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Junaid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Junaid\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Junaid\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [5]:
# Your code to download the corpus here

# loading the 'emotion' dataset
emotion_dataset = load_dataset("dair-ai/emotion")

# checking the dataset structure
emotion_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

### 2.2. Preprocessing

In [7]:
# Your code for any necessary preprocessing here

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # expanding contractions
    text = contractions.fix(text)

    # lowercasing the text
    text = text.lower()

    # removing URLs if any
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)

    # removing mentions, hashtags, and reserved words (RT, FAV)
    text = re.sub(r'@\w+|#\w+|rt|fav', '', text)

    # removing numbers
    text = re.sub(r'\d+', '', text)

    # removing punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # tokenizing
    tokens = word_tokenize(text)

    # removing stopwords and lemmatize
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return ' '.join(cleaned_tokens).strip()

# applying preprocessing to each split
train_texts = [preprocess_text(text) for text in emotion_dataset['train']['text']]
val_texts = [preprocess_text(text) for text in emotion_dataset['validation']['text']]
test_texts = [preprocess_text(text) for text in emotion_dataset['test']['text']]

# saving labels
train_labels = emotion_dataset['train']['label']
val_labels = emotion_dataset['validation']['label']
test_labels = emotion_dataset['test']['label']

# printing examples that contain text before and after preprocessing
for i in range(10):
    original = emotion_dataset['train']['text'][i]
    cleaned = train_texts[i]
    label = emotion_dataset['train']['label'][i]
    print(f"Sample {i+1}")
    print("Original: ", original)
    print("Cleaned: ", cleaned)
    print("Label: ", emotion_dataset['train'].features['label'].int2str(label))
    print("-" * 60)

Sample 1
Original:  i didnt feel humiliated
Cleaned:  feel humiliated
Label:  sadness
------------------------------------------------------------
Sample 2
Original:  i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake
Cleaned:  go feeling hopeless damned hopeful around someone care awake
Label:  sadness
------------------------------------------------------------
Sample 3
Original:  im grabbing a minute to post i feel greedy wrong
Cleaned:  grabbing minute post feel greedy wrong
Label:  anger
------------------------------------------------------------
Sample 4
Original:  i am ever feeling nostalgic about the fireplace i will know that it is still on the property
Cleaned:  ever feeling nostalgic fireplace know still propey
Label:  love
------------------------------------------------------------
Sample 5
Original:  i am feeling grouchy
Cleaned:  feeling grouchy
Label:  anger
------------------------------------------------------

---

## 3. Machine learning model

### 3.1. Model training

In [13]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

# initialising TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# initilalising linear support vector classifier
svm_model = LinearSVC(dual='auto')

# creating pipeline which is a combination of TF-IDF + SVM
pipeline = make_pipeline(vectorizer, svm_model)

# training the model
pipeline.fit(train_texts, train_labels)

# predicting on validation set
y_pred = pipeline.predict(val_texts)

# evaluating performance
accuracy = accuracy_score(val_labels, y_pred)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:")
print(classification_report(val_labels, y_pred, target_names=emotion_dataset['train'].features['label'].names))

Validation Accuracy: 90.85%
Classification Report:
              precision    recall  f1-score   support

     sadness       0.91      0.95      0.93       550
         joy       0.93      0.94      0.93       704
        love       0.86      0.83      0.85       178
       anger       0.93      0.91      0.92       275
        fear       0.88      0.82      0.85       212
    surprise       0.84      0.79      0.82        81

    accuracy                           0.91      2000
   macro avg       0.89      0.87      0.88      2000
weighted avg       0.91      0.91      0.91      2000



### 3.2 Hyperparameter optimization

In [17]:
# Your code for hyperparameter optimization here

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC(dual='auto'))
])

# hyperparameter grid for tuning
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__min_df': [1, 2, 5],
    'tfidf__max_df': [0.75, 0.9, 1.0],
    'clf__C': [0.01, 0.1, 1, 10]
}

# applying grid search with 5-fold cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=2)

# running grid search on training data
grid_search.fit(train_texts, train_labels)

# printing best parameters
print("\nBest Parameters:")
print(grid_search.best_params_)

# evaluating on validation set using best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(val_texts)

accuracy = accuracy_score(val_labels, y_pred)
print(f"\nValidation Accuracy (after tuning): {accuracy * 100:.2f}%")
print("Classification Report:")
print(classification_report(val_labels, y_pred, target_names=emotion_dataset['train'].features['label'].names))

Fitting 5 folds for each of 72 candidates, totalling 360 fits

Best Parameters:
{'clf__C': 10, 'tfidf__max_df': 0.75, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 2)}

Validation Accuracy (after tuning): 91.15%
Classification Report:
              precision    recall  f1-score   support

     sadness       0.93      0.95      0.94       550
         joy       0.93      0.94      0.93       704
        love       0.86      0.84      0.85       178
       anger       0.93      0.91      0.92       275
        fear       0.87      0.84      0.86       212
    surprise       0.78      0.75      0.77        81

    accuracy                           0.91      2000
   macro avg       0.88      0.87      0.88      2000
weighted avg       0.91      0.91      0.91      2000



### 3.3. Evaluation on test set

In [19]:
# Your code to evaluate the final model on the test set here

pipeline.fit(train_texts, train_labels)

# making predictions on the test set
y_test_pred = pipeline.predict(test_texts)

# evaluating performance on the test set
test_accuracy = accuracy_score(test_labels, y_test_pred)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")
print("Test Classification Report:")
print(classification_report(test_labels, y_test_pred, target_names=emotion_dataset['train'].features['label'].names))

Test Accuracy: 89.20%
Test Classification Report:
              precision    recall  f1-score   support

     sadness       0.92      0.93      0.93       581
         joy       0.91      0.93      0.92       695
        love       0.79      0.79      0.79       159
       anger       0.87      0.88      0.88       275
        fear       0.87      0.83      0.85       224
    surprise       0.75      0.67      0.70        66

    accuracy                           0.89      2000
   macro avg       0.85      0.84      0.85      2000
weighted avg       0.89      0.89      0.89      2000



---

## 4. Results and summary

### 4.1 Corpus insights

The emotion dataset consists of English-language tweets annotated for six basic emotions: anger, fear, joy, love, sadness, and surprise. Each instance is a short tweet labeled with a single dominant emotion. The dataset is balanced well across most classes, though some slight imbalances are visible, particularly with surprise and love having fewer examples than joy or sadness. One important insight from analyzing this corpus is that social media language, especially on Twitter, often includes informal text, emojis, hashtags, and contractions, which may introduce noise. Therefore, text preprocessing is essential.

### 4.2 Results

The model, based on a TF-IDF vectorizer and a Linear SVM classifier, shows strong performance across all stages of evaluation. Before hyperparameter tuning, the model achieved a validation accuracy of 90.85%, with high F1-scores across most emotion categories, particularly joy, sadness, and anger. After hyperparameter tuning (best parameters: C=10, ngram_range=(1,2), min_df=1, max_df=0.75), the validation accuracy improved slightly to 91.15%, indicating a minor but consistent gain from tuning. On the unseen test set, the model achieved a solid accuracy of 89.20%. Performance was highest for sadness, joy, and anger, while love and surprise were relatively weaker—likely.

### 4.3 Relation to state of the art

The state-of-the-art performance on the emotion classification task using the same dataset was achieved using a fine-tuned DistilBERT model. This model, distilbert-base-uncased-finetuned-emotion, achieved an accuracy of 93.75% and a macro F1-score of 0.9379 on the evaluation set, as reported on the Hugging Face model card(https://paperswithcode.com/sota/text-classification-on-emotion). In comparison, my TF-IDF + Linear SVM model achieved a validation accuracy of 91.15% and a test accuracy of 89.20%, with a corresponding macro F1-score of 0.88 on the validation set and 0.85 on the test set. While these results are slightly lower than the transformer-based approach, they demonstrate that traditional machine learning methods can still perform competitively with significantly lower computational cost and simpler architecture. This gap also shows how effective pre-trained transformer models like DistilBERT are in capturing contextual information and validating the strong baseline achievable with well-tuned classical models.

---

## 5. Testing model on out-of-domain docs
### 5.1. Annotating out-of-domain documents

1. The out-of-domain dataset selected for this evaluation was a collection of Reddit threads related to the movie Once Upon a Time in Hollywood (https://www.kaggle.com/datasets/jrw2200/once-upon-a-time-in-hollywood-reddit-thread). The dataset is sourced from a public Kaggle repository, which contains discussions from the Reddit community around the movie. As per the task requirement I selected 50 samples from this dataset.
2. The annotation process involved manually assigning one of the six emotion labels each text. Given the nature of the Reddit discussions, some posts contained mixed sentiments, and in such cases, the most prominent emotion was chosen. 

### 5.2 Conversion into dataset

In [23]:
# Your code to convert the annotations into a dataset here

file_path = 'reddit_comments.csv'

df  = pd.read_csv(file_path)

text_column = 'text'  
label_column = 'label' 

# extracting texts and labels
df_texts = df[text_column].astype(str).tolist()
df_labels = df[label_column].tolist()

### 5.3. Model evaluation on out-of-domain test set

In [28]:
# Your code to evaluate the model on the out-of-domain test set here

# making predictions using the trained model
predictions = pipeline.predict(df_texts)

# evaluating model performance on reddit data
accuracy = accuracy_score(df_labels, predictions)
print(f"Accuracy: {accuracy * 100:.2f}%")

class_names = ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']
print("Classification Report:")
print(classification_report(df_labels, predictions, target_names=class_names, zero_division=0))

Accuracy: 26.00%
Classification Report:
              precision    recall  f1-score   support

       anger       0.60      0.56      0.58        16
        fear       0.10      1.00      0.19         3
         joy       0.00      0.00      0.00         8
        love       0.50      0.12      0.20         8
     sadness       0.00      0.00      0.00         6
    surprise       0.00      0.00      0.00         9

    accuracy                           0.26        50
   macro avg       0.20      0.28      0.16        50
weighted avg       0.28      0.26      0.23        50



### 5.4 Bonus task results

The model was evaluated on a 50-text out-of-domain test set, consisting of movie discussion posts from Reddit. These text domain was different from the original dataset used to train the model which was focused on tweets. Given the differing nature of these two text sources, the model's performance on this new domain was expected to be lower. The accuracy of 26.00% show that the model struggled to generalize effectively to this new text type. The classification report provides a detailed insight into the model's performance across different emotion categories. It shows that the model performed relatively better on detecting "anger" with a precision of 0.60 and recall of 0.56. Furthermore, it struggled with other emotions such as, "fear" had a high recall of 1.00, meaning all instances of fear were identified, but the precision was very low at 0.10, suggesting many incorrect predictions were made for this class. Lastly the model failed to recognize "joy," "sadness," and "surprise," as reflected in the zero precision and recall for these emotions. To conclude, this performance gap highlights the challenge of transferring a model trained on one type of text (tweets) to another (movie discussions on Reddit). The lack of overlap in language and context between these domains was likely the reason for model's lower accuracy.

### 5.5. Annotated data

In [31]:
# Include your annotated out-of-domain data here
df

Unnamed: 0,text,label
0,"Margot Robbie's scene was touching, watching S...",3
1,The horror movies this year have been largely ...,0
2,I love Reservoir Dogs. It made me want to stud...,3
3,It was a different friend of Tate's who encoun...,5
4,Margot Robbie watching Sharon Tate onscreen wa...,3
5,It was boring as shit.,0
6,"I didn't like Midsommar, though I understand t...",0
7,John Wick was pretty good. I'm happy the Mavs ...,2
8,Margaret Qualley was very entertaining in this...,5
9,That's about 2 hours in.,5
