We have already seen strong in-domain performance on GoEmotions, but real users don’t write Reddit comments—they write tweets, product reviews, support tickets, and so on. By running the three models (SVM, BERT, RoBERTa) on the Hugging-Face “emotion” tweets corpus, we are performing a domain-shift evaluation: we are asking, “How well does my GoEmotions-trained classifier generalize to text written in a completely different style and genre?” That gap between in-domain and out-of-domain accuracy is critical for understanding whether the model is truly robust or whether it has merely memorized patterns specific to Reddit.

Using the small, human-labeled Emotion dataset lets us do this with no extra annotation effort— we already have gold labels for sadness, joy, love, anger, fear, surprise, and neutral, all of which map directly into the GoEmotions taxonomy. By computing Top-1 accuracy, we  see how often the model’s very best guess aligns with human judgment. By computing Top-3 recall, we measure whether the true emotion at least appears among the model’s top suggestions—an especially important metric in multi-label or recommendation-style scenarios, where surfacing the right answer in the top few options can be good enough for a downstream application.

Comparing the three approaches under this evaluation framework accomplishes several things at once:

Quantifies Generalization – We will see whether the clear RoBERTa advantage on GoEmotions holds when dealing with tweets, or if the simpler SVM or smaller BERT models close the gap.

Reveals Failure Modes – A big drop in Top-1 accuracy but a modest Top-3 recall can tell us if our model still “understands” the emotion but struggles to rank it first.

Guides Next Steps – If all models perform poorly out-of-domain, that’s a strong signal that we need domain-adaptive pretraining on tweet/review text, or at least some in-domain fine-tuning to bring performance back up.

Ultimately, this real-world evaluation gives us the evidence to say, “Here’s how my models behave when faced with new, unseen text—this is their true performance in the wild,” rather than just “here’s how they did on Reddit comments.” That kind of insight is what turns a proof-of-concept into a production-ready solution.

In [1]:
!pip install datasets transformers scikit-learn joblib




In [2]:
!pip install --upgrade fsspec datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are ins

In [3]:
from google.colab import drive
drive.mount("/content/drive")


Mounted at /content/drive


In [4]:
from datasets import load_dataset

# Load just to grab the label names:
go = load_dataset("go_emotions")
emotion_names = go["train"].features["labels"].feature.names
print(emotion_names)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.40k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/350k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/347k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']


## Load & Build The Three Inference Pipelines#

### SVM

In [5]:
import joblib

vec_path = "/content/drive/MyDrive/svm_emotion_baseline/tfidf_vectorizer.joblib"
mdl_path = "/content/drive/MyDrive/svm_emotion_baseline/svm_model.joblib"
lb_path  = "/content/drive/MyDrive/svm_emotion_baseline/label_binarizer.joblib"

vectorizer = joblib.load(vec_path)
svm_model   = joblib.load(mdl_path)
mlb         = joblib.load(lb_path)


### BERT & ROBERTA     

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

def make_multi_label_pipeline(model_folder):
    tok = AutoTokenizer.from_pretrained(model_folder)
    mdl = AutoModelForSequenceClassification.from_pretrained(model_folder)
    return pipeline(
        "text-classification",
        model=mdl,
        tokenizer=tok,
        function_to_apply="sigmoid",
        top_k=None       # return every label’s probability
    )


pipe_roberta = make_multi_label_pipeline("/content/drive/MyDrive/emotion_model")
pipe_bert    = make_multi_label_pipeline("/content/drive/MyDrive/BERT_model")


Device set to use cpu
Device set to use cpu


## Load the Out-of-Domain “Emotion” Tweets Set

In [7]:
from datasets import load_dataset

emod = load_dataset("emotion")
texts = emod["test"]["text"]       # ~2,000 tweets
gold  = emod["test"]["label"]      # ints 0..6
hf_names = emod["train"].features["label"].names
print("HF emotion labels:", hf_names)


README.md:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

HF emotion labels: ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


## Map HF-Emotion Labels → Go-Emotions Indices

In [8]:
# Build name → GoEmotions index
go2idx = {name:i for i,name in enumerate(emotion_names)}

# HF index (0–6) → GoEmotions index (0–27)
hf2go = {hf_i: go2idx[name] for hf_i,name in enumerate(hf_names)}

print("Mapping HF→Go:", hf2go)


Mapping HF→Go: {0: 25, 1: 17, 2: 18, 3: 2, 4: 14, 5: 26}


## Run Inference & Compute Top-1 / Top-3 Metrics
### Loop over every tweet, get the sorted predictions from each model, and check:

Top-1 Accuracy: was the model’s #1 guess equal to the gold emotion?

Top-3 Recall: did the gold emotion appear anywhere in the model’s top-3 guesses?

Re-define preprocess_text

In [9]:
import nltk
nltk.download('punkt')      # the standard tokenizer
nltk.download('punkt_tab')  # the extra one NLTK is looking for
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [10]:
# ──────────────── Re-define your SVM preprocessing ────────────────
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # 1) lowercase
    text = text.lower()
    # 2) remove everything except letters & spaces
    text = re.sub(r'[^a-z\s]', '', text)
    # 3) tokenize
    tokens = word_tokenize(text)
    # 4) drop stopwords
    tokens = [t for t in tokens if t not in stop_words]
    # 5) re-join
    return " ".join(tokens)


In [11]:
import numpy as np

n = len(texts)
results = {
    "svm_top1":   np.zeros(n, dtype=bool),
    "svm_top3":   np.zeros(n, dtype=bool),
    "bert_top1":  np.zeros(n, dtype=bool),
    "bert_top3":  np.zeros(n, dtype=bool),
    "roberta_top1": np.zeros(n, dtype=bool),
    "roberta_top3": np.zeros(n, dtype=bool),
}

for idx, (txt, hf_label) in enumerate(zip(texts, gold)):
    true_go = hf2go[hf_label]

    # — SVM (unchanged) —
    proc        = preprocess_text(txt)
    feats       = vectorizer.transform([proc])
    svm_scores  = svm_model.decision_function(feats)[0]
    svm_labels  = np.argsort(svm_scores)[::-1]   # these _are_ label IDs
    results["svm_top1"][idx] = (svm_labels[0] == true_go)
    results["svm_top3"][idx] = (true_go in svm_labels[:3])

    # — BERT & RoBERTa (fixed) —
    for name, pipe in [("bert", pipe_bert), ("roberta", pipe_roberta)]:
        out = pipe([txt])[0]
        pred_label_ids = [int(d["label"].split("_")[1]) for d in out]
        results[f"{name}_top1"][idx] = (pred_label_ids[0] == true_go)
        results[f"{name}_top3"][idx] = (true_go in pred_label_ids[:3])


# 8. Summarize
for model in ["svm", "bert", "roberta"]:
    top1 = results[f"{model}_top1"].mean()
    top3 = results[f"{model}_top3"].mean()
    print(f"{model.upper():8} → Top-1 Acc: {top1:.3f}, Top-3 Recall: {top3:.3f}")


SVM      → Top-1 Acc: 0.088, Top-3 Recall: 0.212
BERT     → Top-1 Acc: 0.198, Top-3 Recall: 0.384
ROBERTA  → Top-1 Acc: 0.213, Top-3 Recall: 0.403


## Interpret the Numbers

Interpretation of Domain-Shift Results
SVM → BERT → RoBERTa still holds in the new setting—transformers outperform the TF-IDF baseline—but all models exhibit a dramatic drop in Top-1 and Top-3 metrics when moving from Reddit comments (Go-Emotions) to tweets (HF Emotion).

Domain Mismatch Drives Low Scores

Style & Length: Tweets are typically shorter, use non-standard spellings, hashtags, and emoji, whereas Go-Emotions examples are longer, more formal Reddit sentences.

Vocabulary & Slang: Emotion expressions on Twitter often involve slang (“lol,” “smh”), emoticons, or culture-specific references that the models never saw in training.

As a result, even RoBERTa “knows” the right emotion only ~40 % of the time in its top-3—roughly on par with picking three labels at random—because it must generalize across vastly different text genres.

Why This Is Expected

Overfitting to In-Domain Patterns: The fine-tuned models learn to pick up on distribution-specific cues (e.g. punctuation patterns, phrase structures) that don’t carry over to tweets.

No In-Domain Examples: Without seeing actual tweets during either pretraining or fine-tuning, it’s normal that performance plummets—transformers excel when their training data matches the test distribution.

Random-Guess Baseline Comparison: A random Top-3 guess on 7 classes yields ~43 % recall; RoBERTa’s ~40 % shows it’s only marginally better than chance in this zero-shot scenario.

Key Takeaway

Low absolute numbers are not a “bug” but a clear signal of domain shift.

These results underscore the necessity of domain-adaptive steps—whether further pretraining on unlabeled tweets, or fine-tuning on even a small labeled tweet set—to bridge the gap between the research data and real-world text.

