You are evaluating a candidate sentiment model to replace a production baseline. Your goal is to determine whether this model should ship.

‚ÄúShip‚Äù means: we would choose the candidate model over the baseline for deployment based on the evidence you collect.

### Step 1 - Install the required dependencies, set up W&B and make sure the python version is 3.10 and above

In [1]:
!pip install -q wandb datasets transformers evaluate tqdm emoji regex pandas pyarrow scikit-learn nbformat torch

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np
import pandas as pd
import emoji
import wandb

from datasets import load_dataset
from transformers import pipeline

SEED = 42
np.random.seed(SEED)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import os, wandb
wandb.login()

In [1]:
!python --version

Python 3.12.2


In [2]:
#imports and config:
import re, regex, emoji
import pandas as pd
import numpy as np
import tqdm

import wandb
from datasets import load_dataset
from transformers import pipeline
import evaluate


# WANDB CONFIG
PROJECT = "mlip-lab4-slices-2026"
ENTITY = None
RUN_NAME = "baseline_vs_candidate"


In [3]:
# Models to compare
MODELS = {
    "baseline_model": "cardiffnlp/twitter-roberta-base-sentiment-latest",
    "candidate_model":    "LYTinn/finetuning-sentiment-model-tweet-gpt2",
}

In [4]:
# Label normalization for tweet_eval (0/1/2 -> string labels)
ID2LABEL = {0: "negative", 1: "neutral", 2: "positive"}

# Many HF sentiment models output labels like LABEL_0 / LABEL_1 / LABEL_2
HF_LABEL_MAP = {"LABEL_0": "negative", "LABEL_1": "neutral", "LABEL_2": "positive"}

USE_HF_DATASET = True  # set False to use tweets.csv fallback

### Step 2 - Load a dataset from Hugging Face

In [5]:
if USE_HF_DATASET:
    ds = load_dataset("cardiffnlp/tweet_eval", "sentiment")
    df = pd.DataFrame(ds["test"]).head(500).copy()
    df["label"] = df["label"].map(ID2LABEL)
else:
    df = pd.read_csv("tweets.csv")
    # Ensure it has 'text' and 'label' columns
    df = df.rename(columns={c: c.strip() for c in df.columns})
    assert {"text","label"}.issubset(df.columns), "tweets.csv must include text,label"
    df["label"] = df["label"].astype(str).str.lower()

df = df[["text","label"]].dropna().reset_index(drop=True)
df.head(3)


README.md: 0.00B [00:00, ?B/s]

sentiment/train-00000-of-00001.parquet:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

sentiment/test-00000-of-00001.parquet:   0%|          | 0.00/901k [00:00<?, ?B/s]

sentiment/validation-00000-of-00001.parq(‚Ä¶):   0%|          | 0.00/167k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45615 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12284 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Unnamed: 0,text,label
0,@user @user what do these '1/2 naked pics' hav...,neutral
1,OH: ‚ÄúI had a blue penis while I was this‚Äù [pla...,neutral
2,"@user @user That's coming, but I think the vic...",neutral


### Step 3 - Define Failure-Relevant Metadata

#TODO:
In this step, you will create **at least 5** metadata columns that help you slice and analyze model behavior in Weights & Biases (W&B).
These metadata columns should **capture meaningful properties of the data or model behavior that may influence performance**. You can define them using:

1. Value matching (e.g., tweets containing hashtags or mentions)
2. Regex patterns (e.g., negation words, strong sentiment terms like love or hate)
3. Heuristics (e.g., emoji count, all-caps text, tweet length buckets)

Each metadata column should correspond to a potential hypothesis about when or why a model might succeed or fail.
These columns will be propagated through inference and included in the final predictions_table logged to W&B.

After inference, your W&B table (df_long) will contain:
- The original tweet text
- Ground-truth sentiment labels
- Model predictions and confidence scores
- All metadata columns you defined for slicing

You will use these metadata fields in the W&B UI (via the ‚ûï Filter option) to:
- Create slices of the data
- Compare model behavior across slices
- Identify patterns, weaknesses, or regressions that are not visible in overall accuracy

In [14]:
# Step 3 ‚Äì Add slicing metadata (text-only)

# TODO: add your own hypothesis-driven metadata here. 
# Here are examples of the kinds of metadata columns you can add & analyse.
# Categories you can explore are: Linguistic, Emotional/semantic, Model-behavioral. Do not reuse the ones given below.
def count_emojis(text: str) -> int:
    return sum(ch in emoji.EMOJI_DATA for ch in str(text))

df["emoji_count"] = df["text"].apply(count_emojis).astype(int)
df["has_hashtag"] = df["text"].str.contains(r"#\w+", regex=True)
df["has_mention"] = df["text"].str.contains(r"@\w+", regex=True)
df["has_negation"] = df["text"].str.contains(r"\b(not|never|no)\b", regex=True)
df["length_bucket"] = pd.cut(
    df["text"].str.len(),
    bins=[0, 50, 100, 200, 1000, 10_000],
    labels=["0-50", "51-100", "101-200", "201-1000", "1001+"],
    include_lowest=True
).astype(str)

# Additional hypothesis-driven slices
df["has_all_caps"] = df["text"].apply(lambda x: any(word.isupper() and len(word) > 2 for word in str(x).split()))
df["has_question"] = df["text"].str.contains(r"\?", regex=True)
df["has_url"] = df["text"].str.contains(r"http[s]?://|www\.", regex=True)
df["has_strong_sentiment"] = df["text"].str.contains(r"\b(love|hate|amazing|terrible|worst|best)\b", case=False, regex=True)
df["has_sarcasm_indicators"] = df["text"].str.contains(r"\b(yeah right|sure|totally|obviously)\b", case=False, regex=True)

# Example slice definitions (you'll create more later)
def get_slices(df_any: pd.DataFrame):
    return {
        "emoji_gt3": df_any["emoji_count"] > 3,
        "has_negation": df_any["has_negation"] == True,
        "has_hashtag": df_any["has_hashtag"] == True,
        "has_all_caps": df_any["has_all_caps"] == True, # Added
        "has_question": df_any["has_question"] == True, # Added
        "has_strong_sentiment": df_any["has_strong_sentiment"] == True, # Added
        "short_tweets": df_any["length_bucket"] == "0-50", # Added
    }

  df["has_negation"] = df["text"].str.contains(r"\b(not|never|no)\b", regex=True)
  df["has_strong_sentiment"] = df["text"].str.contains(r"\b(love|hate|amazing|terrible|worst|best)\b", case=False, regex=True)
  df["has_sarcasm_indicators"] = df["text"].str.contains(r"\b(yeah right|sure|totally|obviously)\b", case=False, regex=True)


### Assumptions / Hypotheses for the *new* metadata columns
These metadata columns are designed to create hypothesis-driven slices in W&B (to reveal failure modes beyond overall accuracy).

- `has_all_caps` (ALL CAPS emphasis): Tweets with shouting/emphasis may be misread as stronger sentiment than intended, or may correlate with anger/excitement. Expect higher error/regression when emphasis contradicts literal wording.
- `has_question` (contains `?`): Questions are often informational/neutral, but rhetorical questions can imply sentiment. Expect confusion between neutral vs (positive/negative), especially for rhetorical questions.
- `has_url` (contains link): URLs can reduce lexical sentiment signal (more ‚Äúnews-like‚Äù/headline-like), and can include truncated context. Expect more neutral defaults or misclassification when sentiment is implied by context not text.
- `has_strong_sentiment` (lexicon like love/hate/best/worst): Strong sentiment words should make classification easier, but models may over-rely on keywords and fail on negation/hedging (e.g., ‚Äúnot the best‚Äù). Expect high confidence wrong predictions in tricky phrasing.
- `has_sarcasm_indicators` (phrases like ‚Äúyeah right‚Äù, ‚Äúsure‚Äù, ‚Äútotally‚Äù, ‚Äúobviously‚Äù): These often invert the literal sentiment. Expect large candidate regressions if the model lacks pragmatic understanding of sarcasm.

In [7]:
# Transformers requires a backend (PyTorch/TensorFlow/Flax). We'll use PyTorch.
try:
    import torch, transformers, sys
    print("torch:", torch.__version__)
    print("transformers:", transformers.__version__)
    print("CUDA available:", torch.cuda.is_available())
    print("Python:", sys.executable)
except Exception as e:
    raise RuntimeError("Install PyTorch before proceeding: pip install torch torchvision torchaudio") from e

torch: 2.8.0
transformers: 4.57.1
CUDA available: False
Python: /opt/anaconda3/bin/python


###  Step 4 ‚Äì Run Inference (Two Models)

In this step, you'll use two HuggingFace sentiment analysis models to run inference on your dataset:

In [15]:
from tqdm.auto import tqdm

def run_pipeline(model_id: str, texts: list[str]):
    clf = pipeline(
        "text-classification",
        model=model_id,
        truncation=True,
        max_length=128,     # avoid truncation warnings
        framework="pt",
        device=-1           # CPU
    )
    # (Optional) sanity check label mapping for this model
    # print(model_id, clf.model.config.id2label)

    preds, confs = [], []
    for t in tqdm(texts, desc=f"Infer: {model_id}"):
        out = clf(t)[0]
        lbl = HF_LABEL_MAP.get(out["label"], out["label"])
        preds.append(lbl)
        confs.append(float(out["score"]))
    return preds, confs

pred_frames = []
texts = df["text"].tolist()

for model_name, model_id in MODELS.items():
    yhat, conf = run_pipeline(model_id, texts)
    tmp = df.copy()
    tmp["model"] = model_name
    tmp["pred"] = yhat
    tmp["conf"] = conf
    pred_frames.append(tmp)

df_long = pd.concat(pred_frames, ignore_index=True)

# Add a stable example id so reshaping won't silently drop duplicates
df_long["ex_id"] = df_long.groupby(["text", "label"]).ngroup()

df_long.head(5)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Infer: cardiffnlp/twitter-roberta-base-sentiment-latest:   0%|          | 0/500 [00:00<?, ?it/s]

Device set to use cpu


Infer: LYTinn/finetuning-sentiment-model-tweet-gpt2:   0%|          | 0/500 [00:00<?, ?it/s]

Unnamed: 0,text,label,emoji_count,has_hashtag,has_mention,has_negation,length_bucket,has_all_caps,has_question,has_url,has_strong_sentiment,has_sarcasm_indicators,model,pred,conf,ex_id
0,@user @user what do these '1/2 naked pics' hav...,neutral,0,False,True,True,51-100,False,True,False,False,False,baseline_model,negative,0.804726,113
1,OH: ‚ÄúI had a blue penis while I was this‚Äù [pla...,neutral,0,False,False,False,51-100,True,False,False,False,False,baseline_model,neutral,0.866949,363
2,"@user @user That's coming, but I think the vic...",neutral,0,False,True,False,51-100,False,False,False,False,False,baseline_model,neutral,0.763724,102
3,I think I may be finally in with the in crowd ...,positive,0,True,True,False,51-100,False,False,False,False,False,baseline_model,positive,0.774047,305
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,0,False,True,False,101-200,False,False,False,False,False,baseline_model,neutral,0.416398,160


In [16]:
# Step 4.5 ‚Äì Wide-format Table for Model Comparison (Optional but recommended)
# One row per tweet, with baseline + candidate predictions in columns
# TODO: Replace with your metadata
df_wide = df_long.pivot_table(
    index=[
        "ex_id", "text", "label",
        "emoji_count", "has_hashtag", "has_mention", "has_negation", "length_bucket",
        "has_all_caps", "has_question", "has_url", "has_strong_sentiment", "has_sarcasm_indicators"
    ], # Add additional hypothesis-driven slices 
    columns="model",
    values=["pred", "conf"],
    aggfunc="first"
).reset_index()

# Flatten column names (e.g., pred_baseline_model, conf_candidate_model)
df_wide.columns = ["_".join([c for c in col if c]).strip("_") for col in df_wide.columns]

df_wide.head(5)

Unnamed: 0,ex_id,text,label,emoji_count,has_hashtag,has_mention,has_negation,length_bucket,has_all_caps,has_question,has_url,has_strong_sentiment,has_sarcasm_indicators,conf_baseline_model,conf_candidate_model,pred_baseline_model,pred_candidate_model
0,0,"""Fatty Kim The Third"" üò≠üò≠üò≠",neutral,3,False,False,False,0-50,False,False,False,False,False,0.486252,0.978987,neutral,neutral
1,1,"""Focusing on [alt rightists'] respectability.....",neutral,0,False,False,False,51-100,False,False,False,False,False,0.573572,0.999728,negative,neutral
2,2,"""Kim Fatty the Third""",negative,0,False,False,False,0-50,False,False,False,False,False,0.849732,0.937709,neutral,neutral
3,3,"""We have lost everything"": Syrians return to r...",neutral,0,True,True,False,51-100,True,False,False,False,False,0.751955,0.994244,negative,positive
4,4,"""who's the most wiped out white boy? Zac Efron...",neutral,0,False,False,False,51-100,False,True,False,False,False,0.561233,0.906541,neutral,positive


### Step 5: Compute Metrics (Accuracy + Slice Accuracy + Regression)

In [17]:
# TODO: Edit to work for your slices

#compute metrics model-wise
from sklearn.metrics import accuracy_score

def compute_accuracy(y_true, y_pred):
    return accuracy_score(list(y_true), list(y_pred))

# Overall accuracy by model (df_long: one row per (tweet, model))
overall = df_long.groupby("model").apply(
    lambda g: compute_accuracy(g["label"], g["pred"]),
    include_groups=False
)

# Slice accuracy table (uses df_long masks)
slice_table = wandb.Table(columns=["slice", "model", "accuracy"])
slice_metrics = {}

for slice_name, mask in get_slices(df_long).items():
    slice_metrics[slice_name] = {}
    for model_name, g in df_long[mask].groupby("model"):
        acc = float(compute_accuracy(g["label"], g["pred"]))
        slice_table.add_data(slice_name, model_name, acc)
        slice_metrics[slice_name][model_name] = acc

In [18]:
# TODO: Edit to work for your slices


# Regression-aware evaluation (df_eval: one row per tweet, both model outputs) 
# A regression is when the candidate gets something wrong that the baseline got right.
BASELINE = "baseline_model"
CANDIDATE = "candidate_model"

# Ensure ex_id exists (safe even if it already exists)
df_long = df_long.copy()
if "ex_id" not in df_long.columns:
    df_long["ex_id"] = df_long.groupby(["text", "label"]).ngroup()

# Build df_eval with metadata carried through
df_eval = (
    df_long.pivot_table(
        index=[
            "ex_id", "text", "label",
            "emoji_count", "has_hashtag", "has_mention", "has_negation", "length_bucket",
            "has_all_caps", "has_question", "has_url", "has_strong_sentiment", "has_sarcasm_indicators"
        ], # Add additonal hypothesis-driven slices 
        columns="model",
        values=["pred", "conf"],
        aggfunc="first"
    )
    .reset_index()
)

# Flatten column names (pred_baseline_model, conf_candidate_model, etc.)
df_eval.columns = ["_".join([c for c in col if c]).strip("_") for col in df_eval.columns]

# Correctness flags
df_eval["baseline_correct"]  = df_eval[f"pred_{BASELINE}"] == df_eval["label"]
df_eval["candidate_correct"] = df_eval[f"pred_{CANDIDATE}"] == df_eval["label"]

# Regression / improvement flags
df_eval["regressed"]   = df_eval["baseline_correct"] & ~df_eval["candidate_correct"]
df_eval["improved"]    = ~df_eval["baseline_correct"] & df_eval["candidate_correct"]
df_eval["both_wrong"]  = ~df_eval["baseline_correct"] & ~df_eval["candidate_correct"]
df_eval["both_correct"]= df_eval["baseline_correct"] & df_eval["candidate_correct"]

# Confidence-conditional regression (candidate is confident AND worse than baseline)
df_eval["confident_regression"] = df_eval["regressed"] & (df_eval[f"conf_{CANDIDATE}"] >= 0.8)

# Global regression metrics
regression_rate = float(df_eval["regressed"].mean())
improvement_rate = float(df_eval["improved"].mean())
conf_reg_rate = float(df_eval["confident_regression"].mean())

print("Regression rate:", regression_rate)
print("Improvement rate:", improvement_rate)
print("Confident regression rate:", conf_reg_rate)

Regression rate: 0.408
Improvement rate: 0.108
Confident regression rate: 0.382


In [19]:
# TODO: Edit to work with your slices

# Define slices on df_eval (must use columns that exist in df_eval)
def get_slices_eval(df_any):
    return {
        "emoji_gt3": df_any["emoji_count"] > 3,
        "has_negation": df_any["has_negation"] == True,
        "has_hashtag": df_any["has_hashtag"] == True,
        "long_tweets": df_any["length_bucket"].astype(str).isin(["201-1000", "1001+"]),
        "has_all_caps": df_any["has_all_caps"] == True, # Added
        "has_question": df_any["has_question"] == True, # Added
        "has_strong_sentiment": df_any["has_strong_sentiment"] == True, # Added
        "short_tweets": df_any["length_bucket"] == "0-50", # Added
    }

# Slice-level regression metrics table
reg_table = wandb.Table(columns=["slice", "metric", "value"])
reg_metrics = {}

for slice_name, mask in get_slices_eval(df_eval).items():
    g = df_eval[mask]
    if len(g) == 0:
        continue

    reg = float(g["regressed"].mean())
    imp = float(g["improved"].mean())
    conf_reg = float(g["confident_regression"].mean())

    reg_table.add_data(slice_name, "regression_rate", reg)
    reg_table.add_data(slice_name, "improvement_rate", imp)
    reg_table.add_data(slice_name, "confident_regression_rate", conf_reg)

    reg_metrics[slice_name] = {
        "regression_rate": reg,
        "improvement_rate": imp,
        "conf_reg_rate": conf_reg
    }



### Step 6 ‚Äî #TODO: Log to W&B & Analyse Slices
### (Make sure PROJECT/ENTITY/RUN_NAME exist from Step 1)

In [20]:
# Step 6: Log to W&B

PROJECT = "mlip-lab4-slices-2026"
ENTITY = None
RUN_NAME = "baseline_vs_candidate"
run = wandb.init(project=PROJECT, entity=ENTITY, name=RUN_NAME)
wandb.log({"predictions_table": wandb.Table(dataframe=df_long)})
wandb.log({"slice_metrics": slice_table})
wandb.log({"regression_metrics": reg_table})
wandb.log({
    "df_eval": wandb.Table(dataframe=df_eval)
})
for model_name, acc in overall.items():
    wandb.summary[f"{model_name}_accuracy"] = float(acc)
wandb.summary["regression_rate"] = regression_rate
wandb.summary["improvement_rate"] = improvement_rate
wandb.summary["confident_regression_rate"] = conf_reg_rate

print("W&B run URL:", run.get_url())
run.finish()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


W&B run URL: https://wandb.ai/joannac2-carnegie-mellon-university/mlip-lab4-slices-2026/runs/pgnbbb8r


0,1
baseline_model_accuracy,0.698
candidate_model_accuracy,0.398
confident_regression_rate,0.382
improvement_rate,0.108
regression_rate,0.408


### Instructions: Exploring Slice-Based Evaluation in W&B

# Purpose
In this lab, you are evaluating a candidate sentiment model to decide whether it should replace an existing baseline (production) model.
You have already:
  - run both models on the same dataset
  - logged predictions, confidence scores, and metadata to W&B
  - created metadata that allows you to slice the data
The most important goal is to understand when and why models behave differently.
Overall accuracy alone is often misleading.

# What to do in W&B
1. Open your W&B run
  - Click the project link and open the latest run.
2. Explore the predictions table
  - Go to the Tables tab and open predictions_table.
  - Each row is one tweet √ó one model.
3. Create and analyze slices (most important)
  - Use filters to create meaningful slices 
    (e.g., negation, emojis, hashtags, long tweets).
  - For each slice:
    - Compare baseline vs candidate performance.
    - Compare slice accuracy to overall accuracy.
    - Inspect a few misclassified examples to identify patterns.
4. Visualize slice performance
  - Open slice_metrics.
  - Create bar charts comparing baseline vs candidate accuracy for at least two slices.
5. Discuss your findings with the TA
  - Explain why slicing reveals issues that overall accuracy hides.
  - Say whether the candidate model should be deployed and why.


In [28]:
# Students: replace the placeholders below with 1‚Äì2 sentence insights
#TODO: Replace this with 1-2 sentence takeaways for each slice.
saved_slice_notes = [
    "emoji_gt3: Tweets with 3+ emojis showed strong performance differences between models, suggesting emoji interpretation is a key differentiator.",
    "has_negation: Negation words (not, never, no) caused the candidate model to struggle significantly more than baseline, indicating poor negation handling.",
    "has_hashtag: Hashtag presence did not significantly impact model differences, both models performed similarly on this slice.",
    "has_all_caps: All-caps words (indicating emphasis/shouting) caused confusion in sentiment prediction for both models but more so for the candidate.",
    "has_question: Question tweets showed higher regression rates, suggesting the candidate model misinterprets interrogative sentiment.",
    "has_strong_sentiment: Tweets with strong sentiment words (love, hate) had better baseline performance but candidate still regressed significantly.",
    "short_tweets: Very short tweets (0-50 chars) lack context and both models struggled, but candidate performed worse with higher regression."
]
pd.DataFrame(saved_slice_notes, columns=["Slice Notes"])

Unnamed: 0,Slice Notes
0,emoji_gt3: Tweets with 3+ emojis showed strong...
1,"has_negation: Negation words (not, never, no) ..."
2,has_hashtag: Hashtag presence did not signific...
3,has_all_caps: All-caps words (indicating empha...
4,has_question: Question tweets showed higher re...
5,has_strong_sentiment: Tweets with strong senti...
6,short_tweets: Very short tweets (0-50 chars) l...


#### ‚≠êÔ∏èFindings from Slice Visualization ‚Äî 5 key results 
These come directly from the `slice_metrics` visualization (baseline vs candidate accuracy), with slice sizes shown as `n`.

1. **Candidate underperforms baseline on every meaningful slice we tested**, consistent with the overall gap (baseline overall accuracy **0.698** vs candidate **0.398**, Œî = **-0.300**).
2. **Biggest failure: `has_question` (n=58)** ‚Äî baseline accuracy **0.655** vs candidate **0.293** (Œî = **-0.362**). Question / rhetorical-question tweets are a major regression risk.
3. **Strong sentiment keywords are not ‚Äúeasy mode‚Äù for the candidate: `has_strong_sentiment` (n=20)** ‚Äî baseline **0.950** vs candidate **0.650** (Œî = **-0.300**). The candidate still misses obvious polarity cues.
4. **Emphasis / shouting is another large regression: `has_all_caps` (n=102)** ‚Äî baseline **0.667** vs candidate **0.373** (Œî = **-0.294**). The candidate appears to mis-handle intensity markers.
5. **Negation remains a consistent weakness: `has_negation` (n=28)** ‚Äî baseline **0.571** vs candidate **0.357** (Œî = **-0.214**). This supports the Step 7 stress test choice (negation-targeted synthetic tweets).

> Note: `emoji_gt3` shows extreme values but has **n=1**, so it‚Äôs not reliable evidence by itself‚Äîtreat it as a ‚Äúneeds more data‚Äù slice rather than a conclusion.

![W&B slice_metrics visualization (dot plot)](Visualize%20slice%20performance.png)

![W&B slice_metrics visualization (bar chart)](Visualize%20slice%20performance%20-bar.png)

### Step 7 - Targeted stress testing with LLMs

TODO: 
In this step, you will use a Large Language Model (LLM) to generate test cases that specifically target a weakness you observed during slicing.

What to do:
1. Choose one slice where you noticed poor performance, regressions, or surprising behavior.
2. Write a short hypothesis (1‚Äì2 sentences) explaining why the model might struggle on this slice. Example:
‚ÄúThe model struggles with tweets that use slang and sarcasm.‚Äù
3. Use an LLM to generate 10 test cases designed to test this hypothesis.
These can include:
    - subtle or ambiguous cases
    - difficult or adversarial cases
    - small wording changes that affect sentiment
4. Re-run both models on the generated test cases (helper script given below.)
5. Briefly describe what you observed to the TA:
    - Did the same failures appear again?
    - notice any new failure patterns?
    - would this affect your confidence in deploying the model?

Your input can be in the following format:

> Examples:
> - @user @user That‚Äôs coming, but I think the victims are going to be Medicaid recipients.
> - I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user
> 
> Generate more tweets using slangs.

Use our provided GPTs to start the task: [llm-based-test-case-generator](https://chatgpt.com/g/g-982cylVn2-llm-based-test-case-generator). If you do not have access to GPTs, use the plain ChatGPT or other LLM providers you have access to instead.

In [29]:
# TODO: Paste your 10 generated tweets here:
generated_slice_description = "Tweets with negation words (not, never, no) that flip sentiment meaning. The candidate model struggles significantly with negation, showing high regression rates when sentiment is reversed by negative words."

generated_cases = [
    "This movie is not bad at all, actually quite enjoyable!",
    "I'm never going to say this pizza isn't amazing",
    "No way this is the worst day ever, it's actually great",
    "The service was not terrible, but it wasn't good either",
    "I don't hate this song, it's just not my favorite",
    "Never thought I'd say this isn't a disappointment",
    "This restaurant is no good at being bad - it's excellent!",
    "I'm not unhappy with my purchase, just not thrilled",
    "The weather isn't not nice today - double negative intended",
    "Never have I not enjoyed a concert this much"
]

In [30]:
#Helper code to run models on synthetic test cases:

def run_on_generated_tests(texts, models=MODELS):
    rows = []
    for model_name, model_id in models.items():
        clf = pipeline(
            "text-classification",
            model=model_id,
            truncation=True,
            framework="pt",
            device=-1
        )
        for t in texts:
            out = clf(t)[0]
            rows.append({
                "text": t,
                "model": model_name,
                "pred": HF_LABEL_MAP.get(out["label"], out["label"]),
                "conf": float(out["score"])
            })
    return pd.DataFrame(rows)


In [31]:
generated_df = run_on_generated_tests(generated_cases)
generated_df

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Device set to use cpu


Unnamed: 0,text,model,pred,conf
0,"This movie is not bad at all, actually quite e...",baseline_model,positive,0.98394
1,I'm never going to say this pizza isn't amazing,baseline_model,positive,0.555089
2,"No way this is the worst day ever, it's actual...",baseline_model,positive,0.524262
3,"The service was not terrible, but it wasn't go...",baseline_model,negative,0.844705
4,"I don't hate this song, it's just not my favorite",baseline_model,negative,0.731894
5,Never thought I'd say this isn't a disappointment,baseline_model,negative,0.800755
6,This restaurant is no good at being bad - it's...,baseline_model,positive,0.884516
7,"I'm not unhappy with my purchase, just not thr...",baseline_model,negative,0.598655
8,The weather isn't not nice today - double nega...,baseline_model,negative,0.936736
9,Never have I not enjoyed a concert this much,baseline_model,positive,0.891479


In [36]:
#### Step 7 analysis: present & compare the two models on the 10 synthetic negation cases

# 1) Make a wide table: one row per synthetic tweet, baseline + candidate side-by-side
generated_wide = (
    generated_df
    .pivot_table(index="text", columns="model", values=["pred", "conf"], aggfunc="first")
    .reset_index()
)
generated_wide.columns = ["_".join([c for c in col if c]).strip("_") for col in generated_wide.columns]
generated_wide = generated_wide.rename(columns={
    f"pred_{BASELINE}": "baseline_pred",
    f"conf_{BASELINE}": "baseline_conf",
    f"pred_{CANDIDATE}": "candidate_pred",
    f"conf_{CANDIDATE}": "candidate_conf",
})

# 2) Optional (recommended): add intended label for each synthetic case, so you can compute accuracy
# Edit these if you intended different sentiments for your generated cases.
intended_label = [
    "positive",  # not bad ... enjoyable
    "positive",  # never ... isn't amazing (double negation)
    "positive",  # not worst ... actually great
    "neutral",   # not terrible, but not good either (mixed)
    "neutral",   # don't hate ... not my favorite (mild negative / neutral)
    "positive",  # isn't a disappointment (double negation)
    "positive",  # no good at being bad ... excellent
    "neutral",   # not unhappy ... not thrilled
    "positive",  # isn't not nice (double negation)
    "positive",  # never have I not enjoyed (double negation)
 ]

generated_wide["intended_label"] = intended_label
generated_wide["baseline_correct"] = generated_wide["baseline_pred"] == generated_wide["intended_label"]
generated_wide["candidate_correct"] = generated_wide["candidate_pred"] == generated_wide["intended_label"]
generated_wide["pred_disagree"] = generated_wide["baseline_pred"] != generated_wide["candidate_pred"]
generated_wide["candidate_high_conf"] = generated_wide["candidate_conf"] >= 0.9
generated_wide["candidate_high_conf_wrong"] = generated_wide["candidate_high_conf"] & (~generated_wide["candidate_correct"])

# 3) Summary numbers to report to the TA
stress_summary = {
    "n_cases": int(generated_wide.shape[0]),
    "baseline_accuracy_on_synthetic": float(generated_wide["baseline_correct"].mean()),
    "candidate_accuracy_on_synthetic": float(generated_wide["candidate_correct"].mean()),
    "disagreement_rate": float(generated_wide["pred_disagree"].mean()),
    "candidate_high_conf_wrong_count": int(generated_wide["candidate_high_conf_wrong"].sum()),
    "candidate_high_conf_wrong_rate": float(generated_wide["candidate_high_conf_wrong"].mean()),
}

display(generated_wide.sort_values(["candidate_high_conf_wrong", "pred_disagree"], ascending=False))
stress_summary

Unnamed: 0,text,baseline_conf,candidate_conf,baseline_pred,candidate_pred,intended_label,baseline_correct,candidate_correct,pred_disagree,candidate_high_conf,candidate_high_conf_wrong
1,I'm never going to say this pizza isn't amazing,0.555089,0.904155,positive,negative,positive,True,False,True,True,True
3,Never have I not enjoyed a concert this much,0.891479,0.99998,positive,negative,neutral,False,False,True,True,True
5,"No way this is the worst day ever, it's actual...",0.524262,0.999834,positive,negative,positive,True,False,True,True,True
7,The weather isn't not nice today - double nega...,0.936736,0.999999,negative,positive,neutral,False,False,True,True,True
8,"This movie is not bad at all, actually quite e...",0.98394,0.999993,positive,negative,positive,True,False,True,True,True
9,This restaurant is no good at being bad - it's...,0.884516,0.958428,positive,negative,positive,True,False,True,True,True
0,"I don't hate this song, it's just not my favorite",0.731894,0.998578,negative,negative,positive,False,False,False,True,True
2,"I'm not unhappy with my purchase, just not thr...",0.598655,0.999767,negative,negative,positive,False,False,False,True,True
4,Never thought I'd say this isn't a disappointment,0.800755,0.999999,negative,negative,neutral,False,False,False,True,True
6,"The service was not terrible, but it wasn't go...",0.844705,0.991405,negative,negative,positive,False,False,False,True,True


{'n_cases': 10,
 'baseline_accuracy_on_synthetic': 0.4,
 'candidate_accuracy_on_synthetic': 0.0,
 'disagreement_rate': 0.6,
 'candidate_high_conf_wrong_count': 10,
 'candidate_high_conf_wrong_rate': 1.0}

### Step 7 ‚Äî Stress test (Negation) write-up for TA

**Chosen slice:** `has_negation` (tweets with negation words like *not / never / no*).

**Hypothesis (1‚Äì2 sentences):** Negation flips sentiment polarity (e.g., ‚Äúnot bad‚Äù ‚Üí positive), and models that over-rely on keywords can fail badly‚Äîoften with high confidence‚Äîwhen the true sentiment depends on scope/double-negatives.

**How I generated tests:** I used an LLM to create 10 short tweets that contain single negation, mixed sentiment ("not terrible, but not good"), and double negation ("isn't not nice"). These are meant to probe whether the model correctly handles negation scope and polarity flips.

**How I present the comparison:**
- I view the side-by-side table `generated_wide` (and the W&B table `synthetic_tests_wide`) where each row is one synthetic tweet with both models‚Äô `pred` and `conf`.
- I sort/filter on `candidate_high_conf_wrong` and `pred_disagree` to highlight the most deployment-relevant failures (high-confidence wrong predictions + baseline/candidate disagreements).

**Key observations (from `stress_summary`, using my intended labels for these synthetic cases):**
- $n=10$ synthetic negation cases.
- Baseline accuracy on these cases: **0.40**.
- Candidate accuracy on these cases: **0.00**.
- Disagreement rate (baseline vs candidate): **0.60**.
- Candidate high-confidence wrong predictions: **10/10 (100%)** (very risky failure mode).

**Conclusion / impact on deployment decision:** This stress test reproduces and amplifies the slice-based finding that negation is a major weakness‚Äîespecially for the candidate model. The combination of *worse performance* and *high-confidence errors* on negation cases substantially lowers confidence in deploying the candidate model without targeted remediation and additional testing.

In [35]:
# OPTIONAL: Log synthetic test cases to W&B (long + wide + summary metrics)
run = wandb.init(project=PROJECT, entity=ENTITY, name="stress_test_negation", reinit=True)
wandb.log({
    "synthetic_tests_long": wandb.Table(dataframe=generated_df),
    "synthetic_tests_wide": wandb.Table(dataframe=generated_wide),
    "stress_test_summary": stress_summary,
})
run.finish()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
