# Step 1 - Install the required dependencies, set up W&B and make sure the python version is 3.10 and above

In [1]:
!pip install -q wandb datasets transformers evaluate tqdm emoji regex pandas pyarrow scikit-learn

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
import wandb
# wandb.login(key = "")
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmkipsang[0m ([33mmkipsang-carnegie-mellon-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [3]:
!python --version

Python 3.11.8


In [4]:
#imports and config:
import re, regex, emoji
import pandas as pd
import numpy as np
import tqdm

import wandb
from datasets import load_dataset
from transformers import pipeline
import evaluate


# WANDB CONFIG
PROJECT = "mlip-lab4-slices-2025"    
ENTITY = None                        
RUN_NAME = "tweet_eval_roberta_vs_gpt2"


  from .autonotebook import tqdm as notebook_tqdm





In [5]:
# Models to compare
MODELS = {
    "roberta": "cardiffnlp/twitter-roberta-base-sentiment-latest",
    "gpt2":    "LYTinn/finetuning-sentiment-model-tweet-gpt2",
}

In [6]:

# Label normalization 
ID2LABEL = {0: "negative", 1: "neutral", 2: "positive"}
HF_LABEL_MAP = {"LABEL_0":"negative","LABEL_1":"neutral","LABEL_2":"positive"}

USE_HF_DATASET = True   # set False to use tweets.csv fallback
SEED = 42
np.random.seed(SEED)


# Step 2 - Load a dataset from Hugging Face

In [7]:
if USE_HF_DATASET:
    ds = load_dataset("cardiffnlp/tweet_eval", "sentiment")
    df = pd.DataFrame(ds["test"]).head(500).copy()
    df["label"] = df["label"].map(ID2LABEL)
else:
    df = pd.read_csv("tweets.csv")
    # Ensure it has 'text' and 'label' columns
    df = df.rename(columns={c: c.strip() for c in df.columns})
    assert {"text","label"}.issubset(df.columns), "tweets.csv must include text,label"

df = df[["text","label"]].dropna().reset_index(drop=True)
df.head(3)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 45615/45615 [00:00<00:00, 713107.50 examples/s]
Generating test split: 100%|██████████| 12284/12284 [00:00<00:00, 1757558.60 examples/s]
Generating validation split: 100%|██████████| 2000/2000 [00:00<00:00, 483688.40 examples/s]


Unnamed: 0,text,label
0,@user @user what do these '1/2 naked pics' hav...,neutral
1,OH: “I had a blue penis while I was this” [pla...,neutral
2,"@user @user That's coming, but I think the vic...",neutral


# Step 3 - Add MetaData for slicing

In this step, you'll add **5 metadata columns** to your dataset to enable slicing later in **Weights & Biases (W&B)**.

You can use:
- **Value matching** (e.g., tweets with hashtags)
- **Regex** (e.g., strong positive words like *love*, *great*)
- **Heuristics** (e.g., emoji count, all-caps detection, tweet length)

These columns will be carried forward when you run inference in Step 6 and will appear in your final `predictions_table` logged to W&B.

---

Once inference is complete, your W&B table (`df_long`) will include:
- Original tweet text
- Ground-truth labels
- Model predictions and confidence scores
- All slicing metadata you define here

Later, in the W&B UI, you can use the ➕ `Filter` option in the table view to explore model behavior across these slices.

In [15]:
import re
# Step 3 – Add Slicing Metadata
# Add new columns for filtering in W&B later

# Example: count emojis in each tweet & create a slice for tweets with >3 emojis
def count_emojis(text):
    return sum(ch in emoji.EMOJI_DATA for ch in str(text))

df["emoji_count"] = df["text"].apply(count_emojis).astype(int)

NEGATION_RE = re.compile(r"\b(?:no|not|never|nothing|nowhere|hardly|scarcely|barely|n't)\b", re.I)
def has_negation(s: str) -> bool:
    return bool(NEGATION_RE.search(str(s)))

URL_RE      = re.compile(r"(https?://|www\.)", re.I)
def has_url_or_mention(s: str) -> bool:
    s = str(s)
    return bool(URL_RE.search(s) or MENTION_RE.search(s))

MENTION_RE  = re.compile(r"@\w+")
def is_long(s: str, n: int = 100) -> bool:
    return len(str(s)) > n

def get_slices(df):
    return {
        "emoji_gt3": df["emoji_count"] > 3,
        "has_negation":        df["text"].apply(has_negation),
        "long_gt100chars":     df["text"].apply(is_long),
        "has_url_or_mention":  df["text"].apply(has_url_or_mention),
    }

In [16]:
# Transformers requires a backend (PyTorch/TensorFlow/Flax). We'll use PyTorch.
try:
    import torch, transformers, sys
    print("torch:", torch.__version__)
    print("transformers:", transformers.__version__)
    print("CUDA available:", torch.cuda.is_available())
    print("Python:", sys.executable)
except Exception as e:
    raise RuntimeError("Install PyTorch before proceeding: pip install torch torchvision torchaudio") from e

torch: 2.8.0+cpu
transformers: 4.56.1
CUDA available: False
Python: c:\Users\STUDENT\marionvenv\Scripts\python.exe


#  Step 4 – Run Inference on Tweets Using Two Sentiment Models

In this step, you'll use two HuggingFace sentiment analysis models to run inference on your dataset:

In [17]:
from tqdm.auto import tqdm

def run_pipeline(model_id, texts):
    clf = pipeline("text-classification", model=model_id, truncation=True, framework="pt", device=-1)
    preds, confs = [], []
    for t in tqdm(texts, desc=f"Infer: {model_id}"):
        out = clf(t)[0]
        lbl = HF_LABEL_MAP.get(out["label"], out["label"])
        preds.append(lbl)
        confs.append(float(out["score"]))
    return preds, confs

pred_frames = []
for model_name, model_id in MODELS.items():
    yhat, conf = run_pipeline(model_id, df["text"].tolist())
    tmp = df.copy()
    tmp["model"] = model_name
    tmp["pred"]  = yhat
    tmp["conf"]  = conf
    pred_frames.append(tmp)

df_long = pd.concat(pred_frames, ignore_index=True)
df_long.head(5)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Infer: cardiffnlp/twitter-roberta-base-sentiment-latest:   0%|          | 0/500 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Infer: cardiffnlp/twitter-

Unnamed: 0,text,label,emoji_count,model,pred,conf
0,@user @user what do these '1/2 naked pics' hav...,neutral,0,roberta,negative,0.804726
1,OH: “I had a blue penis while I was this” [pla...,neutral,0,roberta,neutral,0.866949
2,"@user @user That's coming, but I think the vic...",neutral,0,roberta,neutral,0.763724
3,I think I may be finally in with the in crowd ...,positive,0,roberta,positive,0.774047
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",negative,0,roberta,neutral,0.416397


# Step 5: Compute Metrics

In [18]:
#compute metrics model-wise
from sklearn.metrics import accuracy_score

def compute_accuracy(y_true, y_pred):
    y_true = list(y_true)
    y_pred = list(y_pred)
    return accuracy_score(y_true, y_pred)


overall = (
    df_long.groupby("model")
           .apply(lambda g: compute_accuracy(g["label"], g["pred"]))
)

slice_table = wandb.Table(columns=["slice", "model", "accuracy"])
slice_metrics = {}

for slice_name, mask in get_slices(df_long).items():
    slice_metrics[slice_name] = {}  # Initialize inner dict

    for model_name, g in df_long[mask].groupby("model"):
        acc = compute_accuracy(g["label"], g["pred"])
        acc = float(acc) 
        # Add to wandb Table
        slice_table.add_data(slice_name, model_name, acc)
        # Add to dict
        slice_metrics[slice_name][model_name] = acc

  .apply(lambda g: compute_accuracy(g["label"], g["pred"]))


# Step 6: Log to Wandb:

In [19]:
run = wandb.init(project=PROJECT, entity=ENTITY, name=RUN_NAME, config={
    "models": MODELS,
    "n_rows": len(df),
    "use_hf_dataset": USE_HF_DATASET
})

# Main predictions table: one row per (example, model)
pred_table = wandb.Table(dataframe=df_long)
wandb.log({"predictions_table": pred_table})

# Log overall accuracy to wandb summary
for model_name, acc in overall.items():
    wandb.summary[f"{model_name}_accuracy"] = float(acc)

# wandb.log({"slice_accuracy_table": slice_table})
for slice_name, model_dict in slice_metrics.items():
    for model_name, acc in model_dict.items():
        metric_name = f"slice/{slice_name}/{model_name}_accuracy"
        wandb.log({metric_name: acc})


wandb.log({"slice_metrics": slice_table})

run.finish()

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
slice/emoji_gt3/gpt2_accuracy,▁
slice/emoji_gt3/roberta_accuracy,▁
slice/has_negation/gpt2_accuracy,▁
slice/has_negation/roberta_accuracy,▁
slice/has_url_or_mention/gpt2_accuracy,▁
slice/has_url_or_mention/roberta_accuracy,▁
slice/long_gt100chars/gpt2_accuracy,▁
slice/long_gt100chars/roberta_accuracy,▁

0,1
gpt2_accuracy,0.398
roberta_accuracy,0.698
slice/emoji_gt3/gpt2_accuracy,1.0
slice/emoji_gt3/roberta_accuracy,0.0
slice/has_negation/gpt2_accuracy,0.33333
slice/has_negation/roberta_accuracy,0.66667
slice/has_url_or_mention/gpt2_accuracy,0.29279
slice/has_url_or_mention/roberta_accuracy,0.68018
slice/long_gt100chars/gpt2_accuracy,0.33486
slice/long_gt100chars/roberta_accuracy,0.66055


## Instructions: Exploring Slice-Based Evaluation in W&B
 Step 1: Open the W&B Project
- Click on the **project link** above.
- Click on the **latest run** near the top.
Step 2: View Tables
- Click the **"Tables"** tab.
- You should see:
  - `predictions_table`
  - `slice_metrics`

Step 3: Use Filters in `predictions_table`
- Click on `predictions_table`.
- Use the filter bar to explore:
Example (see image):
  ```python
  col2 == 0
Step 4: 
- Check slice_metrics table: It shows accuracy of each model for every slice.
- Add a Bar Chart Panel: Click the "Add panels" button (top-right).
- Choose Bar chart under "Charts".
Try to create bar charts comparing accuracies of both models for a slice. Do it for 2 slices.

Discuss your findings with your TA.

# Filtering: 
<img src="images/filtering.png" alt="Predictions Table" width="600">

## Plotting:
<img src="images/plotting.png" alt="Predictions Table" height="300">
<img src="images/bar-charts.png" alt="Predictions Table" width="600">


In [20]:
# Students: replace the placeholders below with 1–2 sentence insights
saved_slice_notes = [
    "Globally, RoBERTa achieved around 0.68 accuracy while GPT-2 lagged at around 0.40, showing that RoBERTa is much more reliable for tweet sentiment.",
    "Slice analysis revealed that RoBERTa struggled heavily on emoji-rich tweets",
    "On the negation slice, RoBERTa achieved around 0.65 accuracy compared to GPT-2 at around 0.35, showing that GPT-2 struggles much more with polarity flips like 'not good'",
    "For tweets containing URLs or @mentions, RoBERTa again performed strongly ( around 0.66) while GPT-2 lagged at around 0.30, suggesting RoBERTa handles noisy social media text better",
    "On longer tweets (>100 characters), RoBERTa maintained around 0.65 accuracy while GPT-2 dropped to around 0.35, showing GPT-2 is less robust to extended context"
]
pd.DataFrame(saved_slice_notes)

Unnamed: 0,0
0,"Globally, RoBERTa achieved around 0.68 accurac..."
1,Slice analysis revealed that RoBERTa struggled...
2,"On the negation slice, RoBERTa achieved around..."
3,"For tweets containing URLs or @mentions, RoBER..."
4,"On longer tweets (>100 characters), RoBERTa ma..."



After successfully creating the two slices, come up with three *additional* slices you want to check and **create** the slices & view them in Wandb.

There are two directions to identify useful slices:
- Top-down: Think about what kinds of things the model can struggle with, and come up with some slices.
- Bottom-up: Look at model (mis-)predictions, come up with hypotheses, and translate them into data slices.

3. Tweets containing negation words
4. Tweets longer than 100 characters to test whether truncation or length impacts accuracy
5. Tweets containing URLs or @mentions

In [21]:
# Add these three slices & re-run the notebook to see them on Wandb.

additional_slice_ideas = [
    "Tweets containing negation words (e.g., 'not', 'never') since models often misinterpret sentiment flips.",
    "Tweets longer than 100 characters to test whether truncation or length impacts accuracy.",
    "Tweets containing URLs or @mentions, since models may struggle to parse external references or usernames."
]
additional_slice_ideas

["Tweets containing negation words (e.g., 'not', 'never') since models often misinterpret sentiment flips.",
 'Tweets longer than 100 characters to test whether truncation or length impacts accuracy.',
 'Tweets containing URLs or @mentions, since models may struggle to parse external references or usernames.']

# Step 7 - Write down three addition data slices you want to create but do not have the metadata for slicing

In the previous step, you might have already come up with some slices you wanted to create but found it hard to do with existing metadata. Write down three of such slices in this step.

Example: 
- I want to create a slice on tweets using slangs
- I want to create a slice on non-English tweets (if any)

In [22]:
## Write down three additional data slices here:

additional_slice_descriptions = [
    "Tweets containing slang expressions (e.g., 'lol', 'brb', 'smh'), since models might not understand informal language",
    "Tweets written in non-English languages, which may confuse models trained primarily on English text",
    "Tweets with sarcasm indicators like 'yeah right' or 'totally', since sentiment can be opposite of literal meaning"
]
additional_slice_descriptions

["Tweets containing slang expressions (e.g., 'lol', 'brb', 'smh'), since models might not understand informal language",
 'Tweets written in non-English languages, which may confuse models trained primarily on English text',
 "Tweets with sarcasm indicators like 'yeah right' or 'totally', since sentiment can be opposite of literal meaning"]

# Step 8 - Generate more test cases with Large Language Models

Select one slice from the three you wrote down and generate **10 test cases** using LLMs, which can include average case, boundary case, or difficult case.

Your input can be in the following format:

> Examples:
> - @user @user That’s coming, but I think the victims are going to be Medicaid recipients.
> - I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user
> 
> Generate more tweets using slangs.

The first part of **Examples** conditions the LLM on the style, length, and content of examples. The second part of **Instructions** instructs what kind of examples you want LLM to generate.

Use our provided GPTs to start the task: [llm-based-test-case-generator](https://chatgpt.com/g/g-982cylVn2-llm-based-test-case-generator). If you do not have access to GPTs, use the plain ChatGPT or other LLM providers you have access to instead.

In [23]:
# Paste your 10 generated tweets here:
generated_slice_description = "Examples: " \
"- lol that movie was trash, can't believe I wasted 2 hours smh " \
"- brb gotta go deal with this drama, life is wild rn " \
"Generate 10 more tweets using modern slang expressions."

generated_cases = [
    "not me crying over a tiktok I've seen 10 times already 😭",
    "this wifi moving like it's powered by vibes only 😤",
    "tell me why my food delivery said “out for delivery” 40 mins ago… I'm starvingggg",
    "woke up and chose chaos today, sorry not sorry",
    "y'all ever get hit with that “we need to talk” text and immediately start spiraling?? 💀",
    "it's giving broke, it's giving stress, it's giving me",
    "someone said my outfit was “interesting” and now I'm spiraling, thanks",
    "had one iced coffee and now I think I can conquer the world",
    "these vibes? immaculate. this playlist? undefeated.",
    "I was today years old when I found out you’re supposed to rinse rice before cooking 😳"
]