# Step 1 - Install the required dependencies and make sure the python version is 3.10 and above

In [1]:
!pip install zenoml



In [2]:
!pip install datasets
!pip install transformers
!pip install transformers torch
!pip install tqdm



In [3]:
!python --version

Python 3.10.0


# Step 2 - Load a dataset from Hugging Face

In [4]:
from datasets import load_dataset
import pandas as pd

ds = load_dataset("cardiffnlp/tweet_eval", "sentiment")
df = pd.DataFrame(ds['test']).head(500)
df.head(5)

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,text,label
0,@user @user what do these '1/2 naked pics' hav...,1
1,OH: “I had a blue penis while I was this” [pla...,1
2,"@user @user That's coming, but I think the vic...",1
3,I think I may be finally in with the in crowd ...,2
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",0


In [5]:
def label_map(x):
    if x == 0:
        return 'negative'
    elif x == 1:
        return 'neutral'
    elif x == 2:
        return 'positive'
    return x
df['label'] = df['label'].map(label_map)

# Step 3 - Run model inference

Warning: This step is going to download two models of ~500MB each. 

**If you don't want to download the models, you can jump to step 4 and use the provided data in the repo instead.**

### Run inference with roberta

In [6]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [7]:
import tqdm

results = []
texts = df['text'].to_list()

## Depending on your machine, this should take around 1 minute
for text in tqdm.tqdm(texts):
    results.append(pipe(text))

100%|██████████| 500/500 [00:10<00:00, 48.52it/s]


In [8]:
df['roberta'] = [r[0]['label'] for r in results]
df['roberta_score'] = [r[0]['score'] for r in results]

### Run inference with gpt2

In [9]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="LYTinn/finetuning-sentiment-model-tweet-gpt2")

Device set to use mps:0


In [10]:
import tqdm

results = []
texts = df['text'].to_list()

## Depending on your machine, this should take around 1 minute
for text in tqdm.tqdm(texts):
    results.append(pipe(text))

100%|██████████| 500/500 [00:07<00:00, 64.79it/s]


In [11]:
df['gpt2'] = [r[0]['label'] for r in results]
df['gpt2_score'] = [r[0]['score'] for r in results]

## map labels back
def label_map(x):
    if x == 'LABEL_0':
        return 'negative'
    elif x == 'LABEL_1':
        return 'neutral'
    elif x == 'LABEL_2':
        return 'positive'
    return x
df['gpt2'] = df['gpt2'].map(label_map)

# Step 4 - Pre-processing data and add additional columns

In [12]:
## If you skip the model inference, uncomment the code below and load the provided data

# df = pd.read_csv('tweets.csv')

In [13]:
df["input_length"] = df["text"].str.len()

# Step 5 - Start Zeno for interactive slicing

In this step, you need to create 5 slices in the Zeno interface and derive meaningful insights.

As a starting point, try to create the two slices we provide:

1. Tweets with hashtags
2. Tweets with strong positive words (e.g., love) -- you can determine the exact words

Creating slices in Zeno is straightforward: Just click on the '+' button for 'create a new slice', and you can define the slice using existing column attributes, with simple value macthing or even regular expression.

![image.png](images/image.png)

There are more fun features in Zeno, including interactive metadata & model comparison -- feel free to check the teaser video in [README](https://github.com/zeno-ml/zeno) of the Zeno repository.

In [14]:
import os
from dotenv import load_dotenv
from zeno_client import ZenoClient, ZenoMetric
import pandas as pd

load_dotenv()
API_KEY = os.getenv("ZENO_API_KEY")

if not API_KEY:
    raise ValueError("API_KEY is not set! Please check your .env file.")

print(f"API Key Loaded: {API_KEY[:5]}... (hidden for security)")

df = pd.read_csv('tweets.csv')
df = df.reset_index()
df["index"] = df["index"].astype(str)  # Fix dtype issue

client = ZenoClient(API_KEY)

project = client.create_project(
    name="Tweet Sentiment Analysis",
    view="text-classification",
    metrics=[
        ZenoMetric(name="accuracy", type="mean", columns=["correct"]),
    ]
)

project.upload_dataset(df, id_column="index", data_column="text", label_column="label")

models = ['roberta', 'gpt2']
for model in models:
    df_system = df[['index', model]].copy()  # Prevent SettingWithCopyWarning
    df_system.loc[:, "correct"] = (df_system[model] == df["label"]).astype(int)  # Fix warning
    project.upload_system(df_system, name=model, id_column="index", output_column=model)

print("Data uploaded successfully to Zeno!")


API Key Loaded: zen_y... (hidden for security)
Successfully updated project.
Access your project at  https://hub.zenoml.com/project/b01949ef-afb7-4084-a6c9-4047904cc65e/Tweet%20Sentiment%20Analysis


100%|██████████| 1/1 [00:00<00:00,  1.08it/s]


Successfully uploaded data


100%|██████████| 1/1 [00:00<00:00,  1.17it/s]


Successfully uploaded system


100%|██████████| 1/1 [00:00<00:00,  1.39it/s]

Successfully uploaded system
Data uploaded successfully to Zeno!





After running the code above, you should be able to access Zeno in http://localhost:8231


After successfully creating the two slices, come up with three *additional* slices you want to check and **create** the slices in the Zeno interface.

There are two directions to identify useful slices:
- Top-down: Think about what kinds of things the model can struggle with, and come up with some slices.
- Bottom-up: Look at model (mis-)predictions, come up with hypotheses, and translate them into data slices.

3. [YOUR CHOICE]
4. [YOUR CHOICE]
5. [YOUR CHOICE]

In [15]:
custom_slice_descriptions = [    
    "Tweets with Negative Words: This slice focuses on tweets that contain negative words such as 'hate,' 'lost,' 'worst,' 'bad,' 'sad,' 'unhappy,' 'failure,' 'awful,' 'horrible,' and 'terrible.' The purpose is to evaluate if the model correctly classifies tweets with strong negative sentiment and identify any misclassifications.",
    
    "Tweets with User Details: This slice captures tweets that include '@' mentions, indicating direct interactions with other users. The goal is to analyze if the presence of user references affects sentiment classification and whether the model treats them differently.",
    
    "Short vs. Long Tweets: This slice categorizes tweets based on their length, where short tweets have an input length of ≤ 10 and long tweets have an input length of < 30. The purpose is to examine whether tweet length impacts sentiment classification accuracy and if the model performs differently on brief vs. detailed text.",
    
    "Tweets with Repeated Letters: This slice identifies tweets that contain elongated words with repeated letters (e.g., 'loooove,' 'happyyyy') using a regular expression pattern. The goal is to determine if the model correctly understands exaggerated words, which often carry strong emotions.",
    
    "Tweets with URLs: This slice filters tweets that contain links such as 'www,' 'https,' or 'http.' The objective is to analyze whether the model is affected by the presence of URLs and if it can correctly classify the sentiment without being misled by external references."
]


# Step 6 - Write down three addition data slices you want to create but do not have the metadata for slicing

In the previous step, you might have already come up with some slices you wanted to create but found it hard to do with existing metadata. Write down three of such slices in this step.

Example: 
- I want to create a slice on tweets using slangs
- I want to create a slice on non-English tweets (if any)

In [16]:
custom_slice_descriptions = [
    "Tweets with Exaggeration: This slice includes tweets that use extreme or exaggerated language, like 'I’ve been waiting forever!' or 'This is the worst thing to ever happen!' The purpose is to analyze if the model can correctly identify sentiment in exaggerated statements without misclassifying them as overly positive or negative.",
    
    "Tweets with Misspellings: This slice captures tweets with spelling mistakes or informal abbreviations, such as 'luv' (love), 'gud' (good), 'tho' (though),  '4' (for). The purpose is to check if the model can still understand the correct sentiment even when words are misspelled or written in a casual way.",
    
    "Tweets with Numerical Data: This slice includes tweets that contain numbers, percentages, or statistics, like '67.4 kg gold missing from Delhi airport in 7 months', 'they're already producing Model 3 or S'. The purpose is to evaluate whether the presence of numerical information affects sentiment classification and if the model can focus on the context rather than the numbers."
]

# Step 7 - Generate more test cases with Large Language Models

Select one slice from the three you wrote down and generate **10 test cases** using LLMs, which can include average case, boundary case, or difficult case.

Your input can be in the following format:

> Examples:
> - OH: “I had a blue penis while I was this” [playing with Google Earth VR]
> - @user @user That’s coming, but I think the victims are going to be Medicaid recipients.
> - I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user
> 
> Generate more tweets using slangs.

The first part of **Examples** conditions the LLM on the style, length, and content of examples. The second part of **Instructions** instructs what kind of examples you want LLM to generate.

Use our provided GPTs to start the task: [llm-based-test-case-generator](https://chatgpt.com/g/g-982cylVn2-llm-based-test-case-generator). If you do not have access to GPTs, use the plain ChatGPT or other LLM providers you have access to instead.

In [17]:
## Write down the slice you select

slice_description = "Tweets with Numerical Data: This slice includes tweets that contain numbers, percentages, or statistics, like '67.4 kg gold missing from Delhi airport in 7 months', 'they're already producing Model 3 or S'. The purpose is to evaluate whether the presence of numerical information affects sentiment classification and if the model can focus on the context rather than the numbers."

## Write down all generated test cases here

generated_test_cases = [
    "67.4 kg of gold went missing from Delhi airport over 7 months—how does that even happen?",
    "Tesla just announced they’ve produced 10,000 Model Y units this month.",
    "The company’s revenue jumped by 15% this quarter, exceeding expectations.",
    "Scientists discovered a new exoplanet 12.5 light-years away from Earth.",
    "2,500 people attended the rally despite heavy rain. That’s dedication.",
    "A new study says 1 in 3 adults don’t get enough sleep—no wonder we’re all exhausted.",
    "The price of Bitcoin dropped below $40,000 for the first time in 6 months.",
    "My flight is delayed by 3 hours, meaning I won’t land until 2 AM.",
    "The movie grossed $150 million worldwide in its opening weekend.",
    "They’ve already built 5 prototypes of the new EV model. Can’t wait to see the final version.",
]
