# Twitter Sentiment Hypothesis Generation

This notebook is adapted from various files in `./notebooks` and contains the following modifications:
1. **OpenAI $\rightarrow$ Gemini API**. GPT is not cheap! I opted for Google's free inferencing plan.
2. **Annotation

In [1]:
%load_ext autoreload
%autoreload 2

import os
from dotenv import find_dotenv, load_dotenv

dotenv_path = find_dotenv()
load_dotenv(dotenv_path)

os.environ['CUDA_VISIBLE_DEVICES'] = '5' # Set to the index of the GPU you want to use; see visible GPUs with `nvidia-smi` on command line


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from hypothesaes.quickstart import train_sae, interpret_sae, generate_hypotheses, evaluate_hypotheses
from hypothesaes.embedding import get_local_embeddings
from hypothesaes.llm_local import get_vllm_engine
from hypothesaes.select_neurons import select_neurons
from hypothesaes.interpret_neurons import NeuronInterpreter, SamplingConfig, LLMConfig, InterpretConfig, ScoringConfig
from hypothesaes.annotate import annotate_texts_with_concepts
from hypothesaes.evaluation import score_hypotheses

current_dir = os.getcwd()
assert current_dir.endswith("phoenix")

## Dataset
The dataset I will be using is [Twitter Tweet Sentiments (27.5k)](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset), a collection of 27,480 tweets and the associated sentiments. 5496 of these tweets are used for validation during SAE training and 5496 tweets used for holdout evaluation. The target variable is the `sentiment` column, which can be `negative`, `neutral`, or `positive`, and we are interested in seeing what features of the `text` column predict it.

In [2]:
base_dir = os.path.join("data")
df = pd.read_csv(os.path.join(base_dir, "Tweets.csv"))
df = df[pd.notnull(df["text"])] # GPT generated command to clean :)

sentiment_to_numeric = {"negative" : -1, "neutral" : 0, "positive" : 1}
df["sentiment"] = df["sentiment"].map(sentiment_to_numeric) # We need numeric labels for regression!

train_df, val_df = train_test_split(df, test_size=5496*2, train_size=16448, random_state=42)
val_df_SAE, val_df_holdout = train_test_split(val_df, test_size=5496, train_size=5496, random_state=42)

train_df.head(5)

Unnamed: 0,textID,text,selected_text,sentiment
19413,ba7bbe76fe,: saw it yesterday. Pretty good.,good.,1
23147,09b5bef434,"hey, I can`t make it to Makers tonight","hey, I can`t make it to Makers tonight",0
21547,83cdebaa92,"Whats with you though, you sound a bit down y...",you sound a bit down yourself.,-1
14162,914da6164e,No B2G1 for me. Trying to save cash for next ...,No B2G1 for me. Trying to save cash for next ...,0
6474,d7709f9f53,hahahaha omg you win the internetz today! 'W...,a omg you win,1


## 1. Feature Generation
First, we will compute the embeddings of the `text` column for the training and validation sets.

In [3]:
train_texts = train_df["text"].tolist()
train_sentiments = train_df["sentiment"].tolist()
val_texts = val_df_SAE["text"].tolist()

EMBEDDER = "nomic-ai/modernbert-embed-base"
CACHE_NAME = f"twitter_quickstart_local_{EMBEDDER}"

text2embedding = get_local_embeddings(train_texts + val_texts, model=EMBEDDER, batch_size=128, cache_name=CACHE_NAME)
train_embeddings = np.stack([text2embedding[text] for text in train_texts])
val_embeddings = np.stack([text2embedding[text] for text in val_texts])

Loading embedding chunks:   0%|          | 0/1 [00:00<?, ?it/s]

Loaded 27440 embeddings in 0.1s


Now that we have the embeddings, we will use a SAE wiil sparsify these representations. Since the size of the dataset used here and the dataset used in `quickstart_local.ipynb` are relatively similar, and after consulting the `README`, I decided to train a Matryoshka SAE with the same parameters $M=256$, $k=8$, and $\text{prefix\_lengths} = [32, 256]$.

In [4]:
checkpoint_dir = os.path.join(base_dir, "checkpoints", CACHE_NAME)
sae = train_sae(embeddings=train_embeddings, M=256, K=8, matryoshka_prefix_lengths=[32, 256], checkpoint_dir=checkpoint_dir, val_embeddings=val_embeddings)
train_activations = sae.get_activations(train_embeddings)

Loaded model from data/checkpoints/twitter_quickstart_local_nomic-ai/modernbert-embed-base/SAE_matryoshka_M=256_K=8_prefixes=32-256.pt onto device cuda


Computing activations (batchsize=16384):   0%|          | 0/2 [00:00<?, ?it/s]

## 2. Feature Selection
Now that we have the sparse activations of our training embeddings, we select the neurons which are most predictive of `sentiment`.

In [5]:
selection_method = "correlation"
top_neuron_count = 20

selected_neurons, scores = select_neurons(
    activations=train_activations,
    target=train_sentiments,
    n_select=top_neuron_count,
    method=selection_method,
)

## 3. Feature Interpretation
By Proposition 3.1, a natural language concept is predictive of the target variable if it is predictive of activations of a neuron that is predictive of that variable. Since we have found the neurons which are predictive of `sentiment`, it suffices to find the concepts which are predictive of each neuron's activation.


To do so, we use the [Gemini API](https://googleapis.github.io/python-genai) to interpret the neurons with natural language concepts via LLM. Fortunately, `NeuronInterpreter` does a lot of the heavy lifting required for this. Unfortunately, it does not do all of the heavy lifting. The following adjustments must be made:
- **Annotator and Interpreter Model**. Currently, only the OpenAI API is supported. To get around this, I rehashed `llm_api.py` and made the appropriate changes in the `_execute_prompts` and `_get_interpretation_gemini` (formerly known as `_get_interpretation_openai`) methods needed for `NeuronInterpreter.interpret_neurons`. Later, annotation required minor changes in `annotate.py`.

- **Rate Limiting**. Initially, I was using the standard plan, where rate limiting was a huge issue and the fix was to scale almost everything down. I settled on rolling back the parallelization and opted for executing the prompts sequentially, along with cutting down the number of examples used to score the fidelity of each neuron interpretation. Eventually, I was able to get a better plan that allowed for much better throughput, essentially making these changes mostly obsolete (but still worth mentioning)!

In [6]:
TASK_SPECIFIC_INSTRUCTIONS = """All of the texts are tweets. 
Features should describe a specific aspect of the tweet. For example:
- "mentions excitement about a sports event"
- "uses humor to comment on politics"
- "praises a new tech product release"
- "complains about poor customer service from a company"
"""

interpreter = NeuronInterpreter(
    interpreter_model="gemini-2.5-flash-lite",
    annotator_model="gemini-2.5-flash-lite",
    n_workers_interpretation=10,
    n_workers_annotation=50,
    cache_name=CACHE_NAME,
)

interpret_config = InterpretConfig(
    sampling=SamplingConfig(
        n_examples=20,
        max_words_per_example=128,
    ),
    llm=LLMConfig(
        temperature=0.7,
        max_interpretation_tokens=75,
    ),
    n_candidates=3,
    task_specific_instructions=TASK_SPECIFIC_INSTRUCTIONS,
)

interpretations = interpreter.interpret_neurons(
    texts=train_texts,
    activations=train_activations,
    neuron_indices=selected_neurons,
    config=interpret_config,
)

Generating interpretations:   0%|          | 0/60 [00:00<?, ?it/s]

After obtaining the interpretations, we score each interpretation by how correlated it is with neuron activation. In the paper, this is referred to as fidelity. 

In [7]:
# This cell is entirely borrowed from detailed_usage.ipynb.

scoring_config = ScoringConfig(
    n_examples=50,
    max_words_per_example=128,
)

all_metrics = interpreter.score_interpretations(
    texts=train_texts,
    activations=train_activations,
    interpretations=interpretations,
    config=scoring_config,
)

interpretations_data = []
for neuron_idx in selected_neurons:
    neuron_metrics = all_metrics[neuron_idx]
    best_interp, best_metrics = max(neuron_metrics.items(), key=lambda x: x[1]['f1'])
    worst_interp, worst_metrics = min(neuron_metrics.items(), key=lambda x: x[1]['f1'])
    
    interpretations_data.append({
        'neuron_idx': neuron_idx,
        f'target_{selection_method}': scores[selected_neurons.index(neuron_idx)],
        'best_interpretation': best_interp,
        'best_f1': best_metrics['f1'],
        'worst_interpretation': worst_interp,
        'worst_f1': worst_metrics['f1']
    })

best_interp_df = pd.DataFrame(interpretations_data).sort_values(by=f'target_{selection_method}', ascending=False)

display(
    best_interp_df.style.format({
        'separation_score': '{:.2f}',
        'best_f1': '{:.2f}', 
        'worst_f1': '{:.2f}'
    })
)

Found 1940 cached items; annotating 1060 uncached items


Scoring neuron interpretation fidelity (20 neurons; 3 candidate interps per neuron; 50 examples to score each …

Unnamed: 0,neuron_idx,target_correlation,best_interpretation,best_f1,worst_interpretation,worst_f1
0,28,0.343308,expresses strong positive emotion or excitement,0.89,expresses strong positive sentiment,0.87
2,27,0.243332,expresses gratitude,0.98,expresses gratitude,0.98
3,5,0.191838,mentions Mother's Day,1.0,mentions Mother's Day,1.0
6,12,0.140774,uses exclamation marks at the end of the tweet,0.91,"uses short, exclamatory phrases",0.84
10,243,0.100437,expresses love,0.86,expresses feelings of love or affection,0.8
14,21,0.088182,mentions a personal project or creation,0.87,mentions a website or online portfolio,0.68
18,249,0.080873,uses the word 'morning' or 'Morning',1.0,uses the word 'morning' or 'Morning',1.0
19,136,0.078432,uses the word 'cute' to describe someone or something,1.0,uses the word 'cute',0.98
17,178,-0.081368,mentions a tummy ache or hurting tummy,0.94,mentions a tummy ache or hurting tummy,0.94
16,242,-0.083918,expresses sadness,0.83,expresses sadness,0.83


Lastly, the best interpretations of each neuron are used to annotate our holdout set, with each concept scored as a hypothesis for predicting `sentiment`.

In [8]:
val_holdout_texts = val_df_holdout['text']
val_holdout_sentiments = val_df_holdout['sentiment']

holdout_annotations = annotate_texts_with_concepts(
    model="gemini-2.5-flash-lite",
    texts=val_holdout_texts,
    concepts=best_interp_df['best_interpretation'].tolist(),
    max_words_per_example=128,
    cache_name=CACHE_NAME,
    n_workers=50,
)

metrics, hypothesis_df = score_hypotheses(
    hypothesis_annotations=holdout_annotations,
    y_true=val_holdout_sentiments,
    classification=False,
)

pd.set_option('display.max_colwidth', None)
display(hypothesis_df.round(3))
pd.reset_option('display.max_colwidth')

print("\nHoldout Set Metrics:")
print(f"R² Score: {metrics['r2']:.3f}")
print(f"Significant hypotheses: {metrics['Significant'][0]}/{metrics['Significant'][1]} " 
      f"(p < {metrics['Significant'][2]:.3e})")

Found 0 cached items; annotating 109920 uncached items


Annotating:   0%|          | 0/109920 [00:00<?, ?it/s]

API error: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.', 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'quotaMetric': 'generativelanguage.googleapis.com/generate_requests_per_model', 'quotaId': 'GenerateRequestsPerMinutePerProjectPerModel', 'quotaDimensions': {'model': 'gemini-2.5-flash-lite', 'location': 'global'}, 'quotaValue': '4000'}]}, {'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Learn more about Gemini API quotas', 'url': 'https://ai.google.dev/gemini-api/docs/rate-limits'}]}, {'@type': 'type.googleapis.com/google.rpc.RetryInfo', 'retryDelay': '22s'}]}}; retrying in 10.0s... (2/3)
API error: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please chec

Unnamed: 0,hypothesis,separation_score,separation_pval,regression_coef,regression_pval,feature_prevalence
1,expresses gratitude,0.835,0.0,0.392,0.0,0.084
0,expresses strong positive emotion or excitement,0.756,0.0,0.428,0.0,0.396
2,mentions Mother's Day,0.756,0.0,0.133,0.006,0.03
4,expresses love,0.635,0.0,0.221,0.0,0.135
7,uses the word 'cute' to describe someone or something,0.518,0.0,0.182,0.021,0.01
6,uses the word 'morning' or 'Morning',0.393,0.0,0.291,0.0,0.03
3,uses exclamation marks at the end of the tweet,0.204,0.0,-0.032,0.073,0.333
5,mentions a personal project or creation,0.151,0.0,0.024,0.372,0.096
17,expresses a feeling of missing someone or something,-0.32,0.0,-0.068,0.006,0.161
11,expresses a state of tiredness,-0.394,0.0,-0.122,0.0,0.122



Holdout Set Metrics:
R² Score: 0.459
Significant hypotheses: 11/20 (p < 5.000e-03)
