# Twitter Sentiment Hypothesis Generation

This notebook is adapted from various files in `./notebooks` and contains the following modifications:
1. **OpenAI $\rightarrow$ Gemini API**. GPT is not cheap! I opted for Google's free Tier I plan.
2. **Annotation by Embedding Space Metrics**. I explored performing hypothesis annotation on the holdout set using two metrics on the embedding space: cosine similarity and distance.
3. **`SupervisedSparseAutoencoder` Class**. I tried my hand at implementing the class specified in this [feature request](https://github.com/rmovva/HypotheSAEs/issues/2). I was able to implement the constructor and forward pass (to the best of my knowledge) before running out of time.

In [95]:
%load_ext autoreload
%autoreload 2

import os
from dotenv import find_dotenv, load_dotenv

dotenv_path = find_dotenv()
load_dotenv(dotenv_path)

os.environ['CUDA_VISIBLE_DEVICES'] = '5' # Set to the index of the GPU you want to use; see visible GPUs with `nvidia-smi` on command line


import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split

from hypothesaes.quickstart import train_sae, interpret_sae, generate_hypotheses, evaluate_hypotheses
from hypothesaes.embedding import get_local_embeddings
from hypothesaes.llm_local import get_vllm_engine
from hypothesaes.select_neurons import select_neurons
from hypothesaes.interpret_neurons import NeuronInterpreter, SamplingConfig, LLMConfig, InterpretConfig, ScoringConfig
from hypothesaes.annotate import annotate_texts_with_concepts
from hypothesaes.evaluation import score_hypotheses
from hypothesaes.sae import SparseAutoencoder

current_dir = os.getcwd()
assert current_dir.endswith("phoenix")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Dataset
The dataset I will be using is [Twitter Tweet Sentiments (27.5k)](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset), a collection of 27,480 tweets and the associated sentiments. 5496 of these tweets are used for validation during SAE training and 5496 tweets used for holdout evaluation. The target variable is the `sentiment` column, which can be `negative`, `neutral`, or `positive`, and we are interested in seeing what features of the `text` column predict it.

In [96]:
base_dir = os.path.join("data")
df = pd.read_csv(os.path.join(base_dir, "Tweets.csv"))
df = df[pd.notnull(df["text"])] # GPT generated command to clean :)

sentiment_to_numeric = {"negative" : -1, "neutral" : 0, "positive" : 1}
df["sentiment"] = df["sentiment"].map(sentiment_to_numeric) # We need numeric labels for regression!

train_df, val_df = train_test_split(df, test_size=5496*2, train_size=16448, random_state=42)
val_df_SAE, val_df_holdout = train_test_split(val_df, test_size=5496, train_size=5496, random_state=42)

train_df.head(5)

Unnamed: 0,textID,text,selected_text,sentiment
19413,ba7bbe76fe,: saw it yesterday. Pretty good.,good.,1
23147,09b5bef434,"hey, I can`t make it to Makers tonight","hey, I can`t make it to Makers tonight",0
21547,83cdebaa92,"Whats with you though, you sound a bit down y...",you sound a bit down yourself.,-1
14162,914da6164e,No B2G1 for me. Trying to save cash for next ...,No B2G1 for me. Trying to save cash for next ...,0
6474,d7709f9f53,hahahaha omg you win the internetz today! 'W...,a omg you win,1


## 1. Feature Generation
First, we will compute the embeddings of the `text` column for the training and validation sets.

In [97]:
train_texts = train_df["text"].tolist()
train_sentiments = train_df["sentiment"].tolist()
val_texts = val_df_SAE["text"].tolist()

EMBEDDER = "nomic-ai/modernbert-embed-base"
CACHE_NAME = f"twitter_quickstart_local_{EMBEDDER}"

text2embedding = get_local_embeddings(train_texts + val_texts, model=EMBEDDER, batch_size=128, cache_name=CACHE_NAME)
train_embeddings = np.stack([text2embedding[text] for text in train_texts])
val_embeddings = np.stack([text2embedding[text] for text in val_texts])

Loading embedding chunks:   0%|          | 0/2 [00:00<?, ?it/s]

Loaded 27460 embeddings in 0.1s


Now that we have the embeddings, we will use a SAE wiil sparsify these representations. Since the size of the dataset used here and the dataset used in `quickstart_local.ipynb` are relatively similar, and after consulting the `README`, I decided to train a Matryoshka SAE with the same parameters $M=256$, $k=8$, and $\text{prefix\_lengths} = [32, 256]$.

In [98]:
checkpoint_dir = os.path.join(base_dir, "checkpoints", CACHE_NAME)
sae = train_sae(embeddings=train_embeddings, M=256, K=8, matryoshka_prefix_lengths=[32, 256], checkpoint_dir=checkpoint_dir, val_embeddings=val_embeddings)
train_activations = sae.get_activations(train_embeddings)

Loaded model from data/checkpoints/twitter_quickstart_local_nomic-ai/modernbert-embed-base/SAE_matryoshka_M=256_K=8_prefixes=32-256.pt onto device cuda


Computing activations (batchsize=16384):   0%|          | 0/2 [00:00<?, ?it/s]

## 2. Feature Selection
Now that we have the sparse activations of our training embeddings, we select the neurons which are most predictive of `sentiment`.

In [99]:
selection_method = "correlation"
top_neuron_count = 20

selected_neurons, scores = select_neurons(
    activations=train_activations,
    target=train_sentiments,
    n_select=top_neuron_count,
    method=selection_method,
)

## 3. Feature Interpretation
By Proposition 3.1, a natural language concept is predictive of the target variable if it is predictive of activations of a neuron that is predictive of that variable. Since we have found the neurons which are predictive of `sentiment`, it suffices to find the concepts which are predictive of each neuron's activation.


To do so, we leverage the [Gemini API](https://googleapis.github.io/python-genai) to interpret the neurons with natural language concepts via LLM. Fortunately, `NeuronInterpreter` does a lot of the heavy lifting required for this. Unfortunately, it does not do all of the heavy lifting. The following adjustments must be made:
- **Annotator and Interpreter Model**. Currently, only the OpenAI API is supported. To get around this, I rehashed `llm_api.py` and made the appropriate changes in the `_execute_prompts` and `_get_interpretation_gemini` (formerly known as `_get_interpretation_openai`) methods needed for `NeuronInterpreter.interpret_neurons`. Later, annotation required minor changes in `annotate.py`.

- **Rate Limiting**. Initially, I was using the standard plan and rate limiting was a huge issue. The fix was to scale almost everything down. I settled on rolling back the parallelization and opted for executing the prompts sequentially, along with cutting down the number of examples used to score the fidelity of each neuron interpretation. Eventually, I was able to get a better plan that allowed for much better throughput, essentially making these changes mostly obsolete (but still worth mentioning)!

In [100]:
TASK_SPECIFIC_INSTRUCTIONS = """All of the texts are tweets. 
Features should describe a specific aspect of the tweet. For example:
- "mentions excitement about a sports event"
- "uses humor to comment on politics"
- "praises a new tech product release"
- "complains about poor customer service from a company"
"""

interpreter = NeuronInterpreter(
    interpreter_model="gemini-2.5-flash-lite",
    annotator_model="gemini-2.5-flash-lite",
    n_workers_interpretation=10,
    n_workers_annotation=50,
    cache_name=CACHE_NAME,
)

interpret_config = InterpretConfig(
    sampling=SamplingConfig(
        n_examples=20,
        max_words_per_example=128,
    ),
    llm=LLMConfig(
        temperature=0.7,
        max_interpretation_tokens=75,
    ),
    n_candidates=3,
    task_specific_instructions=TASK_SPECIFIC_INSTRUCTIONS,
)

interpretations = interpreter.interpret_neurons(
    texts=train_texts,
    activations=train_activations,
    neuron_indices=selected_neurons,
    config=interpret_config,
)

Generating interpretations:   0%|          | 0/60 [00:00<?, ?it/s]

After obtaining the interpretations, we score each interpretation by how correlated it is with neuron activation. In the paper, this is referred to as fidelity. 

In [101]:
# This cell is entirely borrowed from detailed_usage.ipynb.

scoring_config = ScoringConfig(
    n_examples=50,
    max_words_per_example=128,
)

all_metrics = interpreter.score_interpretations(
    texts=train_texts,
    activations=train_activations,
    interpretations=interpretations,
    config=scoring_config,
)

interpretations_data = []
for neuron_idx in selected_neurons:
    neuron_metrics = all_metrics[neuron_idx]
    best_interp, best_metrics = max(neuron_metrics.items(), key=lambda x: x[1]['f1'])
    worst_interp, worst_metrics = min(neuron_metrics.items(), key=lambda x: x[1]['f1'])
    
    interpretations_data.append({
        'neuron_idx': neuron_idx,
        f'target_{selection_method}': scores[selected_neurons.index(neuron_idx)],
        'best_interpretation': best_interp,
        'best_f1': best_metrics['f1'],
        'worst_interpretation': worst_interp,
        'worst_f1': worst_metrics['f1']
    })

best_interp_df = pd.DataFrame(interpretations_data).sort_values(by=f'target_{selection_method}', ascending=False)

display(
    best_interp_df.style.format({
        'separation_score': '{:.2f}',
        'best_f1': '{:.2f}', 
        'worst_f1': '{:.2f}'
    })
)

Found 1950 cached items; annotating 1050 uncached items


Scoring neuron interpretation fidelity (20 neurons; 3 candidate interps per neuron; 50 examples to score each …

Unnamed: 0,neuron_idx,target_correlation,best_interpretation,best_f1,worst_interpretation,worst_f1
0,28,0.343308,expresses strong positive emotion,0.89,expresses strong positive sentiment,0.87
2,27,0.243332,expresses gratitude,0.98,expresses gratitude,0.98
3,5,0.191838,mentions Mother's Day or mothers,1.0,mentions Mother's Day or mothers,1.0
6,12,0.140774,uses an exclamation mark at the end of the tweet,0.86,uses exclamation marks,0.78
10,243,0.100437,expresses love towards another person,0.82,expresses love or affection,0.77
14,21,0.088182,mentions creating or promoting content/products,0.92,mentions a website or a link to a website,0.71
18,249,0.080873,uses the word 'morning',1.0,uses the word 'morning',1.0
19,136,0.078432,uses the word 'cute' to describe someone or something,1.0,uses the word 'cute',0.98
17,178,-0.081368,mentions discomfort in the stomach,0.94,mentions discomfort in the stomach,0.94
16,242,-0.083918,expresses sadness,0.83,expresses sadness,0.83


Lastly, the best interpretations for each neuron are used to annotate our holdout set, with each concept scored as a hypothesis for predicting `sentiment`.

In [None]:
val_holdout_texts = val_df_holdout['text']
val_holdout_sentiments = val_df_holdout['sentiment']

holdout_annotations = annotate_texts_with_concepts(
    model="gemini-2.5-flash-lite",
    texts=val_holdout_texts,
    concepts=best_interp_df['best_interpretation'].tolist(),
    max_words_per_example=128,
    cache_name=CACHE_NAME,
    n_workers=50,
)

metrics, hypothesis_df = score_hypotheses(
    hypothesis_annotations=holdout_annotations,
    y_true=val_holdout_sentiments,
    classification=False,
)

In [103]:
pd.set_option('display.max_colwidth', None)
display(hypothesis_df.round(3))
pd.reset_option('display.max_colwidth')

print("\nHoldout Set Metrics:")
print(f"R² Score: {metrics['r2']:.3f}")
print(f"Significant hypotheses: {metrics['Significant'][0]}/{metrics['Significant'][1]} " 
      f"(p < {metrics['Significant'][2]:.3e})")

Unnamed: 0,hypothesis,separation_score,separation_pval,regression_coef,regression_pval,feature_prevalence
0,expresses strong positive emotion,0.857,0.0,0.602,0.0,0.376
1,expresses gratitude,0.835,0.0,0.373,0.0,0.084
2,mentions Mother's Day or mothers,0.567,0.0,0.094,0.018,0.044
4,expresses love towards another person,0.535,0.0,0.088,0.005,0.084
7,uses the word 'cute' to describe someone or something,0.518,0.0,0.168,0.029,0.01
6,uses the word 'morning',0.391,0.0,0.276,0.0,0.029
5,mentions creating or promoting content/products,0.368,0.0,0.114,0.0,0.101
3,uses an exclamation mark at the end of the tweet,0.21,0.0,-0.045,0.01,0.302
17,expresses a feeling of longing or missing someone or something,-0.318,0.0,-0.027,0.323,0.125
10,mentions a physical ailment of the head,-0.393,0.0,-0.026,0.67,0.018



Holdout Set Metrics:
R² Score: 0.481
Significant hypotheses: 13/20 (p < 5.000e-03)


# Experiments
I had some time left to dive deeper into the project, so I decided to tinker around and implement two new approaches: **hypothesis annotation by embedding space metrics** and **the `SupervisedSparseAutoencoder` class specified [here](https://github.com/rmovva/HypotheSAEs/issues/2)**.

## i. Hypothesis Annotation by Embedding Space Metrics
This was a suggestion in the spec for this project and I found it pretty interesting. The code below uses two approaches for discriminating the holdout dataset's text embeddings wrt the hypotheses: distance and cosine similarity.

In [None]:
val_holdout_texts = val_df_holdout['text'].tolist()
hypotheses_texts = best_interp_df['best_interpretation'].tolist()

_ = get_local_embeddings(val_holdout_texts + hypotheses_texts, model=EMBEDDER, batch_size=128, cache_name=CACHE_NAME)
val_holdout_embeddings = np.stack([_[text] for text in val_holdout_texts])
hypothesis_embeddings = np.stack([_[text] for text in hypotheses_texts])

Loading embedding chunks:   0%|          | 0/2 [00:00<?, ?it/s]

Loaded 27460 embeddings in 0.1s
Loaded model nomic-ai/modernbert-embed-base to cuda


Processing chunks:   0%|          | 0/1 [00:00<?, ?it/s]

Chunk 2:   0%|          | 0/1 [00:00<?, ?it/s]

Saved 13 embeddings to /home/phoenixw/HypotheSAEs/emb_cache/twitter_quickstart_local_nomic-ai/modernbert-embed-base/chunk_002.npy


In [None]:
def score_cosine(examples : np.ndarray, hypothesis : np.ndarray, k : float = 0.7):
    """
    Returns the indices of the most cosine-similar examples to the concept, as indexed by examples.
    """
    similarity_scores = examples @ hypothesis / (np.linalg.norm(examples, axis=1) * np.linalg.norm(hypothesis))
    indices = np.where(similarity_scores >= k)[0] # GPT-Generated!
    if indices.size == 0:
        return [int(np.argmax(similarity_scores))]
    return indices

def score_dist(examples : np.ndarray, hypothesis : np.ndarray, r : float = 4):
    """
    Returns the indices of the most distance-similar examples to the concept, as indexed by examples.
    """

    diffs = examples - hypothesis
    dists = np.linalg.norm(diffs, axis=1)

    indices = np.where(dists <= r)[0]
    if indices.size == 0:
        return [int(np.argmin(dists))]
    return indices

def annotate_by_embedding(metric, examples : np.ndarray, hypotheses : np.ndarray, **kwargs):
    """
    Annotate all examples with all hypotheses.

    Args:
        examples : list of text embeddings
        hypotheses : list of hypothesis embeddings

    Returns:
        Dictionary mapping a hypothesis to its one-hot annotation results for each example.
    """
    annotations: dict[str, np.ndarray] = {}
    # concept -> annotation results for each example
    for i, hypothesis in enumerate(hypotheses):
        best_example_indices = metric(examples, hypothesis, **kwargs)
        one_hot = np.zeros(len(examples), dtype=int)
        one_hot[best_example_indices] = 1
        annotations[hypotheses_texts[i]] = one_hot
    return annotations

annotations_by_cosine = annotate_by_embedding(score_cosine, val_holdout_embeddings, hypothesis_embeddings, k=0.6)
annotations_by_dist = annotate_by_embedding(score_dist, val_holdout_embeddings, hypothesis_embeddings, r=0.8)

In [106]:
metrics_cosine, hypothesis_df_cosine = score_hypotheses(
    hypothesis_annotations=annotations_by_cosine,
    y_true=val_holdout_sentiments,
    classification=False,
)

metrics_dist, hypothesis_df_dist = score_hypotheses(
    hypothesis_annotations=annotations_by_dist,
    y_true=val_holdout_sentiments,
    classification=False,
)

In [107]:
pd.set_option('display.max_colwidth', None)
print("Cosine Similarity Metric")
display(hypothesis_df_cosine.round(3))
print("Distance Similarity Metric")
display(hypothesis_df_dist.round(3))
pd.reset_option('display.max_colwidth')

print("\nHoldout Set Metrics:")
print(f"R² Score (cosine): {metrics_cosine['r2']:.3f}")
print(f"R² Score (distance): {metrics_dist['r2']:.3f}")
print(f"""Significant hypotheses: 
      (cosine) {metrics_cosine['Significant'][0]}/{metrics_cosine['Significant'][1]} (p < {metrics_cosine['Significant'][2]:.3e})
      (distance) {metrics_dist['Significant'][0]}/{metrics_dist['Significant'][1]} " (p < {metrics_dist['Significant'][2]:.3e})""")

Cosine Similarity Metric


Unnamed: 0,hypothesis,separation_score,separation_pval,regression_coef,regression_pval,feature_prevalence
1,expresses gratitude,0.304,0.0,0.397,0.0,0.546
0,expresses strong positive emotion,0.289,0.0,0.425,0.0,0.317
7,uses the word 'cute' to describe someone or something,0.284,0.0,0.184,0.0,0.149
4,expresses love towards another person,0.246,0.0,0.16,0.0,0.175
5,mentions creating or promoting content/products,0.135,0.0,0.063,0.049,0.112
2,mentions Mother's Day or mothers,0.079,0.0,0.058,0.012,0.426
3,uses an exclamation mark at the end of the tweet,0.047,0.058,0.061,0.023,0.224
6,uses the word 'morning',-0.0,0.992,0.069,0.005,0.61
15,expresses apology,-0.023,0.281,-0.067,0.01,0.437
9,expresses sadness,-0.1,0.0,-0.224,0.0,0.443


Distance Similarity Metric


Unnamed: 0,hypothesis,separation_score,separation_pval,regression_coef,regression_pval,feature_prevalence
7,uses the word 'cute' to describe someone or something,0.905,0.0,0.778,0.0,0.003
2,mentions Mother's Day or mothers,0.647,0.0,0.497,0.0,0.036
1,expresses gratitude,0.638,0.0,0.639,0.0,0.044
4,expresses love towards another person,0.413,0.012,0.116,0.487,0.004
0,expresses strong positive emotion,0.384,0.0,0.375,0.0,0.012
5,mentions creating or promoting content/products,0.157,0.649,0.167,0.611,0.001
6,uses the word 'morning',0.089,0.016,0.062,0.094,0.088
3,uses an exclamation mark at the end of the tweet,0.046,0.625,0.052,0.56,0.012
15,expresses apology,-0.005,0.938,-0.152,0.037,0.024
16,"expresses strong negative emotion, specifically hate or dislike",-0.043,0.938,0.56,0.288,0.0



Holdout Set Metrics:
R² Score (cosine): 0.216
R² Score (distance): 0.099
Significant hypotheses: 
      (cosine) 12/20 (p < 5.000e-03)
      (distance) 10/20 " (p < 5.000e-03)


It seems that the same trends are invariant to the mode of annotation here.

## ii. SupervisedSparseAutoencoder Class
Motivating idea: what if we learned a sparse representation that was not only helpful for reconstruction, but also for predicting the target variable? Unfortunately, this is where I ran out of time. I have implemented the constructor and forward pass (to the best of my knowledge), but did not have time to finish the rest of the implementation.

In [108]:
from typing import Optional

class SupervisedSparseAutoencoder(SparseAutoencoder):
    def __init__(
        self,
        input_dim: int,
        m_total_neurons: int,
        k_active_neurons: int,
        supervised_dim=1,
        alpha=1.0,
        **kwargs
    ) -> None:
        """
        supervised_dim: dimension of target y (1 for scalar regression)
        alpha: weight of supervised loss relative to reconstruction
        """
        super().__init__(input_dim, m_total_neurons, k_active_neurons, **kwargs)
        self.alpha = alpha
        self.supervised_dim = supervised_dim

        # projector on latent space to target prediction
        self.supervised_head = nn.Linear(self.m_total_neurons, supervised_dim)

    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, dict]:
        recon, info = super().forward(x)

        latent_code = info['activations']
        y_prediction = self.supervised_head(latent_code)
        return recon, y_prediction, info
    
    def compute_loss(
        self,
        x: torch.Tensor,
        recon: torch.Tensor,
        info: dict[str, torch.Tensor],
        aux_coef: float,
        multi_coef: float,
    ) -> torch.Tensor:
        pass
    
    def fit(
        self,
        X_train: torch.Tensor,
        X_val: Optional[torch.Tensor] = None,
        save_dir: Optional[str] = None,
        batch_size: int = 512,
        learning_rate: float = 5e-4,
        n_epochs: int = 200,
        aux_coef: float = 1 / 32,
        multi_coef: float = 0.0,
        patience: int = 5,
        show_progress: bool = True,
        clip_grad: float = 1.0
    ) -> dict:
        pass