GitHub - lisadunlap/StringSight: Automatically Analyze your Model Traces

Extract, cluster, and analyze behavioral properties from Generative Models

Annoyed at having to look through your long model conversations or agentic traces? Fear not, StringSight has come to ease your woes. Understand and compare model behavior by automatically extracting behavioral properties from their responses, grouping similar behaviors together, and quantifying how important these behaviors are.

demo.mp4

Installation & Quick Start

# Install
pip install stringsight

# Set your API keys
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"  # optional
export GOOGLE_API_KEY="your-google-key"        # optional

# Launch the web interface
stringsight launch 

# Or run in background (survives terminal disconnects)
stringsight launch --daemon

# Check status
stringsight status

# View logs
stringsight logs

# Stop the server
stringsight stop

The UI will be available at http://localhost:5180.

For tutorials and examples, see starter_notebook.ipynb or Google Colab.

Install from source

Use this if you want the latest code, plan to modify StringSight, or want an editable install.

# Clone (includes submodules, e.g. the frontend)
git clone --recurse-submodules https://github.com/lisadunlap/stringsight.git
cd stringsight

# If you already cloned without submodules:
# git submodule update --init --recursive

# Create and activate a virtual environment (example: venv)
python -m venv .venv
source .venv/bin/activate

# Install in editable mode
pip install -U pip
pip install -e .

Deployment Options

Local Development:

stringsight launch              # Foreground mode (stops when terminal closes)
stringsight launch --daemon     # Background mode (persistent)

Docker (for production or multi-user setups):

docker compose up -d

The Docker setup includes PostgreSQL, Redis, MinIO storage, and Celery workers for handling long-running jobs.

Usage

Data Format

Required columns:

prompt: Question/prompt text (this doesn't need to be your actual prompt, just some unique identifier of a run)
model: Model name
model_response: Model output in one of three formats:
- OpenAI conversation format: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}] (recommended, we also support multimodal inputs in this format)
- Plain string: "Model response text..."
- Custom format: Any other format will be converted to string and thus will not render all pretty in the ui (if you care about that sort of thing)

Optional columns:

Scores: You can provide metrics in separate columns (e.g. "accuracy", "helpfulness", etc.—set them using the score_columns parameter) or as a single score column containing a dictionary like {"accuracy": 0.85, "helpfulness": 4.2}.
question_id: Unique ID for a question (useful if you have multiple responses for the same prompt, especially for side-by-side pairing)
Custom column names via prompt_column, model_column, model_response_column, question_id_column parameters

For side-by-side: Use model_a, model_b, model_a_response, model_b_response (pre-paired data) or pass tidy data with method="side_by_side" to auto-pair by prompt.

Extract and Cluster Properties

import pandas as pd
from stringsight import explain

# Prepare your data
df = pd.DataFrame({
    "prompt": ["What is machine learning?", "Explain quantum computing", ...],
    "model": ["gpt-4", "claude-3", ...],
    "model_response": [
        [{"role": "user", "content": "What is machine learning?"},
         {"role": "assistant", "content": "Machine learning involves..."}, ...],
        [{"role": "user", "content": "Explain quantum computing"},
         {"role": "assistant", "content": "Quantum computing uses..."}, ...]
    ],
    "accuracy": [1, 0, ...], 
    "helpfulness": [4.2, 3.8, ...]
})

# Run analysis
clustered_df, model_stats = explain(
    df,
    model_name="gpt-4.1-mini",
    output_dir="results/test",
    score_columns=["accuracy", "helpfulness"],
    model_response_column="model_response" # it default checks for a model_response column
)

Side-by-Side Comparison

# Compare two models
clustered_df, model_stats = explain(
    df,
    method="side_by_side",
    model_a="gpt-4",
    model_b="claude-3",
    output_dir="results/comparison",
    score_columns=["accuracy", "helpfulness"],
)

Fixed Taxonomy Labeling

from stringsight import label

TAXONOMY = {
    "refusal": "Does the model refuse to follow certain instructions?",
    "hallucination": "Does the model generate false information?",
}

clustered_df, model_stats = label(
    df,
    taxonomy=TAXONOMY,
    output_dir="results/labeled"
)

Output

Output dataframe columns:

property_description: Natural language description of behavioral trait
category: High-level grouping (e.g., "Reasoning", "Style", "Safety")
reason: Why this behavior occurs
evidence: Specific quotes demonstrating the behavior
unexpected_behavior: Boolean indicating if this is problematic
type: Nature of the property (e.g., "content", "format", "style")
behavior_type: Classification like "Positive", "Negative (critical)", "Style"
cluster_id: Cluster assignment
cluster_label: Human-readable cluster name

Output files (when output_dir is specified):

clustered_results.parquet: Main dataframe with cluster assignments
clustered_results.jsonl / clustered_results_lightweight.jsonl: JSON formats
full_dataset.json / full_dataset.parquet: Complete PropertyDataset
model_cluster_scores.json: Per model-cluster metrics
cluster_scores.json: Aggregated metrics per cluster
model_scores.json: Overall metrics per model
summary.txt: Human-readable summary

Metrics in model_stats:

The model_stats dictionary contains three DataFrames:

model_cluster_scores: How each model performs on each behavioral cluster
cluster_scores: Aggregated metrics across all models for each cluster
model_scores: Overall metrics for each model across all clusters

Configuration

explain(
    df,
    model_name="gpt-4.1-mini",                      # LLM for extraction
    embedding_model="text-embedding-3-large",       # Embedding model
    min_cluster_size=5,                             # Min cluster size
    sample_size=100,                                # Sample before processing
    output_dir="results/"
)

Advanced Features

See the documentation for:

Docker deployment
Custom column mapping
Multimodal conversations (text + images)
Prompt expansion
Caching configuration
CLI usage

Documentation

Full Documentation: https://lisadunlap.github.io/StringSight/
DEMO Website: https://stringsight.com
Tutorial Notebook: starter_notebook.ipynb

Contributing

PRs very welcome, especially if I forgot to include something important in the readme. Questions or issues? Open an issue on GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github		.github
alembic		alembic
assets		assets
data		data
docker		docker
docs		docs
frontend @ 93ce3e2		frontend @ 93ce3e2
k8s		k8s
scripts		scripts
stringsight		stringsight
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
alembic.ini		alembic.ini
check_jobs.py		check_jobs.py
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
starter_notebook.ipynb		starter_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation & Quick Start

Install from source

Deployment Options

Usage

Data Format

Extract and Cluster Properties

Side-by-Side Comparison

Fixed Taxonomy Labeling

Output

Configuration

Advanced Features

Documentation

Contributing

About

Uh oh!

Releases

Packages

Languages

License

lisadunlap/StringSight

Folders and files

Latest commit

History

Repository files navigation

Installation & Quick Start

Install from source

Deployment Options

Usage

Data Format

Extract and Cluster Properties

Side-by-Side Comparison

Fixed Taxonomy Labeling

Output

Configuration

Advanced Features

Documentation

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages