Skip to content

lisadunlap/StringSight

Repository files navigation

StringSight logo

Extract, cluster, and analyze behavioral properties from Generative Models

Python 3.10+ MIT License Docs Blog Website

Annoyed at having to look through your long model conversations or agentic traces? Fear not, StringSight has come to ease your woes. Understand and compare model behavior by automatically extracting behavioral properties from their responses, grouping similar behaviors together, and quantifying how important these behaviors are.

demo.mp4

Installation & Quick Start

# Install
pip install stringsight

# Set your API keys
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"  # optional
export GOOGLE_API_KEY="your-google-key"        # optional

# Launch the web interface
stringsight launch 

# Or run in background (survives terminal disconnects)
stringsight launch --daemon

# Check status
stringsight status

# View logs
stringsight logs

# Stop the server
stringsight stop

The UI will be available at http://localhost:5180.

For tutorials and examples, see starter_notebook.ipynb or Google Colab.

Install from source

Use this if you want the latest code, plan to modify StringSight, or want an editable install.

# Clone (includes submodules, e.g. the frontend)
git clone --recurse-submodules https://github.com/lisadunlap/stringsight.git
cd stringsight

# If you already cloned without submodules:
# git submodule update --init --recursive

# Create and activate a virtual environment (example: venv)
python -m venv .venv
source .venv/bin/activate

# Install in editable mode
pip install -U pip
pip install -e .

Deployment Options

Local Development:

stringsight launch              # Foreground mode (stops when terminal closes)
stringsight launch --daemon     # Background mode (persistent)

Docker (for production or multi-user setups):

docker compose up -d

The Docker setup includes PostgreSQL, Redis, MinIO storage, and Celery workers for handling long-running jobs.

Usage

Data Format

Required columns:

  • prompt: Question/prompt text (this doesn't need to be your actual prompt, just some unique identifier of a run)
  • model: Model name
  • model_response: Model output in one of three formats:
    • OpenAI conversation format: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}] (recommended, we also support multimodal inputs in this format)
    • Plain string: "Model response text..."
    • Custom format: Any other format will be converted to string and thus will not render all pretty in the ui (if you care about that sort of thing)

Optional columns:

  • Scores: You can provide metrics in separate columns (e.g. "accuracy", "helpfulness", etc.—set them using the score_columns parameter) or as a single score column containing a dictionary like {"accuracy": 0.85, "helpfulness": 4.2}.
  • question_id: Unique ID for a question (useful if you have multiple responses for the same prompt, especially for side-by-side pairing)
  • Custom column names via prompt_column, model_column, model_response_column, question_id_column parameters

For side-by-side: Use model_a, model_b, model_a_response, model_b_response (pre-paired data) or pass tidy data with method="side_by_side" to auto-pair by prompt.

Extract and Cluster Properties

import pandas as pd
from stringsight import explain

# Prepare your data
df = pd.DataFrame({
    "prompt": ["What is machine learning?", "Explain quantum computing", ...],
    "model": ["gpt-4", "claude-3", ...],
    "model_response": [
        [{"role": "user", "content": "What is machine learning?"},
         {"role": "assistant", "content": "Machine learning involves..."}, ...],
        [{"role": "user", "content": "Explain quantum computing"},
         {"role": "assistant", "content": "Quantum computing uses..."}, ...]
    ],
    "accuracy": [1, 0, ...], 
    "helpfulness": [4.2, 3.8, ...]
})

# Run analysis
clustered_df, model_stats = explain(
    df,
    model_name="gpt-4.1-mini",
    output_dir="results/test",
    score_columns=["accuracy", "helpfulness"],
    model_response_column="model_response" # it default checks for a model_response column
)

Side-by-Side Comparison

# Compare two models
clustered_df, model_stats = explain(
    df,
    method="side_by_side",
    model_a="gpt-4",
    model_b="claude-3",
    output_dir="results/comparison",
    score_columns=["accuracy", "helpfulness"],
)

Fixed Taxonomy Labeling

from stringsight import label

TAXONOMY = {
    "refusal": "Does the model refuse to follow certain instructions?",
    "hallucination": "Does the model generate false information?",
}

clustered_df, model_stats = label(
    df,
    taxonomy=TAXONOMY,
    output_dir="results/labeled"
)

Output

Output dataframe columns:

  • property_description: Natural language description of behavioral trait
  • category: High-level grouping (e.g., "Reasoning", "Style", "Safety")
  • reason: Why this behavior occurs
  • evidence: Specific quotes demonstrating the behavior
  • unexpected_behavior: Boolean indicating if this is problematic
  • type: Nature of the property (e.g., "content", "format", "style")
  • behavior_type: Classification like "Positive", "Negative (critical)", "Style"
  • cluster_id: Cluster assignment
  • cluster_label: Human-readable cluster name

Output files (when output_dir is specified):

  • clustered_results.parquet: Main dataframe with cluster assignments
  • clustered_results.jsonl / clustered_results_lightweight.jsonl: JSON formats
  • full_dataset.json / full_dataset.parquet: Complete PropertyDataset
  • model_cluster_scores.json: Per model-cluster metrics
  • cluster_scores.json: Aggregated metrics per cluster
  • model_scores.json: Overall metrics per model
  • summary.txt: Human-readable summary

Metrics in model_stats:

The model_stats dictionary contains three DataFrames:

  1. model_cluster_scores: How each model performs on each behavioral cluster
  2. cluster_scores: Aggregated metrics across all models for each cluster
  3. model_scores: Overall metrics for each model across all clusters

Configuration

explain(
    df,
    model_name="gpt-4.1-mini",                      # LLM for extraction
    embedding_model="text-embedding-3-large",       # Embedding model
    min_cluster_size=5,                             # Min cluster size
    sample_size=100,                                # Sample before processing
    output_dir="results/"
)

Advanced Features

See the documentation for:

  • Docker deployment
  • Custom column mapping
  • Multimodal conversations (text + images)
  • Prompt expansion
  • Caching configuration
  • CLI usage

Documentation

Contributing

PRs very welcome, especially if I forgot to include something important in the readme. Questions or issues? Open an issue on GitHub

About

Automatically Analyze your Model Traces

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published