Extract, cluster, and analyze behavioral properties from Generative Models
Annoyed at having to look through your long model conversations or agentic traces? Fear not, StringSight has come to ease your woes. Understand and compare model behavior by automatically extracting behavioral properties from their responses, grouping similar behaviors together, and quantifying how important these behaviors are.
demo.mp4
# Install
pip install stringsight
# Set your API keys
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key" # optional
export GOOGLE_API_KEY="your-google-key" # optional
# Launch the web interface
stringsight launch
# Or run in background (survives terminal disconnects)
stringsight launch --daemon
# Check status
stringsight status
# View logs
stringsight logs
# Stop the server
stringsight stopThe UI will be available at http://localhost:5180.
For tutorials and examples, see starter_notebook.ipynb or Google Colab.
Use this if you want the latest code, plan to modify StringSight, or want an editable install.
# Clone (includes submodules, e.g. the frontend)
git clone --recurse-submodules https://github.com/lisadunlap/stringsight.git
cd stringsight
# If you already cloned without submodules:
# git submodule update --init --recursive
# Create and activate a virtual environment (example: venv)
python -m venv .venv
source .venv/bin/activate
# Install in editable mode
pip install -U pip
pip install -e .Local Development:
stringsight launch # Foreground mode (stops when terminal closes)
stringsight launch --daemon # Background mode (persistent)Docker (for production or multi-user setups):
docker compose up -dThe Docker setup includes PostgreSQL, Redis, MinIO storage, and Celery workers for handling long-running jobs.
Required columns:
prompt: Question/prompt text (this doesn't need to be your actual prompt, just some unique identifier of a run)model: Model namemodel_response: Model output in one of three formats:- OpenAI conversation format:
[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}](recommended, we also support multimodal inputs in this format) - Plain string:
"Model response text..." - Custom format: Any other format will be converted to string and thus will not render all pretty in the ui (if you care about that sort of thing)
- OpenAI conversation format:
Optional columns:
- Scores: You can provide metrics in separate columns (e.g.
"accuracy","helpfulness", etc.—set them using thescore_columnsparameter) or as a singlescorecolumn containing a dictionary like{"accuracy": 0.85, "helpfulness": 4.2}. question_id: Unique ID for a question (useful if you have multiple responses for the same prompt, especially for side-by-side pairing)- Custom column names via
prompt_column,model_column,model_response_column,question_id_columnparameters
For side-by-side: Use model_a, model_b, model_a_response, model_b_response (pre-paired data) or pass tidy data with method="side_by_side" to auto-pair by prompt.
import pandas as pd
from stringsight import explain
# Prepare your data
df = pd.DataFrame({
"prompt": ["What is machine learning?", "Explain quantum computing", ...],
"model": ["gpt-4", "claude-3", ...],
"model_response": [
[{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning involves..."}, ...],
[{"role": "user", "content": "Explain quantum computing"},
{"role": "assistant", "content": "Quantum computing uses..."}, ...]
],
"accuracy": [1, 0, ...],
"helpfulness": [4.2, 3.8, ...]
})
# Run analysis
clustered_df, model_stats = explain(
df,
model_name="gpt-4.1-mini",
output_dir="results/test",
score_columns=["accuracy", "helpfulness"],
model_response_column="model_response" # it default checks for a model_response column
)# Compare two models
clustered_df, model_stats = explain(
df,
method="side_by_side",
model_a="gpt-4",
model_b="claude-3",
output_dir="results/comparison",
score_columns=["accuracy", "helpfulness"],
)from stringsight import label
TAXONOMY = {
"refusal": "Does the model refuse to follow certain instructions?",
"hallucination": "Does the model generate false information?",
}
clustered_df, model_stats = label(
df,
taxonomy=TAXONOMY,
output_dir="results/labeled"
)Output dataframe columns:
property_description: Natural language description of behavioral traitcategory: High-level grouping (e.g., "Reasoning", "Style", "Safety")reason: Why this behavior occursevidence: Specific quotes demonstrating the behaviorunexpected_behavior: Boolean indicating if this is problematictype: Nature of the property (e.g., "content", "format", "style")behavior_type: Classification like "Positive", "Negative (critical)", "Style"cluster_id: Cluster assignmentcluster_label: Human-readable cluster name
Output files (when output_dir is specified):
clustered_results.parquet: Main dataframe with cluster assignmentsclustered_results.jsonl/clustered_results_lightweight.jsonl: JSON formatsfull_dataset.json/full_dataset.parquet: Complete PropertyDatasetmodel_cluster_scores.json: Per model-cluster metricscluster_scores.json: Aggregated metrics per clustermodel_scores.json: Overall metrics per modelsummary.txt: Human-readable summary
Metrics in model_stats:
The model_stats dictionary contains three DataFrames:
model_cluster_scores: How each model performs on each behavioral clustercluster_scores: Aggregated metrics across all models for each clustermodel_scores: Overall metrics for each model across all clusters
explain(
df,
model_name="gpt-4.1-mini", # LLM for extraction
embedding_model="text-embedding-3-large", # Embedding model
min_cluster_size=5, # Min cluster size
sample_size=100, # Sample before processing
output_dir="results/"
)See the documentation for:
- Docker deployment
- Custom column mapping
- Multimodal conversations (text + images)
- Prompt expansion
- Caching configuration
- CLI usage
- Full Documentation: https://lisadunlap.github.io/StringSight/
- DEMO Website: https://stringsight.com
- Tutorial Notebook: starter_notebook.ipynb
PRs very welcome, especially if I forgot to include something important in the readme. Questions or issues? Open an issue on GitHub
