Structured persona generation for LLM agent social simulations and computational social science.
Generate demographically realistic, psychologically grounded personas from real census data — ready to use as LLM system prompts.
from persona_engine import PersonaGenerator, build_system_prompt
gen = PersonaGenerator()
# Step 1: Generate personas (local, no API needed)
personas = gen.synthetic(n=100, country="Brazil", seed=42)
# Step 2: Use as LLM system prompt
system_prompt = build_system_prompt(personas[0]) # → ready for any LLM APIPersonaForge is a two-stage pipeline:
┌─────────────────────────────────────┐ ┌──────────────────────────────┐
│ 📦 Stage 1: Generate Personas │ │ 🤖 Stage 2: Use with LLM │
│ (local, no API key needed) │ ──▶ │ (requires LLM API key) │
│ │ │ │
│ 4 modes: synthetic / microdata / │ │ build_system_prompt() │
│ survey / archetype │ │ scripts/simulate.py │
└─────────────────────────────────────┘ └──────────────────────────────┘
pip install -e .| Mode | Demographics | Psychology | Data Source |
|---|---|---|---|
| 🧪 synthetic | Census probability tables | Synthesized (IPIP/Hofstede) | World Bank + national census (14 countries) |
| 📊 microdata | Real individual records | Synthesized | US/UK/Canada census (built-in) + IPUMS (8 countries) |
| 📋 survey | Real respondent data | Partially real | Custom CSV or bundled WVS sample |
| 🎭 archetype | Hand-authored templates | Hand-authored | 12 built-in archetypes |
gen = PersonaGenerator()
# 🧪 Synthetic: census-calibrated distributions, 14 countries
personas = gen.synthetic(n=1000, country="Japan", seed=42)
# 📊 Microdata: real census individual records
personas = gen.from_microdata(n=100, seed=42) # US (default)
personas = gen.from_microdata(n=100, country="United Kingdom") # built-in
personas = gen.from_microdata(n=100, country="Canada") # built-in
personas = gen.from_microdata(n=100, country="Brazil") # requires IPUMS setup
# 📋 Survey: your own respondent data
personas = gen.from_survey(path="my_survey.csv", column_mapping={...})
# 🎭 Archetype: named templates with noise
personas = gen.from_archetype("urban_progressive", n=10, noise=0.1)📝 Built-in archetypes (12 types)
urban_progressive · rural_conservative · tech_entrepreneur · retired_worker · young_activist · academic_researcher · blue_collar_worker · immigrant_newcomer · stay_at_home_parent · military_veteran · healthcare_worker · religious_leader
All generation modes support psychology=False to skip psychological synthesis:
# Only demographics — no Big Five, values, moral foundations, or biases
personas = gen.from_microdata(n=1000, country="United Kingdom", psychology=False)CLI:
python scripts/generate.py --mode microdata --n 1000 --country "United Kingdom" --no-psychologyEach Persona contains demographics (always present) and optional psychological dimensions:
| Component | Fields | Range | Notes |
|---|---|---|---|
| 👤 Demographics | age, gender, education, income, country, occupation, marital_status | — | Always present |
| 🏙️ city_type | rural, suburban, urban, metropolitan | enum | Optional; synthesized in synthetic/archetype, None in microdata |
| 🧠 Big Five | openness, conscientiousness, extraversion, agreeableness, neuroticism | 0–1 | Optional |
| 💡 MBTI | 16 types | enum | Independent; use big5_to_mbti() to derive from Big Five |
| ⚖️ Values | political_orientation, risk_aversion, collectivism, religiosity, trust_in_people, env_priority, gender_equality, immigration_openness | 0–1 | Optional |
| 🏛️ Moral Foundations | care, fairness, loyalty, authority, purity | 0–1 | Optional |
| 🎯 Behavioral Bias | confirmation_bias, conformity, trust_in_authority | 0–1 | Optional |
| 🔍 Provenance | tracks which fields are real vs synthesized | dict | Per-field data lineage |
Every persona records how its data was obtained:
persona.provenance
# → {"demographics_basis": "real", "psychology_basis": "synthesized"}
# → {"demographics_basis": "census_table", "psychology_basis": "skipped"}Values: real (individual census records) · census_table (aggregate distributions) · survey_proxy (estimated) · synthesized (statistical models) · hand_authored (archetype templates) · skipped (not generated)
# Synthetic, default 10 personas, JSON to stdout
python scripts/generate.py
# 100 Japanese personas to file
python scripts/generate.py --mode synthetic --n 100 --country Japan --output personas.json
# Real UK census microdata, demographics only
python scripts/generate.py --mode microdata --n 1000 --country "United Kingdom" --no-psychology --output uk.json
# Archetype as LLM prompts
python scripts/generate.py --mode archetype --archetype urban_progressive --n 20 --format prompts --style narrative
# Export to CSV
python scripts/generate.py --mode synthetic --n 50 --output personas.csvGenerate personas and ask each one a question via LLM API:
# Requires: export ANTHROPIC_API_KEY=sk-xxx (or OPENAI_API_KEY for --provider openai)
python scripts/simulate.py \
--question "Should the government ban fossil fuels?" \
--n 20 --country "United States" \
--output results.csv
# Use pre-generated personas
python scripts/simulate.py \
--question "Do you support universal basic income?" \
--personas personas.json \
--provider openai --model gpt-4o \
--output results.csvbuild_system_prompt() → complete system message with persona description + behavioral hints.
Claude
import anthropic
from persona_engine import PersonaGenerator, build_system_prompt
client = anthropic.Anthropic()
persona = PersonaGenerator().from_archetype("urban_progressive", n=1, seed=42)[0]
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=build_system_prompt(persona),
messages=[{"role": "user", "content": "Should we tax carbon emissions?"}],
)OpenAI
import openai
from persona_engine import PersonaGenerator, build_system_prompt
client = openai.OpenAI()
persona = PersonaGenerator().from_archetype("rural_conservative", n=1, seed=42)[0]
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": build_system_prompt(persona)},
{"role": "user", "content": "Should we tax carbon emissions?"},
],
)Any LLM
Works with any LLM that accepts a system prompt — Ollama, vLLM, Together, Groq, etc.
from persona_engine import build_prompt, build_system_prompt
system_prompt = build_system_prompt(persona, style="narrative") # persona + behavioral hints
prompt = build_prompt(persona, style="bullet") # persona description only
# Styles: "narrative" · "bullet" · "minimal"
# Custom Jinja2 template:
prompt = build_prompt(persona, custom_template="You are {{ age }}, from {{ country }}")gen.to_json(indent=2) # JSON string
gen.to_dicts() # list[dict]
gen.to_dataframe() # pandas DataFrame (flat columns: demo_*, big5_*, val_*, ...)| Dataset | Source | Countries | Provenance |
|---|---|---|---|
| 📊 Microdata (built-in) | US Census PUMS 2022, UK ONS Census 2021, Canada StatCan PUMF 2021 | US, UK, Canada | Real individual records, zero config |
| 📈 Census tables | World Bank API (ILO estimates) + national census reports | 14 | Real indicators, calibrated cross-tabs |
| 🌍 IPUMS microdata | IPUMS International census samples | 10+ | Real individual records (setup required) |
| 🧭 Hofstede | Hofstede Insights | 58 | Published cultural dimension scores |
| 🧠 IPIP Big Five | International Personality Item Pool | Global | Population trait distributions |
All 14 countries: US, UK, Japan, India, Brazil, Nigeria, South Africa, Germany, China, Mexico, South Korea, Australia, Canada, Indonesia
Built-in microdata (real census records, zero setup):
- 🇺🇸 US: Census Bureau ACS PUMS 2022 (50K records, auto-fetched on first use)
- 🇬🇧 UK: ONS Census 2021 Public Microdata Teaching Sample (50K records)
- 🇨🇦 Canada: Statistics Canada 2021 Census PUMF (50K records)
Real census microdata has varying field coverage due to source data differences:
| Dimension | 🇺🇸 US | 🇬🇧 UK | 🇨🇦 Canada |
|---|---|---|---|
| Education | 6 levels (no_formal → doctorate) | 3 levels (primary → bachelor) | 5 levels (no_formal → doctorate, no primary) |
| Income | 5 levels | 4 levels (no high) | 5 levels |
| Occupation | 11 categories | 10 categories | 8 categories |
| Marital | 4 statuses | 3 statuses (no widowed) | 4 statuses |
| Age range | 18–94 | 20–72 | 19–88 |
| city_type | None | None | None |
These are limitations of the source data, not bugs. Provenance tracking lets you know exactly what came from real data vs synthesis.
# Rebuild World Bank census tables with latest data
python scripts/build_census_tables.py --fetch
# Rebuild UK/Canada microdata caches from source
python scripts/build_public_microdata.py🌍 IPUMS International Microdata (Optional)
For real individual-level census records beyond US/UK/Canada:
- Register at https://international.ipums.org/ (free for research)
- Get API key at https://account.ipums.org/api_keys
- Install:
pip install ipumspy - Fetch data:
export IPUMS_API_KEY=your_key_here
python scripts/fetch_ipums_data.py --fetch # all countries
python scripts/fetch_ipums_data.py --fetch --countries brazil mexico # specificAvailable: India, Brazil, Nigeria, South Africa, Germany, China, Mexico, Indonesia.
Not in IPUMS: Japan, South Korea, Australia (use synthetic mode instead).
Test whether persona parameters actually influence LLM responses:
from persona_engine import validate_population, BUILTIN_TASKS
report = validate_population(personas, BUILTIN_TASKS, llm_callable=my_llm)
print(report.summary())PersonaForge/
├── persona_engine/
│ ├── schema.py # Pydantic models (Persona, BigFive, Values, ...)
│ ├── generator.py # PersonaGenerator facade (4 modes)
│ ├── datasets.py # Data file loaders
│ ├── validation.py # LLM behavior validation
│ ├── generation/
│ │ ├── _sampling.py # Shared psychological synthesis pipeline
│ │ ├── synthetic.py # Census-table demographics
│ │ ├── archetype.py # Hand-authored templates
│ │ ├── survey.py # Survey/CSV loader
│ │ └── microdata.py # Census microdata sampling (US/UK/Canada + IPUMS)
│ └── output/
│ ├── prompt_builder.py # Persona → system prompt
│ └── behavior_adapter.py # Persona → behavioral hints
├── datasets/
│ ├── census/ # 14 country probability tables + WB source data
│ ├── ipums_cache/ # Microdata: UK, Canada (50K records each)
│ ├── pums_cache.json # US Census PUMS (50K records, auto-fetched)
│ ├── hofstede_sample.csv # Cultural dimensions (58 countries)
│ └── ipip_distributions.json # Big Five population distributions
├── scripts/
│ ├── generate.py # CLI: generate personas
│ ├── simulate.py # CLI: LLM survey simulation (requires API key)
│ ├── build_census_tables.py # Fetch WB data → census tables
│ ├── build_public_microdata.py # Download UK/Canada census → microdata
│ ├── fetch_ipums_data.py # IPUMS International → microdata
│ └── fetch_census_data.py # US PUMS API fetcher
├── examples/ # Usage examples
└── tests/ # 101 tests
pip install -e ".[dev]"
python -m pytest -v # 101 tests
python examples/basic_usage.pyMIT