Skip to content

jameslemon2002/persona-forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔨 PersonaForge

Structured persona generation for LLM agent social simulations and computational social science.

Generate demographically realistic, psychologically grounded personas from real census data — ready to use as LLM system prompts.

License: MIT Python 3.9+ Tests


⚡ TL;DR

from persona_engine import PersonaGenerator, build_system_prompt

gen = PersonaGenerator()

# Step 1: Generate personas (local, no API needed)
personas = gen.synthetic(n=100, country="Brazil", seed=42)

# Step 2: Use as LLM system prompt
system_prompt = build_system_prompt(personas[0])  # → ready for any LLM API

🏗️ How It Works

PersonaForge is a two-stage pipeline:

┌─────────────────────────────────────┐     ┌──────────────────────────────┐
│  📦 Stage 1: Generate Personas      │     │  🤖 Stage 2: Use with LLM    │
│  (local, no API key needed)         │ ──▶ │  (requires LLM API key)      │
│                                     │     │                              │
│  4 modes: synthetic / microdata /   │     │  build_system_prompt()       │
│           survey / archetype        │     │  scripts/simulate.py         │
└─────────────────────────────────────┘     └──────────────────────────────┘

📥 Install

pip install -e .

🎲 Generation Modes

Mode Demographics Psychology Data Source
🧪 synthetic Census probability tables Synthesized (IPIP/Hofstede) World Bank + national census (14 countries)
📊 microdata Real individual records Synthesized US/UK/Canada census (built-in) + IPUMS (8 countries)
📋 survey Real respondent data Partially real Custom CSV or bundled WVS sample
🎭 archetype Hand-authored templates Hand-authored 12 built-in archetypes
gen = PersonaGenerator()

# 🧪 Synthetic: census-calibrated distributions, 14 countries
personas = gen.synthetic(n=1000, country="Japan", seed=42)

# 📊 Microdata: real census individual records
personas = gen.from_microdata(n=100, seed=42)                   # US (default)
personas = gen.from_microdata(n=100, country="United Kingdom")  # built-in
personas = gen.from_microdata(n=100, country="Canada")          # built-in
personas = gen.from_microdata(n=100, country="Brazil")          # requires IPUMS setup

# 📋 Survey: your own respondent data
personas = gen.from_survey(path="my_survey.csv", column_mapping={...})

# 🎭 Archetype: named templates with noise
personas = gen.from_archetype("urban_progressive", n=10, noise=0.1)
📝 Built-in archetypes (12 types)

urban_progressive · rural_conservative · tech_entrepreneur · retired_worker · young_activist · academic_researcher · blue_collar_worker · immigrant_newcomer · stay_at_home_parent · military_veteran · healthcare_worker · religious_leader

🔀 Demographics-Only Mode

All generation modes support psychology=False to skip psychological synthesis:

# Only demographics — no Big Five, values, moral foundations, or biases
personas = gen.from_microdata(n=1000, country="United Kingdom", psychology=False)

CLI:

python scripts/generate.py --mode microdata --n 1000 --country "United Kingdom" --no-psychology

🧬 Persona Structure

Each Persona contains demographics (always present) and optional psychological dimensions:

Component Fields Range Notes
👤 Demographics age, gender, education, income, country, occupation, marital_status Always present
🏙️ city_type rural, suburban, urban, metropolitan enum Optional; synthesized in synthetic/archetype, None in microdata
🧠 Big Five openness, conscientiousness, extraversion, agreeableness, neuroticism 0–1 Optional
💡 MBTI 16 types enum Independent; use big5_to_mbti() to derive from Big Five
⚖️ Values political_orientation, risk_aversion, collectivism, religiosity, trust_in_people, env_priority, gender_equality, immigration_openness 0–1 Optional
🏛️ Moral Foundations care, fairness, loyalty, authority, purity 0–1 Optional
🎯 Behavioral Bias confirmation_bias, conformity, trust_in_authority 0–1 Optional
🔍 Provenance tracks which fields are real vs synthesized dict Per-field data lineage

🔍 Provenance Tracking

Every persona records how its data was obtained:

persona.provenance
# → {"demographics_basis": "real", "psychology_basis": "synthesized"}
# → {"demographics_basis": "census_table", "psychology_basis": "skipped"}

Values: real (individual census records) · census_table (aggregate distributions) · survey_proxy (estimated) · synthesized (statistical models) · hand_authored (archetype templates) · skipped (not generated)

💻 CLI Usage

📦 Generate Personas

# Synthetic, default 10 personas, JSON to stdout
python scripts/generate.py

# 100 Japanese personas to file
python scripts/generate.py --mode synthetic --n 100 --country Japan --output personas.json

# Real UK census microdata, demographics only
python scripts/generate.py --mode microdata --n 1000 --country "United Kingdom" --no-psychology --output uk.json

# Archetype as LLM prompts
python scripts/generate.py --mode archetype --archetype urban_progressive --n 20 --format prompts --style narrative

# Export to CSV
python scripts/generate.py --mode synthetic --n 50 --output personas.csv

🤖 Simulate with LLM

Generate personas and ask each one a question via LLM API:

# Requires: export ANTHROPIC_API_KEY=sk-xxx  (or OPENAI_API_KEY for --provider openai)
python scripts/simulate.py \
  --question "Should the government ban fossil fuels?" \
  --n 20 --country "United States" \
  --output results.csv

# Use pre-generated personas
python scripts/simulate.py \
  --question "Do you support universal basic income?" \
  --personas personas.json \
  --provider openai --model gpt-4o \
  --output results.csv

🔌 LLM Integration

build_system_prompt() → complete system message with persona description + behavioral hints.

Claude
import anthropic
from persona_engine import PersonaGenerator, build_system_prompt

client = anthropic.Anthropic()
persona = PersonaGenerator().from_archetype("urban_progressive", n=1, seed=42)[0]

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=build_system_prompt(persona),
    messages=[{"role": "user", "content": "Should we tax carbon emissions?"}],
)
OpenAI
import openai
from persona_engine import PersonaGenerator, build_system_prompt

client = openai.OpenAI()
persona = PersonaGenerator().from_archetype("rural_conservative", n=1, seed=42)[0]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": build_system_prompt(persona)},
        {"role": "user", "content": "Should we tax carbon emissions?"},
    ],
)
Any LLM

Works with any LLM that accepts a system prompt — Ollama, vLLM, Together, Groq, etc.

🎨 Prompt Styles

from persona_engine import build_prompt, build_system_prompt

system_prompt = build_system_prompt(persona, style="narrative")  # persona + behavioral hints
prompt = build_prompt(persona, style="bullet")                   # persona description only

# Styles: "narrative" · "bullet" · "minimal"
# Custom Jinja2 template:
prompt = build_prompt(persona, custom_template="You are {{ age }}, from {{ country }}")

📤 Export

gen.to_json(indent=2)      # JSON string
gen.to_dicts()              # list[dict]
gen.to_dataframe()          # pandas DataFrame (flat columns: demo_*, big5_*, val_*, ...)

🗂️ Data Sources

Dataset Source Countries Provenance
📊 Microdata (built-in) US Census PUMS 2022, UK ONS Census 2021, Canada StatCan PUMF 2021 US, UK, Canada Real individual records, zero config
📈 Census tables World Bank API (ILO estimates) + national census reports 14 Real indicators, calibrated cross-tabs
🌍 IPUMS microdata IPUMS International census samples 10+ Real individual records (setup required)
🧭 Hofstede Hofstede Insights 58 Published cultural dimension scores
🧠 IPIP Big Five International Personality Item Pool Global Population trait distributions

🌐 Supported Countries

All 14 countries: US, UK, Japan, India, Brazil, Nigeria, South Africa, Germany, China, Mexico, South Korea, Australia, Canada, Indonesia

Built-in microdata (real census records, zero setup):

  • 🇺🇸 US: Census Bureau ACS PUMS 2022 (50K records, auto-fetched on first use)
  • 🇬🇧 UK: ONS Census 2021 Public Microdata Teaching Sample (50K records)
  • 🇨🇦 Canada: Statistics Canada 2021 Census PUMF (50K records)

⚠️ Microdata Coverage Limits

Real census microdata has varying field coverage due to source data differences:

Dimension 🇺🇸 US 🇬🇧 UK 🇨🇦 Canada
Education 6 levels (no_formal → doctorate) 3 levels (primary → bachelor) 5 levels (no_formal → doctorate, no primary)
Income 5 levels 4 levels (no high) 5 levels
Occupation 11 categories 10 categories 8 categories
Marital 4 statuses 3 statuses (no widowed) 4 statuses
Age range 18–94 20–72 19–88
city_type None None None

These are limitations of the source data, not bugs. Provenance tracking lets you know exactly what came from real data vs synthesis.

🔄 Rebuild Data

# Rebuild World Bank census tables with latest data
python scripts/build_census_tables.py --fetch

# Rebuild UK/Canada microdata caches from source
python scripts/build_public_microdata.py
🌍 IPUMS International Microdata (Optional)

For real individual-level census records beyond US/UK/Canada:

  1. Register at https://international.ipums.org/ (free for research)
  2. Get API key at https://account.ipums.org/api_keys
  3. Install: pip install ipumspy
  4. Fetch data:
export IPUMS_API_KEY=your_key_here
python scripts/fetch_ipums_data.py --fetch                          # all countries
python scripts/fetch_ipums_data.py --fetch --countries brazil mexico # specific

Available: India, Brazil, Nigeria, South Africa, Germany, China, Mexico, Indonesia. Not in IPUMS: Japan, South Korea, Australia (use synthetic mode instead).

✅ Validation Framework

Test whether persona parameters actually influence LLM responses:

from persona_engine import validate_population, BUILTIN_TASKS

report = validate_population(personas, BUILTIN_TASKS, llm_callable=my_llm)
print(report.summary())

📁 Project Structure

PersonaForge/
├── persona_engine/
│   ├── schema.py               # Pydantic models (Persona, BigFive, Values, ...)
│   ├── generator.py            # PersonaGenerator facade (4 modes)
│   ├── datasets.py             # Data file loaders
│   ├── validation.py           # LLM behavior validation
│   ├── generation/
│   │   ├── _sampling.py        # Shared psychological synthesis pipeline
│   │   ├── synthetic.py        # Census-table demographics
│   │   ├── archetype.py        # Hand-authored templates
│   │   ├── survey.py           # Survey/CSV loader
│   │   └── microdata.py        # Census microdata sampling (US/UK/Canada + IPUMS)
│   └── output/
│       ├── prompt_builder.py   # Persona → system prompt
│       └── behavior_adapter.py # Persona → behavioral hints
├── datasets/
│   ├── census/                 # 14 country probability tables + WB source data
│   ├── ipums_cache/            # Microdata: UK, Canada (50K records each)
│   ├── pums_cache.json         # US Census PUMS (50K records, auto-fetched)
│   ├── hofstede_sample.csv     # Cultural dimensions (58 countries)
│   └── ipip_distributions.json # Big Five population distributions
├── scripts/
│   ├── generate.py               # CLI: generate personas
│   ├── simulate.py               # CLI: LLM survey simulation (requires API key)
│   ├── build_census_tables.py    # Fetch WB data → census tables
│   ├── build_public_microdata.py # Download UK/Canada census → microdata
│   ├── fetch_ipums_data.py       # IPUMS International → microdata
│   └── fetch_census_data.py      # US PUMS API fetcher
├── examples/                   # Usage examples
└── tests/                      # 101 tests

🛠️ Development

pip install -e ".[dev]"
python -m pytest -v            # 101 tests
python examples/basic_usage.py

📄 License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages