This repository is an anonymous, sata-bench-style scaffold for evaluating memory-conditioned LLM behavior on a demographic fairness dataset and fitting random-effects models over the resulting outcomes described in The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs.
src/personalization_trap/
├── evaluation/
│ ├── dataset/ # dataset loading and preparation
│ └── metrics/ # accuracy / flip-rate helpers
├── methods/
│ ├── inference/ # vLLM inference pipeline
│ ├── random_effects/ # mixed-effects analysis
│ └── utils/ # prompt / extraction / IO helpers
├── configs/ # example TOML configs
scripts/
├── run_inference.py
└── run_random_effects.py
tests/
- Standard inference code for
groupfairnessllm/random_effect_example. vLLMinference oversystem_prompt + user_prompt, with support for open models such asQwen/Qwen3-4B.- A second-stage extraction pipeline to normalize raw generations into scored labels, with multiple extraction logics:
choice_letteryes_noregexjson_fieldlabel_mapllm_extract
- Random-effects modeling to estimate how
gender,age,religion, andethnicityinfluence correctness.
python -m venv .venv
source .venv/bin/activate
pip install -e .python scripts/run_inference.py \
--dataset groupfairnessllm/random_effect_example \
--split train \
--model Qwen/Qwen3-4B \
--output-dir outputs/qwen3_4b \
--extractor choice_letter \
--extraction-prompt-key steu_choice \
--tensor-parallel-size 1python scripts/run_random_effects.py \
--input outputs/qwen3_4b/predictions.csv \
--output-dir outputs/qwen3_4b/random_effects \
--group-col question_idThe inference pipeline expects at least:
system_promptuser_promptquestion_idgold_labelgenderagereligionethnicity
If your dataset uses different names, pass --column-map with a TOML/JSON config or update the CLI flags.
Example:
python scripts/run_inference.py \
--dataset groupfairnessllm/random_effect_example \
--split train \
--model Qwen/Qwen3-4B \
--output-dir outputs/qwen3_4b \
--column-map src/personalization_trap/configs/column_map.example.toml- The project is intentionally anonymous and does not include paper-identifying metadata.
- The mixed-effects implementation uses
statsmodelswith a binomial mixed model and question-level random intercepts. - The code is designed to be extended for both emotional understanding and recommendation-style tasks.