Official code, prompts, and data for our ACL 2026 paper:
CAST: Achieving Stable LLM-based Text Analysis for Data Analytics
Jinxiang Xie*, Zihao Li*, Wei He*, Rui Ding†, Shi Han, Dongmei Zhang
ACL 2026
*Equal contribution. †Corresponding author.
Text analysis of tabular data relies on two core operations:
- Summarization — corpus-level theme extraction
- Tagging — row-level labeling
A critical limitation of using LLMs for these tasks is their inability to meet the high standards of output stability demanded by data analytics. We introduce CAST (Consistency via Algorithmic Prompting and Stable Thinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines:
- Algorithmic Prompting (AP) — a procedural scaffold over valid reasoning transitions.
- Thinking-before-Speaking (TbS) — explicit intermediate commitments before final generation.
To measure progress, we also introduce CAST-S and CAST-T, stability metrics for bulleted summarization and tagging, validated against human judgments. Across multiple LLM backbones, CAST consistently achieves the best stability among all baselines, improving the Stability Score by up to 16.2%, while maintaining or improving output quality.
.
├── summarization/ # Summarization (CAST-S) experiments
│ ├── summary_pipeline.py # End-to-end summary generation + scoring
│ ├── llm_stability_pipeline.py # Compare baseline / AP / TbS / CAST stability
│ ├── path_stability_pipeline.py # Reasoning-path ablations
│ ├── distribution_analysis_pipeline.py # Output-distribution sharpness analysis
│ ├── AblationPrompt/ # baseline / ap / tbs / cast prompts
│ ├── reasoning_path_prompt/ # Reasoning-path ablation prompts
│ ├── EvaluationPrompt/ # Judge prompts for summary + stability
│ ├── Input/ # Input datasets (xlsx)
│ │ ├── Summary-Input/ # Datasets for summary_pipeline.py
│ │ └── Stability-Input/ # Datasets for *_stability_pipeline.py
│ └── Output/Stability-Output/ # Reference stability scores + correlation analysis
│
├── tagging/ # Tagging (CAST-T) experiments
│ ├── program.py # Tagging pipeline
│ ├── evaluation.ipynb # Evaluation / analysis notebook
│ └── AP.md / TbS.md / AP+TbS.md / none.md # Prompt variants
│
├── data/ # Human annotations + supplementary data
│ ├── README.md # See for full file-by-file description
│ ├── human_annotations/
│ │ ├── summarization/ # h1 / j2 / z3 stability score JSONs
│ │ └── tagging/ # h1 / j1 / j2 annotators × 4 prompts
│ └── supplementary/ # Additional input datasets
│
├── requirements.txt
├── .env.example # Copy to .env and fill in API keys
├── LICENSE # MIT
└── README.md
We recommend uv for Python environment and
dependency management (it is significantly faster than pip + venv).
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# or via pipx / Homebrew
pipx install uv # any platform
brew install uv # macOSgit clone https://github.com/jxtse/CAST-text-analysis.git
cd CAST-text-analysis
uv sync # creates .venv and installs core dependencies
source .venv/bin/activate # Windows: .venv\Scripts\activateNeed the heavy distribution-analysis extras
(sentence-transformers, umap-learn)? Add the optional group:
uv sync --extra distribution # or: --extra allPython ≥ 3.9 is supported (defaults to 3.12 via
uv).
Prefer plain pip?
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e . # core dependencies
pip install -e ".[distribution]" # optional: distribution-analysis extrasCopy the template and fill in whichever providers you intend to use:
cp .env.example .envThe pipelines read keys from environment variables (loaded via python-dotenv):
| Variable | Used by |
|---|---|
OPENAI_API_KEY |
summary_pipeline, path_stability, tagging |
OPENROUTER_API_KEY |
All summarization pipelines |
SiliconFlow_API_KEY / SILICONFLOW_API_KEY |
Stability + tagging pipelines |
Grok_API_KEY / GROK_API_KEY |
Stability + tagging pipelines |
Gemini_API_KEY / GEMINI_API_KEY |
Stability + tagging pipelines |
You only need keys for the providers you plan to call.
All commands assume your current directory is the corresponding subfolder
(summarization/ or tagging/), since the scripts use relative paths
(e.g. Input/..., Output/...).
End-to-end summary generation + LLM-judge scoring:
cd summarization
python summary_pipeline.pyCompare stability across baseline / ap / tbs / cast prompts:
python llm_stability_pipeline.py # all four
python llm_stability_pipeline.py --prompt_types cast # subset
python llm_stability_pipeline.py --compare_only # re-aggregate existing results
python llm_stability_pipeline.py --score_only # rescore an existing results fileReasoning-path ablations:
python path_stability_pipeline.py # default 4 paths
python path_stability_pipeline.py --extended_cast # full 8-path study
python path_stability_pipeline.py --prompt_types perspective_prompt,domain_promptOutput-distribution sharpness analysis:
python distribution_analysis_pipeline.pycd tagging
# Edit the `dataset_path`, `sheet_names`, `llm_types`, and `prompt_files` lists
# at the bottom of program.py (in `async def main`) to control the run.
python program.pyNote: the
dataset_pathreferenced inprogram.pypoints to the tagging dataset, which has been removed from this public release because it contained sensitive material. Replace it with your own input file using the same column layout to reproduce the pipeline.
Then open evaluation.ipynb for the post-hoc tagging analysis.
The reference outputs that back the figures and tables in the paper live under
summarization/Output/Stability-Output/:
cast_stability_score_result.json— model-judged CAST-S scoreshuman_cast_stability_score_result_anonymous.json— anonymized human ratingscorrelation_analysis.py— Pearson/Spearman correlation between the twostability_correlation_analysis_overall.png,stability_correlation_analysis_pair.png— corresponding figures
The per-annotator raw inputs that produced the anonymized merge above are in
data/human_annotations/, together with the human tagging
annotations and supplementary multilingual datasets.
To regenerate the correlation figures:
cd summarization/Output/Stability-Output
python correlation_analysis.pyIf CAST or CAST-S/CAST-T are useful in your work, please cite:
@misc{xie2026castachievingstablellmbased,
title = {CAST: Achieving Stable LLM-based Text Analysis for Data Analytics},
author = {Jinxiang Xie and Zihao Li and Wei He and Rui Ding and Shi Han and Dongmei Zhang},
year = {2026},
eprint = {2602.15861},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2602.15861}
}The ACL Anthology entry will be linked here once the proceedings are
published. See paper/ for build notes on the arXiv submission.
This project is released under the MIT License.
This work was conducted in part during Jinxiang Xie's internship at Microsoft.
Correspondence: juding@microsoft.com.
Issues and pull requests are welcome.