This repository contains the source code, experimental configurations, and documentation for the arXiv paper: "StatLLaMA: Multi-Stage training for domain-optimized statistical large language models"
This research introduces StatLLaMA, a highly capable, lightweight language model specialized for the domain of statistics. The project encompasses a complete end-to-end workflow, from sophisticated data engineering pipelines for corpus creation to the systematic training and evaluation methodologies that led to its development.
While general-purpose LLMs have demonstrated remarkable capabilities, they often fall short in specialized domains like statistics. This research addresses a fundamental question: How can we efficiently and effectively adapt a lightweight LLM to the statistical domain, enhancing its specialized capabilities while preserving its valuable general reasoning abilities?
This work provides a comprehensive and empirical investigation into LLM domain adaptation, culminating in several key contributions:
- End-to-End Data Engineering Pipelines: Developed a suite of robust tools for collecting, parsing, cleaning, and synthesizing high-quality training data from diverse sources like arXiv, S2ORC, and the Gemini API.
- Validated a High-Efficacy Training Pipeline: Established and proved the effectiveness of a
SFT -> DPO -> DT FTpipeline starting from a high-quality instruction-tuned model (LLaMA-3.2-3B-Instruct). - Systematic Comparison of Techniques: Provided direct, in-domain comparisons of key methods, such as
DPOvs.GRPO, demonstrating DPO's superior stability and effectiveness. - Key Findings on Model Training: Quantified critical principles, such as DPO's ability to recover lost capabilities and the "less is more" rule for fine-tuning highly optimized models.
- Developed StatLLaMA: Produced a lightweight (3B) yet highly capable statistical language model with significant performance gains.
This project is organized into two primary categories: Part I: Data Engineering & Pre-processing and Part II: Model Training Experimental Flows. Each directory is a self-contained module with its own specific README.md for detailed instructions.
.
├── Part I: Data Engineering & Pre-processing
│ ├── ArXiv_multimodal_extractor/ # Crawls and parses arXiv papers with multimodal features (BLIP).
│ ├── S2ORC_section_parser/ # Memory-efficient streaming parser for the massive S2ORC dataset.
│ ├── pdf_content_extractor/ # General-purpose tool for PDF text extraction (Text & OCR).
│ ├── Synthetic_data_generator/ # Generates synthetic data (DPO, etc.) using the Gemini API.
│ └── Token/ # Pre-tokenizes large corpora for efficient Continual Pre-training.
│
└── Part II: Model Training Experimental Flows
├── Flow1/ # Hypothesis: Knowledge-First from Base Model
│ ├── CoP.py
│ ├── SFT.py
│ └── DPO.py
├── Flow2/ # Hypothesis: Bridging the Capability Gap
│ ├── CoP.py
│ ├── Instruct.py
│ └── ...
└── Flow3/ # Hypothesis: Focused Specialization (The Successful Path)
├── SFT_v1/
├── SFT_v2/
├── SFT_v3/
├── SFT_v3_DPO/
└── SFT_v3.4_DPO_DTFT/
└── ...
- Part I represents the collection of standalone scripts developed to collect, parse, synthesize, and prepare the data for training.
- Part II groups the training scripts according to the three core experimental strategies (Flows) that were systematically tested and evaluated. Each script corresponds to a specific stage within a flow..
Due to the large size of the training and evaluation datasets, they are not hosted directly in this GitHub repository. The complete, pre-processed datasets used for all experiments are available for download from Kaggle.
- Download Link: StatLLaMA - Thesis Datasets on Kaggle
The dataset on Kaggle is organized to correspond with the different stages of the training flows described in the paper.
- Python 3.9+
- PyTorch 2.0+ with CUDA support.
- An NVIDIA GPU suitable for running 3B parameter models (e.g., A100, H100).
- API keys for services like Google Gemini, if using the synthetic data generator.
- Clone the repository:
git clone https://github.com/HuangDLab/StatLLaMA cd StatLLaMA pip install -e .
- Install dependencies:
The
requirements.txtfile in the root directory contains all packages used for the model training stages (Flow1,Flow2,Flow3). For the individual data engineering tools, please refer to therequirements.txtfile within each specific subdirectory.# For model training pip install -r requirements.txt # Example for a specific data engineering tool cd ArXiv_multimodal_extractor/ pip install -r requirements.txt
- Data Preparation: Navigate to the desired data engineering directory (e.g.,
ArXiv_multimodal_extractor/,S2ORC_section_parser/) and follow the instructions in its specificREADME.mdto generate your training datasets. - Model Training: Navigate to the desired model training directory (
Flow1/,Flow2/, orFlow3/) and follow the detailed instructions in itsREADME.mdto run the training scripts. To replicate the final StatLLaMA model, follow the "golden path" outlined inFlow3/README.md.
The primary outcome of this research is the StatLLaMA model, which achieved a balanced and significant performance uplift.
| Model | GSM8K (8-shot) | AP Statistics (0-shot) | ARC (0-shot) |
|---|---|---|---|
| LLaMA-3.2-3B-Instruct (Base) | 64.44 | 37.63 | 43.60 |
| StatLLaMA (Final) | 58.83 | 41.46 | 40.61 |
The final model demonstrates a significant improvement in the core domain-specific benchmark while largely preserving its general reasoning capabilities, validating the effectiveness of the Flow 3 pipeline.
If you find this research or any of the accompanying tools useful, please consider citing the paper:
@misc{Zengetal2026statllamamulti-stage,
title={StatLLaMA: Multi-Stage training for domain-optimized statistical large language models},
author={Jing-Yi Zeng and Guan-Hua Huang},
year={2026},
eprint={2601.09718},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.09718},
}This research was partially supported by grants from the Ministry of Science and Technology, Taiwan (MOST 111-2118-M-A49-003-MY2) and the National Science and Technology Council, Taiwan (NSTC 113-2118-M-A49-006 and NSTC 114-2118-M-A49-002). We are grateful to the National Center for High-performance Computing, Taiwan for computer time and facilities.