Official codebase for Hide to Guide: Learning via Semantic Masking.
Semantic Masked Expert Policy Optimization (SMEPO) is an expert-guided RLVR method that masks reward-relevant semantic spans in expert traces while preserving their procedural structure. This repository provides code for data preparation, semantic masking, and training with masked or full expert-trace guidance.
SMEPO supports three training settings used in the paper:
vanilla GRPO: standard RLVR trainingfull expert trace: RL conditioned on the full expert traceSMEPO: RL conditioned on semantically masked expert traces
The prefix length is directly controllable through PREFIX_RATIO, while the sentence/newline-aligned prefix truncation behavior used in our experiments is preserved in the training code.
scripts/data: dataset construction scriptsscripts/masking: semantic masking components used by SMEPOscripts/train: training entry points for all experiment settingsscripts/setup: environment setup scriptsverl: training stack used by SMEPO
The conda environment name is smepo.
conda env create -f environment.yml
conda activate smepoYou can also use the helper script:
bash scripts/setup/create_env.shThe training environment includes flash-attn:
pip install flash-attn --no-build-isolationSMEPO uses task-specific expert traces stored in teacher_ds.
Released raw datasets use the same schema in all domains:
questionreward_modelteacher_ds
To construct the masked datasets from raw expert traces:
bash scripts/data/build_data.shThe released raw datasets are hosted at mit-han-lab/SMEPO. After downloading math.parquet, the standard data build script can be used directly:
python scripts/data/download_from_hf.py \
--repo mit-han-lab/SMEPO \
--filename math.parquet \
--out-parquet data/raw/math_teacher.parquetThen build the masked dataset:
bash scripts/data/build_data.shThe repository includes training scripts for the main experiment settings in the paper. Below is an example reproduction flow for the math task with Qwen3-8B:
conda activate smepo
MODEL_PATH=/path/to/Qwen3-8B-Base \
bash scripts/train/math/qwen3_8b/smepo.shRelated scripts for the same setting:
scripts/train/math/qwen3_8b/vanilla_grpo.shscripts/train/math/qwen3_8b/full_expert_trace.shscripts/train/math/qwen3_8b/smepo.sh
Additional scripts for other tasks and model settings are provided in scripts/train/.
SMEPO exposes the expert-prefix controls directly from the shell:
MODEL_PATH=/path/to/model \
PREFIX_RATIO=0.5 \
bash scripts/train/math/qwen3_8b/smepo.shUseful knobs:
PREFIX_RATIO: fraction of the expert trace used as the prefixPREFIX_INTRO: prompt string inserted before the prefixPREFIX_TAIL: prompt string inserted after the prefixTRAINER_LOGGER: logging backend selection