SMEPO

Official codebase for Hide to Guide: Learning via Semantic Masking.

Semantic Masked Expert Policy Optimization (SMEPO) is an expert-guided RLVR method that masks reward-relevant semantic spans in expert traces while preserving their procedural structure. This repository provides code for data preparation, semantic masking, and training with masked or full expert-trace guidance.

Overview

SMEPO supports three training settings used in the paper:

vanilla GRPO: standard RLVR training
full expert trace: RL conditioned on the full expert trace
SMEPO: RL conditioned on semantically masked expert traces

The prefix length is directly controllable through PREFIX_RATIO, while the sentence/newline-aligned prefix truncation behavior used in our experiments is preserved in the training code.

Repository Structure

scripts/data: dataset construction scripts
scripts/masking: semantic masking components used by SMEPO
scripts/train: training entry points for all experiment settings
scripts/setup: environment setup scripts
verl: training stack used by SMEPO

Environment

The conda environment name is smepo.

conda env create -f environment.yml
conda activate smepo

You can also use the helper script:

bash scripts/setup/create_env.sh

The training environment includes flash-attn:

pip install flash-attn --no-build-isolation

Data

SMEPO uses task-specific expert traces stored in teacher_ds.

Released raw datasets use the same schema in all domains:

question
reward_model
teacher_ds

To construct the masked datasets from raw expert traces:

bash scripts/data/build_data.sh

The released raw datasets are hosted at mit-han-lab/SMEPO. After downloading math.parquet, the standard data build script can be used directly:

python scripts/data/download_from_hf.py \
  --repo mit-han-lab/SMEPO \
  --filename math.parquet \
  --out-parquet data/raw/math_teacher.parquet

Then build the masked dataset:

bash scripts/data/build_data.sh

Training

The repository includes training scripts for the main experiment settings in the paper. Below is an example reproduction flow for the math task with Qwen3-8B:

conda activate smepo
MODEL_PATH=/path/to/Qwen3-8B-Base \
bash scripts/train/math/qwen3_8b/smepo.sh

Related scripts for the same setting:

scripts/train/math/qwen3_8b/vanilla_grpo.sh
scripts/train/math/qwen3_8b/full_expert_trace.sh
scripts/train/math/qwen3_8b/smepo.sh

Additional scripts for other tasks and model settings are provided in scripts/train/.

Prefix Control

SMEPO exposes the expert-prefix controls directly from the shell:

MODEL_PATH=/path/to/model \
PREFIX_RATIO=0.5 \
bash scripts/train/math/qwen3_8b/smepo.sh

Useful knobs:

PREFIX_RATIO: fraction of the expert trace used as the prefix
PREFIX_INTRO: prompt string inserted before the prefix
PREFIX_TAIL: prompt string inserted after the prefix
TRAINER_LOGGER: logging backend selection

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
verl		verl
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMEPO

Overview

Repository Structure

Environment

Data

Training

Prefix Control

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SMEPO

Overview

Repository Structure

Environment

Data

Training

Prefix Control

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages