Clean-Offline-RLHF

Project Website · Paper · Platform · Datasets · Clean Offline RLHF

This is the official PyTorch implementation of the paper "Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback". Clean-Offline-RLHF is an Offline Reinforcement Learning with Human Feedback codebase that provides high-quality and realistic human feedback implementations of offline RL algorithms.

💡 News

[03-26-2024] 🔥 Update Mini-Uni-RLHF, a minimal out-of-the-box annotation tool for researchers, powered by streamlit.
[03-24-2024] Release of SMARTS environment training dataset, scripts and labels. You can find it in the smarts branch.
[03-20-2024] Update detail setup bash files.
[02-22-2024] Initial code release.

🛠️ Getting Started

Clone this repository.

git clone https://github.com/pickxiguapi/Clean-Offline-RLHF.git
cd Clean-Offline-RLHF

Install PyTorch & torchvision.

pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Install extra dependencies.

pip install -r requirements/requirements.txt

💻 Usage

Human Feedback

Before using offline RLHF algorithm, you should annotate your dataset using human feedback. If you wish to collect labeled dataset to new tasks, we refer to platform part for crowdsourced annotation. Here, we provide a ~15M steps crowdsourced annotation dataset for the sample task. raw dataset.

The processed crowdsourced (CS) and scripted teacher (ST) labels are located at crowdsource_human_labels and generated_fake_labels folders.

Note: for comparison and validation purposes, we provide fast track for scripted teacher (ST) label generation in fast_track/generate_d4rl_fake_labels.py.

Pre-train Reward Models

Here we provided an example of CS-MLP method for walker2d-medium-expert-v2 task, and you can customize it in configuration file rlhf/cfgs/default.yaml.

cd rlhf
python train_reward_model.py domain=mujoco env=walker2d-medium-expert-v2 \
modality=state structure=mlp fake_label=false ensemble_size=3 n_epochs=50 \
num_query=2000 len_query=200 data_dir="../crowdsource_human_labels" \
seed=0 exp_name="CS-MLP"

For more environment of reward model training, we provide the bash files:

cd rlhf
bash scripts/train_mujoco.sh
bash scripts/train_antmze.sh
bash scripts/train_adroit.sh

Train Offline RL with Pre-trained Rewards

Following Uni-RLHF codebase implemeration, we modified IQL, CQL and TD3BC algorithm.

Example: Train IQL with CS-MLP reward model. The log will be uploaded to wandb.

python algorithms/offline/iql_p.py --device "cuda:0" --seed 0 \
--reward_model_path "path/to/reward_model" --config_path ./configs/offline/iql/walker/medium_expert_v2.yaml \
--reward_model_type mlp --seed 0 --name CS-MLP-IQL-Walker-medium-expert-v2

You can have any combination of algorithms, label types and reward model types:

Algorithm	Label Type	Reward Model Type
IQL	CS	MLP
CQL	ST	TFM
TD3BC		CNN

For more environment of policy training, we provide the bash files:

bash scripts/run_mujoco.sh
bash scripts/run_antmze.sh
bash scripts/run_adroit.sh

🏷️ License

Distributed under the MIT License. See LICENSE.txt for more information.

✉️ Contact

For any questions, please feel free to email yuanyf@tju.edu.cn.

📝 Citation

If you find our work useful, please consider citing:

@inproceedings{anonymous2023unirlhf,
    title={Uni-{RLHF}: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback},
    author={Yuan, Yifu and Hao, Jianye and Ma, Yi and Dong, Zibin and Liang, Hebin and Liu, Jinyi and Feng, Zhixin and Zhao, Kai and Zheng, Yan}
    booktitle={The Twelfth International Conference on Learning Representations, ICLR},
    year={2024},
    url={https://openreview.net/forum?id=WesY0H9ghM},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
algorithms		algorithms
assets		assets
configs		configs
crowdsource_human_labels		crowdsource_human_labels
fast_track		fast_track
generated_fake_labels		generated_fake_labels
requirements		requirements
rlhf		rlhf
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

pickxiguapi/Clean-Offline-RLHF

Folders and files

Latest commit

History

Repository files navigation

Clean-Offline-RLHF

💡 News

🛠️ Getting Started

💻 Usage

Human Feedback

Pre-train Reward Models

Train Offline RL with Pre-trained Rewards

🏷️ License

✉️ Contact

📝 Citation

About

Resources

License

Stars

Watchers

Forks

Languages