Skip to content

Implementation of T-REX and D-REX Inverse Reinforcement Learning (IRL) algorithm for learning form suboptimal demonstrations

Notifications You must be signed in to change notification settings

prabinrath/Beyond-Demonstration

Repository files navigation

Beyond-Demonstration

CSE-598 Perception in Robotics Project ASU

Implementation of T-REX and D-REX IRL algorithm that learns a reward capturing the expert's intention from the demonstrations. The learnt reward can be used to train a policy that performs better than the demonstrations. Based on stable_baselines3 and imitation packages.

Given a dataset of ranked sub-optimal demonstrations, a state dependent reward function can be recovered by training a neural network optimized on the preference of better demonstrations over the worst. The Bradley-Terry and Luce-Shephard model of choice is used to train such reward models from preferences. Ranked trajectories can be generated by injecting different levels of noise into a BC policy trained on the demonstrations.

Example of 3 ranked trajectories generated by noise injection.

High Noise Mid Noise No Noise
alt text 1 alt text 2 alt text 3
Least Preferred More Preferred Most Preferred

Changes to Baseline

We implement the IRL algorithm using the tools available in the imitation library. Notable changes from the paper's implementation are

  • Luce preference with discount_factor, noise_prob, clipped reward differences (ideas from DRLHP): Comes with imitation library
  • Mixed sampling: New preference dataset generated every epoch
  • Fixed horizon rollouts for ranked_trajectories: Horizon length is 1000 steps
  • Input normalization in reward network (similar to batch normalization): Comes with imitation library
  • Single reward function No ensemble
  • Reward scaling with tanh, optimized with AdamW: Scaled reward improves stability
  • Entropy regularized actor critic policy for BC: Comes with imitation library

Other possible improvements

  • Use custom rewards (rnn, attention)
  • A better preference loss (aLRP)

Results

Better than demonstrator performance was observed for HalfCheetah-v3 environment. For Hopper-v3, we achieved equal to demonstration performance.

Demonstration Learned Policy

Ground Truth reward to Predicted reward corelation (unscaled)

Hopper HalfCheetah

About

Implementation of T-REX and D-REX Inverse Reinforcement Learning (IRL) algorithm for learning form suboptimal demonstrations

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages