Skip to content

Benchmark for "Offline Policy Comparison with Confidence"


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Offline Policy Comparison with Confidence (opcc)

It's a benchmark comprising "policy comparison queries"(pcq) to evaluate uncertainty estimation in offline reinforcement learning.

Research Paper:


Python package License codecov Code style: black


1. Setup Mujoco

  • Download mujoco 210 and unzip in ~/.mujoco
  • Add following to .consolerc/.zshrc and source it.
    export MUJOCO_PY_MUJOCO_PATH=$HOME/.mujoco/mujoco210/
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin

    (You can also refer here for step-by-step instructions on mujoco installation)

2. Setup Python 3.7+ and optionally(recommended) create a virtualenv (refer here)

3. Python package and dependencies could be installed using:

pip3 install --upgrade 'pip<=23.0.1'
pip3 install --upgrade 'setuptools<=66'
pip3 install --upgrade 'wheel<=0.38.4'
pip3 install git+

4. Install Pytorch [>= 1.8.0]



import opcc
import numpy as np
from sklearn import metrics

env_name = 'HalfCheetah-v2'
dataset_name = 'random'

# ########################################################
# Policy Comparison Queries (PCQ) (Section : 3.1 in paper)
# ########################################################
# Queries are dictionaries with policies as keys and
# corresponding queries as values.
queries = opcc.get_queries(env_name)

def random_predictor(obs_a, obs_b, action_a, action_b,
                     policy_a, policy_b, horizon):
    answer = np.random.randint(low=0, high=2, size=len(obs_a)).tolist()
    confidence = np.random.rand(len(obs_a)).tolist()
    return answer, confidence

targets = []
predictions = []
confidences = []
# Batch iteration through Queries :
for (policy_a_id, policy_b_id), query_batch in queries.items():
    # retrieve policies
    policy_a, _ = opcc.get_policy(*policy_a_id)
    policy_b, _ = opcc.get_policy(*policy_b_id)

    # query-a
    obs_a = query_batch['obs_a']
    action_a = query_batch['action_a']

    # query-b
    obs_b = query_batch['obs_b']
    action_b = query_batch['action_b']

    # horizon for policy evaluation
    horizon = query_batch['horizon']

    # ground truth binary vector:
    # (Q(obs_a, action_a, policy_a, horizon)
    # <  Q(obs_b, action_b, policy_b, horizon))
    target = query_batch['target'].tolist()
    targets += target

    # Let's make predictions for the given queries.
    # One can use any mechanism to predict the corresponding
    # answer to queries, and we simply use a random predictor
    # over here for demonstration purposes
    p, c = random_predictor(obs_a, obs_b, action_a, action_b,
                            policy_a, policy_b, horizon)
    predictions += p
    confidences += c

Evaluation Metrics:

# #########################################
# (Section 3.3 in paper)
# #########################################
loss = np.logical_xor(predictions, targets)  # we use 0-1 loss for demo

# List of tuples (coverage, selective_risks, tau)
coverage_sr_tau = []
for tau in np.arange(0, 1 + 2 * tau_interval, tau_interval):
  non_abstain_filter = confidences >= tau
  if any(non_abstain_filter):
    selective_risk = np.sum(loss[non_abstain_filter])
    selective_risk /= np.sum(non_abstain_filter)
    coverage = np.mean(non_abstain_filter)
    coverage_sr_tau.append((coverage, selective_risk, tau))
    # 0 risk for 0 coverage
    coverage_sr_tau.append((0, 0, tau))

coverages, selective_risks, taus = list(zip(*sorted(coverage_sr_tau)))
assert selective_risks[0] == 0 and coverages[0] == 0 , "no coverage not found"
assert coverages[-1] == 1, 'complete coverage not found'

# AURCC ( Area Under Risk-Coverage Curve): Ideally, we would like it to be 0
aurcc = metrics.auc(x=coverages,y=selective_risks)

# Reverse-pair-proportion
rpp = np.logical_and(np.expand_dims(loss, 1)
                     < np.expand_dims(loss, 1).transpose(),
                     np.expand_dims(confidences, 1)
                     < np.expand_dims(confidences, 1).transpose()).mean()

# Coverage Resolution (cr_k) : Ideally, we would like it to be 1
k = 10
bins = [_ for _ in np.arange(0, 1, 1 / k)]
cr_k = np.unique(np.digitize(coverages, bins)).size / len(bins)

print("aurcc: {}, rpp: {}, cr_{}:{}".format(aurcc, rpp, k, cr_k))


# ###########################################
# Datasets: (Section 4 in paper - step (1) )
# ###########################################

import opcc

env_name = 'HalfCheetah-v2'

# list all dataset names corresponding to an env
dataset_names = opcc.get_dataset_names(env_name)

dataset_name = 'random'
# This is a very-slim wrapper over D4RL datasets.
dataset = opcc.get_qlearning_dataset(env_name, dataset_name)

Policy Usage:

import opcc, gym, torch

env_name = "HalfCheetah-v2"
model, model_info = opcc.get_policy(env_name, pre_trained=1)

done = False
env = gym.make(env_name)

obs = env.reset()
while not done:
    action = model(torch.tensor(obs).unsqueeze(0))
    action =[0].astype('float32')
    obs, reward, done, step_info = env.step(action)

Benchmark Information

  • We borrow dataset's from D4RL
  • Queries can be visualized HERE
  • Baselines can be found here HERE

🔅 d4rl:maze2d



Environment Name Datasets Query-Count
d4rl:maze2d-open-v0 1k, 10k, 100k, 1m 1500
d4rl:maze2d-medium-v1 1k, 10k, 100k, 1m 1500
d4rl:maze2d-umaze-v1 1k, 10k, 100k, 1m 1500
d4rl:maze2d-large-v1 1k, 10k, 100k, 1m 121*

Pre-trained policy performance:

Environment Name pre_trained=1 (best) pre_trained=2 pre_trained=3 pre_trained=4 (worst)
d4rl:maze2d-open-v0 122.2±10.61 104.9±22.19 18.05±14.85 4.85±8.62
d4rl:maze2d-medium-v1 245.55±272.75 203.75±252.61 256.65±260.16 258.55±262.81
d4rl:maze2d-umaze-v1 235.5±35.45 197.75±58.21 23.4±73.24 3.2±9.65
d4rl:maze2d-large-v1 231.35±268.37 160.8±201.97 50.65±76.94 9.95±9.95

🔅 mujoco(gym)

mujoco-halfcheetah mujoco-hopper mujoco-walker2d


Environment Name Datasets Query-Count
HalfCheetah-v2 random, expert, medium, medium-replay, medium-expert 1500
Hopper-v2 random, expert, medium, medium-replay, medium-expert 1500
Walker2d-v2 random, expert, medium, medium-replay, medium-expert 1500

Pre-trained Policy performance:

Environment Name pre_trained=1 (best) pre_trained=2 pre_trained=3 pre_trained=4 (worst)
HalfCheetah-v2 1169.13±80.45 1044.39±112.61 785.88±303.59 94.79±40.88
Hopper-v2 1995.84±794.71 1466.71±497.1 1832.43±560.86 236.51±1.09
Walker2d-v2 2506.9±689.45 811.28±321.66 387.01±42.82 162.7±102.14

Testing Package:

  • Install : pip install -e ".[test]"
  • Run: pytest -v
  • Testing is computationally expensive as we validate ground truth value estimates and corresponding labels. These can be disabled by setting following flags:
    export SKIP_QUERY_TARGET_TESTS=1 # disable target estimation and label validation  
    export SKIP_Q_LEARNING_DATASET_TEST=1  # disable test for checking dataset existence
    export SKIP_SEQUENCE_DATASET_TEST=1 # disables test for checking sequence dataset


  • Install : pip install -e ".[all]"
  • Generate-Queries:
    % Mujoco (Gym) Environment
    python scripts/ --env-name HalfCheetah-v2 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.1 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
    python scripts/ --env-name Hopper-v2 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.1 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
    python scripts/ --env-name Walker2d-v2 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.1 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
    % Maze Environment
    python scripts/ --env-name d4rl:maze2d-large-v1 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.2 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
    python scripts/ --env-name d4rl:maze2d-umaze-v1 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.2 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
    python scripts/ --env-name d4rl:maze2d-medium-v1 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.2 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
    python scripts/ --env-name d4rl:maze2d-open-v0 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.5 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
  • Generate policy performance stats for readme:
    python scripts/ --all-envs


If you have any questions or suggestions , please open an issue on this GitHub repository.