# Banchmark fully off-policy DQN on Cartpole

We need to Benachmark with three diffrent types of datasets generated by
* a __partially-trained__ medium-return policy
* a __random__ low-return policy and 
* an __expert__, high-return policy.

In [1]:
import numpy as np
from stable_baselines import DQN

## Explore Dataste
Load `.npz` file and save data in variabels

In [2]:
data = np.load('cartpole_data.npz')
obs = data['obs'] # shape = (878987, 4)
actions = data['actions'] # shape = (878987, 1)
rewards = data['rewards'] # shape = (878987, )
episode_starts = data['episode_starts']
obs

array([[-0.03360261,  0.04453236, -0.03984775,  0.01214404],
       [-0.03271196, -0.14999613, -0.03960487,  0.29199302],
       [-0.03571189, -0.34453166, -0.03376501,  0.5719267 ],
       ...,
       [ 1.8527644 ,  2.7364528 ,  0.20202377, -0.023985  ],
       [ 1.9074935 ,  2.5390933 ,  0.20154408,  0.32502544],
       [ 1.9582753 ,  2.730861  ,  0.20804459,  0.10205026]],
      dtype=float32)

## Traning DQN with fully off-policy data

## Stable-Baselines
[Stable Baselines](https://stable-baselines.readthedocs.io/en/master/index.html) is a set of improved implementations of reinforcement learning algorithms based on OpenAI [Baselines](https://github.com/openai/baselines/).
### Pre-Training (Behavior Cloning)
With the `.pretrain()` method, you can pre-train RL policies using trajectories from an expert, and therefore accelerate training. [Learn more...](https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html)
* Stable-Baselines `.pretrain()` pretrains a model using behavior cloning: supervised learning given an expert dataset.
* The dokumentation says: _"for a given observation, the action taken by the policy must be the one taken by the expert"_
* Since Q-Learning has no Policy this is the same as filling a replay buffer withe the off-policy data and don´t interakt with the env.

In [3]:
from stable_baselines.gail import ExpertDataset
# Using only one expert trajectory
# you can specify `traj_limitation=-1` for using the whole dataset
dataset = ExpertDataset(expert_path='cartpole_data.npz',
                        traj_limitation=-1, batch_size=128)

model = DQN('MlpPolicy', 'CartPole-v1', verbose=0)
# Pretrain the DQN
model.pretrain(dataset, n_epochs=100)

# Test the pre-trained model
env = model.get_env()
obs = env.reset()
reward_sum = 0.0
episode_reward = []
for _ in range(1000):
        action, _ = model.predict(obs)
        obs, reward, done, _ = env.step(action)
        reward_sum += reward
#         Comment to render (rendering is not posible from remote)
#         env.render()
        if done:
                episode_reward.append(reward_sum)
                reward_sum = 0.0
                obs = env.reset()

print('average reward: {}'. format(np.mean(episode_reward)))

env.close()

actions (878987, 1)
obs (878987, 4)
rewards (878987,)
episode_returns (10000,)
episode_starts (878987,)
Total trajectories: -1
Total transitions: 878987
Average returns: 87.8987
Std for returns: 5.305189752497077
average reward: 87.54545454545455


## Berkeley Artificial Intelligence Research (BAIR)
Learn more about Berkeley Artificial Intelligence Research (BAIR) on [bair.berkeley.edu](https://bair.berkeley.edu/).

### BEAR (Bootstrapping Error Accumulation Reduction)
Papers:
* Kumar et al. 2019 - [Stabilizing Off-Policy Q-learning via Bootstrapping Error Reduction](https://arxiv.org/abs/1906.00949)
* Fu et al. 2020 - [Datasets for Data-Driven Reinforcement Learning](https://arxiv.org/abs/2004.07219)
* Fujimoto et al 2018 - [Off-Policy Deep Reinforcement Learning without Exploration](https://arxiv.org/abs/1812.02900)

Code:
* BEAR on [GitHub](https://github.com/aviralkumar2907/BEAR) by Aviral Kumar
* BCQ on [GitHub](https://github.com/sfujim/BCQ) by Scott Fujimoto

Blog:
* [Data-Driven Deep Reinforcement Learning](https://bair.berkeley.edu/blog/2019/12/05/bear/)

Slides & Talks:
* Stabilizing Off-Policy Q-learning via Bootstrapping Error Reduction - Introduction [Slides](https://sites.google.com/view/bear-off-policyrl)
* Robust Perception, Imitation, and Reinforcement Learning for Embodied Learning Machines - [Talk](https://slideslive.com/38918103/robust-perception-imitation-and-reinforcement-learning-for-embodied-learning-machines?ref=speaker-17453-latest) by Sergey Levine