# Reinforcement Learning

## Minihack

### Install NLE

In [1]:
!apt update -qq && apt install -qq -y flex bison libbz2-dev libglib2.0 libsm6 libxext6 cmake 
!pip install -U --quiet git+https://github.com/facebookresearch/nle.git@main

6 packages can be upgraded. Run 'apt list --upgradable' to see them.
bison is already the newest version (2:3.0.4.dfsg-1build1).
flex is already the newest version (2.6.4-6).
libsm6 is already the newest version (2:1.2.2-1).
libxext6 is already the newest version (2:1.3.3-1).
libglib2.0-cil is already the newest version (2.12.40-2).
libglib2.0-cil-dev is already the newest version (2.12.40-2).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
libbz2-dev is already the newest version (1.0.6-8.1ubuntu0.2).
libglib2.0-0 is already the newest version (2.56.4-0ubuntu0.18.04.9).
libglib2.0-bin is already the newest version (2.56.4-0ubuntu0.18.04.9).
libglib2.0-data is already the newest version (2.56.4-0ubuntu0.18.04.9).
libglib2.0-dev is already the newest version (2.56.4-0ubuntu0.18.04.9).
libglib2.0-dev-bin is already the newest version (2.56.4-0ubuntu0.18.04.9).
libglib2.0-doc is already the newest version (2.56.4-0ubuntu0.18.04.9).
libglib2.0-tests is already the newest ver

### Install Minihack

In [2]:
!pip install -U --quiet git+https://github.com/facebookresearch/minihack.git@main

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


### Install RLlib

In [3]:
!pip install -U --quiet ray[rllib] ray[tune] ray[default]

### Installs

In [4]:
!pip install -U --quiet comet_ml hydra-core pipdeptree wandb

### Versions

In [5]:
!python --version
!pip --version
!pipdeptree -r --packages gym,nle,minihack,ray,wandb

Python 3.7.15
pip 21.1.3 from /usr/local/lib/python3.7/dist-packages/pip (python 3.7)
* ipython==7.9.0
 - jedi [required: >=0.10, installed: ?]
------------------------------------------------------------------------
gym==0.23.0
  - dopamine-rl==1.0.5 [requires: gym>=0.10.5]
  - minihack==0.1.3+2f022b0 [requires: gym>=0.15,<=0.23]
  - nle==0.8.1+68b9362 [requires: gym>=0.15,<=0.23]
    - minihack==0.1.3+2f022b0 [requires: nle>=0.8.1]
ray==2.0.1
wandb==0.13.5


### Imports

In [6]:
import random
import gym
import nle
import minihack

import numpy as np
import cv2
from collections import OrderedDict
from tqdm.auto import trange


### Custom

In [7]:
%matplotlib inline
import matplotlib.pyplot as plt

from gym.spaces import Box
from minihack.envs.skills_quest import MiniHackQuestHard
from ray.tune.registry import register_env

class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__


class CustomEnv(MiniHackQuestHard):
    def __init__(self, config):
        # Hack to resolve error "'CustomEnv' object has no attribute 'env'"
        self.env = dotdict({'_vardir': '/tmp/run'})

        config = dotdict(config)

        self._obs_keys = config.obs_keys.split(",")
        super().__init__(observation_keys=self._obs_keys)

        self.shape = dotdict(config.input_shape)
        self.observation_space['pixel'] = Box(0, 255, (self.shape.height, self.shape.width, 3), np.uint8)


    def _resize_frame(self, frame):
        return cv2.resize(
            frame,
            dsize=(self.shape.width,self.shape.height),
            interpolation=cv2.INTER_LINEAR
        )

    def _process_obs(self, obs):
        return OrderedDict({
            key: self._resize_frame(obs[key]) if key == 'pixel' else obs[key] for key in self._obs_keys
        })

    def reset(self):
        return self._process_obs(super().reset())

    def step(self, action):
        obs, reward, done, info = super().step(action)
        return self._process_obs(obs), reward, done, info

register_env('MiniHack-D3QN-v0', CustomEnv)

### Train

In [None]:
from hydra import initialize, compose
from omegaconf import OmegaConf
from ray import tune
from ray.tune.logger import DEFAULT_LOGGERS
from ray.air.callbacks.wandb import WandbLoggerCallback

with initialize(version_base=None, config_path="."):
  cfg = compose(config_name='config.yaml')

dqn_cfg = OmegaConf.to_object(cfg.get('minihack-d3qn', {}))

wandb_cfg = dqn_cfg.get('logger_config', {}).get('wandb', {})

callbacks = [
    WandbLoggerCallback(
        project=wandb_cfg['project'],
        group=wandb_cfg['group'],
        api_key_file=wandb_cfg['api_key_file']
    )
]

analysis = tune.run(
    'DQN',
    callbacks=callbacks,
    config=dqn_cfg,
    stop={"training_iteration": 10},
    local_dir="./results",
    log_to_file=True
    )

2022-11-05 12:30:24,831	INFO worker.py:1515 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Trial name,status,loc,gamma,hiddens,lr,target_network_up...,iter,total time (s),ts,reward,num_recreated_wor...,episode_reward_max,episode_reward_min
DQN_MiniHack-D3QN-v0_a2cda_00001,RUNNING,172.28.0.12:15583,0.999,[256],0.0001,5000,2.0,2324.34,2000.0,-6.0,0.0,-2.54,-9.46
DQN_MiniHack-D3QN-v0_a2cda_00002,PENDING,,0.99,[512],0.0001,5000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00003,PENDING,,0.999,[512],0.0001,5000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00004,PENDING,,0.99,[256],0.001,5000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00005,PENDING,,0.999,[256],0.001,5000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00006,PENDING,,0.99,[512],0.001,5000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00007,PENDING,,0.999,[512],0.001,5000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00008,PENDING,,0.99,[256],0.0001,10000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00009,PENDING,,0.999,[256],0.0001,10000,,,,,,,
DQN_MiniHack-D3QN-v0_a2cda_00010,PENDING,,0.99,[512],0.0001,10000,,,,,,,


[2m[36m(pid=15318)[0m Instructions for updating:
[2m[36m(pid=15318)[0m experimental_relax_shapes is deprecated, use reduce_retracing instead
[2m[36m(pid=15318)[0m Instructions for updating:
[2m[36m(pid=15318)[0m experimental_relax_shapes is deprecated, use reduce_retracing instead
[2m[36m(DQN pid=15318)[0m 2022-11-05 12:30:42,419	INFO simple_q.py:294 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
[2m[36m(DQN pid=15318)[0m 2022-11-05 12:30:42,421	INFO algorithm.py:358 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for DQN_MiniHack-D3QN-v0_a2cda_00000:
  agent_timesteps_total: 1000
  counters:
    num_agent_steps_sampled: 1000
    num_agent_steps_trained: 0
    num_env_steps_sampled: 1000
    num_env_steps_trained: 0
  custom_metrics: {}
  date: 2022-11-05_12-31-48
  done: false
  episode_len_mean: 1000.0
  episode_media: {}
  episode_reward_max: -9.11999999999985
  episode_reward_mean: -9.11999999999985
  episode_reward_min: -9.11999999999985
  episodes_this_iter: 1
  episodes_total: 1
  experiment_id: 69cc267f84b546d192e6aea4eb23746f
  hostname: 1c147db97f6e
  info:
    learner: {}
    num_agent_steps_sampled: 1000
    num_agent_steps_trained: 0
    num_env_steps_sampled: 1000
    num_env_steps_trained: 0
  iterations_since_restore: 1
  node_ip: 172.28.0.12
  num_agent_steps_sampled: 1000
  num_agent_steps_trained: 0
  num_env_steps_sampled: 1000
  num_env_steps_sampled_this_iter: 1000
  num_env_steps_trained: 0
  num_env_steps_trained_this_iter: 0
  num_faulty_episodes: 0
  num_healthy_



Result for DQN_MiniHack-D3QN-v0_a2cda_00000:
  agent_timesteps_total: 10000
  counters:
    last_target_update_ts: 10000
    num_agent_steps_sampled: 10000
    num_agent_steps_trained: 32
    num_env_steps_sampled: 10000
    num_env_steps_trained: 32
    num_target_updates: 1
  custom_metrics: {}
  date: 2022-11-05_12-40-03
  done: true
  episode_len_mean: 754.1538461538462
  episode_media: {}
  episode_reward_max: -1.6600000000000013
  episode_reward_mean: -6.986153846153741
  episode_reward_min: -9.449999999999843
  episodes_this_iter: 1
  episodes_total: 13
  experiment_id: 69cc267f84b546d192e6aea4eb23746f
  hostname: 1c147db97f6e
  info:
    last_target_update_ts: 10000
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_lr: 0.0001
          max_q: 0.03945760428905487
          mean_q: 0.002244360279291868
          min_q: -0.029430756345391273
        mean_td_error: -0.0280128326267004
        model: {}
        td_error: [-0.021293779

[2m[36m(pid=15583)[0m Instructions for updating:
[2m[36m(pid=15583)[0m experimental_relax_shapes is deprecated, use reduce_retracing instead
[2m[36m(pid=15583)[0m Instructions for updating:
[2m[36m(pid=15583)[0m experimental_relax_shapes is deprecated, use reduce_retracing instead
[2m[36m(DQN pid=15583)[0m 2022-11-05 12:43:47,833	INFO simple_q.py:294 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
[2m[36m(DQN pid=15583)[0m 2022-11-05 12:43:48,020	INFO algorithm.py:358 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(DQN pid=15583)[0m 2022-11-05 12:46:30,503	INFO trainable.py:166 -- Trainable.setup took 162.801 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for DQN_MiniHack-D3QN-v0_a2cda_00001:
  agent_timesteps_total: 1000
  counters:
    num_agent_steps_sampled: 1000
    num_agent_steps_trained: 0
    num_env_steps_sampled: 1000
    num_env_steps_trained: 0
  custom_metrics: {}
  date: 2022-11-05_13-05-54
  done: false
  episode_len_mean: 278.0
  episode_media: {}
  episode_reward_max: -2.53999999999999
  episode_reward_mean: -2.53999999999999
  episode_reward_min: -2.53999999999999
  episodes_this_iter: 1
  episodes_total: 1
  experiment_id: 2d42823f93d540eb9ecce81cc99f6391
  hostname: 1c147db97f6e
  info:
    learner: {}
    num_agent_steps_sampled: 1000
    num_agent_steps_trained: 0
    num_env_steps_sampled: 1000
    num_env_steps_trained: 0
  iterations_since_restore: 1
  node_ip: 172.28.0.12
  num_agent_steps_sampled: 1000
  num_agent_steps_trained: 0
  num_env_steps_sampled: 1000
  num_env_steps_sampled_this_iter: 1000
  num_env_steps_trained: 0
  num_env_steps_trained_this_iter: 0
  num_faulty_episodes: 0
  num_healthy_w