<a href="https://colab.research.google.com/github/lincolnschick/ML4MC/blob/main/docs/reports/requirement-10-code/hedges_MineRL_BC%2Bscripted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center">
  <img src="https://github.com/KarolisRam/MineRL2021-Intro-baselines/blob/main/img/colab_banner.png?raw=true">
</div>

# Introduction
This notebook is part three of the Intro track baselines for the [MineRL 2021](https://minerl.io/) competition. To run it you will need to enable GPU by going to `Runtime -> Change runtime type` and selecting GPU from the drop down list.

Below you will find an agent that has two components:
1. A machine learning agent that trains on human data to learn how to imitate them to chop trees (training takes less than 10 minutes).
2. A script that crafts a wooden pickaxe and digs down to get some cobblestone.  

The machine learning part runs for a fixed number of steps (2000 by default), then the crafting and digging script kicks in.
When evaluated on MineRLObtainDiamond environment it achieves an average reward of 8.6.

## Software 2.0
The approach we used here, where we took some human written code and replaced it with an AI component is quite similar to how Tesla approaches self driving cars. See this talk by Andrej Karpathy, Director of AI at Tesla:  
[Building the Software 2.0 Stack](https://databricks.com/session/keynote-from-tesla)

Go on, improve the self driving Steve/Alex below! :)

# Setup

In [None]:
%%capture
# !sudo add-apt-repository -y ppa:openjdk-r/ppa
# !sudo apt-get purge openjdk-*
# !sudo apt-get install openjdk-8-jdk
# !sudo apt-get install xvfb xserver-xephyr vnc4server python-opengl ffmpeg

In [None]:
%%capture
# !pip3 install --upgrade minerl
# !pip3 install pyvirtualdisplay
# !pip3 install pytorch
# !pip3 install imageio==2.4.1
# !pip3 install -U colabgymrender

In [None]:
!sudo add-apt-repository -y ppa:openjdk-r/ppa
!sudo apt-get purge openjdk-*
!sudo apt-get install openjdk-8-jdk
!sudo apt-get install xvfb
!sudo apt-get install xserver-xephyr
!sudo apt-get install -y python3-opengl
!sudo apt-get install ffmpeg
!pip3 install gym==0.13.1
!pip3 install minerl==0.4.4
!pip3 install pyvirtualdisplay
!pip3 install -U colabgymrender
!sudo apt-get install xvfb

PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/openjdk-r/ppa/ubuntu/ jammy main'
More info: https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/openjdk-r-ubuntu-ppa-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/openjdk-r-ubuntu-ppa-jammy.list
Adding key to /etc/apt/trusted.gpg.d/openjdk-r-ubuntu-ppa.gpg with fingerprint DA1A4A13543B466853BAF164EB9B1D8886F44E2A
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:3 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:5 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:6 https://ppa.launchpadcontent.net/graphics-

In [None]:
# Launch virtual display, which is needed for MineRL
from pyvirtualdisplay import Display
display = Display(visible=False, size=(400, 300))
display.start();

In [None]:
import cv2
import gym
import numpy as np

In [None]:
!sudo apt-get install xvfb xserver-xephyr vnc4server python-opengl ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package vnc4server is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'vnc4server' has no installation candidate
E: Unable to locate package python-opengl


In [None]:
import minerl

# Import Libraries

In [None]:
import numpy as np
import torch as th
from torch import nn
import gym
# import minerl
from tqdm.notebook import tqdm
from colabgymrender.recorder import Recorder
from pyvirtualdisplay import Display
import logging
logging.disable(logging.ERROR) # reduce clutter, remove if something doesn't work to see the error logs.

# Neural network

In [None]:
class NatureCNN(nn.Module):
    """
    CNN from DQN nature paper:
        Mnih, Volodymyr, et al.
        "Human-level control through deep reinforcement learning."
        Nature 518.7540 (2015): 529-533.

    :param input_shape: A three-item tuple telling image dimensions in (C, H, W)
    :param output_dim: Dimensionality of the output vector
    """

    def __init__(self, input_shape, output_dim):
        super().__init__()
        n_input_channels = input_shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4, padding=0),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(th.zeros(1, *input_shape)).shape[1]

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, 512),
            nn.ReLU(),
            nn.Linear(512, output_dim)
        )

    def forward(self, observations: th.Tensor) -> th.Tensor:
        return self.linear(self.cnn(observations))

# Custom Environments

Here we are going to set up a custom environment for our model. The goal is to optimize the harvesting of wood with a few potential changes to **rewards** and **actions**.

I've used Faith's Colab notebook as a starting point.

# TreeOpt Environment

In [None]:
# Base: Source Code From TreeChop Environment
# see treechop_specs.py

# Copyright (c) 2020 All Rights Reserved
# Author: William H. Guss, Brandon Houghton

from minerl.herobraine.env_specs.simple_embodiment import SimpleEmbodimentEnvSpec
from minerl.herobraine.hero.mc import MS_PER_STEP, STEPS_PER_MS
from minerl.herobraine.hero.handler import Handler
from typing import List

import minerl.herobraine
import minerl.herobraine.hero.handlers as handlers
from minerl.herobraine.env_spec import EnvSpec

# DO NOT CHANGE PATHS
TREEOPT_DOC = """
.. image:: ../assets/treechop1.mp4.gif
  :scale: 100 %
  :alt:

.. image:: ../assets/treechop2.mp4.gif
  :scale: 100 %
  :alt:

.. image:: ../assets/treechop3.mp4.gif
  :scale: 100 %
  :alt:

.. image:: ../assets/treechop4.mp4.gif
  :scale: 100 %
  :alt:
In treechop, the agent must collect 64 `minecraft:log`. This replicates a common scenario
in Minecraft, as logs are necessary to craft a large amount of items in the game and are a
key resource in Minecraft.

The agent begins in a forest biome (near many trees) with an iron axe for cutting trees. The agent
is given +1 reward for obtaining each unit of wood, and the episode terminates once the agent
obtains 64 units.
"""
TREEOPT_LENGTH = 8000
TREEOPT_WORLD_GENERATOR_OPTIONS = '''{
    "coordinateScale": 684.412,
    "heightScale": 684.412,
    "lowerLimitScale": 512.0,
    "upperLimitScale": 512.0,
    "depthNoiseScaleX": 200.0,
    "depthNoiseScaleZ": 200.0,
    "depthNoiseScaleExponent": 0.5,
    "mainNoiseScaleX": 80.0,
    "mainNoiseScaleY": 160.0,
    "mainNoiseScaleZ": 80.0,
    "baseSize": 8.5,
    "stretchY": 12.0,
    "biomeDepthWeight": 1.0,
    "biomeDepthOffset": 0.0,
    "biomeScaleWeight": 1.0,
    "biomeScaleOffset": 0.0,
    "seaLevel": 1,
    "useCaves": false,
    "useDungeons": false,
    "dungeonChance": 8,
    "useStrongholds": false,
    "useVillages": false,
    "useMineShafts": false,
    "useTemples": false,
    "useMonuments": false,
    "useMansions": false,
    "useRavines": false,
    "useWaterLakes": false,
    "waterLakeChance": 4,
    "useLavaLakes": false,
    "lavaLakeChance": 80,
    "useLavaOceans": false,
    "fixedBiome": 4,
    "biomeSize": 4,
    "riverSize": 1,
    "dirtSize": 33,
    "dirtCount": 10,
    "dirtMinHeight": 0,
    "dirtMaxHeight": 256,
    "gravelSize": 33,
    "gravelCount": 8,
    "gravelMinHeight": 0,
    "gravelMaxHeight": 256,
    "graniteSize": 33,
    "graniteCount": 10,
    "graniteMinHeight": 0,
    "graniteMaxHeight": 80,
    "dioriteSize": 33,
    "dioriteCount": 10,
    "dioriteMinHeight": 0,
    "dioriteMaxHeight": 80,
    "andesiteSize": 33,
    "andesiteCount": 10,
    "andesiteMinHeight": 0,
    "andesiteMaxHeight": 80,
    "coalSize": 17,
    "coalCount": 20,
    "coalMinHeight": 0,
    "coalMaxHeight": 128,
    "ironSize": 9,
    "ironCount": 20,
    "ironMinHeight": 0,
    "ironMaxHeight": 64,
    "goldSize": 9,
    "goldCount": 2,
    "goldMinHeight": 0,
    "goldMaxHeight": 32,
    "redstoneSize": 8,
    "redstoneCount": 8,
    "redstoneMinHeight": 0,
    "redstoneMaxHeight": 16,
    "diamondSize": 8,
    "diamondCount": 1,
    "diamondMinHeight": 0,
    "diamondMaxHeight": 16,
    "lapisSize": 7,
    "lapisCount": 1,
    "lapisCenterHeight": 16,
    "lapisSpread": 16
}'''


class Treeopt(SimpleEmbodimentEnvSpec):
    def __init__(self, *args, **kwargs):
        if 'name' not in kwargs:
            kwargs['name'] = 'MineRLTreeopt-v0'

        super().__init__(*args,
                         max_episode_steps=TREEOPT_LENGTH, reward_threshold=64.0,
                         **kwargs)

    def create_rewardables(self) -> List[Handler]:
        return [
            handlers.RewardForCollectingItems([
                dict(type="log", amount=1, reward=1.0),
            ]),
            # Attempts were made to add additional reward handlers
            # Some handlers provided in the documentation
            # are not fully implemented in code
            # (ex: ConstantReward()).
            # Other handlers in the reward.py file
            # are privated that could be desirable
            # (ex: _RewardForPosessingItemBase())

        ]

    def create_agent_start(self) -> List[Handler]:
        return [
            handlers.SimpleInventoryAgentStart([
                # Opted to change to diamond_axe to improve speed chopping
                dict(type="diamond_axe", quantity=1)
            ])
        ]

    def create_agent_handlers(self) -> List[Handler]:
        return [
            handlers.AgentQuitFromPossessingItem([
                dict(type="log", amount=64)]
            )
        ]

    def create_server_world_generators(self) -> List[Handler]:
        return [
            handlers.DefaultWorldGenerator(force_reset="true",
                                           generator_options=TREEOPT_WORLD_GENERATOR_OPTIONS
                                           )
        ]

    def create_server_quit_producers(self) -> List[Handler]:
        return [
            handlers.ServerQuitFromTimeUp(
                (TREEOPT_LENGTH * MS_PER_STEP)),
            handlers.ServerQuitWhenAnyAgentFinishes()
        ]

    def create_server_decorators(self) -> List[Handler]:
        return []

    def create_server_initial_conditions(self) -> List[Handler]:
        return [
            handlers.TimeInitialCondition(
                allow_passage_of_time=False
            ),
            handlers.SpawningInitialCondition(
                allow_spawning=True
            )
        ]

    def determine_success_from_rewards(self, rewards: list) -> bool:
        return sum(rewards) >= self.reward_threshold

    def is_from_folder(self, folder: str) -> bool:
        return folder == 'survivaltreechop'
        # DON'T CHANGE TO ENSURE ASSETS GET REFERENCED CORRECTLY

    def get_docstring(self):
        return TREEOPT_DOC


# Environment wrappers

In [None]:
class ActionShaping(gym.ActionWrapper):
    """
    The default MineRL action space is the following dict:

    Dict(attack:Discrete(2),
         back:Discrete(2),
         camera:Box(low=-180.0, high=180.0, shape=(2,)),
         craft:Enum(crafting_table,none,planks,stick,torch),
         equip:Enum(air,iron_axe,iron_pickaxe,none,stone_axe,stone_pickaxe,wooden_axe,wooden_pickaxe),
         forward:Discrete(2),
         jump:Discrete(2),
         left:Discrete(2),
         nearbyCraft:Enum(furnace,iron_axe,iron_pickaxe,none,stone_axe,stone_pickaxe,wooden_axe,wooden_pickaxe),
         nearbySmelt:Enum(coal,iron_ingot,none),
         place:Enum(cobblestone,crafting_table,dirt,furnace,none,stone,torch),
         right:Discrete(2),
         sneak:Discrete(2),
         sprint:Discrete(2))

    It can be viewed as:
         - buttons, like attack, back, forward, sprint that are either pressed or not.
         - mouse, i.e. the continuous camera action in degrees. The two values are pitch (up/down), where up is
           negative, down is positive, and yaw (left/right), where left is negative, right is positive.
         - craft/equip/place actions for items specified above.
    So an example action could be sprint + forward + jump + attack + turn camera, all in one action.

    This wrapper makes the action space much smaller by selecting a few common actions and making the camera actions
    discrete. You can change these actions by changing self._actions below. That should just work with the RL agent,
    but would require some further tinkering below with the BC one.
    """
    def __init__(self, env, camera_angle=10, always_attack=False):
        super().__init__(env)

        self.camera_angle = camera_angle
        self.always_attack = always_attack
        self._actions = [
            [('attack', 1)],
            [('forward', 1)],
            # [('back', 1)],
            # [('left', 1)],
            # [('right', 1)],
            # [('jump', 1)],
            # [('forward', 1), ('attack', 1)],
            # [('craft', 'planks')],
            [('forward', 1), ('jump', 1)],
            [('camera', [-self.camera_angle, 0])],
            [('camera', [self.camera_angle, 0])],
            [('camera', [0, self.camera_angle])],
            [('camera', [0, -self.camera_angle])],
        ]

        self.actions = []
        for actions in self._actions:
            act = self.env.action_space.noop()
            for a, v in actions:
                act[a] = v
            if self.always_attack:
                act['attack'] = 1
            self.actions.append(act)

        self.action_space = gym.spaces.Discrete(len(self.actions))

    def action(self, action):
        return self.actions[action]

# Data parser

In [None]:
def dataset_action_batch_to_actions(dataset_actions, camera_margin=5):
    """
    Turn a batch of actions from dataset (`batch_iter`) to a numpy
    array that corresponds to batch of actions of ActionShaping wrapper (_actions).

    Camera margin sets the threshold what is considered "moving camera".

    Note: Hardcoded to work for actions in ActionShaping._actions, with "intuitive"
        ordering of actions.
        If you change ActionShaping._actions, remember to change this!

    Array elements are integers corresponding to actions, or "-1"
    for actions that did not have any corresponding discrete match.
    """
    # There are dummy dimensions of shape one
    camera_actions = dataset_actions["camera"].squeeze()
    attack_actions = dataset_actions["attack"].squeeze()
    forward_actions = dataset_actions["forward"].squeeze()
    jump_actions = dataset_actions["jump"].squeeze()
    batch_size = len(camera_actions)
    actions = np.zeros((batch_size,), dtype=np.int)

    for i in range(len(camera_actions)):
        # Moving camera is most important (horizontal first)
        if camera_actions[i][0] < -camera_margin:
            actions[i] = 3
        elif camera_actions[i][0] > camera_margin:
            actions[i] = 4
        elif camera_actions[i][1] > camera_margin:
            actions[i] = 5
        elif camera_actions[i][1] < -camera_margin:
            actions[i] = 6
        elif forward_actions[i] == 1:
            if jump_actions[i] == 1:
                actions[i] = 2
            else:
                actions[i] = 1
        elif attack_actions[i] == 1:
            actions[i] = 0
        else:
            # No reasonable mapping (would be no-op)
            actions[i] = -1
    return actions

# Setup training

In [None]:
def train():
    data = minerl.data.make("MineRLTreechop-v0",  data_dir='data', num_workers=4)

    # We know ActionShaping has seven discrete actions, so we create
    # a network to map images to seven values (logits), which represent
    # likelihoods of selecting those actions
    network = NatureCNN((3, 64, 64), 7).cuda()
    optimizer = th.optim.Adam(network.parameters(), lr=LEARNING_RATE)
    loss_function = nn.CrossEntropyLoss()

    iter_count = 0
    losses = []
    for dataset_obs, dataset_actions, _, _, _ in tqdm(data.batch_iter(num_epochs=EPOCHS, batch_size=32, seq_len=1)):
        # We only use pov observations (also remove dummy dimensions)
        obs = dataset_obs["pov"].squeeze().astype(np.float32)
        # Transpose observations to be channel-first (BCHW instead of BHWC)
        obs = obs.transpose(0, 3, 1, 2)
        # Normalize observations
        obs /= 255.0

        # Actions need bit more work
        actions = dataset_action_batch_to_actions(dataset_actions)

        # Remove samples that had no corresponding action
        mask = actions != -1
        obs = obs[mask]
        actions = actions[mask]

        # Obtain logits of each action
        logits = network(th.from_numpy(obs).float().cuda())

        # Minimize cross-entropy with target labels.
        # We could also compute the probability of demonstration actions and
        # maximize them.
        loss = loss_function(logits, th.from_numpy(actions).long().cuda())

        # Standard PyTorch update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        iter_count += 1
        losses.append(loss.item())
        if (iter_count % 1000) == 0:
            mean_loss = sum(losses) / len(losses)
            tqdm.write("Iteration {}. Loss {:<10.3f}".format(iter_count, mean_loss))
            losses.clear()

    th.save(network.state_dict(), TRAIN_MODEL_NAME)
    del data

# Scripted part

In [None]:
def str_to_act(env, actions):
    """
    Simplifies specifying actions for the scripted part of the agent.
    Some examples for a string with a single action:
        'craft:planks'
        'camera:[10,0]'
        'attack'
        'jump'
        ''
    There should be no spaces in single actions, as we use spaces to separate actions with multiple "buttons" pressed:
        'attack sprint forward'
        'forward camera:[0,10]'

    :param env: base MineRL environment.
    :param actions: string of actions.
    :return: dict action, compatible with the base MineRL environment.
    """
    act = env.action_space.noop()
    for action in actions.split():
        if ":" in action:
            k, v = action.split(':')
            if k == 'camera':
                act[k] = eval(v)
            else:
                act[k] = v
        else:
            act[action] = 1
    return act

# Actions
Here's a list of all possible actions:
```
Dict(attack:Discrete(2),
     back:Discrete(2),
     camera:Box(low=-180.0, high=180.0, shape=(2,)),
     craft:Enum(crafting_table,none,planks,stick,torch),
     equip:Enum(air,iron_axe,iron_pickaxe,none,stone_axe,stone_pickaxe,wooden_axe,wooden_pickaxe),
     forward:Discrete(2),
     jump:Discrete(2),
     left:Discrete(2),
     nearbyCraft:Enum(furnace,iron_axe,iron_pickaxe,none,stone_axe,stone_pickaxe,wooden_axe,wooden_pickaxe),
     nearbySmelt:Enum(coal,iron_ingot,none),
     place:Enum(cobblestone,crafting_table,dirt,furnace,none,stone,torch),
     right:Discrete(2),
     sneak:Discrete(2),
     sprint:Discrete(2))
```

### Camera
Camera actions contain two values:
1. Pitch (up/down), where up is negative, down is positive.
2. Yaw (left/right), where left is negative, right is positive.  

For example, moving the camera up by 10 degrees would be 'camera:[-10,0]'.


In [None]:
def get_action_sequence():
    """
    Specify the action sequence for the agent to execute.
    """
    # get 6 logs:
    action_sequence = []
    action_sequence += [''] * 100  # wait 5 sec
    action_sequence += ['forward'] * 8
    action_sequence += ['attack'] * 61
    action_sequence += ['camera:[-10,0]'] * 7  # look up
    action_sequence += ['attack'] * 61
    action_sequence += ['attack'] * 61
    action_sequence += ['attack'] * 61
    action_sequence += ['attack'] * 61
    action_sequence += [''] * 50
    action_sequence += ['jump']
    action_sequence += ['forward'] * 10
    action_sequence += ['camera:[-10,0]'] * 2
    action_sequence += ['attack'] * 61
    action_sequence += ['attack'] * 61
    action_sequence += ['attack'] * 61
    action_sequence += ['camera:[10,0]'] * 9  # look down
    action_sequence += [''] * 50

    # make planks, sticks, crafting table and wooden pickaxe:
    action_sequence += ['back'] * 2
    action_sequence += ['craft:planks'] * 4
    action_sequence += ['craft:stick'] * 2
    action_sequence += ['craft:crafting_table']
    action_sequence += ['camera:[10,0]'] * 9
    action_sequence += ['jump']
    action_sequence += [''] * 5
    action_sequence += ['place:crafting_table']
    action_sequence += [''] * 10

    # bug: looking straight down at a crafting table doesn't let you craft. So we look up a bit before crafting:
    action_sequence += ['camera:[-1,0]']
    action_sequence += ['nearbyCraft:wooden_pickaxe']
    action_sequence += ['camera:[1,0]']
    action_sequence += [''] * 10
    action_sequence += ['equip:wooden_pickaxe']
    action_sequence += [''] * 10

    # dig down:
    action_sequence += ['attack'] * 600
    action_sequence += [''] * 10

    # make stone pick and furnace
    action_sequence += ['jump']
    action_sequence += [''] * 5
    action_sequence += ['place:crafting_table']
    action_sequence += [''] * 10

    action_sequence += ['camera:[-1,0]']
    action_sequence += ['nearbyCraft:stone_pickaxe']
    action_sequence += ['nearbyCraft:furnace']
    action_sequence += ['camera:[1,0]']
    action_sequence += [''] * 10
    action_sequence += ['equip:stone_pickaxe']
    action_sequence += [''] * 10

    return action_sequence

# Parameters

In [None]:
# Parameters:
EPOCHS = 3  # How many times we train over the dataset.
LEARNING_RATE = 0.0001  # Learning rate for the neural network.

TRAIN_MODEL_NAME = 'another_potato.pth'  # name to use when saving the trained agent.
TEST_MODEL_NAME = 'another_potato.pth'  # name to use when loading the trained agent.

TEST_EPISODES = 20  # number of episodes to test the agent for.
MAX_TEST_EPISODE_LEN = 5000  # 18k is the default for MineRLObtainDiamond.
TREEOPT_STEPS = 2000  # number of steps to run BC lumberjack for in evaluations.

# Download the data

In [None]:
minerl.data.download(directory='data', environment='MineRLTreechop-v0');

Download: https://minerl.s3.amazonaws.com/v4/MineRLTreechop-v0.tar: 100%|██████████| 1511.0/1510.73792 [00:41<00:00, 36.47MB/s]


# Train

In [None]:
display = Display(visible=0, size=(400, 300))
display.start();

In [None]:
train()  # only need to run this once.

0it [00:00, ?it/s]

Iteration 1000. Loss 1.107     
Iteration 2000. Loss 1.023     
Iteration 3000. Loss 0.963     
Iteration 4000. Loss 0.896     
Iteration 5000. Loss 0.912     
Iteration 6000. Loss 0.933     
Iteration 7000. Loss 0.896     
Iteration 8000. Loss 0.899     
Iteration 9000. Loss 0.872     
Iteration 10000. Loss 0.825     
Iteration 11000. Loss 0.837     
Iteration 12000. Loss 0.834     
Iteration 13000. Loss 0.863     
Iteration 14000. Loss 0.855     
Iteration 15000. Loss 0.856     
Iteration 16000. Loss 0.877     
Iteration 17000. Loss 0.814     
Iteration 18000. Loss 0.825     
Iteration 19000. Loss 0.798     
Iteration 20000. Loss 0.848     
Iteration 21000. Loss 0.830     
Iteration 22000. Loss 0.851     
Iteration 23000. Loss 0.803     
Iteration 24000. Loss 0.793     
Iteration 25000. Loss 0.766     
Iteration 26000. Loss 0.786     
Iteration 27000. Loss 0.744     
Iteration 28000. Loss 0.775     
Iteration 29000. Loss 0.834     
Iteration 30000. Loss 0.811     
Iteration 31000. Lo

# Start Minecraft

In [None]:
!sudo apt install tigervnc-standalone-server

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libfile-readbackwards-perl tigervnc-common tigervnc-tools x11-xserver-utils
Suggested packages:
  xfonts-100dpi | xfonts-75dpi xfonts-scalable nickle cairo-5c xorg-docs-core
The following NEW packages will be installed:
  libfile-readbackwards-perl tigervnc-common tigervnc-standalone-server
  tigervnc-tools x11-xserver-utils
0 upgraded, 5 newly installed, 0 to remove and 19 not upgraded.
Need to get 1,442 kB of archives.
After this operation, 3,886 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libfile-readbackwards-perl all 1.06-1 [11.2 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tigervnc-common amd64 1.12.0+dfsg-4 [101 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tigervnc-standalone-server amd64 1.12.0+dfsg-4 [1,138 kB]
Get:4 http://archive.u

In [None]:
abs_TO = Treeopt()
abs_TO.register()

env = gym.make('MineRLTreeopt-v0')

env1 = Recorder(env, './video', fps=60)  # saving environment before action shaping to use with scripted part
env = ActionShaping(env1, always_attack=True)

# Run your agent
As the code below runs you should see episode videos and rewards show up. You can run the below cell multiple times to see different episodes.

In [None]:
network = NatureCNN((3, 64, 64), 7).cuda()
network.load_state_dict(th.load(TEST_MODEL_NAME))

num_actions = env.action_space.n
action_list = np.arange(num_actions)

action_sequence = get_action_sequence()

for episode in range(TEST_EPISODES):
    obs = env.reset()
    done = False
    total_reward = 0
    steps = 0

    # BC part to get some logs:
    for i in tqdm(range(TREEOPT_STEPS)):
        # Process the action:
        #   - Add/remove batch dimensions
        #   - Transpose image (needs to be channels-last)
        #   - Normalize image
        obs = th.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
        # Turn logits into probabilities
        probabilities = th.softmax(network(obs), dim=1)[0]
        # Into numpy
        probabilities = probabilities.detach().cpu().numpy()
        # Sample action according to the probabilities
        action = np.random.choice(action_list, p=probabilities)

        obs, reward, done, info = env.step(action)
        total_reward += reward
        steps += 1
        if done:
            break

    # scripted part to use the logs:
    if not done:
        for i, action in enumerate(tqdm(action_sequence[:MAX_TEST_EPISODE_LEN - TREEOPT_STEPS])):
            obs, reward, done, _ = env1.step(str_to_act(env1, action))
            total_reward += reward
            steps += 1
            if done:
                break

    env1.release()
    # env1.play()
    print(f'Episode #{episode + 1} reward: {total_reward}\t\t episode length: {steps}\n')

  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #1 reward: 0.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #2 reward: 30.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #3 reward: 25.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #4 reward: 5.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #5 reward: 13.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #6 reward: 1.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #7 reward: 12.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #8 reward: 14.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #9 reward: 14.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #10 reward: 12.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #11 reward: 17.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #12 reward: 16.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #13 reward: 9.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #14 reward: 2.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #15 reward: 14.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #16 reward: 13.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #17 reward: 8.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #18 reward: 15.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #19 reward: 20.0		 episode length: 3436



  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/1436 [00:00<?, ?it/s]

Episode #20 reward: 0.0		 episode length: 3436

