<a href="https://colab.research.google.com/github/jimmy93029/Intro2AI-Final/blob/main/AI_Final_BASALT_GAIL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center">
  <img src="https://github.com/KarolisRam/MineRL2021-Intro-baselines/blob/main/img/colab_banner.png?raw=true">
</div>

# Introduction
This notebook is the installation part for the [MineRL 2022](https://minerl.io/) competition, building on the original introductory notebooks created for the MineRL 2021 competition.

## Note: About this file

This file is updated by NYCU 2024 Spring Intro2AI Team 11: まふまふ.
The original file is come from [here](https://colab.research.google.com/drive/1rJ3lGy-bG7kJRe_wYBWg7fjSaD9oOMDw?usp=sharing)

## There's a video to explain...
Please visit [this intro YouTube video](https://youtu.be/8yIrWcyWGek) to see some background information.  Hopefully, this will lead to a number of additional videos that explore what can be done in this environment...

And if you see me=@mdda online, then please say "Hi!"

## Software 2.0
The approach we are going to use, where we took some human written code and replaced it with an AI component is quite similar to how Tesla approaches self driving cars. See this talk by Andrej Karpathy, Director of AI at Tesla:  
[Building the Software 2.0 Stack](https://databricks.com/session/keynote-from-tesla)


# Setup

In [None]:
%%capture
!sudo add-apt-repository -y ppa:openjdk-r/ppa
!sudo apt-get purge openjdk-*
!sudo apt-get install openjdk-8-jdk
!sudo apt-get install xvfb xserver-xephyr vnc4server python-opengl ffmpeg
# Takes ~1min to run this
# New Add
!sudo apt-get install -y xvfb  # Install Xvfb

In [None]:

# This takes ~22mins - which would hit us every time we start Colab
#   So we'll do it once, and store a '.tar.gz' of the installation into our
#   Google Drive, so that we can get it back much quicker the second time!

##%%capture
##!pip3 install --upgrade minerl # Default is 0.4.4, we want 1.0.0 for VPT
##!pip3 uninstall minerl
#!pip3 install git+https://github.com/minerllabs/minerl@v1.0.0
#
#!pip3 install pyvirtualdisplay
#!pip3 install -U colabgymrender

In [None]:
import os, sys, time

mine_env = 'mine_env'
mine_env_full = f'/content/{mine_env}'
mine_tar = f'{mine_env}.tar.gz'

if mine_env_full not in sys.path:
  sys.path.insert(0, mine_env_full)
  os.environ['PYTHONPATH'] += f':{mine_env_full}'

mine_env, mine_env_full, mine_tar

('mine_env', '/content/mine_env', 'mine_env.tar.gz')

In [None]:
# We'll connect to our Google Drive here, and see whether we've already saved off a copy
#   This will ask permission to 'connect to your drive' : The answer is 'Yes'!
MINE_ENV_IS_NEW = True

from google.colab import drive  # google.colab contains functions specifically for interacting with Google Colab's environment.
drive.mount('/content/drive')    # mounts your Google Drive as a local file system
if os.path.isfile(f'/content/drive/MyDrive/pythonLib/{mine_tar}'): # check if "mine_env.tar.gz" is in your Google Drive
  ! cp /content/drive/MyDrive/pythonLib/$mine_tar ./$mine_tar  # ! means the command is to be executed in the shell rather than as Python code.
                                              # This command copies the file from your Google Drive to the current working directory of the Colab notebook.

  ! ls -l ./$mine_tar                         # This lists the file details such as permissions, owner, size, and modification date for the copied file in the current directory.
                                              # It helps verify that the file has been copied correctly and shows its properties.
  # e.g.: -rw------- 1 root root 1510118446 Jun 26 08:48 ./mine_env.tar.gz

  # ! tar -tzf ./$mine_tar | grep minerl | head -5    # list some contents of the compressed tar file without extracting it
  ! tar -xzf ./$mine_tar    # This extracts the contents of the tar file into the current directory

  MINE_ENV_IS_NEW = False
  # Takes 1min too (huge saving!)

sys.path.append('/content/drive/MyDrive/pythonLib')
sys.path.append('/content/drive/MyDrive/pythonLib/VPT')

"DONE"

In [None]:
# Check default packages (execute if needed)
!pip3 list

In [None]:
# Build the mine_env if necessary
try:
  from pyvirtualdisplay import Display
except :
  !pip3 install --target=$mine_env git+https://github.com/minerllabs/minerl@v1.0.2   # 21 mins
  # https://stackoverflow.com/questions/55833509/attributeerror-type-object-callable-has-no-attribute-abc-registry
  !mv $mine_env/typing.py $mine_env/MEH-typing.py  # Fix for Python3.7 ...

  !pip3 install --target=$mine_env pyvirtualdisplay  # 4 secs  #注 Display creates a virtual framebuffer that graphical applications can use to render output as if they were using a real monitor.
                                                              #注 This allows you to run applications that require a GUI without having an actual GUI environment installed on the system.
  !pip3 install --target=$mine_env --upgrade colabgymrender # 22 secs  #注 colabgymrender provides a workaround by capturing the graphical output of the environment and displaying it within the notebook.

  MINE_ENV_IS_NEW = True
  # NB: some restart notices in the output ... but there's no need to restart!
  #     In any case, please wait for the 'DONE' message to print out
f"DONE, with MINE_ENV_IS_NEW={MINE_ENV_IS_NEW}"

In [None]:
# check content of mine_env (execute if needed)
! du -b mine_env | tail -5  # mine_env = ~ 2,094,031,775 bytes overall (a little bit less)

In [None]:
# Build new env.tar.gz file in google drive (execute if needed)
if MINE_ENV_IS_NEW: #  or True
  # ! ls -l /gdrive/MyDrive/mine*
  ! rm -f ./$mine_tar   #注 removes the existing tar.gz archive of the environment, if any, from the current directory.
  ! tar -czf ./$mine_tar $mine_env  #注 This command creates a new compressed (gzipped) tar archive of the directory specified by the $mine_env variable (the environment directory).
  ! ls -l ./$mine_tar
  # Without running the env...
  # -rw-r--r-- 1 root root 1505020174 Jun 26 07:26 ./mine_env.tar.gz
  # Once the minerl env has been reset once (i.e. java has built...)
  # -rw------- 1 root root 1511976116 Jun 26 08:43 ./mine_env.tar.gz
  ! tar -tzf ./$mine_tar | head
  ! cp ./$mine_tar /content/drive/MyDrive/pythonLib/  #注 This copies the newly created archive to a Google Drive directory.
  ! ls -l /content/drive/MyDrive/pythonLib/$mine_tar
"DONE"

'DONE'

# Import Libraries

In [29]:
import os   # For interacting with the operating system.
import time

import numpy as np  # For numerical operations.

import gym    # To create and manage environments based on the OpenAI Gym toolkit.
import minerl

from tqdm.notebook import tqdm  # For displaying progress bars in Jupyter notebooks.
from colabgymrender.recorder import Recorder # To facilitate rendering of Gym environments in Google Colab.
from pyvirtualdisplay import Display # To create a virtual display to render environments in a headless server or environment like Google Colab.

import logging
logging.disable(logging.ERROR) # reduce clutter, remove if something doesn't work to see the error logs.

np.__version__  # '1.21.6' => that this is reading from our ~/mine_env directory
# Numpy version may be different from the content above
# About warning: since warning is in a local package, so if error occurs, please comment the specific line

import cv2
#from google.colab.patches import cv2_imshow
#from PIL import Image
import matplotlib.pylab as plt

import glob
import json
import torch as th
import torchvision.transforms.functional as TF
from torch import nn
from torch.nn import functional as F
from torch import optim
from run_inverse_dynamics_model import json_action_to_env_action


from torch.nn import Module, Sequential, Linear, Tanh, Parameter, Embedding
from torch.distributions import Categorical, MultivariateNormal

if torch.cuda.is_available():
    from torch.cuda import FloatTensor
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
else:
    from torch import FloatTensor

# Download Dataset

In [None]:
from download_dataset import download_file
download_file(400) # default is 400, about 40 GB?

# Construct Inverse Dynamic Model Agent

Optimal

In [None]:
from inverse_dynamics_model import load_IDM_agent
IDMAgent = load_IDM_agent()

In [None]:
# Test for IDMAgent
# from agent import ENV_KWARGS # need to modify
# required_resolution = ENV_KWARGS["resolution"]
# files = glob.glob("/content/MineRLBasaltFindCave-v0/*.mp4")
# video_path = files[0]
# json_path = video_path.replace(".mp4", ".jsonl")

# cap = cv2.VideoCapture(video_path)
# frames = []

# json_index = 0
# with open(json_path) as json_file:
#   json_lines = json_file.readlines()
#   json_data = "[" + ",".join(json_lines) + "]"
#   json_data = json.loads(json_data)

# for _ in range(5000):
#   ret, frame = cap.read()
#   break
#   if not ret:
#     break
#   assert frame.shape[0] == required_resolution[1] and frame.shape[1] == required_resolution[0], "Video must be of resolution {}".format(required_resolution)
#   # BGR -> RGB
#   frames.append(frame[..., ::-1])
#   break
#   if len(frames) == 100 or len(frames) == 50:
#     l = len(frames)
#     fs = np.stack(frames)
#     predicted_actions = IDMAgent.predict_actions(fs)
#     for i in range(50):
#       env_action, _ = json_action_to_env_action(json_data[json_index])
#       json_index += 1
#       for y, (action_name, action_array) in enumerate(predicted_actions.items()):
#         print(f"{action_name}: {action_array[0, (l - 50 + i)]} ({env_action[action_name]}), ", end = "")
#       print("\n")
#     frames = frames[50:99]

# predicted_actions = IDMAgent.predict_actions(fs)
# l = len(frames)
# for i in range(50, l):
#   env_action, _ = json_action_to_env_action(json_data[json_index])
#   json_index += 1
#   for y, (action_name, action_array) in enumerate(predicted_actions.items()):
#     print(f"{action_name}: {action_array[0, (l - 50 + i)]} ({env_action[action_name]}), ", end = "")
#   print("\n")

# Neural Nwtwork for GAIL

In [None]:
# transform of env action and agent action
env = gym.make("MineRLBasaltFindCave-v0")

NOOP = env.action_space.no_op()

# binary encoding of env_action
# forward, back, left, right, sneak, sprint(run), jump, ESC = 2^7, ..., 2^0
ACTION_LIST = ["forward", "back", "left", "right", "sneak", "sprint", "jump", "ESC"]

device = th.device("cuda" if th.cuda.is_available() else "cpu")
print(device)

cuda


### support functions

In [None]:
def get_flat_grads(f, net):
    flat_grads = torch.cat([
        grad.view(-1)
        for grad in torch.autograd.grad(f, net.parameters(), create_graph=True)
    ])

    return flat_grads


def get_flat_params(net):
    return torch.cat([param.view(-1) for param in net.parameters()])


def set_params(net, new_flat_params):
    start_idx = 0
    for param in net.parameters():
        end_idx = start_idx + np.prod(list(param.shape))
        param.data = torch.reshape(
            new_flat_params[start_idx:end_idx], param.shape
        )

        start_idx = end_idx


def conjugate_gradient(Av_func, b, max_iter=10, residual_tol=1e-10):
    x = torch.zeros_like(b)
    r = b - Av_func(x)
    p = r
    rsold = r.norm() ** 2

    for _ in range(max_iter):
        Ap = Av_func(p)
        alpha = rsold / torch.dot(p, Ap)
        x = x + alpha * p
        r = r - alpha * Ap
        rsnew = r.norm() ** 2
        if torch.sqrt(rsnew) < residual_tol:
            break
        p = r + (rsnew / rsold) * p
        rsold = rsnew

    return x


def rescale_and_linesearch(
    g, s, Hs, max_kl, L, kld, old_params, pi, max_iter=10,
    success_ratio=0.1
):
    set_params(pi, old_params)
    L_old = L().detach()

    beta = torch.sqrt((2 * max_kl) / torch.dot(s, Hs))

    for _ in range(max_iter):
        new_params = old_params + beta * s

        set_params(pi, new_params)
        kld_new = kld().detach()

        L_new = L().detach()

        actual_improv = L_new - L_old
        approx_improv = torch.dot(g, beta * s)
        ratio = actual_improv / approx_improv

        if ratio > success_ratio \
            and actual_improv > 0 \
                and kld_new < max_kl:
            return new_params

        beta *= 0.5

    print("The line search was failed!")
    return old_params

In [None]:
def env_action_to_agent(env_action: dict):
    target_action_C = int(0)
    target_action_R = env_action["camera"]
    for act in ACTION_LIST:
        target_action_C *= 2
        target_action_C += 1 if env_action.get(act) == 1 else 0
    if target_action_C == 0 and np.array_equal(target_action_R, np.zeros(2)):
        isNoop = True
    else:
        isNoop = False
    return [target_action_C, target_action_R, isNoop]

def agent_action_to_env(agent_action_C, agent_action_R):
    target_action = NOOP
    ACTION_LIST_Rev = ACTION_LIST.copy()
    ACTION_LIST_Rev.reverse()
    for act in ACTION_LIST_Rev:
        target_action[act] = 1 if agent_action_C % 2 == 1 else 0
        agent_action_C //= 2  # Use integer division to keep agent_action_C as an integer
    target_action["camera"] = agent_action_R
    return target_action

def img_to_tensor(frames):
  target_tensor = th.empty((0, 3, 227, 227), dtype = th.float32)
  for frame in frames:
    frame = cv2.resize(frame, (227, 227))
    frame = TF.to_tensor(frame).unsqueeze(0)
    target_tensor = th.cat((target_tensor, frame), dim = 0)
  return target_tensor

### Network

In [None]:
class PolicyNetwork(Module):
    def __init__(self, state_dim, action_dim, discrete) -> None:
        super().__init__()

        self.net = Sequential(
            Linear(state_dim, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, action_dim),
        )

        self.state_dim = state_dim
        self.action_dim = action_dim
        self.discrete = discrete

        if not self.discrete:
            self.log_std = Parameter(torch.zeros(action_dim))

    def forward(self, states):
        if self.discrete:
            probs = torch.softmax(self.net(states), dim=-1)
            distb = Categorical(probs)
        else:
            mean = self.net(states)

            std = torch.exp(self.log_std)
            cov_mtx = torch.eye(self.action_dim) * (std ** 2)

            distb = MultivariateNormal(mean, cov_mtx)

        return distb


class ValueNetwork(Module):
    def __init__(self, state_dim) -> None:
        super().__init__()

        self.net = Sequential(
            Linear(state_dim, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 1),
        )

    def forward(self, states):
        return self.net(states)


class Discriminator(Module):
    def __init__(self, state_dim, action_dim, discrete) -> None:
        super().__init__()

        self.state_dim = state_dim
        self.action_dim = action_dim
        self.discrete = discrete

        if self.discrete:
            self.act_emb = Embedding(
                action_dim, state_dim
            )
            self.net_in_dim = 2 * state_dim
        else:
            self.net_in_dim = state_dim + action_dim

        self.net = Sequential(
            Linear(self.net_in_dim, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 1),
        )

    def forward(self, states, actions):
        return torch.sigmoid(self.get_logits(states, actions))

    def get_logits(self, states, actions):
        if self.discrete:
            actions = self.act_emb(actions.long())

        sa = torch.cat([states, actions], dim=-1)

        return self.net(sa)


class Expert(Module):
    def __init__(
        self,
        state_dim,
        action_dim,
        discrete,
        train_config=None
    ) -> None:
        super().__init__()

        self.state_dim = state_dim
        self.action_dim = action_dim
        self.discrete = discrete
        self.train_config = train_config

        self.pi = PolicyNetwork(self.state_dim, self.action_dim, self.discrete)

    def get_networks(self):
        return [self.pi]

    def act(self, state):
        self.pi.eval()

        state = FloatTensor(state)
        distb = self.pi(state)

        action = distb.sample().detach().cpu().numpy()

        return action

# FindCave Agent

In [35]:
import torch
import torch.optim as optim
import numpy as np
import gym
import minerl
from torch.nn import Module, Linear, Sequential, Tanh, Embedding, Parameter
from torch.distributions import Categorical, MultivariateNormal
import cv2
import glob
import json
import random
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

# Assuming we have the networks defined as in your GAIL code
class PolicyNetwork(Module):
    def __init__(self, state_dim, action_dim, discrete) -> None:
        super().__init__()

        self.net = Sequential(
            Linear(state_dim, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, action_dim),
        )

        self.state_dim = state_dim
        self.action_dim = action_dim
        self.discrete = discrete

        if not self.discrete:
            self.log_std = Parameter(torch.zeros(action_dim))

    def forward(self, states):
        if self.discrete:
            probs = torch.softmax(self.net(states), dim=-1)
            distb = Categorical(probs)
        else:
            mean = self.net(states)
            std = torch.exp(self.log_std)
            cov_mtx = torch.eye(self.action_dim) * (std ** 2)
            distb = MultivariateNormal(mean, cov_mtx)
        return distb

class ValueNetwork(Module):
    def __init__(self, state_dim) -> None:
        super().__init__()
        self.net = Sequential(
            Linear(state_dim, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 1),
        )
    def forward(self, states):
        return self.net(states)

class Discriminator(Module):
    def __init__(self, state_dim, action_dim, discrete) -> None:
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.discrete = discrete
        if self.discrete:
            self.act_emb = Embedding(action_dim, state_dim)
            self.net_in_dim = 2 * state_dim
        else:
            self.net_in_dim = state_dim + action_dim

        self.net = Sequential(
            Linear(self.net_in_dim, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 50),
            Tanh(),
            Linear(50, 1),
        )

    def forward(self, states, actions):
        return torch.sigmoid(self.get_logits(states, actions))

    def get_logits(self, states, actions):
        if self.discrete:
            actions = self.act_emb(actions.long())
        sa = torch.cat([states, actions], dim=-1)
        return self.net(sa)

class GAIL(Module):
    def __init__(self, state_dim, action_dim, discrete, train_config=None) -> None:
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.discrete = discrete
        self.train_config = train_config
        self.pi = PolicyNetwork(self.state_dim, self.action_dim, self.discrete)
        self.v = ValueNetwork(self.state_dim)
        self.d = Discriminator(self.state_dim, self.action_dim, self.discrete)

    def get_networks(self):
        return [self.pi, self.v]

    def act(self, state):
        self.pi.eval()
        state = torch.FloatTensor(state).unsqueeze(0)
        distb = self.pi(state)
        action = distb.sample().detach().cpu().numpy().squeeze()
        return action

    def train(self, env, expert, render=False):
        # Add your training logic here, similar to the provided GAIL code
        pass

def img_to_tensor(images):
    transform = ToTensor()
    tensors = [transform(image) for image in images]
    return torch.stack(tensors).cuda()

class FindCaveGAILAgent(GAIL):
    def __init__(self, state_dim, action_dim, discrete, train_config=None):
        super().__init__(state_dim, action_dim, discrete, train_config)
        self.policyC = self.pi
        self.policyR = self.pi
        self.optimizerC = optim.Adam(self.policyC.parameters(), lr=0.001)
        self.optimizerR = optim.Adam(self.policyR.parameters(), lr=0.001)
        self.lossCFunc = torch.nn.CrossEntropyLoss()
        self.lossRFunc = torch.nn.MSELoss()

        # Move models to device
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.policyC.to(self.device)
        self.policyR.to(self.device)

    def train(self):
        # Implement the training logic similar to the provided `FindCaveAgent` class
        video_paths = glob.glob("/content/MineRLBasaltFindCave-v0/*.mp4")
        json_paths = [vp.replace(".mp4", ".jsonl") for vp in video_paths]

        batch_count = 0
        batch_loss_C = 0
        batch_loss_R = 0

        n_epochs = 10
        batch_size = 32

        for epoch in range(n_epochs):
            print(f"Epoch {epoch+1}")
            cap_list = []
            json_data_list = []
            cur_json_index = []
            for video_path, json_path in zip(video_paths, json_paths):
                cap_list.append(cv2.VideoCapture(video_path))
                with open(json_path) as jf:
                    json_lines = jf.readlines()
                    json_data = "[" + ",".join(json_lines) + "]"
                    json_data = json.loads(json_data)
                    json_data_list.append(json_data)
                cur_json_index.append(0)

            while len(cap_list) >= batch_size:
                batch_cap = random.sample(cap_list, batch_size)
                batch_index = [cap_list.index(cap) for cap in batch_cap]

                batch_frames = [cap.read()[1] for cap in batch_cap]
                batch_frames = img_to_tensor(batch_frames).to(self.device)

                batch_cur_json_index = [cur_json_index[index] for index in batch_index]
                batch_env_actions = [json_action_to_env_action(json_data_list[index][json_index])[0] for index, json_index in zip(batch_index, batch_cur_json_index)]

                batch_target_actions_C = [env_action_to_agent(env_action)[0] for env_action in batch_env_actions]
                batch_target_actions_R = [env_action_to_agent(env_action)[1] for env_action in batch_env_actions]
                batch_target_actions_C = torch.tensor(batch_target_actions_C, dtype=torch.long).to(self.device)
                batch_target_actions_R = torch.tensor(batch_target_actions_R, dtype=torch.float).to(self.device)

                # Training
                self.optimizerC.zero_grad()
                batch_output_C = self.policyC(batch_frames)
                loss_C = self.lossCFunc(batch_output_C, batch_target_actions_C)
                loss_C.backward()
                self.optimizerC.step()

                self.optimizerR.zero_grad()
                batch_output_R = self.policyR(batch_frames)
                loss_R = self.lossRFunc(batch_output_R, batch_target_actions_R)
                loss_R.backward()
                self.optimizerR.step()

                batch_loss_C += loss_C.item()
                batch_loss_R += loss_R.item()
                batch_count += 1
                if batch_count % 100 == 0:
                    print(f'Batch {batch_count-100} to {batch_count-1}, LossC: {batch_loss_C}, LossR: {batch_loss_R}')
                    batch_loss_C = 0
                    batch_loss_R = 0

                del_indices = []
                for index in batch_index:
                    cur_json_index[index] += 1
                    if cur_json_index[index] >= len(json_data_list[index]):
                        del_indices.append(index)

                del_indices.sort(reverse=True)
                for index in del_indices:
                    cap_list[index].release()
                    del cap_list[index]
                    del json_data_list[index]
                    del cur_json_index[index]

            for cap in cap_list:
                cap.release()

    def predict(self, observe):
        with torch.no_grad():
            obs_tensor = img_to_tensor([observe]).to(self.device)
            resultC = self.policyC(obs_tensor).squeeze().argmax().cpu().numpy()
            resultR = self.policyR(obs_tensor).squeeze().cpu().numpy()
            env_action = agent_action_to_env(resultC, resultR)
        return env_action

    def save_model_weights(self, path="minerl_weights.pth"):
        torch.save({
            'policyC_state_dict': self.policyC.state_dict(),
            'optimizerC_state_dict': self.optimizerC.state_dict(),
            'policyR_state_dict': self.policyR.state_dict(),
            'optimizerR_state_dict': self.optimizerR.state_dict(),
        }, path)

    def load_model_weights(self, path="minerl_weights.pth"):
        checkpoint = torch.load(path)
        self.policyC.load_state_dict(checkpoint['policyC_state_dict'])
        self.optimizerC.load_state_dict(checkpoint['optimizerC_state_dict'])
        self.policyR.load_state_dict(checkpoint['policyR_state_dict'])
        self.optimizerR.load_state_dict(checkpoint['optimizerR_state_dict'])

TA = FindCaveGAILAgent(state_dim=3, action_dim=10, discrete=True)
TA.train()


Epoch 1


RuntimeError: mat1 and mat2 shapes cannot be multiplied (34560x640 and 3x50)

In [None]:
class GAIL(Module):
    def __init__(
        self,
        state_dim,
        action_dim,
        discrete,
        train_config=None
    ) -> None:
        super().__init__()

        self.state_dim = state_dim
        self.action_dim = action_dim
        self.discrete = discrete
        self.train_config = train_config

        self.pi = PolicyNetwork(self.state_dim, self.action_dim, self.discrete)
        self.v = ValueNetwork(self.state_dim)

        self.d = Discriminator(self.state_dim, self.action_dim, self.discrete)

    def get_networks(self):
        return [self.pi, self.v]

    def act(self, state):
        self.pi.eval()

        state = FloatTensor(state)
        distb = self.pi(state)

        action = distb.sample().detach().cpu().numpy()

        return action

    def train(self, env, expert, render=False):
        num_iters = self.train_config["num_iters"]
        num_steps_per_iter = self.train_config["num_steps_per_iter"]
        horizon = self.train_config["horizon"]
        lambda_ = self.train_config["lambda"]
        gae_gamma = self.train_config["gae_gamma"]
        gae_lambda = self.train_config["gae_lambda"]
        eps = self.train_config["epsilon"]
        max_kl = self.train_config["max_kl"]
        cg_damping = self.train_config["cg_damping"]
        normalize_advantage = self.train_config["normalize_advantage"]

        opt_d = torch.optim.Adam(self.d.parameters())

        exp_rwd_iter = []

        exp_obs = []
        exp_acts = []

        steps = 0
        while steps < num_steps_per_iter:
            ep_obs = []
            ep_rwds = []

            t = 0
            done = False

            ob = env.reset()

            while not done and steps < num_steps_per_iter:
                act = expert.act(ob)

                ep_obs.append(ob)
                exp_obs.append(ob)
                exp_acts.append(act)

                if render:
                    env.render()
                ob, rwd, done, info = env.step(act)

                ep_rwds.append(rwd)

                t += 1
                steps += 1

                if horizon is not None:
                    if t >= horizon:
                        done = True
                        break

            if done:
                exp_rwd_iter.append(np.sum(ep_rwds))

            ep_obs = FloatTensor(np.array(ep_obs))
            ep_rwds = FloatTensor(ep_rwds)

        exp_rwd_mean = np.mean(exp_rwd_iter)
        print(
            "Expert Reward Mean: {}".format(exp_rwd_mean)
        )

        exp_obs = FloatTensor(np.array(exp_obs))
        exp_acts = FloatTensor(np.array(exp_acts))

        rwd_iter_means = []
        for i in range(num_iters):
            rwd_iter = []

            obs = []
            acts = []
            rets = []
            advs = []
            gms = []

            steps = 0
            while steps < num_steps_per_iter:
                ep_obs = []
                ep_acts = []
                ep_rwds = []
                ep_costs = []
                ep_disc_costs = []
                ep_gms = []
                ep_lmbs = []

                t = 0
                done = False

                ob = env.reset()

                while not done and steps < num_steps_per_iter:
                    act = self.act(ob)

                    ep_obs.append(ob)
                    obs.append(ob)

                    ep_acts.append(act)
                    acts.append(act)

                    if render:
                        env.render()
                    ob, rwd, done, info = env.step(act)

                    ep_rwds.append(rwd)
                    ep_gms.append(gae_gamma ** t)
                    ep_lmbs.append(gae_lambda ** t)

                    t += 1
                    steps += 1

                    if horizon is not None:
                        if t >= horizon:
                            done = True
                            break

                if done:
                    rwd_iter.append(np.sum(ep_rwds))

                ep_obs = FloatTensor(np.array(ep_obs))
                ep_acts = FloatTensor(np.array(ep_acts))
                ep_rwds = FloatTensor(ep_rwds)
                # ep_disc_rwds = FloatTensor(ep_disc_rwds)
                ep_gms = FloatTensor(ep_gms)
                ep_lmbs = FloatTensor(ep_lmbs)

                ep_costs = (-1) * torch.log(self.d(ep_obs, ep_acts))\
                    .squeeze().detach()
                ep_disc_costs = ep_gms * ep_costs

                ep_disc_rets = FloatTensor(
                    [sum(ep_disc_costs[i:]) for i in range(t)]
                )
                ep_rets = ep_disc_rets / ep_gms

                rets.append(ep_rets)

                self.v.eval()
                curr_vals = self.v(ep_obs).detach()
                next_vals = torch.cat(
                    (self.v(ep_obs)[1:], FloatTensor([[0.]]))
                ).detach()
                ep_deltas = ep_costs.unsqueeze(-1)\
                    + gae_gamma * next_vals\
                    - curr_vals

                ep_advs = FloatTensor([
                    ((ep_gms * ep_lmbs)[:t - j].unsqueeze(-1) * ep_deltas[j:])
                    .sum()
                    for j in range(t)
                ])
                advs.append(ep_advs)

                gms.append(ep_gms)

            rwd_iter_means.append(np.mean(rwd_iter))
            print(
                "Iterations: {},   Reward Mean: {}"
                .format(i + 1, np.mean(rwd_iter))
            )

            obs = FloatTensor(np.array(obs))
            acts = FloatTensor(np.array(acts))
            rets = torch.cat(rets)
            advs = torch.cat(advs)
            gms = torch.cat(gms)

            if normalize_advantage:
                advs = (advs - advs.mean()) / advs.std()

            self.d.train()
            exp_scores = self.d.get_logits(exp_obs, exp_acts)
            nov_scores = self.d.get_logits(obs, acts)

            opt_d.zero_grad()
            loss = torch.nn.functional.binary_cross_entropy_with_logits(
                exp_scores, torch.zeros_like(exp_scores)
            ) \
                + torch.nn.functional.binary_cross_entropy_with_logits(
                    nov_scores, torch.ones_like(nov_scores)
                )
            loss.backward()
            opt_d.step()

            self.v.train()
            old_params = get_flat_params(self.v).detach()
            old_v = self.v(obs).detach()

            def constraint():
                return ((old_v - self.v(obs)) ** 2).mean()

            grad_diff = get_flat_grads(constraint(), self.v)

            def Hv(v):
                hessian = get_flat_grads(torch.dot(grad_diff, v), self.v)\
                    .detach()

                return hessian

            g = get_flat_grads(
                ((-1) * (self.v(obs).squeeze() - rets) ** 2).mean(), self.v
            ).detach()
            s = conjugate_gradient(Hv, g).detach()

            Hs = Hv(s).detach()
            alpha = torch.sqrt(2 * eps / torch.dot(s, Hs))

            new_params = old_params + alpha * s

            set_params(self.v, new_params)

            self.pi.train()
            old_params = get_flat_params(self.pi).detach()
            old_distb = self.pi(obs)

            def L():
                distb = self.pi(obs)

                return (advs * torch.exp(
                            distb.log_prob(acts)
                            - old_distb.log_prob(acts).detach()
                        )).mean()

            def kld():
                distb = self.pi(obs)

                if self.discrete:
                    old_p = old_distb.probs.detach()
                    p = distb.probs

                    return (old_p * (torch.log(old_p) - torch.log(p)))\
                        .sum(-1)\
                        .mean()

                else:
                    old_mean = old_distb.mean.detach()
                    old_cov = old_distb.covariance_matrix.sum(-1).detach()
                    mean = distb.mean
                    cov = distb.covariance_matrix.sum(-1)

                    return (0.5) * (
                            (old_cov / cov).sum(-1)
                            + (((old_mean - mean) ** 2) / cov).sum(-1)
                            - self.action_dim
                            + torch.log(cov).sum(-1)
                            - torch.log(old_cov).sum(-1)
                        ).mean()

            grad_kld_old_param = get_flat_grads(kld(), self.pi)

            def Hv(v):
                hessian = get_flat_grads(
                    torch.dot(grad_kld_old_param, v),
                    self.pi
                ).detach()

                return hessian + cg_damping * v

            g = get_flat_grads(L(), self.pi).detach()

            s = conjugate_gradient(Hv, g).detach()
            Hs = Hv(s).detach()

            new_params = rescale_and_linesearch(
                g, s, Hs, max_kl, L, kld, old_params, self.pi
            )

            disc_causal_entropy = ((-1) * gms * self.pi(obs).log_prob(acts))\
                .mean()
            grad_disc_causal_entropy = get_flat_grads(
                disc_causal_entropy, self.pi
            )
            new_params += lambda_ * grad_disc_causal_entropy

            set_params(self.pi, new_params)

        return exp_rwd_mean, rwd_iter_means

In [None]:
# TODO: train and test func

class FindCaveAgent():

  def __init__(self, learning_rate = 0.001):

    # For Classification (8 button)
    self.policyC = AlexNet(output_size = 256).to(device)
    self.optimizerC = optim.Adam(self.policyC.parameters(), lr = learning_rate)
    self.lossCFunc = nn.CrossEntropyLoss()

    # For regression (camera)
    self.policyR = AlexNet(output_size = 2).to(device)
    self.optimizerR = optim.Adam(self.policyR.parameters(), lr = learning_rate)
    self.lossRFunc = nn.MSELoss()

  def train(self, batch_size = 20):

    video_src = glob.glob("/content/MineRLBasaltFindCave-v0/*.mp4")
    vcaps = []
    vlength = []
    action_data = []
    action_index = 0
    counter = 0
    random_video = np.random.randint(len(video_src), size = batch_size)
    for x in random_video:
      vcaps.append(cv2.VideoCapture(video_src[x]))
      vlength.append(int(vcaps[-1].get(7)) - 1)
      with open(video_src[x].replace(".mp4", ".jsonl")) as json_file:
        json_lines = json_file.readlines()
        json_data = "[" + ",".join(json_lines) + "]"
        json_data = json.loads(json_data)
        action_data.append(json_data)
    while len(vcaps) != 0:
      counter += 1
      if counter % 20 == 0:
        print(counter)
      if action_index == 3000:
        break
      frames = np.array([vcap.read()[1] for vcap in vcaps])
      actionsC = np.empty(batch_size)
      actionsR = np.empty((batch_size, 2))
      pop_cap = []
      for i in range(batch_size):
        if action_index < vlength[i]:
          actions = env_action_to_agent(json_action_to_env_action(action_data[i][action_index])[0])
          actionsC[i] = actions[0]
          actionsR[i] = actions[1]
        else:
          actionsC[i] = 1
          actionsR[i] = [0, 0]
          pop_cap.append(i)
      if len(pop_cap) > 0:
        pop_cap.reverse()
        for i in pop_cap:
          vcaps[i].release()
          vcaps.pop(i)
          vlength.pop(i)
          action_data.pop(i)
          batch_size -= 1
      frames_tensor = img_to_tensor(frames).to(device)
      actionC_tensor = th.LongTensor(actionsC).to(device)
      actionR_tensor = th.FloatTensor(actionsR).to(device)
      resultC = self.policyC(frames_tensor)
      resultR = self.policyR(frames_tensor)

      self.optimizerC.zero_grad()
      lossC = self.lossCFunc(resultC, actionC_tensor)
      lossC.backward()
      if counter % 20 == 0: print(lossC)
      self.optimizerC.step()

      self.optimizerR.zero_grad()
      lossR = self.lossRFunc(resultR, actionR_tensor)
      lossR.backward()
      if counter % 20 == 0: print(lossR)
      self.optimizerR.step()

      action_index += 1
    self.save_model_weights()

  def test(self):

    files = glob.glob("/content/MineRLBasaltFindCave-v0/*.mp4")
    video_path = files[0]
    json_path = video_path.replace(".mp4", ".jsonl")

    cap = cv2.VideoCapture(video_path)
    frames = []

    ret, frame = cap.read()
    frames.append(frame[::-1])
    ret, frame = cap.read()
    frames.append(frame[::-1])
    frames_tensor = img_to_tensor(frames).to(device)
    result = self.policyC.forward(frames_tensor).detach()
    print(result, type(result))

  def predict(self, observe):

    with th.no_grad():

      obs_tensor = img_to_tensor([observe, ]).to(device)
      resultC = self.policyC(obs_tensor).squeeze().argmax().cpu().numpy()
      resultR = self.policyR(obs_tensor).squeeze().cpu().numpy()
      env_action = agent_action_to_env(resultC, resultR)

    return env_action

  def save_model_weights(self, path="minerl_weights.pth"):
    # Save the state dictionaries of models and optimizers
    th.save({
        'policyC_state_dict': self.policyC.state_dict(),
        'optimizerC_state_dict': self.optimizerC.state_dict(),
        'policyR_state_dict': self.policyR.state_dict(),
        'optimizerR_state_dict': self.optimizerR.state_dict(),
    }, path)

  def load_model_weights(self, path="minerl_weights.pth"):
    # Load the state dictionaries of models and optimizers
    checkpoint = th.load(path)
    self.policyC.load_state_dict(checkpoint['policyC_state_dict'])
    self.optimizerC.load_state_dict(checkpoint['optimizerC_state_dict'])
    self.policyR.load_state_dict(checkpoint['policyR_state_dict'])
    self.optimizerR.load_state_dict(checkpoint['optimizerR_state_dict'])


TA = FindCaveAgent()
TA.train()

## Testing

In [None]:
path = "/content/drive/MyDrive/minerl_weights.pth"

TA.load_model_weights(path)
env = gym.make("MineRLBasaltFindCave-v0")

disp = Display(visible=0, backend="xvfb")
disp.start();

In [None]:
import gym
import matplotlib.pyplot as plt

def testing(agent, env, render=False):

    obs = env.reset()
    pov = obs["pov"]

    done = False
    cumulative_reward = 0

    while not done:
        ac = agent.predict(pov)
        obs, reward, done, info = env.step(ac)
        pov = obs["pov"]

        cumulative_reward += reward

        if render:
            plt.imshow(pov)
            plt.show()
            plt.clf()  # Important to reduce the usage of RAM

    print(f"Total Cumulative Reward: {cumulative_reward}")
    return cumulative_reward


render = False
num_runs = 10
cumulative_rewards = []

for _ in range(num_runs):
    cumulative_reward = testing(TA, env, render=render)
    cumulative_rewards.append(cumulative_reward)

average_cumulative_reward = sum(cumulative_rewards) / num_runs
print(f"Average Cumulative Reward over {num_runs} runs: {average_cumulative_reward}")


Total Cumulative Reward: 0.0
Total Cumulative Reward: 0.0


## Others

In [None]:
disp = Display(visible=0, backend="xvfb")
disp.start();

In [None]:
env.action_space.sample().keys()

In [None]:
# Have a look at a few actions we might do:
for _ in range(10):
  print( env.action_space.sample() )

In [None]:
# Now that Steve has been spawned, do some actions...
t0=time.time()
obs = env.reset()
pov = obs["pov"]
print(f"{(time.time()-t0):.2f}sec for env.reset")

done, iter = False, 0
actionClist = [128, 128, 128, 130, 130, 130, 130, 130, 130, 128, 128, 0, 0, 0, 1]
while not done:
    ac = agent_action_to_env(actionClist[iter], [0, 0])
    # ac = TA.predict(pov)  # Use this to test the performance of NN
    # Spin around to see what is around us
    # ac["camera"] = [0, +30]  # (pitch, yaw) deltas in degrees : +30 => turn to right

    t1=time.time()
    obs, reward, done, info = env.step(ac)
    #print(obs, reward, info)  # NB: Yikes : obs is only the image!
    #  obs = Dict(pov:Box(low=0, high=255, shape=(360, 640, 3)))
    #print(pov.shape) # (360, 640, 3)  Image spec agrees with docs!
    print(f"{(time.time()-t1):.2f}sec for env.step")  # Approx 0.25sec per step

    pov = obs["pov"]

    #env.render()  # This does an internal cv2.imshow that colab rejects
    #cv2_imshow(pov[:, :, ::-1])
    #cv2.waitKey(1)

    plt.imshow(pov)
    plt.show()
    plt.clf()  # important to reduce the usage of RAM
    iter +=1
    if iter>22: done=True

plt.close()

f"{(time.time()-t0):.2f}sec for whole spin"

In [None]:
# Set up a simple testing function
def action_step(action):
  ac = env.action_space.noop()
  ac.update(action)
  obs, reward, done, info = env.step(ac)
  plt.imshow(obs["pov"])
  plt.show()

In [None]:
action_step({})
action_step(dict(inventory=[1]))
action_step(dict(camera=[0, +30]))
action_step(dict(camera=[-10, -30]))
action_step(dict(camera=[+10, 0]))
action_step(dict(inventory=[1]))  # Put inventory away? = Yes, if it is showing

In [None]:
#action_step({'inventory':[1]})  # Put inventory away? = NOT jump, sneak, use, hotbar.X, back
action_step({})  # NOOP

In [None]:
# Set up a simple calibration function
import cv2
from google.colab.patches import cv2_imshow

def action_step_calibrate(x_off,y_off):
  ac = env.action_space.noop()
  ac.update(dict(camera=[y_off, x_off]))
  obs, reward, done, info = env.step(ac)
  im = obs["pov"][100:250, 200:400,:]
  cv2_imshow(cv2.cvtColor(im, cv2.COLOR_RGB2BGR))
  ac = env.action_space.noop()
  ac.update(dict(camera=[-y_off, -x_off]))  # Move back
  obs, reward, done, info = env.step(ac)

In [None]:
action_step({})
action_step(dict(inventory=[1]))

action_step_calibrate(0, 0)
for x_off in [+0.62, +1.61, +3.22, +5.81, +10.0]:
  print(f"x_off={x_off}")
  action_step_calibrate(x_off,0)
  action_step_calibrate(-x_off,0)
for y_off in [+0.62, +1.61, +3.22, +5.81, +10.0]:
  print(f"y_off={y_off}")
  action_step_calibrate(0, y_off)
  action_step_calibrate(0, -y_off)

action_step(dict(inventory=[1]))  # Put inventory away? = Yes, if it is showing

In [None]:
env.close()

In [None]:
disp.stop();

In [None]:
# THE END! - We'll be using this set-up in the future!