### Made by KukJinKim


## Minecraft RL 환경 설치
1. 마인크래프트 게임은 Java Platform에서 동작합니다. 때문에 먼저 JDK 1.8 버전을 설치해야 합니다. 

(Windows)  
https://www.oracle.com/kr/java/technologies/javase/javase8-archive-downloads.html

(Mac)  
brew tap AdoptOpenJDK/openjdk  
brew install --cask adoptopenjdk8  

(Ubuntu)  
sudo add-apt-repository ppa:openjdk-r/ppa  
sudo apt-get update  
sudo apt-get install openjdk-8-jdk  

2. pytorch가 설치된 가상환경에서 아래 매직커맨드를 실행하고 gym과 minerl을 설치해주세요. 

In [2]:
!pip install gym==0.19.0
!pip install minerl==0.3.7




## 강화학습 구현 팁 
잘 모르는 강화학습 환경에서 어떤 모델을 구현하고 학습시킬 때는 보통 다음의 절차를 따릅니다.  
1. MDP 및 환경 정보 확인
2. Random Policy 구현 및 동작 확인
3. 모델 구현
4. 모델 동작 확인
5. 버퍼 구현 (Replay Buffer, Temporal Buffer 등)
6. 모델 업데이트 코드 구현
7. 하이퍼파라미터 조정
8. 반복 실험  

### 1. MDP 및 환경 정보 확인 
먼저 MineRL의 환경 정보를 확인해보겠습니다.  
환경의 정보라하면 크게 observation space, action space, dynamics, rewards, terminal state를 말합니다.  

튜토리얼에서는 MineRLNavigateDense-v0 환경을 이용할 것입니다. 자세한 정보는 아래의 링크에서 확인할 수 있습니다.  
https://minerl.readthedocs.io/en/latest/environments/index.html  


아래의 코드를 실행시키면 마인크래프트 게임이 실행됩니다. 

In [1]:
import gym
import minerl

env = gym.make("MineRLNavigateDense-v0")
env.make_interactive(port=5656, realtime=False) # 상호작용을 위한 코드입니다. 
print(f"obs space: {env.observation_space}")
print(f"action space: {env.action_space}")




obs space: Dict(compassAngle:Box(low=-180.0, high=180.0, shape=()), inventory:Dict(dirt:Box(low=0, high=2304, shape=())), pov:Box(low=0, high=255, shape=(64, 64, 3)))
action space: Dict(attack:Discrete(2), back:Discrete(2), camera:Box(low=-180.0, high=180.0, shape=(2,)), forward:Discrete(2), jump:Discrete(2), left:Discrete(2), place:Enum(dirt,none), right:Discrete(2), sneak:Discrete(2), sprint:Discrete(2))


코드가 정상적으로 실행이 되었다면 아래와 같이 출력됩니다. 
### Env Info  
obs space:  
Dict(compassAngle:Box(low=-180.0, high=180.0, shape=()), inventory:Dict(dirt:Box(low=0, high=2304, shape=())), pov:Box(low=0, high=255, shape=(64, 64, 3)))  
  

action space:  
Dict(attack:Discrete(2), back:Discrete(2), camera:Box(low=-180.0, high=180.0, shape=(2,)), forward:Discrete(2), jump:Discrete(2), left:Discrete(2), place:Enum(dirt,none), right:Discrete(2), sneak:Discrete(2), sprint:Discrete(2))

### Environment Goal  

![](2022-06-28-15-31-33.png)
![](2022-06-28-15-32-02.png)  
본 환경의 목표는 에이전트가 다이아몬드 블록의 위치를 찾아가는 것입니다. 다이아몬드에 가까워질 수록 (+) 보상을 받고, 멀어질수록 (-) 보상을 받습니다. 에이전트가 다이아몬드 블록을 밝거나 600초를 넘기게 되면 에피소드가 종료됩니다.  

#### 1-1 Observation space  
환경의 observation 정보는 64x64x3 RGB image tensor와  -180에서 180 사이의 스칼라 값입니다. 목표를 달성하기 위해 두 정보를 이용해야합니다. 

![obs_spec](./2022-06-28-15-28-32.png)

#### 1-2 Action space

attack:Discrete(2),  
back:Discrete(2),  
camera:Box(low=-180.0, high=180.0, shape=(2,)), forward:Discrete(2),  
jump:Discrete(2),  
left:Discrete(2),  
place:Enum(dirt,none),  
right:Discrete(2),  
sneak:Discrete(2),  
sprint:Discrete(2)

에이전트가 환경에서 취할 수 있는 행동은 위와 같이 매우 다양합니다. 위 행동들의 조합을 만들어서 task를 수행하게끔 해야합니다. 모델의 복잡도를 줄이기 위해서 7개의 action만 사용할 것입니다. 

### 2. Random policy 구현 및 동작 확인
이제 위의 정보를 이용해서 환경에서 에이전트가 어떻게 동작하는지 확인하기 위해 Random Policy를 구현할 것입니다.  

In [1]:
import random 

def make_7action(env, action_index):
    # Action들을 정의
    action = env.action_space.noop()

    # Always attack
    action['attack'] = 1
    action['jump'] = 0

    # No action
    if (action_index == 0):
        action['camera'] = [0, 0]
        action['forward'] = 0
        action['jump'] = 0

    # Camera
    elif (action_index == 1):
        action['camera'] = [0, -10]
    elif (action_index == 2):
        action['camera'] = [0, 10]
    elif (action_index == 3):
        action['camera'] = [-10, 0]
    elif (action_index == 4):
        action['camera'] = [10, 0]

    # Move forward or jump
    elif (action_index == 5):
        action['forward'] = 1
    elif (action_index == 6):
        action['jump'] = 1

    return action

def random_policy():
    action_index = random.randint(0, 7) 
    action = make_7action(env, action_index)
    return action
    

In [4]:
# 기본적인 gym의 실행과정과 똑같습니다. 
def main():
    episodes = 2
    for e in range(episodes):
        obs = env.reset()
        done = False
        score = 0
        while not done:
            env.render() # 에이전트의 실행을 눈으로 볼 수 있습니다. 
            action = random_policy()
            next_obs, reward, done, info = env.step(action)
            score += reward
            if done:
                #print(f"Episode {e} is finished")
                #print(f"Total score: {score}")
                break
    env.close()
    return 0

main() 
# if __name__ == '__main__':
    # main()

MineRL agent is public, connect on port 5656 with Minecraft 1.11
0it [00:00, ?it/s]

KeyboardInterrupt: 

: 

#### 위 코드를 실행하고 아래의 매직커맨드를 터미널에서 실행하면 환경에 직접 접속할 수 있습니다.  

![](2022-06-28-15-53-51.png)

python -m minerl.interactor 5656

### 3. 모델 구현  
이제 PPO 모델을 구현해보겠습니다. PPO 클래스와 observation numpy 배열을 torch tensor로 바꾸어주는 converter 함수를 구현해야 합니다. 

#### 3.1 converter

In [2]:
import numpy as np 
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
from collections import deque

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def navigate_converter(observation, device):
    # Convert pixels
    pixels = observation['pov']
    pixels = torch.from_numpy(pixels).float() # 64, 64, 3
    pixels /= 255.0 # int2float
    pixels = pixels.permute(2, 0, 1) # 3, 64, 64
    if len(pixels.shape) < 4: # Add batch dimension to pixels
        pixels = pixels.unsqueeze(0) # 1, 3, 64, 64
    
    # Convert angle
    angle = np.array([observation['compassAngle']], dtype=np.float)
    compassAngle = torch.from_numpy(angle)
    compassAngle /= 180.0
    compassAngle = compassAngle.unsqueeze(0)
    return pixels.to(device, dtype=torch.float), compassAngle.to(device, dtype=torch.float)



### 3.2 PPO Class  
아키텍처는 다음의 그림과 같습니다.   
![](2022-06-28-16-00-17.png)

In [4]:
class PPO(nn.Module):
    def __init__(self, num_actions):
        super(PPO, self).__init__()
        self.num_actions = num_actions
        
        self.conv_layers = nn.Sequential(
        nn.Conv2d(3, 32, kernel_size=8, stride=4),
        nn.BatchNorm2d(32),
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=4, stride=2),
        nn.BatchNorm2d(64),
        nn.ReLU(),
        nn.Conv2d(64, 64, kernel_size=3, stride=1),
        nn.BatchNorm2d(64),
        nn.ReLU(),
        nn.Flatten()
        )

        def conv2d_size_out(size, kernel_size=3, stride=2):
            return (size - (kernel_size - 1) - 1) // stride + 1

        conv_size = conv2d_size_out(64, 8, 4)
        conv_size = conv2d_size_out(conv_size, 4, 2)
        conv_size = conv2d_size_out(conv_size, 3, 1)
        linear_input_size = conv_size * conv_size * 64 # 4 x 4 x 64 = 1024
        self.fc = nn.Linear(linear_input_size+1, 512)
        self.fc_pi = nn.Linear(512, self.num_actions)
        self.fc_v = nn.Linear(512, 1)
    
    def forward(self, obs, softmax_dim=1):
        # make_batch 코드에서 obs를 잘 만들어야 한다. 
        # pixels (Batch, C, H, W)
        # angles (Batch, angle)
        pixels, compassAngle = obs
        conv_feature = self.conv_layers(pixels) # (Batch, Linear_size)
        concat_feature = torch.cat((conv_feature, compassAngle), dim=1)
        feature = F.relu(self.fc(concat_feature))
        prob = self.fc_pi(feature)
        log_prob = F.softmax(prob, dim=softmax_dim)
        value = self.fc_v(feature)
        return log_prob, value

### 4. 모델 동작 확인  
이제 구현한 PPO 클래스의 동작을 확인해보겠습니다.  
PPO 모델은 converter를  input을 받고 action_index를 출력합니다.

In [5]:
import gym
import minerl

env = gym.make("MineRLNavigateDense-v0")
env.make_interactive(port=5656, realtime=False)

def main():
    model = PPO(num_actions=7).to(device)
    episodes = 1
    for e in range(episodes):
        state = env.reset()
        done = False
        score = 0
        while not done:
            env.render() 
            # 1. actinon sampling 
            obs = navigate_converter(state, device)
            prob, value = model(obs, softmax_dim=1)
            prob = prob.squeeze(0) # prob 텐서의 배치 차원을 제거합니다. 
            m = Categorical(prob)
            action_index = m.sample().item()
            
            # 2. convert action 
            action = make_7action(env, action_index)

            # 3. take an action and get next information 
            next_state, reward, done, info = env.step(action)
            next_obs = navigate_converter(next_state, device)
            state = next_state
            score += reward
            if done:
                #print(f"Episode {e} is finished")
                #print(f"Total score: {score}")
                break
    env.close()
    return 0

main()
# if __name__ == '__main__':
#     main()


MineRL agent is public, connect on port 5656 with Minecraft 1.11
0it [00:00, ?it/s]

Failed to delete the temporary minecraft directory.


0

### 5. 버퍼 구현 

### Buffer Class  
make_batch의 함수에서 주목해야할 것은  
1) 스칼라 값들을 리스트로 감싸서 torch.tensor로 만들어준 것
2) 픽셀들을 torch.cat을 통해 배치 차원을 따라 합쳐주는 것입니다.  

위 과정들은 update 함수에서 버퍼의 샘플들을 배치 텐서로 만들기 위함입니다.  

In [12]:
class Buffer:
    def __init__(self, T_horizon):
        self.T_horizon = T_horizon
        self.data = deque(maxlen=T_horizon)

    def put_data(self, transition):
        # obs : pixels, angle
        # trans (obs, a, r, next_obs, prob[a].item(), done)
        self.data.append(transition)
        
    def make_batch(self):
        pixels_lst, angles_lst = [], []
        a_lst, r_lst, prob_a_lst, done_lst = [], [], [], []
        n_pixels_lst, n_angles_lst = [], []

        for transition in self.data:
            obs, a, r, next_obs, prob_a, done = transition
            pixels_lst.append(obs[0])
            angles_lst.append([obs[1]])
            a_lst.append([a])
            r_lst.append([r])
            n_pixels_lst.append(next_obs[0])
            n_angles_lst.append([next_obs[1]])
            prob_a_lst.append([prob_a])
            done_mask = 0 if done else 1
            done_lst.append([done_mask])
        
        pixels = torch.cat(pixels_lst).to(device)
        angles = torch.tensor(angles_lst).to(device)
        a = torch.tensor(a_lst, dtype=torch.int64).to(device)
        r = torch.tensor(r_lst, dtype=torch.float).to(device)
        n_pixels = torch.cat(n_pixels_lst).to(device)
        n_angles = torch.tensor(n_angles_lst).to(device)
        done_mask = torch.tensor(done_lst, dtype=torch.float).to(device)
        prob_a = torch.tensor(prob_a_lst).to(device)
        self.data = deque(maxlen=self.T_horizon)
        obs = (pixels, angles)
        next_obs = (n_pixels, n_angles)
        return obs, a, r, next_obs, done_mask, prob_a

### 6. model update 코드 구현 

### train_net function  

In [13]:
def train_net(model, buffer, optimizer, K_epoch, lmbda, gamma, eps_clip, entopy_coef):
    obs, a, r, next_obs, done_mask, prob_a = buffer.make_batch()
    for i in range(K_epoch):
        next_log_prob, next_value = model(next_obs)
        pi, value = model(obs) # [batch, 4]

        td_target = r + gamma * next_value * done_mask
        delta = td_target - value
        delta = delta.detach().cpu().numpy()

        advantage_lst = []
        advantage = 0.0
        for delta_t in delta[::-1]:
            advantage = gamma * lmbda * advantage + delta_t[0]
            advantage_lst.append([advantage])
        advantage_lst.reverse()
        advantage = torch.tensor(advantage_lst, dtype=torch.float).cuda()

        pi_a = pi.gather(1,a) 
        ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a))  # a/b == exp(log(a)-log(b))

        m = Categorical(pi)
        entropy = m.entropy().mean()
        surr1 = ratio * advantage
        surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
        loss = -torch.min(surr1, surr2) + F.smooth_l1_loss(value , td_target.detach()) - entopy_coef * entropy

        optimizer.zero_grad()
        loss.mean().backward()
        optimizer.step()
    
    return loss.mean().item()

### 7. 학습 및 하이퍼파라미터 조정 
이제 본격적으로 모델을 학습시켜보겠습니다.  


In [14]:
from torch.utils.tensorboard import SummaryWriter

def save_model(episode, SAVE_PERIOD, SAVE_PATH, model, MODEL_NAME, ENV_NAME):
    if episode % SAVE_PERIOD == 0:
        save_path_name = SAVE_PATH + ENV_NAME+'_'+MODEL_NAME+'_'+str(episode)+'.pt'
        torch.save(model.state_dict(), save_path_name)
        print("model saved")

def load_model(model, SAVE_PATH, MODEL_NAME):
    model.load_state_dict(torch.load(SAVE_PATH+MODEL_NAME+'.pt'))
    print("load model successfully")
    return model

    

In [15]:
import os
SAVE_PATH = "./weights/"
summary_path = "./experiments/train/"
if not os.path.isdir(summary_path):
    os.mkdir(summary_path)
if not os.path.isdir(SAVE_PATH):
    os.mkdir(SAVE_PATH)

MODEL_NAME = 'PPO'
ENV_NAME = 'MineRLNavigateDense-v0'

#Hyperparameters
learning_rate = 0.0005
gamma         = 0.99
lmbda         = 0.95
eps_clip      = 0.1
entopy_coef = 0.1
K_epoch       = 5
T_horizon     = 40

SAVE_PERIOD = 100
total_episodes = 2000
print_interval = 20

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


In [16]:
def main():
    torch.manual_seed(3407)
    writer = SummaryWriter(summary_path)
    env = gym.make(ENV_NAME)
    env.make_interactive(port=6666, realtime=False)
    model = PPO(num_actions=7).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    buffer = Buffer(T_horizon)

    for n_epi in range(total_episodes):
        score = 0.0
        seed = 3407
        env.seed(seed)
        state = env.reset()
        done = False
        loss = 0.0
        steps = 0
        while not done:
            steps += 1
            for t in range(T_horizon):
                env.render()
                obs = navigate_converter(state, device)
                prob, value = model(obs, softmax_dim=1)
                prob = prob.squeeze(0)
                m = Categorical(prob)
                a = m.sample().item()
                action = make_7action(env, a)
                s_prime, r, done, info = env.step(action)
                next_obs = navigate_converter(s_prime, device)
                transition = obs, a, r, next_obs, prob[a].item(), done
                buffer.put_data(transition)
                state = s_prime
                score += r
                if done:
                    break
            if done:
                writer.add_scalar("total_rewards", score, n_epi)
                writer.add_scalar("loss", loss, n_epi)
                print(f'loss : {loss}')
                print("# of episode :{}, score : {:.1f}".format(n_epi, score))
                break
                   
            loss = train_net(model, buffer, optimizer, K_epoch, lmbda, gamma, eps_clip, entopy_coef)
        if n_epi % 100 == 0:
            save_model(n_epi, SAVE_PERIOD, SAVE_PATH, model, MODEL_NAME, ENV_NAME)
    writer.close()
    env.close()
    return 0

main()
# if __name__ == '__main__':
#     main()

MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.11129617691040039
# of episode :0, score : 43.9
model saved


MineRL agent is public, connect on port 6666 with Minecraft 1.11
0it [11:55, ?it/s]


loss : -0.035930655896663666
# of episode :1, score : 21.9


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.23664569854736328
# of episode :2, score : 48.2


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : 0.029654482379555702
# of episode :3, score : -16.7


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.24434776604175568
# of episode :4, score : 25.2


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.21240167319774628
# of episode :5, score : 18.3


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.20293287932872772
# of episode :6, score : -2.8


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.19111189246177673
# of episode :7, score : 5.5


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.4168812334537506
# of episode :8, score : 31.8


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.12589131295681
# of episode :9, score : -9.2


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.1534474641084671
# of episode :10, score : 40.1


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.2577904164791107
# of episode :11, score : -5.3


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.31792280077934265
# of episode :12, score : 40.9


MineRL agent is public, connect on port 6666 with Minecraft 1.11


loss : -0.2872028946876526
# of episode :13, score : -8.0


MineRL agent is public, connect on port 6666 with Minecraft 1.11
