<a href="https://colab.research.google.com/github/ieyasu2017/Thermal/blob/master/environment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>迷路環境での Markov Decision Process (MDP)</h1>

プログラムで実装している MDP を満たす環境

<img src="https://ieyasu03.web.fc2.com/Deep_Learning/Fig18-1.jpg">


　エージェントは上下左右への移動が可能であり、状態は迷路の現在位置となる。青のセルに到達したらプラスの報酬（$+1$）でゴール、赤のセルに到達したらマイナスの報酬（$-1$）でゴールとなる。黒のセルは移動できないブロックとする。また、セルの移動ごとにマイナスの報酬（$-0.04$）を設定している。
 
　方策　：　上下左右からランダムに行動を選択
 
　遷移確率　：　選択された方向は $0.8$、その反対方向は $0$、その他の方向 $(1-0.8)/2=0.1$
 
としている。

　図の破線の経路は $5$ ステップで報酬を最大とする最適経路となっており、そのときの報酬は
 
 $$ \mathrm{Reward} = -0.04 \times (5-1) + (+1) = 0.84$$ となる。

<h3>状態・行動・環境の設定</h3>

In [0]:
from enum import Enum
import numpy as np


class State(): 

    def __init__(self, row=-1, column=-1):
        self.row = row
        self.column = column
    
    def clone(self):
        return State(self.row, self.column)


class Action(Enum): 
    UP = 1
    DOWN = -1
    LEFT = 2
    RIGHT = -2


class Environment(): 

    def __init__(self, grid, move_prob=0.8):      
        # grid is 2d-array. Its values are treated as an attribute.
        # Kinds of attribute is following.
        #  0: ordinary cell
        #  -1: damage cell (game end)
        #  1: reward cell (game end)
        #  9: block cell (can't locate agent)        
        self.grid = grid
        self.agent_state = State()
        
        # Default reward is minus. Just like a poison swamp.
        # It means the agent has to reach the goal fast!
        self.default_reward = -0.04

        # Agent can move to a selected direction in move_prob.
        # It means the agent will move different direction
        # in (1 - move_prob).
        self.move_prob = move_prob
        self.reset()

    @property
    def row_length(self):
        return len(self.grid)

    @property
    def column_length(self):
        return len(self.grid[0])

    @property
    def actions(self):
        return [Action.UP, Action.DOWN,
                Action.LEFT, Action.RIGHT]

    def transit_func(self, state, action):
        transition_probs = {}
        if not self.can_action_at(state):
            # Already on the terminal cell.
            return transition_probs

        opposite_direction = Action(action.value * -1)

        for a in self.actions:
            prob = 0
            if a == action:
                prob = self.move_prob
            elif a != opposite_direction:
                prob = (1 - self.move_prob) / 2

            next_state = self._move(state, a)
            if next_state not in transition_probs:
                transition_probs[next_state] = prob
            else:
                transition_probs[next_state] += prob

        return transition_probs

    def can_action_at(self, state):
        if self.grid[state.row][state.column] == 0:
            return True
        else:
            return False

    def _move(self, state, action):
        if not self.can_action_at(state):
            raise Exception("Can't move from here!")

        next_state = state.clone()

        # Execute an action (move).
        if action == Action.UP:
            next_state.row -= 1
        elif action == Action.DOWN:
            next_state.row += 1
        elif action == Action.LEFT:
            next_state.column -= 1
        elif action == Action.RIGHT:
            next_state.column += 1

        # Check whether a state is out of the grid.
        if not (0 <= next_state.row < self.row_length):
            next_state = state
        if not (0 <= next_state.column < self.column_length):
            next_state = state

        # Check whether the agent bumped a block cell.
        if self.grid[next_state.row][next_state.column] == 9:
            next_state = state

        return next_state

    def reward_func(self, state):
        reward = self.default_reward
        done = False

        # Check an attribute of next state.
        attribute = self.grid[state.row][state.column]
        if attribute == 1:
            # Get reward! and the game ends.
            reward = 1
            done = True
        elif attribute == -1:
            # Get damage! and the game ends.
            reward = -1
            done = True

        return reward, done

    def reset(self):
        # Locate the agent at lower left corner.
        self.agent_state = State(self.row_length - 1, 0)
        return self.agent_state

    def step(self, action):
        next_state, reward, done = self.transit(self.agent_state, action)
        if next_state is not None:
            self.agent_state = next_state

        return next_state, reward, done

    def transit(self, state, action):
        transition_probs = self.transit_func(state, action)
        if len(transition_probs) == 0:
            return None, None, True

        next_states = []
        probs = []
        for s in transition_probs:
            next_states.append(s)
            probs.append(transition_probs[s])

        next_state = np.random.choice(next_states, p=probs)
        reward, done = self.reward_func(next_state)
        return next_state, reward, done


<h3>プログラムの実行</h3>

In [4]:
import random

class Agent():
    
    def __init__(self, env):
        self.actions = env.actions
        
    def policy(self, state):
        return random.choice(self.actions)
    
def main():
    
    # Make grid environment
    grid = [[0,0,0,1],
            [0,9,0,-1],
            [0,0,0,0]]
    
    env = Environment(grid)
    agent = Agent(env)
    
    # Try 1000 games
    for i in range(1000):
        # initialize position of agent
        state = env.reset()
        total_reward = 0
        done = False
        
        n_step = 0
        while not done:
            action = agent.policy(state)
            next_state, reward, done = env.step(action)
            total_reward += reward
            state = next_state
            n_step += 1
        
        if n_step == 5:
            print(f'Episode {i:2d}: Agent gets {total_reward: 3f} reward at {n_step:3d} steps')
        
if __name__ == "__main__":
    main()

Episode 83: Agent gets -1.160000 reward at   5 steps
Episode 350: Agent gets -1.160000 reward at   5 steps
Episode 389: Agent gets  0.840000 reward at   5 steps
Episode 407: Agent gets  0.840000 reward at   5 steps
Episode 535: Agent gets -1.160000 reward at   5 steps
Episode 537: Agent gets -1.160000 reward at   5 steps
Episode 553: Agent gets -1.160000 reward at   5 steps
Episode 601: Agent gets -1.160000 reward at   5 steps
Episode 645: Agent gets -1.160000 reward at   5 steps
Episode 792: Agent gets -1.160000 reward at   5 steps
Episode 910: Agent gets  0.840000 reward at   5 steps
Episode 928: Agent gets -1.160000 reward at   5 steps


$5$ ステップでゴールする経路を表示すると、報酬 = $0.84$ となる経路が存在することが確かめられる。

<h3>参考文献</h3>

<a href="https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%82%B9%E3%82%BF%E3%83%BC%E3%83%88%E3%82%A2%E3%83%83%E3%83%97%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA-Python%E3%81%A7%E5%AD%A6%E3%81%B6%E5%BC%B7%E5%8C%96%E5%AD%A6%E7%BF%92-%E5%85%A5%E9%96%80%E3%81%8B%E3%82%89%E5%AE%9F%E8%B7%B5%E3%81%BE%E3%81%A7-KS%E6%83%85%E5%A0%B1%E7%A7%91%E5%AD%A6%E5%B0%82%E9%96%80%E6%9B%B8-%E4%B9%85%E4%BF%9D/dp/4065142989/ref=sr_1_3?__mk_ja_JP=%E3%82%AB%E3%82%BF%E3%82%AB%E3%83%8A&keywords=Python%E3%81%A7%E5%AD%A6%E3%81%B6%E5%BC%B7%E5%8C%96%E5%AD%A6%E7%BF%92&qid=1568288711&s=gateway&sr=8-3">久保 隆宏 『機械学習スタートアップシリーズ Pythonで学ぶ強化学習 入門から実践まで』 講談社 (2019/02/04)</a>