# Example 3.5: Gridworld

<img src="figures/chap.03.05.example3.5.gridworld.png" width="70%">

* 각 cell에서 취할 수 있는 actions: `north`, `south`, `east`, and `west`
* deterministic action
* 경계를 벗어나는 action에 대해서는 state가 변하지 않고 reward `-1`을 받음
* 나머지 다른 action(state `A`와 `B`에 있을때를 빼고)에 대해서는 reward `0`을 받음
* state `A`에서는 어떤 action을 하던지 state `A'`으로 가고 reward `+10`을 받음
* state `B`에서는 어떤 action을 하던지 state `B'`으로 가고 reward `+5`을 받음

In [1]:
import numpy as np
np.set_printoptions(precision=1)

### Grid world state index

|   |   |   |   |   |
|----|----|----|----|----|
| 0,0  | 0,1 | 0,2 | 0,3 | 0,4 |
| 1,0  | 1,1 | 1,2 | 1,3 | 1,4 |
| 2,0  | 2,1 | 2,2 | 2,3 | 2,4 |
| 3,0  | 3,1 | 3,2 | 3,3 | 3,4 |
| 4,0  | 4,1 | 4,2 | 4,3 | 4,4 |

## Bellman equation

$$v_{\pi}(s) = \sum_{a} \pi(a | s)
\sum_{s', r} p(s', r | s, a)
\left[ r + \gamma v_{\pi}(s') \right]$$

* $p(s', r | s, a)$: deterministic
  * $p(s'=(4,1), \, r=10 \ | \ s=(0,1), \, a=\textrm{'north'}) = 1$
    * state $A$에서 'north' 방향으로 움직였을 때 다음 state가 $(4,1)$ 이고 reward 10을 받을 확률은 1
  * $p(s'=(2,1), \, r=0 \ | \ s=(1,1), \, a=\textrm{'north'}) = 0$
    * state $(1,1)$에서 'north' 방향으로 움직였을 때 다음 state가 $(2,1)$ 이고 reward 0을 받을 확률은 0
    * deterministic 이라 $s'=(0,1)$만 허용됨

#### `self.M`

* linear equation의 계수를을 모아놓은 matrix

$$ M V = R$$

$$ \left[ \begin{array}{cccc}
w_{0,0} & w_{0,1} & \cdots  & w_{0,24} \\
w_{1,0} & w_{1,1} & \cdots  & w_{1,24} \\
\vdots & \vdots & \vdots & \vdots \\
w_{24,0} & w_{24,1} & \cdots  & w_{24,24}
\end{array} \right]
\left[ \begin{array}{c}
v_{\pi}(s_{(0, 0)}) \\
v_{\pi}(s_{(0, 1)}) \\
\vdots \\
v_{\pi}(s_{(4, 4)}) \\
\end{array} \right]
= \left[ \begin{array}{c}
\frac{1}{4} R_{s_{(0, 0)}} \\
\frac{1}{4} R_{s_{(0, 1)}} \\
\vdots \\
\frac{1}{4} R_{s_{(4, 4)}} \\
\end{array} \right]
$$

where $R_{s_{(0, 0)}} = r_{a=\textrm{'north'}} + r_{a=\textrm{'south'}} + r_{a=\textrm{'east'}} + r_{a=\textrm{'west'}}$

In [2]:
class GridworldEnv():
  def __init__(self, size=5):
    self.actions = ['north', 'south', 'east', 'west']
    self.A = (0, 1) # special site
    self.B = (0, 3) # special site
    self.M = np.eye(size * size)
    self.R = np.zeros(size * size)
    self.gamma = 0.9
    self.size = size
    
  def BellmanEquation(self, state):
    """
    Args:
      state: tuple (x, y) coordinate
    """
    x, y = state
    state_index = x * self.size + y
    for action in self.actions:
      next_state, reward = self.Step(state, action)
      next_stete_index = next_state[0] * self.size + next_state[1]
      self.M[state_index, next_stete_index] -= 0.25 * self.gamma
      self.R[state_index] += 0.25 * reward
      
  def Step(self, state, action):
    """
    Args:
      state: tuple (x, y) coordinate
      action: string
      
    Returns:
      next_state: tuple (x, y) coordinate
      reward: int
    """
    if state == self.A:
      # 모든 action에 대해 next_state=(4, 1), reward=10 을 준다.
      return (4, 1), 10
    elif state == self.B:
      # 모든 action에 대해 next_state=(2, 3), reward=5 을 준다.
      return (2, 3), 5
    else:
      if action == 'north':
        if state[0] > 0:
          next_state = (state[0]-1, state[1])
          reward = 0
        else:
          next_state = state
          reward = -1
      elif action == 'south':
        if state[0] < 4:
          next_state = (state[0]+1, state[1])
          reward = 0
        else:
          next_state = state
          reward = -1
      elif action == 'east':
        if state[1] < 4:
          next_state = (state[0], state[1]+1)
          reward = 0
        else:
          next_state = state
          reward = -1
      elif action == 'west':
        if state[1] > 0:
          next_state = (state[0], state[1]-1)
          reward = 0
        else:
          next_state = state
          reward = -1
      return next_state, reward
    
  def AllBellmanEquations(self):
    for i in range(self.size):
      for j in range(self.size):
        self.BellmanEquation((i, j))
        
  def SolveLinearBellmanEquations(self):
    self.AllBellmanEquations()
    solution = np.linalg.solve(self.M, self.R)
    solution = solution.reshape(self.size, self.size)
    return solution

In [3]:
g = GridworldEnv()
solution = g.SolveLinearBellmanEquations()
print(solution)

[[ 3.3  8.8  4.4  5.3  1.5]
 [ 1.5  3.   2.3  1.9  0.5]
 [ 0.1  0.7  0.7  0.4 -0.4]
 [-1.  -0.4 -0.4 -0.6 -1.2]
 [-1.9 -1.3 -1.2 -1.4 -2. ]]


### Results

<img src="figures/chap.03.05.example3.5.gridworld.png" width="70%">