## CSCI E-89C Deep Reinforcement Learning, Spring 2020
### Section 5

## First-Visit Monte Carlo (MC) prediction

Consider Environment that has five states: 1, 2, 3, 4, and 5. Possible transitions are: (1) 1->1, 1->2; (2) 2->1, 2->2, 2->3; (3) 3->2, 3->3, 3->4; (4) 4->3, 4->4, 4->5; (5) 5->4, 5->5.

Actions of the Agent are decoded by -1, 0, and +1, which correspond to its intention to move left, stay, and move right, respectively. The Environment, however, does not always respond to these intentions exactly, and there is 10% chance that action 0 will result in moving to the left (if moving to the left is admissible), and +1 action will result in staying - in other words, there is an "east wind." More specifically, the non-zero transition probabilities $p(s^\prime,r|s,a)$ are<br>

$p(s^\prime=1,r=0|s=1,a=0)=1$,<br>
$p(s^\prime=1,r=0|s=1,a=+1)=0.1,p(s^\prime=2,r=0|s=1,a=+1)=0.9$,<br>

$p(s^\prime=1,r=0|s=2,a=-1)=1$,<br>
$p(s^\prime=1,r=0|s=2,a=0)=0.1,p(s^\prime=2,r=0|s=2,a=0)=0.9$,<br>
$p(s^\prime=2,r=0|s=2,a=+1)=0.1,p(s^\prime=3,r=1|s=2,a=+1)=0.9$,<br>

$p(s^\prime=2,r=0|s=3,a=-1)=1$,<br>
$p(s^\prime=2,r=0|s=3,a=0)=0.1,p(s^\prime=3,r=1|s=3,a=0)=0.9$,<br>
$p(s^\prime=3,r=1|s=3,a=+1)=0.1,p(s^\prime=4,r=0|s=3,a=+1)=0.9$,<br>

etc.

Further, we assume that whenever the process enters state 3, the Environment generates reward = 1. In all other cases the reward is 0. For example, transition 2->3 will result in reward 1, transition 3->3 will result in reward 1, transition 3->2 will result in reward 0, transition 2->2 will result in reward 0, etc.



Further, assume that the agent does not know about the wind or what rewards to expect. It chooses to stay in all states, i.e. the policy is
$\pi(-1|1)=0, \pi(0|1)=1, \pi(+1|1)=0$,<br>
$\pi(-1|2)=0, \pi(0|2)=1, \pi(+1|2)=0$,<br>
$\pi(-1|3)=0, \pi(0|3)=1, \pi(+1|3)=0$,<br>
$\pi(-1|4)=0, \pi(0|4)=1, \pi(+1|4)=0$,<br>
etc.

Let’s use $\gamma=0.9$ and $T=100$. Action-value function via First-Visit Monte Carlo (MC) prediction:

In [123]:
import random
from matplotlib import pyplot as plt 
import numpy as np

class Environment:
    def __init__(self, S0 = 1):
        self.time = 0
        self.state = S0

    def admissible_actions(self):
        A = list((-1,0,1))
        if self.state == 1: A.remove(-1)
        if self.state == 5: A.remove(1)
        return A
    
    def check_state(self):
        return self.state

    def get_reward(self, action):
        self.time += 1
        move = action
        if self.state > 1 and move > -1:
            move = np.random.choice([move-1, move],p=[0.1,0.9])
        self.state += move
        if self.state == 3:
            reward = 1
        else:
            reward = 0
        return reward

In [124]:
class Agent:
    def __init__(self, A0=0):
        self.current_reward = 0.0
        self.current_action = A0

    def step(self, env):
        #actions = env.admissible_actions()
        action_selected = 0
        if env.time == 0:
            action_selected = self.current_action
        reward = env.get_reward(action_selected)            
        self.current_action = action_selected 
        self.current_reward = reward

In [125]:
def gen_episode(S0, A0, T=10):
    env = Environment(S0)
    agent = Agent(A0)
    states = []
    actions = []
    rewards = []
    for t in range(T+1):
        states.append(env.state)
        agent.step(env)
        actions.append(agent.current_action)
        rewards.append(agent.current_reward)
    return [states, actions, rewards]

In [131]:
gen_episode(4,1, T=24)

[[4, 5, 5, 5, 5, 5, 5, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]

In [133]:
gen_episode(4,0, T=24)

[[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 2, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]

In [132]:
gen_episode(4,-1, T=24)

[[4, 3, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [-1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

In [137]:
S0 = 2
env = Environment(S0)
env.admissible_actions()

[-1, 0, 1]

In [121]:
gamma = 0.9
T = 100
N = np.zeros((5, 3))
Q = np.zeros((5, 3))

for k in range(4000):
        S0 = np.random.randint(low=1, high=6, size=1)[0]
        env = Environment(S0)
        A0 = np.random.choice(env.admissible_actions())
        if k < 20:
            print(S0,A0)
        episode = gen_episode(S0, A0, T)
        G = 0
        for t in reversed(range(0,T)):
            S = episode[0][t]
            A = episode[1][t]
            R = episode[2][t]
            G = gamma*G + R
            if (S not in episode[0][0:t]) and (A not in episode[1][0:t]):
                N[S-1,A+1] = N[S-1,A+1] + 1
                Q[S-1,A+1] = Q[S-1,A+1] + 1/N[S-1,A+1]*(G - Q[S-1,A+1])
            

5 -1
2 1
3 0
4 1
2 0
4 0
3 0
5 0
4 1
4 0
2 0
5 0
2 0
2 -1
4 -1
4 0
1 1
4 1
2 -1
2 1


In [122]:
Q

array([[0.  , 0.  , 0.  ],
       [0.  , 0.  , 4.79],
       [0.  , 4.8 , 2.72],
       [5.38, 2.77, 1.24],
       [2.56, 1.27, 0.  ]])