# RL - 2 : Question 2-4, 8, 9

1 <-> 2 <-> 3 <-> 4 -> Goal

let’s assume the agent starts in each numbered cell with equal probability. It also moves left or right with equal probability. The agent will receive -1 reward at each numbered cell (for lollygagging), and 1 reward at the goal cell.

What is the expected undiscounted return the agent receives?

**Question 3:**
how does the expected return change when we have the agent move right with probability 0.75 and 0.25, respectively?

**Question 4:**
What is the expected undiscounted return of an agent following the optimal policy?

**Question 8:**
What are the advantages of moving right from cell 1, 2, 3, and 4?

**Question 9:**
What is the difference in state values V-optimal(1) - V(1) for cell 1 under the two policies ?

In [9]:
# Expected return is sum (P(s) * V(s)) s -> 1, 2, 3, 4, G
#  1/4 * [ V(1) + V(2) + V(3) + V(4) ]
#  V(s) = sum-over-actions { P(a | s) * Q(s, a) }
#  Q(s, a) = sum-over-states { P(s' | s, a) * [ r(s, a) + gamma * V(s') ] }
#
#  V(G) = 1 # no transitions from the goal state.
#
#  V(4) = 1/2 * Q(4, <-) + 1/2 * Q(4, ->)
#  Q(4, ->) = 1 * [1 + 1 * V(G) ]
#  Q(4, <-) = 1 * [ -1 + 1 * V(3) ]
#  V(4) =  1/2 * 2 + 1/2 * [-1 + V(3)] = 1/2 + 1/2 * V(3)
# etc..

import numpy as np

def compute_optimal_return(
    P_left, P_right,
    rewards=np.array([-1, -1, -1, -1, 1]), discount_factor=1):
    
    Q = np.zeros((5, 2))
    V = np.zeros(5)

    for i in range(0, 1000):
        #left 
        for s in range(1, 4):
            Q[s][0] = 1 * ( rewards[s-1] + discount_factor * V[s-1])
        Q[0][0] = 0
        
        #right 
        for s in range(0, 4):
            Q[s][1] = 1 * ( rewards[s+1] + discount_factor * V[s+1])
        
        for s in range(1, 4):
            V[s] = max(Q[s][0], Q[s][1])
        V[0] = Q[0][1]

    return (1/4 * np.sum(V[0:4])), V, Q

def compute_expected_return(
    P_left, P_right,
    rewards=np.array([-1, -1, -1, -1, 1]), discount_factor=1):
    
    Q = np.zeros((5, 2))
    V = np.zeros(5)

    for i in range(0, 1000):
        #left 
        for s in range(1, 4):
            Q[s][0] = 1 * ( rewards[s-1] + discount_factor * V[s-1])
        
        #right 
        for s in range(0, 4):
            Q[s][1] = 1 * ( rewards[s+1] + discount_factor * V[s+1])
        
        for s in range(0, 4):
            V[s] = P_left[s] * Q[s][0] + P_right[s] * Q[s][1]

    return (1/4 * np.sum(V[0:4])), V, Q

P_left = np.array([0, 1/2, 1/2, 1/2, 0])
P_right = np.array([1, 1/2, 1/2, 1/2, 0])
print("Expected return (1/2, 1/2):", compute_expected_return(P_left, P_right)[0])

P_left = np.array([0, 1/4, 1/4, 1/4, 0])
P_right = np.array([1, 3/4, 3/4, 3/4, 0])
print("Expected return (1/4, 3/4):", compute_expected_return(P_left, P_right)[0])

P_left = np.array([0, 3/4, 3/4, 3/4, 0])
P_right = np.array([1, 1/4, 1/4, 1/4, 0])
print("Expected return (3/4, 1/4):", compute_expected_return(P_left, P_right)[0])

P_left = np.array([0, 0, 0, 0, 0])
P_right = np.array([1, 1, 1, 1, 0])
print("Expected return (optimal):", compute_expected_return(P_left, P_right)[0])


P_left = np.array([0, 1/2, 1/2, 1/2, 0])
P_right = np.array([1, 1/2, 1/2, 1/2, 0])
_, V, Q = compute_expected_return(P_left, P_right)
print("advantages of moving right from cell 1, 2, 3, and 4:", Q[0:4, 1] - V[0:4])
_, V_opt, _ = compute_optimal_return(P_left, P_right)
print("difference in state values V-optimal(1) - V(1):", V_opt[0]-V[0])


Expected return (1/2, 1/2): -10.499999999999995
Expected return (1/4, 3/4): -2.462962962962962
Expected return (3/4, 1/4): -99.4899706960878
Expected return (optimal): -0.5
advantages of moving right from cell 1, 2, 3, and 4: [0. 2. 4. 6.]
difference in state values V-optimal(1) - V(1): 11.999999999999993


# RL - 2 : Question 7

1 <-> 2 <-> 3 <-> 4 -> Goal

let’s assume the agent starts in each numbered cell with equal probability. It also moves left or right with equal probability. The agent will receive -1 reward at each numbered cell (for lollygagging), and 1 reward at the goal cell.

We initialize the agent to a uniform policy, using weights and bias. w=[6, 0, 0, 0] & b=0 giving a uniform policy. The agent starts in cell 4, taking quite a detour 4 -> 3 -> 2 -> 1 -> 2 -> 3 -> 4 -> Goal. 

With this episode, let’s update the policy with REINFORCE. How would the policy weights change?

In [11]:
import torch
from torch import nn
import numpy as np
from torch.distributions.bernoulli import Bernoulli
from torch.optim import SGD

states = np.array([
    [1, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 1]
])

L = nn.Linear(in_features=4, out_features=1, bias=False)
with torch.no_grad():
    L.weight.copy_(torch.tensor([6.0, 0.0, 0.0, 0.0], dtype=torch.float32))

model = nn.Sequential(L, nn.Sigmoid())
trajectory = [(3, 0, -1), (2, 0, -1), (1, 0, -1), (0, 1, -1), (1, 1, -1), (2, 1, -1), (3, 1, 1)]

input = []
actions = []
returns = []

for i in range(0, len(trajectory)):
    (s, a, r) = trajectory[i]
    input.append(states[s])
    actions.append(a)


returns = [trajectory[-1][2]]
for i in reversed(range(0, len(trajectory)-1)):
    returns.append(trajectory[i][2]+returns[-1])
returns.reverse()

optimizer = SGD(model.parameters(), lr=1)
ip = torch.tensor(np.array(input), dtype=torch.float32, requires_grad=False)

optimizer.zero_grad()
probs = model.forward(ip)
dist = Bernoulli(probs=probs)
loss = -1 * (dist.log_prob(torch.tensor(actions, dtype=torch.float32))*torch.as_tensor(returns)).mean()
loss.backward()
optimizer.step()

print(model[0].weight)


NameError: name '_C' is not defined