# HW 3

Max Schrader

In [1]:
import numpy as np

## Problem 1

Program in Python to calculate the value functions in slides 16-18 (three cases for $\gamma = 0, 0.9, 1$. Compare the results with those in the slides

In [2]:
MOVEMENTS = {"FACEBOOK": {"R": -1, "P": {"FACEBOOK": 0.9, "CLASS_1": 0.1}},
             "CLASS_1": {"R": -2, "P": {"FACEBOOK": 0.5, "CLASS_2": 0.5}},
             "CLASS_2": {"R": -2, "P": {"CLASS_3": 0.8, "SLEEP": 0.2}},
             "CLASS_3": {"R": -2, "P": {"PASS": 0.6, "PUB": 0.4}},
             "PASS": {"R": 10, "P": {"SLEEP": 1}},
             "SLEEP": {"R": 0, "P": {}},
             "PUB": {"R": 1, "P": {"CLASS_1": 0.2, "CLASS_2": 0.4, "CLASS_3": 0.4}},
             }
NUM_STATES = 7

### Composing the R and P Matrices

In [3]:
P = []
R = []
ORDER = {key: i for i, key in enumerate(MOVEMENTS.keys())}
for state, state_info in MOVEMENTS.items():
    R.append(state_info['R'])
    P.append([0 for _ in enumerate(ORDER.values())])
    for to_state, prob in state_info['P'].items():
        P[-1][ORDER[to_state]] = prob
P = np.array(P)
R = np.array(R)

### Calculating the State-Value Function 

In [4]:
def state_value(discount_factor):
    I = np.eye(len(P))
    values = np.matmul(np.linalg.inv(I - discount_factor * P), R) 
    for key, state in ORDER.items():
        print(f"{key}: ", round(values[state], 2))

### Discount of 0

In [5]:
state_value(0)

FACEBOOK:  -1.0
CLASS_1:  -2.0
CLASS_2:  -2.0
CLASS_3:  -2.0
PASS:  10.0
SLEEP:  0.0
PUB:  1.0


### Discount of 0.9

In [6]:
state_value(0.9)

FACEBOOK:  -7.64
CLASS_1:  -5.01
CLASS_2:  0.94
CLASS_3:  4.09
PASS:  10.0
SLEEP:  0.0
PUB:  1.91


### Discount of 1

In [7]:
state_value(1)

FACEBOOK:  -22.54
CLASS_1:  -12.54
CLASS_2:  1.46
CLASS_3:  4.32
PASS:  10.0
SLEEP:  0.0
PUB:  0.8


### Compare the results with those in the slides

The results are the same as those in the slides, the differences just coming from rounding

## Question 2

Solve the Bellman equation in either Python or MATLAB using the equation shown in slide 23 and compare the solution with the values shown in slide 21.

In [8]:
def bellman(p: np.array, r: np.array, gamma: float) -> np.array:
    return np.matmul(np.linalg.inv(np.eye(len(p)) - gamma * p), r)

In [9]:
[round(v, 2) for v in bellman(P, R, 1)]

[-22.54, -12.54, 1.46, 4.32, 10.0, 0.0, 0.8]

The solution above is the exact same as slide 21, save for the rounding

## Question 3

Verify all state values in slide 29 by solving the Bellman equations by hand

In [10]:
gamma = 1
policy = 0.5
# Sleep:
print("Sleep:", policy * (0))
# Study:
print("Study:", round(policy * (10) + policy * (1 + 0.2*-1.3 + 0.4*2.7 + 0.4*7.4), 2))
# Class 2:
print("Class 2:", round(policy * (-2 + 7.4) + policy * 0), 2)
# Class 1
print("Class 1:", round(policy * (-2 + 2.7) + policy* (-1 + -2.3), 2))
# Class 1
print("Facebook:", round(policy * (-1 + -2.3) + policy * (0 + -1.3)), 2)

Sleep: 0.0
Study: 7.39
Class 2: 3 2
Class 1: -1.3
Facebook: -2 2


## Question 4

Verify all values of the optimal value function and optimal action-value function in slides 38-39 by solving Bellman equations by hand. Are all values correct

### Optimal Value Function

In [11]:
# Facebook
print("Facebook:", 0 + -2 + -2 + 10)

# Class 1
print("Class 1:", -2 + -2 + 10)

# Class 2
print("Class 2:", -2 + 10)

# Study
print("Class 1:", 10)

# Slep
print("Sleep:", 0)

Facebook: 6
Class 1: 6
Class 2: 8
Class 1: 10
Sleep: 0


#### Optimal Action-Value Function

In [12]:
# Facebook to Facebook
print("Facebook -> Facebook:", -1 + 6)

# Facebook to Class 1
print("Facebook -> Class 1:", 6 - 0)

# Class 1 to Facebook
print("Class 1 -> Facebook:", -1 + 6)

# Class 1 to Class 2
print("Class 1 -> Class 2:", -2 + 8)

# Class 2 to Sleep
print("Class 2 -> Sleep:", 0 + 0)

# Class 2 to Study
print("Class 2 -> Study:", -2 + 10)

# Study -> Sleep
print("Study -> Sleep:", 10)

# Study -> Pub
print("Study -> Pub:", 1 + 0.2 * 6 + 0.4 * 8 + 0.4 *10)

# Study -> Sleep
print("Sleep -> Sleep:", 0)

Facebook -> Facebook: 5
Facebook -> Class 1: 6
Class 1 -> Facebook: 5
Class 1 -> Class 2: 6
Class 2 -> Sleep: 0
Class 2 -> Study: 8
Study -> Sleep: 10
Study -> Pub: 9.4
Sleep -> Sleep: 0


Slide 39 has an incorrect $q^*$ from Study to Pub. It should be $q^* = 9.4$