## Small Policy v1
Given 10x10 grid world, (s, a, r, sp) data, find optimal policy

Lisa Fung

Last Updated: 11/5/24

### Data Exploration

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
small_data = pd.read_csv("./data/small.csv")

In [16]:
# small_data.head()
# small_data.describe()
# small_data.hist()
# [print(col, small_data[col].value_counts()) for col in small_data.columns]

Small Data Information
- States s: [1, 100], all seen
- Actions a: [1, 4], all seen
    - 1: left, 2: right, 3: up, 4: down
- Rewards r: {0, 3, 10}
- Next states sp: [1, 100]

In [37]:
# Find R(s, a) > 0
small_data_pos_rewards = small_data[small_data['r'] > 0]    # Subset of small data with positive R(s, a)

# Find states, actions with positive rewards
# R(82, a) = 3
# R(47, a) = 10
small_data_pos_rewards[['s', 'a', 'r']].value_counts()

s   a  r 
82  2  3     154
    4  3     137
    3  3     135
47  1  10    128
    3  10    125
    2  10    113
    4  10    113
82  1  3     105
Name: count, dtype: int64

Optimal policy intuition: move toward 47, collect 82 along the way if possible
- $R(82, a) = 3$
- $R(47, a) = 10$
- $R(s, a) = 0$  $\forall s \neq 82, 47$

### Approach

1. Estimate transitions $T(s' | s, a)$ using **Maximum Likelihood Estimation** (Note: no exploration)
2. Set rewards $R(s, a)$ with $R(47, a) = 10$, $R(82, a) = 3$, all other $R(s, a) = 0$
3. Find optimal policy $\pi^*$ using **Value Iteration**

    a. Find $U^*(s)$ by updating $$U_{k+1}(s) = \max_a ( R(s,a) + \gamma * \sum_{s'} T(s' | s,a) \cdot U_k(s'))$$ until convergence when maximum change in value $||U_{k+1} - U_{k}||_{\infty} < \delta$

    b. Extract $\pi^*$ with $\pi^*(s) = \argmax_a ( R(s,a) + \gamma * \sum_{s'} T(s' | s,a) \cdot U^*(s') )$


In [None]:
n_states = 100
n_actions = 4

# Estimate T(sp | s, a) in transition matrix T(s, a, sp)
T = np.zeros((n_states + 1, n_actions + 1, n_states + 1))

# Find counts N(s, a, sp)
grouped_T = small_data.set_index(['s', 'a', 'sp']).groupby(by=['s', 'a', 'sp']).count()
for indices, row in grouped_T.iterrows():
    T[indices] = row.iloc[0]

# Normalize along next state (sp) dimension to divide by N(s, a)
T /= np.sum(T, axis=2, keepdims=True)
T = np.nan_to_num(T, nan=0.0)   # convert nan to 0.0

  T /= np.sum(T, axis=2, keepdims=True)


In [124]:
# Set rewards R(s, a) = R(s), only depends on current state
R = np.zeros((n_states + 1))
R[47] = 10
R[82] = 3

In [None]:
# Value Iteration