## Large Policy v2: Model-based approach

Given unknown `(s, a, r, sp)` data, find optimal policy. Not all `(s, a)` pairs will be seen in data, so interpolate from neighbors.
- States: |S| = 302020
- Actions: 9 actions
- Discount factor = 0.95

Lisa Fung

Last Updated: 11/9/24

### Data Exploration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [12]:
large_data = pd.read_csv("./data/large.csv")
n_states = 302020
n_actions = 9
# n_limit_actions = 5

Large Data Observations

Rewards
- Only 7 unique values: [-10  -5   0   5  10  50 100]
- r=100 only at states sp = 301013, 301111, via actions [1,4]
- sp = 301013
    - s=301012, a=1 (delta_s = +1)
    - s=301014, a=2 (delta_s = -1)
    - s=301113, a=3 (delta_s = -100)
    - s=300413, a=4 (delta_s = +600)
- sp = 301111
    - s=301110, a=1 (delta_s = +1)
    - s=301112, a=2 (delta_s = -1)
    - s=301211, a=3 (delta_s = -100)
    - s=301011, a=4 (delta_s = +100)


Actions
- a = [1,4] are probabilistic
- a = [5,9] are usually 0, occasionally random
    ```

### Approach

- Transition Model: $T(s' - s \mid a)$
- Rewards: $R(s, s')$
- Only take actions $a = 1,2,3,4,5$


#### Transition Model

$T(a, \Delta s) = T(\Delta s \mid a)$
- $|\Delta s| = 9$, $|A| = 5$

In [18]:
large_data['delta_s'] = large_data['sp'] - large_data['s']  # delta_s = sp - s

# Only keep delta_s for actions [1, 4] and 0
# array([-600, -100,   -6,   -1,    0,    1,    6,  100,  600])
delta_s_list = np.sort(large_data[large_data['a'] == 1]['delta_s'].unique())

In [27]:
n_delta_s = 9
n_limit_actions = 5

T = np.zeros((n_limit_actions+1, n_delta_s))

In [29]:
large_data_a_delta_s = large_data[large_data['delta_s'].isin(delta_s_list)][['a', 'delta_s']].value_counts().sort_index(level=[0,1])

In [33]:
for row in large_data_a_delta_s.items():
    print(row)

((1, -600), 161)
((1, -100), 562)
((1, -6), 88)
((1, -1), 637)
((1, 0), 887)
((1, 1), 6250)
((1, 6), 1723)
((1, 100), 603)
((1, 600), 162)
((2, -600), 177)
((2, -100), 573)
((2, -6), 1610)
((2, -1), 6312)
((2, 0), 885)
((2, 1), 547)
((2, 6), 175)
((2, 100), 604)
((2, 600), 153)
((3, -600), 1574)
((3, -100), 6440)
((3, -6), 179)
((3, -1), 581)
((3, 0), 946)
((3, 1), 596)
((3, 6), 158)
((3, 100), 558)
((3, 600), 179)
((4, -600), 99)
((4, -100), 674)
((4, -6), 198)
((4, -1), 579)
((4, 0), 764)
((4, 1), 619)
((4, 6), 155)
((4, 100), 6345)
((4, 600), 1648)
((5, 0), 10913)
((6, 0), 11131)
((7, 0), 11033)
((8, 0), 10957)
((9, 0), 11149)


Reward Model

### Extract Policy from Q Function

In [167]:
# Extract optimal policy pi(s) = a from action-value function Q(s, a)

def extract_policy(Q, mode='random'):
    """
    Limit only to actions 1-4.
    """
    policy = np.zeros(n_states+1)
    # predicable_action = np.random.randint(1, n_actions+1)
    predicable_action = 4
    for s in range(1, n_states+1):
        policy[s] = np.argmax(Q[s, :])
        if policy[s] not in [1, 2, 3, 4]: # Actions [5,9] are usually 0, random
            if mode == 'random':
                policy[s] = np.random.randint(1, 5)
            if mode == 'previous':
                policy[s] = predicable_action
        else:
            predicable_action = policy[s]

    return policy

In [None]:
optimal_policy = extract_policy(Q, mode='random')

In [None]:
np.unique(optimal_policy, return_counts=True)

NameError: name 'optimal_policy_sarsa' is not defined

In [None]:
# Write optimal policy to file
with open("large.policy", "w") as file:
    for a in optimal_policy[1:]:
        file.write(f"{int(a)}\n")

### Future Improvements

- Average values of Q(s, a) with some distance-dependent discount for unvisited states s