# Write-up and code for Feb 20 and 22

## To Do:
- ~~Write code for the interface for RL algorithms with value function approximation. The core of this interface should be a function from a (state, action) pair to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the state-transition probability model or the reward model.~~
- ~~Implement any Monte-Carlo Prediction algorithm with Value Function approximation~~
- Implement 1-step TD Prediction algorithm with Value Function approximation
- Implement Eligibility-Traces-based TD(lambda) Prediction algorithm with Value Function approximation
- Implement SARSA and SARSA(Lambda) with Value Function approximation
- Implement Q-Learning with Value Function approximation
- Optional: Implement LSTD and LSPI
- Test the above algorithms versus DP Policy Evaluation and DP Policy Iteration/Value Iteration algorithm on an example MDP
- Project Suggestion: Customize the LSPI algorithm for American Option Pricing (see this paper)

## MDP Interface for RL Algorithms with Value Function Approximation

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [8]:
from typing import Dict, NamedTuple, Callable, Set, Tuple, List
from modules.MDP import MDP, Q, Policy, V
from modules.RL_interface import RL_interface
from modules.state_action_vars import S, A
import random
import numpy as np

In [4]:
class RL_interface_FA(NamedTuple):
    # interface for reinforcement learning with value function approximation, 
    # largely inspired by Professor Ashwin Rao's implementation
    
    # function that takes in a state and return a set of possible actions
    state_action_func: Callable[[S], Set[A]]
    # function that takes in a state and an action, and returns a new state sp and the reward 
    state_reward_func: Callable[[S, A], Tuple[S, float]]
    # inital state generator
    init_state_action_gen: Callable[[], Tuple[S, A]]
    # discount factor
    gamma: float

## Monte-Carlo Prediction Algorithm with Value Function Approximation

In [15]:
def mc_vi_approx(polf: Callable[[S], A], alpha: float, v_hat: Callable[[S, np.ndarray], float], 
                 sampler: RL_interface_FA, feature_func: Callable[[S], np.ndarray], 
                 num_epi: int, num_steps: int, d: int) \
    -> Callable[[S], float]:
    # implementation of Monte-Carlo Prediction Algorithm with Value Function Approximation
    # assume the approximation function is linear

    # initiliaze weight vector
    w = np.zeros((d,1))
    gamma = sampler.gamma
    
    for i in range(num_epi):
        s_list, a_list, r_list = get_mc_path(polf, sampler, num_steps)
        G = 0
        for j in range(num_steps):
            G = np.sum(np.multiply(np.power(gamma, np.arange(num_steps-j)), np.array(r_list[j:])))
            w += alpha*(G - v_hat(s_list[j], w))*w
   
    return w
    
    
def get_mc_path(polf: Callable[[S], A], sampler: RL_interface_FA, num_steps: int) \
    -> Tuple[List[S], List[A], List[float]]:
    # simulate a Monte-Carlo path
    s_list = []
    a_list = []
    r_list = []

    s, a = sampler.init_state_action_gen()
    s_list.append(s)
    a_list.append(a)
    
    for i in range(n_steps):
        s, r = sampler.state_reward_func(s, a)
        a = polf(s)
        s_list.append(s)
        a_list.append(a)
        r_list.append(r)
        
    # sample the last reward
    _, r = sampler.state_reward_func(s, a)
    r_list.append(r)
    
    return s_list, a_list, r_list

## Least Square Policy Iteration for American Option Pricing