# Write-up and code for Feb 20 and 22

## To Do:
- ~~Write code for the interface for RL algorithms with value function approximation. The core of this interface should be a function from a (state, action) pair to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the state-transition probability model or the reward model.~~
- Implement any Monte-Carlo Prediction algorithm with Value Function approximation
- Implement 1-step TD Prediction algorithm with Value Function approximation
- Implement Eligibility-Traces-based TD(lambda) Prediction algorithm with Value Function approximation
- Implement SARSA and SARSA(Lambda) with Value Function approximation
- Implement Q-Learning with Value Function approximation
- Optional: Implement LSTD and LSPI
- Test the above algorithms versus DP Policy Evaluation and DP Policy Iteration/Value Iteration algorithm on an example MDP
- Project Suggestion: Customize the LSPI algorithm for American Option Pricing (see this paper)

## MDP Interface for RL Algorithms with Value Function Approximation

In [9]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [14]:
from typing import Dict, NamedTuple, Callable, Set, Tuple
from modules.MDP import MDP, Q, Policy, V
from modules.RL_interface import RL_interface
from modules.state_action_vars import S, A
import random

In [16]:
class RL_interface_FA(NamedTuple):
    # interface for reinforcement learning with value function approximation, 
    # largely inspired by Professor Ashwin Rao's implementation
    
    # function that takes in a state and return a set of possible actions
    state_action_func: Callable[[S], Set[A]]
    # function that takes in a state and an action, and returns a new state sp and the reward 
    state_reward_func: Callable[[S, A], Tuple[S, float]]
    # discount factor
    gamma: float

## Monte-Carlo Prediction Algorithm with Value Function Approximation