# Prototyping: Control with Linear Function Approximation

Our aim is to carry out RL tasks when rewards are delayed (aggregate, and anonymous), using linear function approximation.
To solve this problem, we aim to project both large discrete states and continuous states into basis vectors.


In this notebook, we test policy control with linear function approximation, using a select set of environments .
These environments have either discrete r continuous states, and discrete actions.

The environments and their best encoding for reward estimation from previous analyses are:

  - GridWorld: Tiling(8, HT=512)
  - RedGreen: Tiling(2)
  - MountainCar: Tiling(4)
  - GEM:Finite-CC-PMSM-v0 (Gym Electric Motor): Scaled raw features

We cannot know whether a feature encoding that is suitable for estimating the rewards from aggregate samples is equally adequate to represente state for learning an action-value function in control (nor a state-value in the case of policy evaluation).
So, for each, we also use a second encoding as a reference for comparison: 

  - GridWorld: Random Binary Vectors
  - RedGreen: Random Binary Vectors
  - MountainCar: Gaussian Mixture(covariance_type='diag', n_components=3)
  - GEM:Finite-CC-PMSM-v0 (Gym Electric Motor): GaussianMixture(covariance_type='diag', n_components=11)



In [1]:
import math
import random
from typing import Sequence


In [2]:
import gymnasium as gym
import numpy as np
import pandas as pd


In [3]:
from drmdp import envs, feats

## Control with SARSA

In [4]:
def action_values(
    observation, actions: Sequence[int], weights, feat_transform: feats.FeatTransform
):
    observations = [observation] * len(actions)
    state_action_m = feat_transform.batch_transform(observations, actions)
    return np.dot(state_action_m, weights), state_action_m

In [5]:
def semi_gradient_sarsa(
    env, alpha: float, gamma: float, 
    epsilon: float, num_episodes: int, 
    feat_transform: feats.FeatTransform,
    verbose: bool = True
):
    actions = tuple(range(env.action_space.n))
    weights = np.zeros(feat_transform.output_shape, dtype=np.float64)
    returns = []
    
    for i in range(num_episodes):
        obs, _ = env.reset()
        state_qvalues, gradients = action_values(obs, actions, weights, feat_transform)
        rewards = 0
        # choose action
        if random.random() <= epsilon:
            action = env.action_space.sample()
        else:
            action = np.random.choice(np.flatnonzero(state_qvalues == state_qvalues.max()))

        while True:
            # greedy            
            next_obs, reward, term, trunc, _,  = env.step(action)
            rewards += reward
            
            if term or trunc:
                weights = weights + alpha * (reward - state_qvalues[action]) * gradients[action]
                break

            next_state_qvalues, next_gradients = action_values(next_obs, actions, weights, feat_transform)
            
            if random.random() <= epsilon:
                next_action = env.action_space.sample()
            else:
                # greedy
                next_action = np.random.choice(np.flatnonzero(next_state_qvalues == next_state_qvalues.max()))

            weights = weights + alpha * (
                reward + gamma * next_state_qvalues[next_action] - state_qvalues[action]
            ) * gradients[action]
            obs = next_obs
            action = next_action
            state_qvalues = next_state_qvalues
            gradients = next_gradients
        returns.append(rewards)
        if verbose and (i+1) % math.floor(num_episodes/5) == 0:
            print("Episode", i+1, "mean returns:", np.mean(returns))
    return weights

In [6]:
def collect_interaction_data(env, weights, num_episodes: int, feat_transform):
    actions = tuple(range(env.action_space.n))
    buffer = []
    returns = []
    for i in range(num_episodes):
        obs, _ = env.reset()
        rewards = 0
        steps = []
        while True:
            state_qvalues, _ = action_values(obs, actions, weights, feat_transform)
            action = np.random.choice(
                np.flatnonzero(state_qvalues == state_qvalues.max())
            )
            (
                next_obs,
                reward,
                term,
                trunc,
                _,
            ) = env.step(action)
            rewards += reward
            steps.append((obs, action, next_obs, reward))
            obs = next_obs
            if term or trunc:
                returns.append(rewards)
                break
        buffer.append(steps)
    return buffer, returns

In [7]:
def play(env, weights, num_episodes: int, feat_transform):
    actions = tuple(range(env.action_space.n))
    returns = []
    for i in range(num_episodes):
        obs, _ = env.reset()
        rewards = 0
        while True:
            state_qvalues, _ = action_values(obs, actions, weights, feat_transform)
            action = np.random.choice(
                np.flatnonzero(state_qvalues == state_qvalues.max())
            )
            (
                next_obs,
                reward,
                term,
                trunc,
                _,
            ) = env.step(action)
            rewards += reward
            obs = next_obs
            if term or trunc:
                returns.append(rewards)
                break
    return returns

In [8]:
def control_and_evaluate(
    env: gym.Env,
    args,
    alpha: float = 0.01,
    epsilon: float = 0.1,
    num_episodes: int = 5000,
    gamma: float = 1.0,
    turns: int = 5,
    eval_max_steps: int = 1000,
):
    rows = []
    for kwargs in args:
        print("Control with Fn Approx - SARSA:", kwargs)
        for turn in range(turns):
            print("Turn", turn + 1)
            ft_tfx = feats.create_feat_transformer(env, **kwargs)
            print("ft-tfx:", vars(ft_tfx))
            weights = semi_gradient_sarsa(
                env,
                alpha=alpha,
                gamma=gamma,
                epsilon=epsilon,
                num_episodes=num_episodes,
                feat_transform=ft_tfx,
            )
            buffer, returns = collect_interaction_data(
                gym.wrappers.TimeLimit(env, max_episode_steps=eval_max_steps),
                weights=weights,
                num_episodes=15,
                feat_transform=ft_tfx,
            )
            print(
                f"Eval (mean) returns: [min:{np.min(returns)}, mean:{np.mean(returns)}, max:{np.max(returns)}]"
            )
            rows.append({"args": kwargs, "buffer": buffer, "turn": turn})
    return pd.DataFrame(rows)

### Grid World

In [9]:
%%time
env = envs.make("GridWorld-v0")
df_gridworld = control_and_evaluate(env, [
    {"name": "random", "enc_size": 64},
    {"name": "tiles", "tiling_dim": 6}
])


Control with Fn Approx - SARSA: {'name': 'random', 'enc_size': 64}
Turn 1
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -575.0
Episode 2000 mean returns: -341.1285
Episode 3000 mean returns: -254.88466666666667
Episode 4000 mean returns: -212.25525
Episode 5000 mean returns: -185.7678
Eval (mean) returns: [min:-17.0, mean:-17.0, max:-17.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -52.069
Episode 2000 mean returns: -37.966
Episode 3000 mean returns: -33.72866666666667
Episode 4000 mean returns: -31.19925
Episode 5000 mean returns: -30.5046
Eval (mean) returns: [min:-15.0, mean:-15.0, max:-15.0]
Turn 3
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -56.67
Episode 2000 mean returns: -50.1415
Epi

In [10]:
df_gridworld

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'enc_size': 64}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",0
1,"{'name': 'random', 'enc_size': 64}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",1
2,"{'name': 'random', 'enc_size': 64}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",2
3,"{'name': 'random', 'enc_size': 64}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",3
4,"{'name': 'random', 'enc_size': 64}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",4
5,"{'name': 'tiles', 'tiling_dim': 6}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",0
6,"{'name': 'tiles', 'tiling_dim': 6}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",1
7,"{'name': 'tiles', 'tiling_dim': 6}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",2
8,"{'name': 'tiles', 'tiling_dim': 6}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",3
9,"{'name': 'tiles', 'tiling_dim': 6}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",4


### Ice World

In [11]:
%%time
env = envs.make("IceWorld-v0")
df_iceworld = control_and_evaluate(env, [
    {"name": "random", "enc_size": 64},
    {"name": "tiles", "tiling_dim": 6}
])

Control with Fn Approx - SARSA: {'name': 'random', 'enc_size': 64}
Turn 1
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -11.132
Episode 2000 mean returns: -11.033
Episode 3000 mean returns: -11.059666666666667
Episode 4000 mean returns: -11.02925
Episode 5000 mean returns: -11.0252
Eval (mean) returns: [min:-1000.0, mean:-1000.0, max:-1000.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -11.095
Episode 2000 mean returns: -11.009
Episode 3000 mean returns: -10.860333333333333
Episode 4000 mean returns: -10.957
Episode 5000 mean returns: -10.9908
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 3
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -11.783
Episode 2000 mean returns: -11.051
Episode 3000 mean retu

In [12]:
df_iceworld

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'enc_size': 64}","[[([0 0], 3, [1 0], -1.0), ([1 0], 0, [1 0], -...",0
1,"{'name': 'random', 'enc_size': 64}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",1
2,"{'name': 'random', 'enc_size': 64}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",2
3,"{'name': 'random', 'enc_size': 64}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",3
4,"{'name': 'random', 'enc_size': 64}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",4
5,"{'name': 'tiles', 'tiling_dim': 6}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",0
6,"{'name': 'tiles', 'tiling_dim': 6}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",1
7,"{'name': 'tiles', 'tiling_dim': 6}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",2
8,"{'name': 'tiles', 'tiling_dim': 6}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",3
9,"{'name': 'tiles', 'tiling_dim': 6}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",4


### RedGreen

In [13]:
%%time
env = envs.make("RedGreen-v0")
df_redgreen = control_and_evaluate(env, [
    {"name": "random", "enc_size": 32},
    {"name": "tiles", "tiling_dim": 6}
])

Control with Fn Approx - SARSA: {'name': 'random', 'enc_size': 32}
Turn 1
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1000 mean returns: -6.835
Episode 2000 mean returns: -6.618
Episode 3000 mean returns: -6.552333333333333
Episode 4000 mean returns: -6.5255
Episode 5000 mean returns: -6.5026
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1000 mean returns: -6.673
Episode 2000 mean returns: -6.5555
Episode 3000 mean returns: -6.52
Episode 4000 mean returns: -6.4895
Episode 5000 mean returns: -6.4816
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 3
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1000 mean returns: -6.761
Episode 2000 mean returns: -6.583
Episode 3000 mean returns: -6.514666666666667
Episode 4

In [14]:
df_redgreen

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'enc_size': 32}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",0
1,"{'name': 'random', 'enc_size': 32}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",1
2,"{'name': 'random', 'enc_size': 32}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",2
3,"{'name': 'random', 'enc_size': 32}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",3
4,"{'name': 'random', 'enc_size': 32}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",4
5,"{'name': 'tiles', 'tiling_dim': 6}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",0
6,"{'name': 'tiles', 'tiling_dim': 6}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",1
7,"{'name': 'tiles', 'tiling_dim': 6}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",2
8,"{'name': 'tiles', 'tiling_dim': 6}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",3
9,"{'name': 'tiles', 'tiling_dim': 6}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",4


### Moutain Car

In [15]:
%%time
env = envs.make("MountainCar-v0", max_episode_steps=1000)
df_mountaincar = control_and_evaluate(env, [
    {"name": "gaussian-mix", "params": {"n_components": (384//3), "covariance_type": "diag"}},
    {"name": "tiles", "tiling_dim": 6},
], alpha=0.01, epsilon=0.2, num_episodes=1000)

Control with Fn Approx - SARSA: {'name': 'gaussian-mix', 'params': {'n_components': 128, 'covariance_type': 'diag'}}
Turn 1
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2,), float32), 'num_actions': 3, '_gm': GaussianMixture(covariance_type='diag', n_components=128), 'obs_dim': 128}
Episode 200 mean returns: -1000.0
Episode 400 mean returns: -1000.0
Episode 600 mean returns: -1000.0
Episode 800 mean returns: -1000.0
Episode 1000 mean returns: -1000.0
Eval (mean) returns: [min:-1000.0, mean:-1000.0, max:-1000.0]
Turn 2
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2,), float32), 'num_actions': 3, '_gm': GaussianMixture(covariance_type='diag', n_components=128), 'obs_dim': 128}
Episode 200 mean returns: -1000.0
Episode 400 mean returns: -1000.0
Episode 600 mean returns: -1000.0
Episode 800 mean returns: -1000.0
Episode 1000 mean returns: -1000.0
Eval (mean) returns: [min:-1000.0, mean:-1000.0, max:-1000.0]
Turn 3
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2

### GEM

In [16]:
%%time
env = envs.make("Finite-CC-PMSM-v0", max_time_steps=1000)
df_gem = control_and_evaluate(env, [
    # {"name": "gaussian-mix", "params": {"n_components": 11, "covariance_type": "diag"}},
    {"name": "scale"},
], alpha=0.01, epsilon=0.2, num_episodes=1000)

Control with Fn Approx - SARSA: {'name': 'scale'}
Turn 1
ft-tfx: {'obs_space': Box(0.0, [0.8 0.8 1.  1. ], (4,), float64), 'num_actions': 8, 'obs_dim': 4, 'obs_range': array([0.8, 0.8, 1. , 1. ])}
Episode 200 mean returns: 165.8506076960823
Episode 400 mean returns: 5372.248419400881
Episode 600 mean returns: 4538.945804037329
Episode 800 mean returns: 4335.166491246552
Episode 1000 mean returns: 4159.184354820485
Eval (mean) returns: [min:585.3857300194837, mean:909.8025153560114, max:1778.7087335684062]
Turn 2
ft-tfx: {'obs_space': Box(0.0, [0.8 0.8 1.  1. ], (4,), float64), 'num_actions': 8, 'obs_dim': 4, 'obs_range': array([0.8, 0.8, 1. , 1. ])}
Episode 200 mean returns: 158.0563628461465
Episode 400 mean returns: 190.45983296269935
Episode 600 mean returns: 336.4992377977178
Episode 800 mean returns: 1787.8911903004177
Episode 1000 mean returns: 2114.487680598498
Eval (mean) returns: [min:486.3623744167834, mean:1228.6469828189986, max:1816.972941108315]
Turn 3
ft-tfx: {'obs_space