# Prototyping: Control with Linear Function Approximation

Our aim is to carry out RL tasks when rewards are delayed (aggregate, and anonymous), using linear function approximation.
To solve this problem, we aim to project both large discrete states and continuous states into basis vectors.


In this notebook, we test policy control with linear function approximation, using a select set of environments .
These environments have either discrete r continuous states, and discrete actions.

The environments and their best encoding for reward estimation from previous analyses are:

  - GridWorld: Tiling(8, HT=512)
  - RedGreen: Tiling(2)
  - MountainCar: Tiling(4)
  - GEM:Finite-CC-PMSM-v0 (Gym Electric Motor): Scaled raw features

We cannot know whether a feature encoding that is suitable for estimating the rewards from aggregate samples is equally adequate to represente state for learning an action-value function in control (nor a state-value in the case of policy evaluation).
So, for each, we also use a second encoding as a reference for comparison: 

  - GridWorld: Random Binary Vectors
  - RedGreen: Random Binary Vectors
  - MountainCar: Gaussian Mixture(covariance_type='diag', n_components=3)
  - GEM:Finite-CC-PMSM-v0 (Gym Electric Motor): GaussianMixture(covariance_type='diag', n_components=11)



In [1]:
import math
import random
from typing import Sequence


In [2]:
import gymnasium as gym
import numpy as np
import pandas as pd


In [3]:
from drmdp import envs, feats

## Control with SARSA

In [4]:
def action_values(
    observation, actions: Sequence[int], weights, feat_transform: feats.FeatTransform
):
    observations = [observation] * len(actions)
    state_action_m = feat_transform.batch_transform(observations, actions)
    return np.dot(state_action_m, weights), state_action_m

In [38]:
def semi_gradient_sarsa(
    env, alpha: float, gamma: float, 
    epsilon: float, num_episodes: int, 
    feat_transform: feats.FeatTransform,
    verbose: bool = True
):
    actions = tuple(range(env.action_space.n))
    weights = np.zeros(feat_transform.output_shape, dtype=np.float64)
    returns = []
    
    for i in range(num_episodes):
        obs, _ = env.reset()
        state_qvalues, gradients = action_values(obs, actions, weights, feat_transform)
        rewards = 0
        steps = 0
        # choose action
        if random.random() <= epsilon:
            action = env.action_space.sample()
        else:
            action = np.random.choice(np.flatnonzero(state_qvalues == state_qvalues.max()))

        while True:
            # greedy            
            next_obs, reward, term, trunc, _,  = env.step(action)
            rewards += reward
            steps += 1
            
            if term or trunc:
                weights = weights + alpha * (reward - state_qvalues[action]) * gradients[action]
                break

            next_state_qvalues, next_gradients = action_values(next_obs, actions, weights, feat_transform)
            
            if random.random() <= epsilon:
                next_action = env.action_space.sample()
            else:
                # greedy
                next_action = np.random.choice(np.flatnonzero(next_state_qvalues == next_state_qvalues.max()))

            weights = weights + alpha * (
                reward + gamma * next_state_qvalues[next_action] - state_qvalues[action]
            ) * gradients[action]
            obs = next_obs
            action = next_action
            state_qvalues = next_state_qvalues
            gradients = next_gradients
        returns.append(rewards)
        if verbose and ((i == 0) or (i+1) % math.floor(num_episodes/5) == 0):
            print("Episode", i+1, ";", steps, "steps", "mean returns:", np.mean(returns))
    return weights

In [6]:
def collect_interaction_data(env, weights, num_episodes: int, feat_transform):
    actions = tuple(range(env.action_space.n))
    buffer = []
    returns = []
    for i in range(num_episodes):
        obs, _ = env.reset()
        rewards = 0
        steps = []
        while True:
            state_qvalues, _ = action_values(obs, actions, weights, feat_transform)
            action = np.random.choice(
                np.flatnonzero(state_qvalues == state_qvalues.max())
            )
            (
                next_obs,
                reward,
                term,
                trunc,
                _,
            ) = env.step(action)
            rewards += reward
            steps.append((obs, action, next_obs, reward))
            obs = next_obs
            if term or trunc:
                returns.append(rewards)
                break
        buffer.append(steps)
    return buffer, returns

In [7]:
def play(env, weights, num_episodes: int, feat_transform):
    actions = tuple(range(env.action_space.n))
    returns = []
    for i in range(num_episodes):
        obs, _ = env.reset()
        rewards = 0
        while True:
            state_qvalues, _ = action_values(obs, actions, weights, feat_transform)
            action = np.random.choice(
                np.flatnonzero(state_qvalues == state_qvalues.max())
            )
            (
                next_obs,
                reward,
                term,
                trunc,
                _,
            ) = env.step(action)
            rewards += reward
            obs = next_obs
            if term or trunc:
                returns.append(rewards)
                break
    return returns

In [8]:
def control_and_evaluate(
    env: gym.Env,
    args,
    alpha: float = 0.01,
    epsilon: float = 0.1,
    num_episodes: int = 5000,
    gamma: float = 1.0,
    turns: int = 5,
    eval_max_steps: int = 1000,
):
    rows = []
    for kwargs in args:
        print("Control with Fn Approx - SARSA:", kwargs)
        for turn in range(turns):
            print("Turn", turn + 1)
            ft_tfx = feats.create_feat_transformer(env, **kwargs)
            print("ft-tfx:", vars(ft_tfx))
            weights = semi_gradient_sarsa(
                env,
                alpha=alpha,
                gamma=gamma,
                epsilon=epsilon,
                num_episodes=num_episodes,
                feat_transform=ft_tfx,
            )
            buffer, returns = collect_interaction_data(
                gym.wrappers.TimeLimit(env, max_episode_steps=eval_max_steps),
                weights=weights,
                num_episodes=15,
                feat_transform=ft_tfx,
            )
            print(
                f"Eval (mean) returns: [min:{np.min(returns)}, mean:{np.mean(returns)}, max:{np.max(returns)}]"
            )
            rows.append({"args": kwargs, "buffer": buffer, "turn": turn})
    return pd.DataFrame(rows)

### Grid World

In [35]:
%%time
env = envs.make("GridWorld-v0")
df_gridworld = control_and_evaluate(env, [
    {"name": "random", "args": {"enc_size": 64}},
    {"name": "tiles", "args": {"tiling_dim": 6}}
])


Control with Fn Approx - SARSA: {'name': 'random', 'args': {'enc_size': 64}}
Turn 1
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 ; 17 steps mean returns: -90.236
Episode 2000 ; 16 steps mean returns: -59.3325
Episode 3000 ; 15 steps mean returns: -48.474333333333334
Episode 4000 ; 17 steps mean returns: -43.5925
Episode 5000 ; 23 steps mean returns: -40.3842
Eval (mean) returns: [min:-15.0, mean:-15.0, max:-15.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 ; 24 steps mean returns: -88.914
Episode 2000 ; 15 steps mean returns: -69.6405
Episode 3000 ; 17 steps mean returns: -59.284
Episode 4000 ; 21 steps mean returns: -51.06125
Episode 5000 ; 16 steps mean returns: -45.8
Eval (mean) returns: [min:-15.0, mean:-15.0, max:-15.0]
Turn 3
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim

In [11]:
df_gridworld

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",0
1,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",1
2,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",2
3,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",3
4,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",4
5,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",0
6,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",1
7,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",2
8,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",3
9,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",4


### Ice World

In [12]:
%%time
env = envs.make("IceWorld-v0")
df_iceworld = control_and_evaluate(env, [
    {"name": "random", "args": {"enc_size": 64}},
    {"name": "tiles", "args":{"tiling_dim": 6}}
])

Control with Fn Approx - SARSA: {'name': 'random', 'args': {'enc_size': 64}}
Turn 1
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -10.401
Episode 2000 mean returns: -10.413
Episode 3000 mean returns: -10.608333333333333
Episode 4000 mean returns: -10.744
Episode 5000 mean returns: -10.6986
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -10.734
Episode 2000 mean returns: -10.916
Episode 3000 mean returns: -10.893333333333333
Episode 4000 mean returns: -11.0345
Episode 5000 mean returns: -10.9996
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 3
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1000 mean returns: -10.663
Episode 2000 mean returns: -10.822
Episode 3000 mean retu

In [13]:
df_iceworld

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",0
1,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",1
2,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",2
3,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",3
4,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",4
5,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",0
6,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",1
7,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",2
8,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",3
9,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",4


### RedGreen

In [14]:
%%time
env = envs.make("RedGreen-v0")
df_redgreen = control_and_evaluate(env, [
    {"name": "random", "args": {"enc_size": 32}},
    {"name": "tiles", "args":{"tiling_dim": 6}}
])

Control with Fn Approx - SARSA: {'name': 'random', 'args': {'enc_size': 32}}
Turn 1
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1000 mean returns: -6.687
Episode 2000 mean returns: -6.5485
Episode 3000 mean returns: -6.504666666666667
Episode 4000 mean returns: -6.49525
Episode 5000 mean returns: -6.4876
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1000 mean returns: -6.897
Episode 2000 mean returns: -6.6525
Episode 3000 mean returns: -6.579
Episode 4000 mean returns: -6.55825
Episode 5000 mean returns: -6.5264
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 3
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1000 mean returns: -6.855
Episode 2000 mean returns: -6.6275
Episode 3000 mean returns: -6.5503333333

In [15]:
df_redgreen

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",0
1,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",1
2,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",2
3,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",3
4,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",4
5,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",0
6,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",1
7,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",2
8,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",3
9,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",4


### Moutain Car

In [18]:
%%time
env = envs.make("MountainCar-v0", max_episode_steps=1000)
df_mountaincar = control_and_evaluate(env, [
    {"name": "scale", "args": None},
    {"name": "tiles", "args": {"tiling_dim": 6}},
], alpha=0.01, epsilon=0.2, num_episodes=1000)

Control with Fn Approx - SARSA: {'name': 'scale', 'args': None}
Turn 1
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2,), float32), 'num_actions': 3, 'obs_dim': 2, 'obs_range': array([1.8000001, 0.14     ], dtype=float32)}
Episode 200 mean returns: -1000.0
Episode 400 mean returns: -1000.0
Episode 600 mean returns: -1000.0
Episode 800 mean returns: -1000.0
Episode 1000 mean returns: -1000.0
Eval (mean) returns: [min:-1000.0, mean:-1000.0, max:-1000.0]
Turn 2
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2,), float32), 'num_actions': 3, 'obs_dim': 2, 'obs_range': array([1.8000001, 0.14     ], dtype=float32)}
Episode 200 mean returns: -1000.0
Episode 400 mean returns: -1000.0
Episode 600 mean returns: -1000.0
Episode 800 mean returns: -1000.0
Episode 1000 mean returns: -1000.0
Eval (mean) returns: [min:-1000.0, mean:-1000.0, max:-1000.0]
Turn 3
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2,), float32), 'num_actions': 3, 'obs_dim': 2, 'obs_range': array([1.800

### GEM

In [39]:
%%time
env = envs.make("Finite-CC-PMSM-v0", max_episode_steps=200)
df_gem = control_and_evaluate(env, [
    {"name": "scale", "args": None},
], alpha=0.01, epsilon=0.2, num_episodes=1000)

Control with Fn Approx - SARSA: {'name': 'scale', 'args': None}
Turn 1
ft-tfx: {'obs_space': Box(0.0, [0.8 0.8 1.  1. ], (4,), float64), 'num_actions': 8, 'obs_dim': 4, 'obs_range': array([0.8, 0.8, 1. , 1. ])}
Episode 1 ; 200 steps mean returns: 366.7618433946664
Episode 200 ; 200 steps mean returns: 369.7405795512957
Episode 400 ; 200 steps mean returns: 369.27925967642756
Episode 600 ; 200 steps mean returns: 368.4499800164134
Episode 800 ; 200 steps mean returns: 365.5863532717111


capi_return is NULL
Call-back cb_fcn_in___user__routines failed.


KeyboardInterrupt: 