# Prototyping: Control with Linear Function Approximation

Our aim is to carry out RL tasks when rewards are delayed (aggregate, and anonymous), using linear function approximation.
To solve this problem, we aim to project both large discrete states and continuous states into basis vectors.


In this notebook, we test policy control with linear function approximation, using a select set of environments .
These environments have either discrete r continuous states, and discrete actions.

The environments and their best encoding for reward estimation from previous analyses are:

  - GridWorld: Tiling(8, HT=512)
  - RedGreen: Tiling(2)
  - MountainCar: Tiling(4)
  - GEM:Finite-CC-PMSM-v0 (Gym Electric Motor): Scaled raw features

We cannot know whether a feature encoding that is suitable for estimating the rewards from aggregate samples is equally adequate to represente state for learning an action-value function in control (nor a state-value in the case of policy evaluation).
So, for each, we also use a second encoding as a reference for comparison: 

  - GridWorld: Random Binary Vectors
  - RedGreen: Random Binary Vectors
  - MountainCar: Gaussian Mixture(covariance_type='diag', n_components=3)
  - GEM:Finite-CC-PMSM-v0 (Gym Electric Motor): GaussianMixture(covariance_type='diag', n_components=11)



In [1]:
import math
import random
from typing import Sequence


In [2]:
import gymnasium as gym
import numpy as np
import pandas as pd


In [3]:
from drmdp import envs, feats

## Control with SARSA

In [4]:
def action_values(
    observation, actions: Sequence[int], weights, feat_transform: feats.FeatTransform
):
    observations = [observation] * len(actions)
    state_action_m = feat_transform.batch_transform(observations, actions)
    return np.dot(state_action_m, weights), state_action_m

In [5]:
def semi_gradient_sarsa(
    env, alpha: float, gamma: float, 
    epsilon: float, num_episodes: int, 
    feat_transform: feats.FeatTransform,
    verbose: bool = True
):
    actions = tuple(range(env.action_space.n))
    weights = np.zeros(feat_transform.output_shape, dtype=np.float64)
    returns = []
    
    for i in range(num_episodes):
        obs, _ = env.reset()
        state_qvalues, gradients = action_values(obs, actions, weights, feat_transform)
        rewards = 0
        steps = 0
        # choose action
        if random.random() <= epsilon:
            action = env.action_space.sample()
        else:
            action = np.random.choice(np.flatnonzero(state_qvalues == state_qvalues.max()))

        while True:
            # greedy            
            next_obs, reward, term, trunc, _,  = env.step(action)
            rewards += reward
            steps += 1
            
            if term or trunc:
                weights = weights + alpha * (reward - state_qvalues[action]) * gradients[action]
                break

            next_state_qvalues, next_gradients = action_values(next_obs, actions, weights, feat_transform)
            
            if random.random() <= epsilon:
                next_action = env.action_space.sample()
            else:
                # greedy
                next_action = np.random.choice(np.flatnonzero(next_state_qvalues == next_state_qvalues.max()))

            weights = weights + alpha * (
                reward + gamma * next_state_qvalues[next_action] - state_qvalues[action]
            ) * gradients[action]
            obs = next_obs
            action = next_action
            state_qvalues = next_state_qvalues
            gradients = next_gradients
        returns.append(rewards)
        if verbose and ((i == 0) or (i+1) % math.floor(num_episodes/5) == 0):
            print("Episode", i+1, ";", steps, "steps", "mean returns:", np.mean(returns))
    return weights

In [6]:
def collect_interaction_data(env, weights, num_episodes: int, feat_transform):
    actions = tuple(range(env.action_space.n))
    buffer = []
    returns = []
    for i in range(num_episodes):
        obs, _ = env.reset()
        rewards = 0
        steps = []
        while True:
            state_qvalues, _ = action_values(obs, actions, weights, feat_transform)
            action = np.random.choice(
                np.flatnonzero(state_qvalues == state_qvalues.max())
            )
            (
                next_obs,
                reward,
                term,
                trunc,
                _,
            ) = env.step(action)
            rewards += reward
            steps.append((obs, action, next_obs, reward))
            obs = next_obs
            if term or trunc:
                returns.append(rewards)
                break
        buffer.append(steps)
    return buffer, returns

In [7]:
def play(env, weights, num_episodes: int, feat_transform):
    actions = tuple(range(env.action_space.n))
    returns = []
    for i in range(num_episodes):
        obs, _ = env.reset()
        rewards = 0
        while True:
            state_qvalues, _ = action_values(obs, actions, weights, feat_transform)
            action = np.random.choice(
                np.flatnonzero(state_qvalues == state_qvalues.max())
            )
            (
                next_obs,
                reward,
                term,
                trunc,
                _,
            ) = env.step(action)
            rewards += reward
            obs = next_obs
            if term or trunc:
                returns.append(rewards)
                break
    return returns

In [8]:
def control_and_evaluate(
    env: gym.Env,
    args,
    alpha: float = 0.01,
    epsilon: float = 0.1,
    num_episodes: int = 5000,
    gamma: float = 1.0,
    turns: int = 5,
    eval_max_steps: int = 1000,
):
    rows = []
    for kwargs in args:
        print("Control with Fn Approx - SARSA:", kwargs)
        for turn in range(turns):
            print("Turn", turn + 1)
            ft_tfx = feats.create_feat_transformer(env, **kwargs)
            print("ft-tfx:", vars(ft_tfx))
            weights = semi_gradient_sarsa(
                env,
                alpha=alpha,
                gamma=gamma,
                epsilon=epsilon,
                num_episodes=num_episodes,
                feat_transform=ft_tfx,
            )
            buffer, returns = collect_interaction_data(
                gym.wrappers.TimeLimit(env, max_episode_steps=eval_max_steps),
                weights=weights,
                num_episodes=15,
                feat_transform=ft_tfx,
            )
            print(
                f"Eval (mean) returns: [min:{np.min(returns)}, mean:{np.mean(returns)}, max:{np.max(returns)}]"
            )
            rows.append({"args": kwargs, "buffer": buffer, "turn": turn})
    return pd.DataFrame(rows)

### Grid World

In [9]:
%%time
env = envs.make("GridWorld-v0")
print(env)
df_gridworld = control_and_evaluate(env, [
    {"name": "random", "args": {"enc_size": 64}},
    {"name": "tiles", "args": {"tiling_dim": 6}}
])


<GridWorldObsAsVectorWrapper<GridWorld instance>>
Control with Fn Approx - SARSA: {'name': 'random', 'args': {'enc_size': 64}}
Turn 1
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1 ; 5314 steps mean returns: -7690.0
Episode 1000 ; 20 steps mean returns: -62.935
Episode 2000 ; 17 steps mean returns: -43.8265
Episode 3000 ; 18 steps mean returns: -37.091
Episode 4000 ; 21 steps mean returns: -33.9805
Episode 5000 ; 26 steps mean returns: -32.2854
Eval (mean) returns: [min:-17.0, mean:-17.0, max:-17.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, [ 4. 12.], (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1 ; 64939 steps mean returns: -87610.0
Episode 1000 ; 21 steps mean returns: -165.232
Episode 2000 ; 17 steps mean returns: -101.553
Episode 3000 ; 29 steps mean returns: -81.045
Episode 4000 ; 17 steps mean returns: -70.0075
Episode 5000 ; 19 steps mean returns: -62.7996
Eval (mean) re

In [10]:
df_gridworld

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",0
1,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 1, [2 1], -...",1
2,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",2
3,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",3
4,"{'name': 'random', 'args': {'enc_size': 64}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",4
5,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",0
6,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",1
7,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",2
8,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",3
9,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([3 0], 2, [2 0], -1.0), ([2 0], 2, [1 0], -...",4


### Ice World

In [11]:
%%time
env = envs.make("IceWorld-v0")
df_iceworld = control_and_evaluate(env, [
    {"name": "random", "args": {"enc_size": 64}},
    {"name": "tiles", "args":{"tiling_dim": 6}}
])

Control with Fn Approx - SARSA: {'name': 'random', 'args': {'enc_size': 64}}
Turn 1
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1 ; 12 steps mean returns: -43.0
Episode 1000 ; 6 steps mean returns: -10.683
Episode 2000 ; 6 steps mean returns: -10.944
Episode 3000 ; 6 steps mean returns: -11.005333333333333
Episode 4000 ; 7 steps mean returns: -11.01575
Episode 5000 ; 6 steps mean returns: -11.0096
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, 4.0, (2,), float32), 'num_actions': 4, 'obs_dim': 64, '_representations': {}}
Episode 1 ; 2 steps mean returns: -33.0
Episode 1000 ; 6 steps mean returns: -12.087
Episode 2000 ; 6 steps mean returns: -11.363
Episode 3000 ; 6 steps mean returns: -11.171666666666667
Episode 4000 ; 6 steps mean returns: -11.018
Episode 5000 ; 6 steps mean returns: -10.9456
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 3
ft-tfx: {'obs_space

In [12]:
df_iceworld

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",0
1,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",1
2,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",2
3,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",3
4,"{'name': 'random', 'args': {'enc_size': 64}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",4
5,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",0
6,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",1
7,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",2
8,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 3, [1 0], -1.0), ([1 0], 3, [2 0], -...",3
9,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0 0], 1, [0 1], -1.0), ([0 1], 1, [0 2], -...",4


### RedGreen

In [13]:
%%time
env = envs.make("RedGreen-v0")
df_redgreen = control_and_evaluate(env, [
    {"name": "random", "args": {"enc_size": 32}},
    {"name": "tiles", "args":{"tiling_dim": 6}}
])

Control with Fn Approx - SARSA: {'name': 'random', 'args': {'enc_size': 32}}
Turn 1
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1 ; 14 steps mean returns: -14.0
Episode 1000 ; 6 steps mean returns: -6.745
Episode 2000 ; 7 steps mean returns: -6.58
Episode 3000 ; 6 steps mean returns: -6.53
Episode 4000 ; 6 steps mean returns: -6.511
Episode 5000 ; 7 steps mean returns: -6.494
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 2
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), float32), 'num_actions': 3, 'obs_dim': 32, '_representations': {}}
Episode 1 ; 21 steps mean returns: -21.0
Episode 1000 ; 6 steps mean returns: -6.642
Episode 2000 ; 6 steps mean returns: -6.5365
Episode 3000 ; 6 steps mean returns: -6.5056666666666665
Episode 4000 ; 6 steps mean returns: -6.48625
Episode 5000 ; 6 steps mean returns: -6.474
Eval (mean) returns: [min:-6.0, mean:-6.0, max:-6.0]
Turn 3
ft-tfx: {'obs_space': Box(0.0, 7.0, (1,), 

In [14]:
df_redgreen

Unnamed: 0,args,buffer,turn
0,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",0
1,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",1
2,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",2
3,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",3
4,"{'name': 'random', 'args': {'enc_size': 32}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",4
5,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",0
6,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",1
7,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",2
8,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",3
9,"{'name': 'tiles', 'args': {'tiling_dim': 6}}","[[([0], 0, [1], -1.0), ([1], 1, [2], -1.0), ([...",4


### Moutain Car

In [15]:
%%time
env = envs.make("MountainCar-v0", max_episode_steps=1000)
df_mountaincar = control_and_evaluate(env, [
    {"name": "scale", "args": None},
    {"name": "tiles", "args": {"tiling_dim": 6}},
], alpha=0.01, epsilon=0.2, num_episodes=1000)

Control with Fn Approx - SARSA: {'name': 'scale', 'args': None}
Turn 1
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2,), float32), 'num_actions': 3, 'obs_dim': 2, 'obs_range': array([1.8000001, 0.14     ], dtype=float32)}
Episode 1 ; 1000 steps mean returns: -1000.0
Episode 200 ; 1000 steps mean returns: -1000.0
Episode 400 ; 1000 steps mean returns: -1000.0
Episode 600 ; 1000 steps mean returns: -1000.0
Episode 800 ; 1000 steps mean returns: -1000.0
Episode 1000 ; 1000 steps mean returns: -1000.0
Eval (mean) returns: [min:-1000.0, mean:-1000.0, max:-1000.0]
Turn 2
ft-tfx: {'obs_space': Box([-1.2  -0.07], [0.6  0.07], (2,), float32), 'num_actions': 3, 'obs_dim': 2, 'obs_range': array([1.8000001, 0.14     ], dtype=float32)}
Episode 1 ; 1000 steps mean returns: -1000.0
Episode 200 ; 1000 steps mean returns: -1000.0
Episode 400 ; 1000 steps mean returns: -1000.0
Episode 600 ; 1000 steps mean returns: -1000.0
Episode 800 ; 1000 steps mean returns: -1000.0
Episode 1000 ; 1000 step