# Using the Stable Baselines RL library

[Documentation of Stable Baselines](https://stable-baselines3.readthedocs.io/en/master/index.html).

## Installation of Stable Baselines

* `conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia`
* `conda install gymnasium numpy matplotlib stable-baselines3[extra] tensorboard pip jupyter notebook -c conda-forge`

## Test Stable Baselines using Gymnasium Mountain Car

<img src="mountain-car-v0.gif" alt="drawing" width="400"/>

Goal: drive up the mountain on the right. However, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Let's have a look at the [documentation](https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py) to understand the environment.

Note that "car position" is along the x-axis (1-dimensional).

What can we state about the problem?
* To be able to reach the mountain top at the right, the car needs to swing to the left. This makes it not so obvious what a good reward function would be

### 1st attempt: DQN
* Let's try DQN.
* Let's try 100 episodes. From the documentation, we understand that an episode takes maximally 200 steps. So we need 20000 steps.



In [5]:
# mountaincar_dqn_agent

import gymnasium as gym
from stable_baselines3 import DQN

env = gym.make('MountainCar-v0', render_mode="rgb_array")
env.metadata['render_fps'] = 200

training = False
if training:
    # mountaincar does not give good results with DQN default settings,
    # so use tuned settings from SB3 Zoo: rl-baselines3-zoo\hyperparams\dqn.yml
    model = DQN("MlpPolicy", env, verbose=1, device="cpu", gamma=0.98, learning_rate=0.005, buffer_size=10000, exploration_fraction=0.2, exploration_final_eps=0.07, exploration_initial_eps=1.0, train_freq=16, gradient_steps=8, batch_size=128, learning_starts=1000, target_update_interval=600, _init_setup_model=True, policy_kwargs=dict(net_arch=[256, 256]), tensorboard_log="tensorboard_logs/mountaincar_dqn_agent/")
    model.learn(total_timesteps=20000)#20000)
    model.save("learned_models/dqn_mountaincar")
else:
    # model = DQN.load("learned_models/dqn_mountaincar")
    model = DQN.load("learned_models/dqn_mountaincar_400000steps")

print("finished training, now use the trained model and render the env")

env.env.env.env.render_mode = "human" # 'env.env.env.env.' due to all wrapping done internally by gymnasium
n_episodes = 10
for i in range(n_episodes):
    obs, info = env.reset()
    done = False
    episode_steps = 0
    while not done:
        action, state = model.predict(obs)  # greedy policy
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        episode_steps += 1
        env.render()

    result = 'Success!' if episode_steps < 200 else 'Failure!'
    print(result)
env.close()

finished training, now use the trained model and render the env
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Failure!
Failure!
Failure!


It takes 1 minute to train and the result is shown below. An unsuccessful episode gives an episode reward of -200. So we must conclude that the agent never managed to get a success.
  <img src="training_result_1st_attempt.png" alt="drawing" width="250"/>
  <img src="dqn_mountaincar_20000steps.png" alt="drawing" width="1200"/>
  
Options to try is to increase exploration:
* We noted that none of the 100 episodes was successful. So the agent had really no single possibility to learn. If only we were so lucky that only one episode is successful, there's at least something to learn for the agent -->> increase the amount of exploration. Let's look at the [documentation](https://stable-baselines3.readthedocs.io/en/master/index.html) how to do this. There are several ways to do this. We could set `exploration_fraction=0.2` to `exploration_fraction=0.3`. This is the fraction of the total number of training steps over which epsilon is decreased from its max to its min value.


### 2nd attempt: longer learning
* Let's try to not be intelligent. We can also simply train for a longer time, hoping that we will be lucky once that the car reaches the top. We try 400k steps.
* Note that in Stable Baselines time spent in exploration is proportional to the number of steps. So more steps means longer exploration.
* It takes 10 minutes and the result is shown below. Some success! Quite often the car reaches the top.
  <img src="dqn_mountaincar_400000steps.png" alt="drawing" width="1200"/>
* Set `Training = False` and let's try to play a couple of time (load the saved model "dqn_mountaincar_400000steps"). As you can see brute force worked out quite well!

### 3rd attempt: tweaking the reward function

Let's try a different approach and try to tweak the reward function of the mountaincar environment. Remember that defining subgoals helps learning, but possibly removes the guarantee of optimality.

* make a copy of source code of the official Gymnasium Mountaincar, which you can find in `...\envs\<your env>\Lib\site-packages\gymnasium\envs\classic_control\mountain_car.py`
* the max episode length of 200 is enforced by the gym environment wrapper, not by the environment itself. So I've added code for this by hand (see below for the code).

**Original reward function**:

`step(self, action):
    ...
    reward = -1.0
    ...`

Note that looking at this code, there's no reward for reaching the finish. There's only a reward of -1 per step. Can you explain why this reward function work anyway?

**Modified reward function (1st attempt)**:

To be able to reach the mountain top at the right, the car needs to swing to the left. This makes it not so obvious what a good reward function would be. 

The more the car manages to go to the right, the better it is, so let's reward it when the car is far to the right.

`step(self, action):
    ...
    reward = -1.0 + position  # the more to the right the higher the reward
    if position >= 0.5:  # bonus if finish is reached
        reward = 1
    ...`

What do you think will happen?

Result:
<img src="dqn_mymountaincar_reward_attempt1_250k_sb2.png" alt="drawing" width="400"/>

Can you explain it?


**Modified reward function (2nd attempt)**:

Instead of continually rewarding the agent trying to be as to the right as possible, let's only reward it when it breaks the record of begin furthest to the right. Also important to note that it is not a problem if the car swings very far to the left. The car does not "die". 

`step(self, action):
    ...
    reward = -1.0
    if position > self.max_reached_position:  # reward when new maximally right position has been reached
        self.max_reached_position = position
        reward = 5.0
    ...`

<img src="dqn_mymountaincar_reward_attempt2_200k_sb2.png" alt="drawing" width="400"/>

The results are quite good. Some discussion, hypothetically, without proof:
* We were warned to be careful with defining subgoals. This is a subgoal
* What will give maximum reward? 
  * If with every swing to the right, we just go a little further than the previous time, the reward is every time 5.
  * So actually a slowly learning car will have more reward than a fast learning car, because a fast learning car has quickly reached a high max_reached_position, which will not often be exceeded any more.
  * There's no incentive for the car to reach the finish! The longer it keeps driving, the better the reward!
  * This all has very much to do with the time limit of 200 steps! Slow learning would work perfectly if there were no time limit.
 
**Modified reward function (3rd and last attempt)**:

 Add a finish bonus that exceeds the cumulative max_reached position bonus: 5000

`step(self, action):
    ...
    reward = -1.0
    if position > self.max_reached_position:  # reward when new maximally right position has been reached
        self.max_reached_position = position
        reward = 5.0
    if position >= 0.5:  # bonus if finish is reached
        reward = 5000.0
    ...`

<img src="dqn_mymountaincar_reward_attempt3_200k_sb2.png" alt="drawing" width="400"/>

Setting `Training = False` shows quite okay results. Is this due to the reward function, or is the reward function actually not very different from the original one, and is it simply because we've trained for 200k steps? Just from eye-sight, the original reward function seems to be even slightly better, but it was trained for 400k steps.

In [2]:
"""
http://incompleteideas.net/MountainCar/MountainCar1.cp
permalink: https://perma.cc/6Z2N-PFWC
"""
import math
from typing import Optional

import numpy as np

import gymnasium as gym
from gymnasium import spaces
from gymnasium.envs.classic_control import utils
from gymnasium.error import DependencyNotInstalled


class MyMountainCarEnv(gym.Env):
    """
    ## Description

    The Mountain Car MDP is a deterministic MDP that consists of a car placed stochastically
    at the bottom of a sinusoidal valley, with the only possible actions being the accelerations
    that can be applied to the car in either direction. The goal of the MDP is to strategically
    accelerate the car to reach the goal state on top of the right hill. There are two versions
    of the mountain car domain in gymnasium: one with discrete actions and one with continuous.
    This version is the one with discrete actions.

    This MDP first appeared in [Andrew Moore's PhD Thesis (1990)](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf)

    ```
    @TECHREPORT{Moore90efficientmemory-based,
        author = {Andrew William Moore},
        title = {Efficient Memory-based Learning for Robot Control},
        institution = {University of Cambridge},
        year = {1990}
    }
    ```

    ## Observation Space

    The observation is a `ndarray` with shape `(2,)` where the elements correspond to the following:

    | Num | Observation                          | Min   | Max  | Unit         |
    |-----|--------------------------------------|-------|------|--------------|
    | 0   | position of the car along the x-axis | -1.2  | 0.6  | position (m) |
    | 1   | velocity of the car                  | -0.07 | 0.07 | velocity (v) |

    ## Action Space

    There are 3 discrete deterministic actions:

    - 0: Accelerate to the left
    - 1: Don't accelerate
    - 2: Accelerate to the right

    ## Transition Dynamics:

    Given an action, the mountain car follows the following transition dynamics:

    *velocity<sub>t+1</sub> = velocity<sub>t</sub> + (action - 1) * force - cos(3 * position<sub>t</sub>) * gravity*

    *position<sub>t+1</sub> = position<sub>t</sub> + velocity<sub>t+1</sub>*

    where force = 0.001 and gravity = 0.0025. The collisions at either end are inelastic with the velocity set to 0
    upon collision with the wall. The position is clipped to the range `[-1.2, 0.6]` and
    velocity is clipped to the range `[-0.07, 0.07]`.

    ## Reward:

    The goal is to reach the flag placed on top of the right hill as quickly as possible, as such the agent is
    penalised with a reward of -1 for each timestep.

    ## Starting State

    The position of the car is assigned a uniform random value in *[-0.6 , -0.4]*.
    The starting velocity of the car is always assigned to 0.

    ## Episode End

    The episode ends if either of the following happens:
    1. Termination: The position of the car is greater than or equal to 0.5 (the goal position on top of the right hill)
    2. Truncation: The length of the episode is 200.


    ## Arguments

    ```python
    import gymnasium as gym
    gym.make('MountainCar-v0')
    ```

    On reset, the `options` parameter allows the user to change the bounds used to determine
    the new random state.

    ## Version History

    * v0: Initial versions release (1.0.0)
    """

    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 200, # Erco
    }

    def __init__(self, render_mode: Optional[str] = None, goal_velocity=0):
        self._max_episode_steps = 200  # Erco
        self.min_position = -1.2
        self.max_position = 0.6
        self.max_speed = 0.07
        self.goal_position = 0.5
        self.goal_velocity = goal_velocity
        
        self.max_reached_position = self.min_position  # max_reached_position not in reset(), so keep its value (Erco)
        
        self.force = 0.001
        self.gravity = 0.0025

        self.low = np.array([self.min_position, -self.max_speed], dtype=np.float32)
        self.high = np.array([self.max_position, self.max_speed], dtype=np.float32)

        self.render_mode = render_mode

        self.screen_width = 600
        self.screen_height = 400
        self.screen = None
        self.clock = None
        self.isopen = True

        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)

    def step(self, action: int):
        assert self.action_space.contains(
            action
        ), f"{action!r} ({type(action)}) invalid"

        position, velocity = self.state
        velocity += (action - 1) * self.force + math.cos(3 * position) * (-self.gravity)
        velocity = np.clip(velocity, -self.max_speed, self.max_speed)
        position += velocity
        position = np.clip(position, self.min_position, self.max_position)
        if position == self.min_position and velocity < 0:
            velocity = 0

        terminated = bool(
            position >= self.goal_position and velocity >= self.goal_velocity
        )
        # RewArt shaping (Erco)
        # attempt 0: original reward function
        reward = -1.0  
        #
        # attempt 1:
        #reward = -1.0 + position  # the more to the right the higher the reward
        #if position >= 0.5:  # bonus if finish is reached
        #    reward = 1
        #
        # attempt 2:
        #reward = -1.0
        #if position > self.max_reached_position:  # reward when new maximally right position has been reached
        #    self.max_reached_position = position
        #    reward = 5.0
        #
        # attempt 3:
        #reward = -1.0
        #if position > self.max_reached_position:  # reward when new maximally right position has been reached
        #    self.max_reached_position = position
        #    reward = 5.0
        #if position >= 0.5:  # bonus if finish is reached
        #    reward = 5000.0

        self._elapsed_steps += 1  # Erco
        truncated =  bool(self._elapsed_steps >= self._max_episode_steps)  # the Gymnasium wrapper limits #steps (Erco)

        self.state = (position, velocity)
        if self.render_mode == "human":
            self.render()
        return np.array(self.state, dtype=np.float32), reward, terminated, truncated, {}

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        options: Optional[dict] = None,
    ):
        super().reset(seed=seed)
        
        self._elapsed_steps = 0  # Erco
        
        # Note that if you use custom reset bounds, it may lead to out-of-bound
        # state/observations.
        low, high = utils.maybe_parse_reset_bounds(options, -0.6, -0.4)
        self.state = np.array([self.np_random.uniform(low=low, high=high), 0])

        if self.render_mode == "human":
            self.render()
        return np.array(self.state, dtype=np.float32), {}

    def _height(self, xs):
        return np.sin(3 * xs) * 0.45 + 0.55

    def render(self):
        if self.render_mode is None:
            assert self.spec is not None
            gym.logger.warn(
                "You are calling render method without specifying any render mode. "
                "You can specify the render_mode at initialization, "
                f'e.g. gym.make("{self.spec.id}", render_mode="rgb_array")'
            )
            return

        try:
            import pygame
            from pygame import gfxdraw
        except ImportError as e:
            raise DependencyNotInstalled(
                "pygame is not installed, run `pip install gymnasium[classic-control]`"
            ) from e

        if self.screen is None:
            pygame.init()
            if self.render_mode == "human":
                pygame.display.init()
                self.screen = pygame.display.set_mode(
                    (self.screen_width, self.screen_height)
                )
            else:  # mode in "rgb_array"
                self.screen = pygame.Surface((self.screen_width, self.screen_height))
        if self.clock is None:
            self.clock = pygame.time.Clock()

        world_width = self.max_position - self.min_position
        scale = self.screen_width / world_width
        carwidth = 40
        carheight = 20

        self.surf = pygame.Surface((self.screen_width, self.screen_height))
        self.surf.fill((255, 255, 255))

        pos = self.state[0]

        xs = np.linspace(self.min_position, self.max_position, 100)
        ys = self._height(xs)
        xys = list(zip((xs - self.min_position) * scale, ys * scale))

        pygame.draw.aalines(self.surf, points=xys, closed=False, color=(0, 0, 0))

        clearance = 10

        l, r, t, b = -carwidth / 2, carwidth / 2, carheight, 0
        coords = []
        for c in [(l, b), (l, t), (r, t), (r, b)]:
            c = pygame.math.Vector2(c).rotate_rad(math.cos(3 * pos))
            coords.append(
                (
                    c[0] + (pos - self.min_position) * scale,
                    c[1] + clearance + self._height(pos) * scale,
                )
            )

        gfxdraw.aapolygon(self.surf, coords, (0, 0, 0))
        gfxdraw.filled_polygon(self.surf, coords, (0, 0, 0))

        for c in [(carwidth / 4, 0), (-carwidth / 4, 0)]:
            c = pygame.math.Vector2(c).rotate_rad(math.cos(3 * pos))
            wheel = (
                int(c[0] + (pos - self.min_position) * scale),
                int(c[1] + clearance + self._height(pos) * scale),
            )

            gfxdraw.aacircle(
                self.surf, wheel[0], wheel[1], int(carheight / 2.5), (128, 128, 128)
            )
            gfxdraw.filled_circle(
                self.surf, wheel[0], wheel[1], int(carheight / 2.5), (128, 128, 128)
            )

        flagx = int((self.goal_position - self.min_position) * scale)
        flagy1 = int(self._height(self.goal_position) * scale)
        flagy2 = flagy1 + 50
        gfxdraw.vline(self.surf, flagx, flagy1, flagy2, (0, 0, 0))

        gfxdraw.aapolygon(
            self.surf,
            [(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)],
            (204, 204, 0),
        )
        gfxdraw.filled_polygon(
            self.surf,
            [(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)],
            (204, 204, 0),
        )

        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))
        if self.render_mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        elif self.render_mode == "rgb_array":
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
            )

    def get_keys_to_action(self):
        # Control with left and right arrow keys.
        return {(): 1, (276,): 0, (275,): 2, (275, 276): 1}

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False

In [3]:
# my_mountaincar_dqn_agent

import gymnasium as gym
from stable_baselines3 import DQN

#env = gym.make('MountainCar-v0', render_mode="rgb_array")
env = MyMountainCarEnv(0)
env.render_mode = "rgb_array"

training = True
if training:
    # mountaincar does not give good results with DQN default settings,
    # so use tuned settings from SB3 Zoo: rl-baselines3-zoo\hyperparams\dqn.yml
    model = DQN("MlpPolicy", env, device="cpu", verbose=0, gamma=0.98, learning_rate=0.005, buffer_size=10000, exploration_fraction=0.2, exploration_final_eps=0.07, exploration_initial_eps=1.0, train_freq=16, gradient_steps=8, batch_size=128, learning_starts=1000, target_update_interval=600, _init_setup_model=True, policy_kwargs=dict(net_arch=[256, 256]), tensorboard_log="tensorboard_logs/mymountaincar_dqn_agent/")
    model.learn(total_timesteps=20000)
    model.save("learned_models/dqn_my_mountaincar")
else:
    model = DQN.load("learned_models/dqn_my_mountaincar")

print("finished training, now use the trained model and render the env")

env.render_mode = "human"
n_episodes = 10
for i in range(n_episodes):
    obs, info = env.reset()
    done = False
    episode_steps = 0
    while not done:
        action, state = model.predict(obs)  # greedy policy
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        episode_steps += 1
        env.render()

    result = 'Success!' if episode_steps < 200 else 'Failure!'
    print(result)
env.close()

finished training, now use the trained model and render the env
Failure!
Failure!
Failure!
Failure!
Failure!
Failure!
Failure!
Failure!
Failure!
Failure!
