# Using the Stable Baselines RL library

## Installation

<u>Correct way of using pip with anaconda</u>
* always create an environment, to not mess up the base environment: `conda create -n <myenv> python=<python version>`
* first try conda install: `conda install <your package>`
* if this doesn't work, try the conda forge channel: `conda install -c conda-forge <your package>`
* if this doesn't work, try pip in the following way:
  * `conda install pip`
  * `<location of anaconda>\anaconda\envs\<your env>\Scripts\pip install <your package>` (to be sure to use the correct pip binary)
* hwo to check the location of pip:
  * `which pip` (linux)
  * `where pip` (windows cmd)
  * `Get-Command pip` (windows powershell)

So installing Stable Baselines will be:
* `conda create -n py36 python=3.6`
* `conda activate py36`
* `conda install pip`
* `conda install tensorflow=1.15` (stable-baselines version 2 does not support tensorflow 2, whereas version 3 uses pytorch)
* `<location of anaconda>\anaconda\envs\<your env>\Scripts\pip install stable-baselines` (to be sure to use the correct pip binary)

Possible installation problems:
* Error “AttributeError: module 'gym.envs.box2d' has no attribute ‘MountainCar’”. Solution is to install the packages `swig`, `pocketsphinx` and `gym[all]`.


## Test Stable Baselines using OpenAI Gym Mountain Car

<img src="mountain-car-v0.gif" alt="drawing" width="400"/>

Goal: drive up the mountain on the right. However, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Let's have a look at the [documentation](https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py) to understand the environment.

Note that "car position" is along the x-axis (1-dimensional).

What can we state about the problem?
* To be able to reach the mountain top at the right, the car needs to swing to the left. This makes it not so obvious what a good reward function would be

### 1st attempt: DQN
* Let's try DQN.
* Let's try 100 episodes. From the documentation, we understand that an episode takes maximally 200 steps. So we need 20000 steps.



In [None]:
# mountaincar_dqn_agent

import gym
from stable_baselines import DQN

env = gym.make('MountainCar-v0')

training = True
if training:
    model = DQN("MlpPolicy", env, verbose=1, tensorboard_log="tensorboard_logs/mountaincar_dqn_agent/")
    #model = DQN("MlpPolicy", env, verbose=1, learning_rate=0.0005, gamma=0.99, policy_kwargs=dict(layers=[64, 64]), tensorboard_log="tensorboard_logs/mountaincar_dqn_agent/")
    model.learn(total_timesteps=20000)
    model.save("learned_models/dqn_mountaincar")
else:
    model = DQN.load("learned_models/dqn_mountaincar")
    #model = DQN.load("learned_models/dqn_mountaincar_400000steps_sb2")

print("finished training, now use the trained model and render the env")

n_episodes = 10
for i in range(n_episodes):
    obs = env.reset()
    done = False
    episode_steps = 0
    while not done:
        action, state = model.predict(obs)  # greedy policy
        obs, reward, done, info = env.step(action)
        episode_steps += 1
        env.render()

    result = 'Success!' if episode_steps < 200 else 'Failure!'
    print(result)
env.close()

It takes 1 minute to train and the result is shown below. An unsuccessful episode gives an episode reward of -200. So we must conclude that the agent never managed to get a success.
  <img src="training_result_1st_attempt_sb2.png" alt="drawing" width="400"/>
  
Options to try is to increase exploration:
* We noted that none of the 100 episodes was successful. So the agent had really no single possibility to learn. If only we were so lucky that only one episode is successful, there's at least something to learn for the agent -->> increase the amount of exploration. Let's look at the [documentation](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html) how to do this. There are several ways to do this. We could set `exploration_fraction=0.1` to `exploration_fraction=0.2`. This is the fraction of the total number of training steps over which epsilon is decreased from its max to its min value.


### 2nd attempt: longer learning
* Let's try to not be intelligent. We can also simply train for a longer time, hoping that we will be lucky once that the car reaches the top. We try 400k steps.
* Note that in Stable Baselines time spent in exploration is proportional to the number of steps. So more steps means longer exploration.
* It takes 10 minutes and the result is shown below. Some success! Quite often the car reaches the top.
  <img src="tensorboard400000_sb2.png" alt="drawing" width="400"/>
* Set `Training = False` and let's try to play a couple of time (load the saved model "dqn_mountaincar_400000steps"). As you can see brute force worked out quite well! From the tensorboard graph, we did not expect this.

### 3rd attempt: tweaking the reward function

Let's try a different apprach and try to tweak the reward function of the mountaincar environment. Remember that defining subgoals helps learning, but possibly removes the guarantee of optimality.

* make a copy of the [official OpenAI Gym Mountaincar](https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py) 
* the max episode length of 200 is enforced by the gym environment wrapper, not by the environment itself. So I've added code for this by hand (see below for the code).

**Original reward function**:

`step(self, action):
    ...
    reward = -1.0
    ...`

Note that looking at this code, there's no reward for reaching the finish. There's only a reward of -1 per step. Can you explain why this reward function work anyway?

**Modified reward function (1st attempt)**:

To be able to reach the mountain top at the right, the car needs to swing to the left. This makes it not so obvious what a good reward function would be. 

The more the car goes to the right, the better it is, so let's reward it when the car is far to the right.

`step(self, action):
    ...
    reward = -1.0 + position  # the more to the right the higher the reward
    if position >= 0.5:  # bonus if finish is reached
        reward = 1
    ...`

What do you think will happen?

Result:
<img src="dqn_mymountaincar_reward_attempt1_250k_sb2.png" alt="drawing" width="400"/>

Can you explain it?


**Modified reward function (2nd attempt)**:

Instead of continually rewarding the agent trying to be as to the right as possible, let's only reward it when it breaks the record of begin furthest to the right. Also important to note that it is not a problem if the car swings very far to the left. The car does not "die". 

`step(self, action):
    ...
    reward = -1.0
    if position > self.max_reached_position:  # reward when new maximally right position has been reached
        self.max_reached_position = position
        reward = 5.0
    ...`

<img src="dqn_mymountaincar_reward_attempt2_200k_sb2.png" alt="drawing" width="400"/>

The results are quite good. Some discussion, hypothetically, without proof:
* We were warned to be careful with defining subgoals. This is a subgoal
* What will give maximum reward? 
  * If with every swing to the right, we just go a little further than the previous time, the reward is every time 5.
  * So actually a slowly learning car will have more reward than a fast learning car, because a fast learning car has quickly reached a high max_reached_position, which will not often be exceeded any more.
  * There's no incentive for the car to reach the finish! The longer it keeps driving, the better the reward!
  * This all has very much to do with the time limit of 200 steps! Slow learning would work perfectly if there were no time limit.
 
**Modified reward function (3rd and last attempt)**:

 Add a finish bonus that exceeds the cumulative max_reached position bonus: 5000

`step(self, action):
    ...
    reward = -1.0
    if position > self.max_reached_position:  # reward when new maximally right position has been reached
        self.max_reached_position = position
        reward = 5.0
    if position >= 0.5:  # bonus if finish is reached
        reward = 5000.0
    ...`

<img src="dqn_mymountaincar_reward_attempt3_200k_sb2.png" alt="drawing" width="400"/>

Setting `Training = False` shows quite okay results. Is this due to the reward function, or is the reward function actually not very different from the original one, and is it simply because we've trained for 200k steps? Just from eye-sight, the original reward function seems to be even slightly better, but it was trained for 400k steps.

In [None]:
"""
http://incompleteideas.net/sutton/MountainCar/MountainCar1.cp
permalink: https://perma.cc/6Z2N-PFWC
"""
import math

import numpy as np

import gym
from gym import spaces
from gym.utils import seeding


class MyMountainCarEnv(gym.Env):
    """
    Description:
        The agent (a car) is started at the bottom of a valley. For any given
        state the agent may choose to accelerate to the left, right or cease
        any acceleration.
    Source:
        The environment appeared first in Andrew Moore's PhD Thesis (1990).
    Observation:
        Type: Box(2)
        Num    Observation               Min            Max
        0      Car Position              -1.2           0.6
        1      Car Velocity              -0.07          0.07
    Actions:
        Type: Discrete(3)
        Num    Action
        0      Accelerate to the Left
        1      Don't accelerate
        2      Accelerate to the Right
        Note: This does not affect the amount of velocity affected by the
        gravitational pull acting on the car.
    Reward:
         Reward of 0 is awarded if the agent reached the flag (position = 0.5)
         on top of the mountain.
         Reward of -1 is awarded if the position of the agent is less than 0.5.
    Starting State:
         The position of the car is assigned a uniform random value in
         [-0.6 , -0.4].
         The starting velocity of the car is always assigned to 0.
    Episode Termination:
         The car position is more than 0.5
         Episode length is greater than 200
    """

    metadata = {
        'render.modes': ['human', 'rgb_array'],
        'video.frames_per_second': 30
    }

    def __init__(self, goal_velocity=0):
        self._max_episode_steps = 200  # Erco
        self.min_position = -1.2
        self.max_position = 0.6
        self.max_speed = 0.07
        self.goal_position = 0.5
        self.goal_velocity = goal_velocity

        self.max_reached_position = self.min_position  # max_reached_position not in reset(), so keeps its value (Erco)

        self.force = 0.001
        self.gravity = 0.0025

        self.low = np.array(
            [self.min_position, -self.max_speed], dtype=np.float32
        )
        self.high = np.array(
            [self.max_position, self.max_speed], dtype=np.float32
        )

        self.viewer = None

        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(
            self.low, self.high, dtype=np.float32
        )

        self.seed()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, action):
        assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))

        position, velocity = self.state
        velocity += (action - 1) * self.force + math.cos(3 * position) * (-self.gravity)
        velocity = np.clip(velocity, -self.max_speed, self.max_speed)
        position += velocity
        position = np.clip(position, self.min_position, self.max_position)
        if (position == self.min_position and velocity < 0):
            velocity = 0

        done = bool(
            position >= self.goal_position and velocity >= self.goal_velocity
        )
        # Erco
        # attempt 0: original reward function
        #reward = -1.0  
        #
        # attempt 1:
        #reward = -1.0 + position  # the more to the right the higher the reward
        #if position >= 0.5:  # bonus if finish is reached
        #    reward = 1
        #
        # attempt 2:
        reward = -1.0
        if position > self.max_reached_position:  # reward when new maximally right position has been reached
            self.max_reached_position = position
            reward = 5.0
        #
        # attempt 3:
        #reward = -1.0
        #if position > self.max_reached_position:  # reward when new maximally right position has been reached
        #    self.max_reached_position = position
        #    reward = 5.0
        #if position >= 0.5:  # bonus if finish is reached
        #    reward = 5000.0

            
        self._elapsed_steps += 1  # Erco
        if self._elapsed_steps >= self._max_episode_steps:  # the OpenAI Gym wrapper limits #steps (Erco)
            done = True  # Erco
            

        self.state = (position, velocity)
        return np.array(self.state), reward, done, {}

    def reset(self):
        self._elapsed_steps = 0  # Erco
        self.state = np.array([self.np_random.uniform(low=-0.6, high=-0.4), 0])
        return np.array(self.state)

    def _height(self, xs):
        return np.sin(3 * xs) * .45 + .55

    def render(self, mode='human'):
        screen_width = 600
        screen_height = 400

        world_width = self.max_position - self.min_position
        scale = screen_width / world_width
        carwidth = 40
        carheight = 20

        if self.viewer is None:
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(screen_width, screen_height)
            xs = np.linspace(self.min_position, self.max_position, 100)
            ys = self._height(xs)
            xys = list(zip((xs - self.min_position) * scale, ys * scale))

            self.track = rendering.make_polyline(xys)
            self.track.set_linewidth(4)
            self.viewer.add_geom(self.track)

            clearance = 10

            l, r, t, b = -carwidth / 2, carwidth / 2, carheight, 0
            car = rendering.FilledPolygon([(l, b), (l, t), (r, t), (r, b)])
            car.add_attr(rendering.Transform(translation=(0, clearance)))
            self.cartrans = rendering.Transform()
            car.add_attr(self.cartrans)
            self.viewer.add_geom(car)
            frontwheel = rendering.make_circle(carheight / 2.5)
            frontwheel.set_color(.5, .5, .5)
            frontwheel.add_attr(
                rendering.Transform(translation=(carwidth / 4, clearance))
            )
            frontwheel.add_attr(self.cartrans)
            self.viewer.add_geom(frontwheel)
            backwheel = rendering.make_circle(carheight / 2.5)
            backwheel.add_attr(
                rendering.Transform(translation=(-carwidth / 4, clearance))
            )
            backwheel.add_attr(self.cartrans)
            backwheel.set_color(.5, .5, .5)
            self.viewer.add_geom(backwheel)
            flagx = (self.goal_position-self.min_position) * scale
            flagy1 = self._height(self.goal_position) * scale
            flagy2 = flagy1 + 50
            flagpole = rendering.Line((flagx, flagy1), (flagx, flagy2))
            self.viewer.add_geom(flagpole)
            flag = rendering.FilledPolygon(
                [(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)]
            )
            flag.set_color(.8, .8, 0)
            self.viewer.add_geom(flag)

        pos = self.state[0]
        self.cartrans.set_translation(
            (pos-self.min_position) * scale, self._height(pos) * scale
        )
        self.cartrans.set_rotation(math.cos(3 * pos))

        return self.viewer.render(return_rgb_array=mode == 'rgb_array')

    # manual: python keyboard_agent.py MountainCar-v0, keys 0, 1, 2, ... to control (https://tomroth.com.au/gym-play/)
    # Keyboard agent only supports discrete action spaces
    def get_keys_to_action(self):
        # Control with left and right arrow keys.
        return {(): 1, (276,): 0, (275,): 2, (275, 276): 1}

    def close(self):
        if self.viewer:
            self.viewer.close()
            self.viewer = None

In [None]:
# mymountaincar_dqn_agent

import gym
from stable_baselines import DQN

#env = gym.make('MountainCar-v0')
env = MyMountainCarEnv(0)

training = True
if training:
    model = DQN("MlpPolicy", env, verbose=1, tensorboard_log="tensorboard_logs/mymountaincar_dqn_agent/")
    model.learn(total_timesteps=20000)  # normally at least 200000 steps
    model.save("learned_models/dqn_mymountaincar")
else:
    model = DQN.load("learned_models/dqn_mymountaincar")
    #model = DQN.load("learned_models/dqn_mymountaincar_reward_attempt3_200k_sb2")

print("finished training, now use the trained model and render the env")

n_episodes = 10
for i in range(n_episodes):
    obs = env.reset()
    done = False
    episode_steps = 0
    while not done:
        action, state = model.predict(obs)  # greedy policy
        obs, reward, done, info = env.step(action)
        episode_steps += 1
        env.render()

    result = 'Success!' if episode_steps < 200 else 'Failure!'
    print(result)
env.close()