# IBM Think Digital Summit 2020 Code Cafe RL experince

Presented by [Nextgrid](https://nextgrid.ai)

<img align="left" width="100%" height="auto" src="https://nextgrid.ai/wp-content/uploads/2020/08/ibm-think-2020.jpg"/>

<div style="max-width:700px;">

## Reinforcment Learning Hands-on

Reinforcement learning is a machine learning technique that learns how to maximize a reward by taking actions. A dog might try to learn how to maximize belly rubs through its barking, or a cat might try to learn how to maximize being annoying through its jumping. Both these animals are **AGENTS** taking **ACTIONS** based on their current **STATE**, trying to maximize the **REWARD**.  
### Goal = maximize reward    
    
The goal of the **AGENT** is to **maximize its total reward**. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of the expected values of the rewards of all future steps starting from the current state.

<img align="left" width="500" hight="auto" src="https://nextgrid.ai/wp-content/uploads/2020/08/RL.png"></img>

<div style="clear:left;"></div>

<br/>

---

## Todays hands-on task - Gym, CartPole v1
---

Today we will use be using a **Reinforcment Learning Algoritm** on **IBM Watson studio** to **train** an **agent** on how to **efficiently** balance a pole on a cart





<img align="left"  src="https://nextgrid.ai/wp-content/uploads/2020/09/cartpole.gif"/><div style="clear:left;"/>





## CartPole Environment


A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. 

The **pendulum starts upright, and the goal is to prevent it from falling over**. 
A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
<br/>

### Observation

Type: Box(4)  

<img align="left" width="400" src="https://nextgrid.ai/wp-content/uploads/2020/09/Screenshot-2020-09-10-at-10.28.17-e1599728746594.png"/><div style="clear:left;"/>
<br/>

### Actions

Type: Discrete(2)

Action is two real values vector from -1 to +1 where [0] controls the throttle of the main engine and [1] controls the throttle of left & right boosters. Fuel is infinite, so an agent can learn to fly and then land on its first attempt

<img align="left" width="400" src="https://nextgrid.ai/wp-content/uploads/2020/09/Screenshot-2020-09-10-at-10.28.22.png"/><div style="clear:left;"/>
<br/>   
   
   
   

### Reward

A reward of +1 is provided for every timestep that the pole remains upright.

### Episode Termination
1. Pole Angle is more than ±12°
2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 500

### Solved Requirements
Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.


# Libraries
---

For this exersice we will be using following libraries and environments

**[OpenAi Gym](https://gym.openai.com/)**  
Toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball.

**[Box2D](https://box2d.com)**  
A 2D Physics Engine

**[Stable-baselines3](https://stable-baselines3.readthedocs.io/en/master/)**  
Set of improved implementations of reinforcement learning algorithms based on OpenAI [Baselines](https://github.com/openai/baselines/).


## Stable Baselines


<img align="left" width="200" src="https://github.com/hill-a/stable-baselines/raw/master/docs//_static/img/logo.png"/><div style="clear:left;"/>
Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. Stable baselines 3 that we are using today is based upon [Pytorch](https://pytorch.org/).

<img align="left" width="500" src="https://nextgrid.ai/wp-content/uploads/2020/09/Screenshot-2020-09-10-at-12.34.52-e1599734718305.png"/><div style="clear:left;"/>



<br/>


# RL Algoritm - DQN Deep Q-network
---
In deep Q-learning, we use a **neural network** to approximate the Q-value function. The state is given as the input and the Q-value of all possible actions is generated as the output. The comparison between Q-learning & deep Q-learning is illustrated below:
<img align="left" width="1000" src="https://nextgrid.ai/wp-content/uploads/2020/09/Screenshot-2020-09-10-at-13.05.08.png"/><div style="clear:left;"/>



---
</div>

## Instructions


### How to run cells

- Option 1: Select the cell and click Run in menu
- Option 2: Shift + Enter

<img align="left" width="1000" src="https://nextgrid.ai/wp-content/uploads/2020/09/ezgif.com-crop.gif"/><div style="clear:left;"/>

## Install Python libraries

Start by installing required packages using Python package manager `pip` 

**We add following packages**
```
stable-baselines3 box2d box2d-kengz
```

In [None]:
!pip install stable-baselines3 box2d box2d-kengz pyglet==1.5.0 cloudpickle==1.2.0

# Import Python packages
Import the Python packages that we will use, add packages to the code cell below

#### Todo
- Add stable_baseline3 packages to cell

```
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.results_plotter import load_results, ts2xy
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.callbacks import BaseCallback
```

In [None]:
import os
import gym
import imageio
import numpy as np
import matplotlib.pyplot as plt
import base64
import IPython
import PIL.Image


# Import stable_baselines packages
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.results_plotter import load_results, ts2xy
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.callbacks import BaseCallback

# Notebook values and folders
Set the values and folders that will used in notebook. 

#### Todo

- Add `CartPole-v1` in ```env_id = ''``` 

In [None]:
# Enviorment
env_id = 'CartPole-v1'   

# Set and make log folder
log_dir = 'log'            
os.makedirs(log_dir, exist_ok=True)    

# Simple example
Let's start with a simple example where we configure the enviorment and train our model. We will also evaluate the models score before and after the training
1. import RL algoritm & policy
2. configure enviorment & hyperparmeters
4. Evaluate before training
3. Train model
4. Evaluate after training
5. Plotting
6. Save model 
7. Delete model

## 1. Import RL algoritm & policy 
From `stable_baselines` we import the `DQN` algoritm and `MlpPolicy`  
`MlpPolicy` = Policy object that implements actor critic, using a MLP (2 layers of 64)

In [None]:
from stable_baselines3 import DQN
from stable_baselines3.dqn import MlpPolicy


## 2. Configure enviorment
- Configure our `env` using the values stated previously. 
- Instantiate the agent in the model

In [None]:
# Create environment
env = gym.make(env_id)
env = Monitor(env, log_dir)


# Instantiate the agent
model = DQN(MlpPolicy, env, verbose=1)
# model.set_env(env)

## 3. Evaluate before training model
Before running model training we evaluate it by running 100 episodes and returning the score for each one. 

In [None]:
# import evaluation helper from stable_baselines
from stable_baselines3.common.evaluation import evaluate_policy

# evaluate the model by runnign it 100 times `n_eval_episodes=10` and then print the score for each round (episode) 
evals = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=True)

# print result
print(evals[0])

## 4. Train model
Train our model for 20000 steps using default settings

In [None]:
# run model training
model.learn(total_timesteps=75000, log_interval=100)

## 5. Evaluate after training model
Lets evaluate the model again after training. 
Did the score improve?

In [None]:
evals = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=True)
print(evals[0])

## 6. Plotting
We visualize the results with plotting

In [None]:
# Import results plotter
from stable_baselines3.common import results_plotter

# Plot results
results_plotter.plot_results([log_dir], 1e6, results_plotter.X_TIMESTEPS, "DQN CartPole-v1")

## 7. Save model
Save the current state of our agent by using `model.save()` function.

In [None]:
# Save the agent
model.save('my_cartpole_model')

In [None]:
!ls # check if my_cartpole_model(.zip) have been saved
!cd log
!ls log

## 8. Delete model
We can reset our `model` by deleting the configuration.

In [None]:
del model 

In [None]:
# load model back again
model = DQN.load("my_cartpole_model")

# Taking it one step future
You have now tried the basics of configuring your model, trained and evaluted it.  
Now let's see how we can improve our result & training by configure hyperparameters and adding some automation

### Delete model configuration
Before continuing we delete the previous DQN model configuration

In [None]:
# delete current model & configuration
del model

# delete log previous log files
!rm -r ./log/



In [None]:
# Rebuild enviorment 
env = gym.make(env_id)
env = Monitor(env, log_dir)



## Hyperparameter configuration
Configure the DQN algoritms hyperparameters.  
Read more at https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html

In [None]:
model = DQN(
    MlpPolicy,
    env,
    verbose=1,                         # display output when training, 0 = no output, 1 = show output
    gamma=0.99,                        # discount for future rewards
    learning_rate=0.00025,             # learning rate
    buffer_size=100000,                # size of the replay buffer
    batch_size=512,                    # number of transitions sampled from replay buffer
    learning_starts=0,                 # info
    target_update_interval=1250,       # info
    train_freq=16,                     # steps before starting training
    gradient_steps=4,                  # steps before starting training
    exploration_fraction=0.06,         # steps before starting training
    exploration_final_eps=0.11,        # steps before starting training
 
    )

# model.set_env(env)

### Save model when new reward score is achieved
Let's create a callback function that will evaluate our model every 25000 steps and save it if it performes better 

In [None]:
class SaveOnBestTrainingRewardCallback(BaseCallback):
    """
    Callback for saving a model (the check is done every ``check_freq`` steps)
    based on the training reward (in practice, we recommend using ``EvalCallback``).

    :param check_freq: (int)
    :param log_dir: (str) Path to the folder where the model will be saved.
      It must contains the file created by the ``Monitor`` wrapper.
    :param verbose: (int)
    """
    def __init__(self, check_freq: int, log_dir: str, verbose=1):
        super(SaveOnBestTrainingRewardCallback, self).__init__(verbose)
        self.check_freq = check_freq
        self.log_dir = log_dir
        self.save_path = os.path.join(log_dir, 'best_model')
        self.best_mean_reward = -np.inf

    def _init_callback(self) -> None:
        # Create folder if needed
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self) -> bool:
        if self.n_calls % self.check_freq == 0:

          # Retrieve training reward
          x, y = ts2xy(load_results(self.log_dir), 'timesteps')
          if len(x) > 0:
              # Mean training reward over the last 100 episodes
              mean_reward = np.mean(y[-100:])
              if self.verbose > 0:
                print(f"Num timesteps: {self.num_timesteps}")
                print(f"Best mean reward: {self.best_mean_reward:.2f} - Last mean reward per episode: {mean_reward:.2f}")

              # New best model, you could save the agent here
              if mean_reward > self.best_mean_reward:
                  self.best_mean_reward = mean_reward
                  # Example for saving best model
                  if self.verbose > 0:
                    print(f"Saving new best model to {self.save_path}.zip")
                  self.model.save(self.save_path)

        return True

In [None]:
# Create the callback: check every 25000 steps
callback = SaveOnBestTrainingRewardCallback(check_freq=25000, log_dir=log_dir)

In [None]:
# Train the agent
model.learn(total_timesteps=50000, log_interval=100, callback=callback)

# Plotting & Evaluating
We use plotting & evaluation to messure the results.

In [None]:
evals = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=True)
print(evals[0])

In [None]:
from stable_baselines3.common import results_plotter

# Helper from the library
results_plotter.plot_results([log_dir], 1e6, results_plotter.X_TIMESTEPS, "DQN CartPole-v1")

In [None]:
def moving_average(values, window):
    """
    Smooth values by doing a moving average
    :param values: (numpy array)
    :param window: (int)
    :return: (numpy array)
    """
    weights = np.repeat(1.0, window) / window
    return np.convolve(values, weights, 'valid')


def plot_results(log_folder, title='Learning Curve'):
    """
    plot the results

    :param log_folder: (str) the save location of the results to plot
    :param title: (str) the title of the task to plot
    """
    x, y = ts2xy(load_results(log_folder), 'timesteps')
    y = moving_average(y, window=50)
    # Truncate x
    x = x[len(x) - len(y):]

    fig = plt.figure(title)
    plt.plot(x, y)
    plt.xlabel('Number of Timesteps')
    plt.ylabel('Rewards')
    plt.title(title + " Smoothed")
    plt.show()

In [None]:
plot_results(log_dir)

## Keep training & evaluating
Keep training and evaluating by adding new to empty code cells below to the notebook.   
Read through the documentation linked above to understand how you can modify DQN to perform better

In [None]:
# Add your code 

In [None]:
# Add your code 

In [None]:
# Add your code 

In [None]:
# Add your code 

In [None]:
# Add your code 

In [None]:
# Add your code 

<img align="left" width="1000" src="https://nextgrid.ai/wp-content/uploads/2020/09/Screenshot-2020-09-10-at-12.53.12.png"/><div style="clear:left;"/>