### Setup libraries

RL library for working with mode-free algorithms

In [None]:
!pip install stable-baselines3[extra]

In [None]:
import os
import gym 
from stable_baselines3 import PPO #algorithm
from stable_baselines3.common.vec_env import DummyVecEnv #Vectorized environments (wrapper for environment)
from stable_baselines3.common.evaluation import evaluate_policy #Test how model is performing 

### Load environment

OpenAI Gym provides you with an easy way to build environments for training RL agents

In [2]:
environment_name = 'CartPole-v0'
env = gym.make(environment_name)

In [4]:
episodes = 5 #Try the environment 5 times
for episode in range(1, episodes+1):
    state = env.reset() #Initial set of observations
    done = False
    score = 0

    while not done:
        env.render() #View the graphical representation of env
        action = env.action_space.sample() #Random selection of an action
        n_state, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:24.0
Episode:2 Score:10.0
Episode:3 Score:28.0
Episode:4 Score:13.0
Episode:5 Score:11.0


Episodes: Think of an episode as one full game within the environment
Some envs have a fixed episode lenght e.g. CartPole which is 200 frames. Others are continous, e.g. play until you run out of lives

In [3]:
episodes = 5
for episode in range(1, episodes+1):
    print(episode)

1
2
3
4
5


Env.functions 
- env.reset(): reset the environment and obtain initial obs
- env.render(): visualise the environment
- env.step(): apply an action to the environment
- env.close(): close down the render frame

In [11]:
print(env.reset())
print(env.action_space.sample())
print(env.observation_space)
print(env.step(1))

[ 0.0196153  -0.04807584  0.00778899  0.03825711]
0
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
(array([ 0.01865378,  0.14693356,  0.00855413, -0.25195816], dtype=float32), 1.0, False, {})


### Understanding the environment

Observation
- 0 ----------- Cart Position ----- -4.8 to 4.8
- 1 ----------- Cart Velocity ----- -Inf to Inf
- 2 ----------- Pole Angle ----- -24 deg to 24 deg
- 3 ----------- Pole Angle Velocity ----- -Inf to Inf

In [13]:
env.observation_space.sample()


array([-3.4992239e+00,  5.8906958e+37, -1.4341356e-01, -1.0775765e+38],
      dtype=float32)

Actions
- 0 ----------- Push cart to the left
- 1 ----------- Push cart to the right 

In [14]:
print(env.action_space)
print(env.action_space.sample())

Discrete(2)
0


### Training

Model vs Model Free: learning based on predictions of next state/reward or real samples

There are a number of algorithms available through Stable Baselines.

Certain algorithms will perform better for certain environments. It is needed a literature review to determine the best approach

Action space

- Discrete Single Process: DQN
- Discrete Multi Processed: PPO or A2C
- Continous Single Process: SAC or TD3
- Continous Multi Process: PPO or A2C

![RL_algorithms](RL_algorithms.png)

Training metrics

Evaluation Metrics
- Ep_len_mean: how long a particular episode lasted before done on average
- ep_rew_mean: average reward that the agent accumulated per episode

Time metrics
- Fps: how fast you are processing
- Iterations: how many times you went through
- Time_elapsed: how long have been running 
- Total_timesteps: how many steps it has taken an episode

Loss metrics
- Entropy_loss
- policy_loss
- value_loss

Other metrics
- Explained_variance: how much of the variance in the environment your agent is able to explain
- Learning_rate: how fast your policy is updated
- n_updates: how many updates you have made to the agent

In [15]:
#Make directories 
log_path = os.path.join('Training', 'Logs')

In [16]:
env = gym.make(environment_name) #Create environment
env = DummyVecEnv([lambda: env]) #Wrap environment
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path) #Policy = MlpPolicy 
#Policy: Rule which tells it how to operate in the environment 
# Stable Baselines 3 has three policy types: MlpPolicy, CnnPolicy, MultiInputPolicy



Using cpu device


In [None]:
model.learn(total_timesteps=20000)

### Save and Reload Model

In [18]:
PPO_Path = os.path.join('Training', 'Saved_Models', 'PPO_Model_Cartpole')

Save model

In [19]:
model.save(PPO_Path)

Load model

In [20]:
model = PPO.load(PPO_Path, env=env)

### Evaluation

Evaluate policy returns 
- Average reward over the number of episodes 
- Standard deviation

In [21]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)



(200.0, 0.0)

In [25]:
obs = env.reset() 
model.predict(obs)
action, _ = model.predict(obs)

Testing model across episodes

In [28]:
episodes = 5 #Try the environment 5 times
for episode in range(1, episodes+1):
    obs = env.reset() 
    done = False
    score = 0

    while not done:
        env.render() 
        action, _ = model.predict(obs) # USING MODEL HERE
        obs, reward, done, info = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:200.0
Episode:2 Score:200.0
Episode:3 Score:200.0
Episode:4 Score:200.0
Episode:5 Score:200.0


### Viewing logs in Tensorboard

In [29]:
training_log_path = os.path.join(log_path, 'PPO_1')

In [33]:
!tensorboard --logdir={training_log_path} #! is a magick command to run a command code from jupyter notebook

2024-05-13 16:35:55.760257: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-13 16:35:55.762024: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-13 16:35:55.786909: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-13 16:35:55.786936: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-13 16:35:55.787894: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

Core metrics to look at
1. Average reward
2. Average Episode length

Training strategies
1. Train for longer
2. Hyperparameter tuning
3. Try different algorithms

### Applying callbacks to the training stage

You can leverage callback functions as part of stable baselines to log out data or save the model under certain conditions

- Specify reward threshold: algorithm stop when it reach a certain thres (allow stop model before it becames unstable)

In [34]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [36]:
save_path = os.path.join('Training', 'Saved_Models')

In [37]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=200, verbose = 1)
env_callback = EvalCallback(env,
                            callback_on_new_best = stop_callback,
                            eval_freq = 10000, #how frequent we want to run EvalCallback
                            best_model_save_path= save_path,
                            verbose = 1)



In [38]:
env = gym.make(environment_name) #Create environment
env = DummyVecEnv([lambda: env]) #Wrap environment
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cpu device




In [None]:
model.learn(total_timesteps=20000, callback=env_callback)

### Changing policies

In [42]:
net_arch = [dict(pi=[128,128,128,128], vf = [128,128,128,128])]

In [43]:
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path, policy_kwargs={'net_arch':net_arch})

Using cpu device




In [None]:
model.learn(total_timesteps=20000, callback=env_callback)

### Using an alternate algorithm

In [46]:
from stable_baselines3 import DQN

In [47]:
model = DQN('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cpu device


In [None]:
model.learn(total_timesteps=20000, callback=env_callback)