# `Stable-Baselines3` Tutorial - Part 1
Created by following [this](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/) tutorial by sentdex.

In [1]:
import gym
from stable_baselines3 import A2C, PPO

In [2]:
# Making the environment
env = gym.make('LunarLander-v2')

# Inspecting the environment
print('sample action:', env.action_space.sample())
print('observation space shape:', env.observation_space.shape)
print('sample observation:', env.observation_space.sample())

sample action: 3
observation space shape: (8,)
sample observation: [-0.11661723  0.2551806  -0.5293492   0.43286434 -0.43185088  0.7570258
  0.84934956  0.2473871 ]


In [3]:
# Visualizing the environment
env.reset()
for step in range(200):
	env.render()
	env.step(env.action_space.sample()) # take random action
env.close()

## A2C Model

In [4]:
# Creating and training an A2C model
a2c_model = A2C('MlpPolicy', env, verbose=1)
a2c_model.learn(total_timesteps=10_000)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 116      |
|    ep_rew_mean        | -387     |
| time/                 |          |
|    fps                | 80       |
|    iterations         | 100      |
|    time_elapsed       | 6        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -1.36    |
|    explained_variance | -0.0115  |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -0.956   |
|    value_loss         | 5.81     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 125      |
|    ep_rew_mean        | -282     |
| time/                 |          |
|    fps                | 123      |
|    iterations         | 200      |
|    time_elapsed

<stable_baselines3.a2c.a2c.A2C at 0x1fdfb563a90>

In [5]:
# Visualizing model performance
episodes = 5
for ep in range(episodes):
	obs = env.reset()
	done = False
	while not done:
		action, _states = a2c_model.predict(obs) # pass observation to model to get predicted action
		obs, reward, done, info = env.step(action) # pass action to env and get info back
		env.render() # show the environment on the screen
env.close()

## PPO Model

In [6]:
# Creating and training a PPO model
ppo_model = PPO('MlpPolicy', env, verbose=1)
ppo_model.learn(total_timesteps=10_000)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 91.9     |
|    ep_rew_mean     | -175     |
| time/              |          |
|    fps             | 448      |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 93.1        |
|    ep_rew_mean          | -174        |
| time/                   |             |
|    fps                  | 373         |
|    iterations           | 2           |
|    time_elapsed         | 10          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.011935966 |
|    clip_fraction        | 0.0854      |
|    clip_range           | 0.2         |
|    entropy_loss  

<stable_baselines3.ppo.ppo.PPO at 0x1fd90793b20>

In [7]:
# Visualizing model performance
episodes = 5
for ep in range(episodes):
	obs = env.reset()
	done = False
	while not done:
		action, _states = ppo_model.predict(obs) # pass observation to model to get predicted action
		obs, reward, done, info = env.step(action) # pass action to env and get info back
		env.render() # show the environment on the screen
env.close()