## Welcome to the FUN track of WiDS reinforement learning in a nutshell tutorial!

In this track, you will play with several RL libraries that provide
- Standard environments to train and compare different algorithms
- Easy-to-use pre-implemented algorithms

Now let's get started!

In [12]:
# For Windows, please check https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30
# for dependency installation

!pip install gym
!pip install stable_baselines



In [1]:
import numpy as np

### OpenAIGym

In [2]:
import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
from stable_baselines import A2C
from stable_baselines import DQN

In [3]:
# Render the simulation of the model in the environment
# For vectorized models like PPO and A2C, set use_vec_env to True
def render(env_id, use_vec_env=False, model=None, max_step=500):
    env = gym.make(env_id)
    if use_vec_env:
        env = DummyVecEnv([lambda: env])
    
    observation = env.reset()
    
    for _ in range(max_step):
        env.render()
        if (model==None): # Sample a random action from the action space if no model is provided
            if use_vec_env:
                action = [env.action_space.sample()]
            else:
                action = env.action_space.sample()
        else:
            action, _states = model.predict(observation)

        observation, reward, done, info = env.step(action)

        if done:
            observation = env.reset()

    if use_vec_env:
        env.envs[0].close()
    else:
        env.close()

In [4]:
# Environment
env_id = "CartPole-v1"

# Training parameters
policy = "MlpPolicy"
max_train_step = 10000

model_path = "./models/CartPole_PPO.model"

In [5]:
# Random agent
render(env_id, model=None)

To make life easier, we use a variation of the original OpenAI baselines: [stable baselines](https://github.com/hill-a/stable-baselines).

In [6]:
model = PPO2(policy, env_id).learn(max_train_step)
model.save(model_path)
del model # We will reload the saved model for rendering

In [7]:
model = PPO2(policy, env_id).load(model_path)
render(env_id, use_vec_env=True, model=model)

Now let's try another environment with another model!

In [8]:
# Environment
env_id = "LunarLander-v2"

# Training parameters
policy = "MlpPolicy"
max_train_step = 10000

model_path = "./models/LunarLander_A2C.model"

In [9]:
model = A2C(policy, env_id, ent_coef=0.1).learn(total_timesteps=100000)
# Save the agent
model.save(model_path)
del model  # delete trained model to demonstrate loading

In [11]:
# Load the trained agent
model = A2C(policy, env_id, ent_coef=0.1).load(model_path)

# Enjoy trained agent
render(env_id, use_vec_env=True, model=model)

### Tensorforce

[Tensorforce](https://github.com/tensorforce/tensorforce) is an open-source library that provides modulized APIs for reinforcement learning. As the name suggest, it is built on top of TensorFlow.

In [1]:
from tensorforce.agents import PPOAgent
from tensorforce.execution import Runner
from tensorforce.contrib.openai_gym import OpenAIGym

In [9]:
# Create an OpenAIgym environment
env = OpenAIGym('CartPole-v1', visualize=True)

In [10]:
# Network as list of layers
network_spec = [
    dict(type='dense', size=32, activation='tanh'),
    dict(type='dense', size=32, activation='tanh')
]

agent = PPOAgent(
    states=env.states,
    actions=env.actions,
    network=network_spec,
    batching_capacity=4096,
    step_optimizer=dict(
        type='adam',
        learning_rate=1e-3
    ),
    optimization_steps=10,
    # Model
    scope='ppo',
    discount=0.99,
    entropy_regularization=0.01,
    likelihood_ratio_clipping=0.2,
#    summarizer=dict(directory="./board/",
#                    steps=50,
#                    labels=['graph',
#                            'configuration',
#                            'gradients_scalar',
#                            'regularization',
#                            'inputs',
#                            'losses'
#                            'variables'
#                           ])
)

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


In [11]:
# Callback function printing episode statistics
def episode_finished(r):
    print("Finished episode {ep} after {ts} timesteps (reward: {reward})".format(ep=r.episode, ts=r.episode_timestep,
                                                                                 reward=r.episode_rewards[-1]))
    return True

In [12]:
# Create the runner
runner = Runner(agent=agent, environment=env)

# Start learning
runner.run(episodes=100, max_episode_timesteps=200, episode_finished=episode_finished)

# Print statistics
print("Learning finished. Total episodes: {ep}. Average reward of last 100 episodes: {ar}.".format(
    ep=runner.episode,
    ar=np.mean(runner.episode_rewards[-100:]))
)

Finished episode 1 after 13 timesteps (reward: 13.0)
Finished episode 2 after 15 timesteps (reward: 15.0)
Finished episode 3 after 25 timesteps (reward: 25.0)
Finished episode 4 after 28 timesteps (reward: 28.0)
Finished episode 5 after 16 timesteps (reward: 16.0)
Finished episode 6 after 30 timesteps (reward: 30.0)
Finished episode 7 after 13 timesteps (reward: 13.0)
Finished episode 8 after 18 timesteps (reward: 18.0)
Finished episode 9 after 10 timesteps (reward: 10.0)
Finished episode 10 after 26 timesteps (reward: 26.0)
Finished episode 11 after 27 timesteps (reward: 27.0)
Finished episode 12 after 22 timesteps (reward: 22.0)
Finished episode 13 after 23 timesteps (reward: 23.0)
Finished episode 14 after 11 timesteps (reward: 11.0)
Finished episode 15 after 23 timesteps (reward: 23.0)
Finished episode 16 after 22 timesteps (reward: 22.0)
Finished episode 17 after 21 timesteps (reward: 21.0)
Finished episode 18 after 18 timesteps (reward: 18.0)
Finished episode 19 after 53 timestep

In [13]:
runner.agent.save_model(directory="./agents/")
runner.close()