## Welcome to the FUN track of WiDS reinforement learning in a nutshell tutorial!

In this track, you will play with several RL libraries that provide
- Standard environments to train and compare different algorithms
- Easy-to-use pre-implemented algorithms

Now let's get started!

In [None]:
# For Windows, please first install msmpi form https://www.microsoft.com/en-us/download/details.aspx?id=57467,
# and then follow the instructions at https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30
!conda update conda
!conda update conda-build
!conda install swig

# For Linux and MacOS, you can run this cell directly.
!apt install swig # Remember to uncomment this line if you are running in Windows.
!pip install --upgrade pip
!pip install gym==0.12.1
!pip install stable_baselines==2.4.1
!pip install box2d==2.3.2
!pip install box2d-kengz
!pip install tensorforce==0.4.3
# If you run into issues installing box2d, please try to build it from source by uncommenting the following lines.
#!sudo apt-get install --reinstall build-essential
#!pip install git+https://github.com/pybox2d/pybox2d


In [None]:
import time
import numpy as np

### OpenAIGym

In [None]:
import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
from stable_baselines import A2C
from stable_baselines import DQN

In [None]:
# Render the simulation of the model in the environment
# For vectorized models like PPO and A2C, set use_vec_env to True
def render(env_id, use_vec_env=False, model=None, max_step=500):
    env = gym.make(env_id)
    if use_vec_env:
        # Note: Vectorized environments allow multiprocess training. 
        # In this tutorial, we only uses one process, so we use the DummyVecEnv which is just a simple wrapper.

        env = DummyVecEnv([lambda: env])
    
    observation = env.reset()
    
    for _ in range(max_step):
        env.render()
        if (model==None): # Sample a random action from the action space if no model is provided
            if use_vec_env:
                action = [env.action_space.sample()]
            else:
                action = env.action_space.sample()
        else:
            action, _states = model.predict(observation)

        observation, reward, done, info = env.step(action)

        if done:
            observation = env.reset()

    if use_vec_env:
        env.envs[0].close()
    else:
        env.close()

We chose the MlpPolicy because input of CartPole is a feature vector, not images.

In [None]:
# Environment
env_id = "CartPole-v1"

# Training parameters
policy = "MlpPolicy"
max_train_step = 100000
learning_rate = 0.0001

model_path = "./models/CartPole_PPO.model"
log_path = "./log/"

In [None]:
# Random agent
render(env_id, model=None)

To make life easier, we use a variation of the original OpenAI baselines: [stable baselines](https://github.com/hill-a/stable-baselines).

In [None]:
model = PPO2(policy, env_id, learning_rate=learning_rate, tensorboard_log=log_path)
model.learn(max_train_step, tb_log_name=env_id+str(time.time()))
# Save the agent
model.save(model_path)
del model  # delete trained model to demonstrate loading

In [None]:
model = PPO2(policy, env_id).load(model_path)
render(env_id, use_vec_env=True, model=model)

In [None]:
model = DQN(policy, env_id, learning_rate=learning_rate, tensorboard_log=log_path)
model.learn(max_train_step, tb_log_name=env_id+str(time.time()))
# Save the agent
model.save(model_path)
del model  # delete trained model to demonstrate loading

In [None]:
model = DQN(policy, env_id).load(model_path)
render(env_id, use_vec_env=True, model=model)

Now let's try another environment with another model!

In [None]:
# Environment
env_id = "LunarLander-v2"

# Training parameters
policy = "MlpPolicy"
max_train_step = 100000

model_path = "./models/LunarLander_A2C.model"

In [None]:
model = A2C(policy, env_id, ent_coef=0.1, learning_rate=learning_rate, tensorboard_log=log_path)
model.learn(total_timesteps=max_train_step, tb_log_name=env_id+str(time.time()))
# Save the agent
model.save(model_path)
del model  # delete trained model to demonstrate loading

In [None]:
# Load the trained agent
model = A2C(policy, env_id, ent_coef=0.1).load(model_path)

# Enjoy trained agent
render(env_id, use_vec_env=True, model=model)

### Tensorforce

[Tensorforce](https://github.com/tensorforce/tensorforce) is an open-source library that provides modulized APIs for reinforcement learning. As the name suggest, it is built on top of TensorFlow.

In [None]:
from tensorforce.agents import PPOAgent
from tensorforce.execution import Runner
from tensorforce.contrib.openai_gym import OpenAIGym

In [None]:
# Create an OpenAIgym environment
env = OpenAIGym('CartPole-v1', visualize=True)

In [None]:
# Network as list of layers
network_spec = [
    dict(type='dense', size=32, activation='tanh'),
    dict(type='dense', size=32, activation='tanh')
]

agent = PPOAgent(
    states=env.states,
    actions=env.actions,
    network=network_spec,
    batching_capacity=4096,
    step_optimizer=dict(
        type='adam',
        learning_rate=1e-3
    ),
    optimization_steps=10,
    # Model
    scope='ppo',
    discount=0.99,
    entropy_regularization=0.01,
    likelihood_ratio_clipping=0.2,
#    summarizer=dict(directory="./board/",
#                    steps=50,
#                    labels=['graph',
#                            'configuration',
#                            'gradients_scalar',
#                            'regularization',
#                            'inputs',
#                            'losses'
#                            'variables'
#                           ])
)

In [None]:
# Callback function printing episode statistics
def episode_finished(r):
    print("Finished episode {ep} after {ts} timesteps (reward: {reward})".format(ep=r.episode, ts=r.episode_timestep,
                                                                                 reward=r.episode_rewards[-1]))
    return True

In [None]:
# Create the runner
runner = Runner(agent=agent, environment=env)

# Start learning
runner.run(episodes=100, max_episode_timesteps=200, episode_finished=episode_finished)

# Print statistics
print("Learning finished. Total episodes: {ep}. Average reward of last 100 episodes: {ar}.".format(
    ep=runner.episode,
    ar=np.mean(runner.episode_rewards[-100:]))
)

In [None]:
runner.agent.save_model(directory="./agents/")
runner.close()