# Reinforcement Learning with OpenAI Gym
This notebook serves as a simple working example of how to perform RL with OpenAI's AI Gym.

## Environment Setup
1. Install swig: https://www.dev2qa.com/how-to-install-swig-on-macos-linux-and-windows/
2. Set up a python venv (optional):

`pip3 install virtualenv`

`python3 -m virtualenv venv`

`source venv/bin/activate`

3. Install required python packages:
`pip3 install gym==0.17.2 box2d-py==2.3.8`


In [12]:
import numpy as np
from numpy import pi
import gym

import matplotlib.pyplot as plt
%matplotlib inline

from operator import itemgetter

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T
# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Reinforcement Learning
The general statement of an RL problem can be formulated as follows:

The probability $p_\theta$ of a play-out for a game composed of a sequence of state vectors $\textbf{s}_t$ and agent actions $\textbf{a}_t$ is factored into the policy vector $\pi_\theta(\textbf{a}_t|\textbf{s}_t)$ and the model $p(\textbf{s}_{t+1}|\textbf{s}_t, \textbf{a}_t)$:

$$
p_{\theta}(\textbf{s}_1,\textbf{a}_1,...,\textbf{s}_T,\textbf{a}_T)=p(\textbf{s}_1)\Pi_{t=1}^{T}\pi_{\theta}(\textbf{a}_t|\textbf{s}_t)p(\textbf{s}_{t+1}|\textbf{s}_t,\textbf{a}_t)
$$

There is additionally a reward, $r(\textbf{s}_t,\textbf{a}_t)$, given to the agent for each step in the game. The goal is to teach an agent to maximize this reward, i.e.

$$
max_{\theta}E_{p_{\theta}}[\Sigma_{t}r(\textbf{s}_t,\textbf{a}_t)]
$$

In model-free RL we ignore the model and teach our agent to maximize the reward based purely on the current state and possibly the agent's history (past states and actions).

In model-based RL our agent tries to learn a model which correctly predicts future rewards, so that the policy can be easily chosen to maximize the cumulative reward.

## Bipedal Walker
The following variables are used for defining the actions and states of the game "BipedalWalker-v3" from the OpenAI gym

BipedalWalker has 2 legs. Each leg has 2 joints. You can apply the torque on each of these joints in the range of (-1, 1)

The state of the game is given by 24 variables, described in more detail here: https://github.com/openai/gym/wiki/BipedalWalker-v2


In [2]:
STATE_SPACE = 24
ACTION_SPACE = 4
ENV = gym.make("BipedalWalker-v3")

# actions
Hip_1 = 0
Knee_1 = 1
Hip_2 = 2
Knee_2 = 3

# state
HULL_ANGLE = 0
HULL_ANGULAR_VELOCITY = 1
VEL_X = 2
VEL_Y = 3
HIP_JOINT_1_ANGLE = 4
HIP_JOINT_1_SPEED = 5
KNEE_JOINT_1_ANGLE = 6
KNEE_JOINT_1_SPEED = 7
LEG_1_GROUND_CONTACT_FLAG = 8
HIP_JOINT_2_ANGLE = 9
HIP_JOINT_2_SPEED = 10
KNEE_JOINT_2_ANGLE = 11
KNEE_JOINT_2_SPEED = 12
LEG_2_GROUND_CONTACT_FLAG = 13




In [57]:
ENV.observation_space.sample()

array([-1.2787744 ,  1.1125426 , -0.67280614, -1.0236198 ,  0.55412406,
        1.2463422 ,  0.19889538,  0.9246162 , -1.2400373 , -0.5879139 ,
       -1.9885919 ,  0.56692463,  1.1481102 , -1.1484573 ,  1.9298387 ,
        0.65511125,  0.02889112, -1.6323286 ,  0.0808932 , -1.3743677 ,
       -0.02229984,  0.55941194, -1.019841  ,  0.29258353], dtype=float32)

Now we'll define the basic interface with the game. The game is essentially a very rapid turn-based game. In each round, there are 2 main steps:
1. An action is taken by the agent according to the agent's policy function.
2. The game updates according to its current state and the input action from the agent.


In [41]:
class Agent():
    """
    The base class for all agent-based reinforcement learning.
    Provides default implementations for __init__() and play() methods.
    
    __init__() by default only requires a policy function, which selects the next action for the agent to take
    
    play() by default steps through the game using the policy function provided by the user to select an action at each step
    """
    
    def __init__(self, policy_function):
        self.policy = policy_function
    
    def play(self, env, steps=500, score_augmentation=False, show=False, verbose=False, *args, **kwargs):
        state = env.reset()
        cumulative_score = 0
        if show:
            env.render()
        for step in range(steps):
            if verbose:
                print("state:", state)
            action = self.policy(state, *args, **kwargs)
            if verbose:
                print("action:", action)
            state, reward, terminal, info = env.step(action)
            if show:
                env.render()
            cumulative_score += reward
            if terminal:
                break
        env.close()
        return cumulative_score

In the BipedalWalker-v3 game, the agent gets a positive reward proportional to the distance walked on the terrain. It can get a total of 300+ reward all the way up to the end.

If agent tumbles, it gets a reward of -100.
There is some negative reward proportional to the torque applied on the joint so that agent learns to walk smoothly with minimal torque.

Let's define a couple of test policy functions to play with:

In [42]:
def random_policy(state):
    """This agent returns random actions."""
    return np.random.uniform(low=-1.0, high=1.0, size=4)


def stupid_policy(state):
    """A very simple expert system."""
    if state[LEG_1_GROUND_CONTACT_FLAG] == 1 and state[LEG_2_GROUND_CONTACT_FLAG] == 1:
        return np.random.uniform(low=-1.0, high=1.0, size=4)
    if state[KNEE_JOINT_1_SPEED] < -0.5:
        return np.array([-0.05, -0.2, 0., 0.])
    if state[KNEE_JOINT_2_SPEED] < -0.5:
        return np.array([0., 0., -0.05, -0.2])
    return np.random.uniform(low=-1.0, high=1.0, size=4)

In [43]:
my_agent = Agent(stupid_policy)
my_agent.play(ENV, steps=100, show=False, verbose=False)

-97.54617795890783

## Model-free RL

This section will provide a simple template for model-free reinforcement learning.

The goal in model-free RL is to sample game iterations and thereby train a model which correctly predicts the future reward given the current state and candidate action. The 'future reward' may be the reward in the next step of the game, or the cumulative reward over many future steps.

The choice of action can then be as simple as choosing the action which leads to the largest future reward.

Below, we define a basic ModelFreeAgent class which can serve as a template for creating your own custom model-free agent.

In [44]:
class ModelFreeAgent(Agent):
    """
    Template for a model-free RL agent. 
    
    To instantiate an agent, the user must provide the following functions:
    
    policy_function(): A method for choosing what action to take next. This can be a selection among a  discrete set
                       of actions (as in the probability of each legal move for a current board position),
                       or a function which chooses an action vector by some heuristic (applied when the action space 
                       is a continuous-valued vector with like the four torques for our Bipedal Walker).
    
    action_generator(): A method for generating candidate actions. This can be deterministic (generate *all* legal moves
                        for a given chess board position) or probabilistic (generate candidate actions by sampling from
                        some - possibly learned - distribution over the action space)
                           
    reward_predictor(): A method for predicting the reward for a given state:action pair (s_t, a_t)
    """
    
    def __init__(self, policy_function, action_generator, reward_predictor):
        self.policy = policy_function
        self.action_generator = action_generator
        self.reward_predictor = reward_predictor
        

In [47]:
def model_free_policy(current_state, *args, **kwargs):
    """
    The model-free agent chooses an action by generating candidate actions, predicting the future reward for each
    candidate action, and then applying its policy function to select among the action:reward pairs. In this example
    the policy is simply to choose the action which maximizes the predicted reward in the next time step.

    Returns the next action to be input to the environment.
    """
    action_generator = kwargs['action_generator']
    reward_predictor = kwargs['reward_predictor']
    
    # Generate a list of candidate actions
    candidate_actions =  action_generator(current_state)
    # Predict the future reward for each candidate action
    action_reward_pairs = []
    for action in candidate_actions:
        reward = reward_predictor(current_state, action)
        action_reward_pairs.append((action, reward))
    # Apply the policy function to the list of action:reward pairs. 
    # In this case, the policy is to choose the maximum predicted reward for the next time step.
    best_action = max(action_reward_pairs, key=itemgetter(1))[0]

    return best_action

        
def model_free_action_generator(state):
    """
    The action generator for our model-free agent. In this example, the function generates four random action vectors.
    
    (This code is a template which should be replaced for optimal agent performance)
    """
    return np.random.uniform(low=-1.0, high=1.0, size=(4,4))


def model_free_reward_predictor(current_state, candidate_action):
    """
    The function to predict the reward for a given environment state and candidate action.
    
    (This code is a template which should be replaced for optimal agent performance)
    """
    return np.random.uniform(low=-100.0, high=10.0)

Let's test out our template model-free agent. We expect it to do about as well as the random agent from above, since the candidate actions and predicted rewards are providing absolutely zero additional information to the agent.

In [49]:
my_model_free_agent = ModelFreeAgent(model_free_policy, model_free_action_generator, model_free_reward_predictor)

my_model_free_agent.play(ENV,
                         steps=100,
                         action_generator=my_model_free_agent.action_generator,
                         reward_predictor=my_model_free_agent.reward_predictor)

-101.88813562640368

## Model-based RL

This section will provide a simple template for model-based reinforcement learning.

The goal in model-based RL is to sample game iterations and thereby train a model which correctly predicts the future environment state and corresponding reward, given the current state and candidate action. 

The main advantage of model-based RL is that a model for the environment's evolution allows the agent to predict the cumulative reward over many future steps. This allows the agent to 'plan ahead', hopefully leading to a more successful policy.

The choice of action can then be as simple as choosing the action which leads to the largest cumulative future reward.

Below, we define a basic ModelBasedAgent class which can serve as a template for creating your own custom model-based agent.

In [50]:
class ModelBasedAgent(Agent):
    """
    Template for a model-based RL agent. 
    
    To instantiate an agent, the user must provide the following functions:
    
    policy_function(): A method for choosing what action to take next. This can be a selection among a  discrete set
                       of actions (as in the probability of each legal move for a current board position),
                       or a function which chooses an action vector by some heuristic (applied when the action space 
                       is a continuous-valued vector with like the four torques for our Bipedal Walker).
                       
 
    
    action_generator(): A method for generating candidate actions. This can be deterministic (generate *all* legal moves
                        for a given chess board position) or probabilistic (generate candidate actions by sampling from
                        some - possibly learned - distribution over the action space)
                        
    environment_model(): A method for predicting the next environment state given the current environment state and the 
                         candidate action. 
                         
    reward_predictor(): A method for predicting the reward for a given state:action pair (s_t, a_t)
    """
    
    def __init__(self, policy_function, action_generator, environment_model, reward_predictor):
        self.policy = policy_function
        self.action_generator = action_generator
        self.environment_model = environment_model        
        self.reward_predictor = reward_predictor
        

In [65]:
def model_based_policy(current_state, *args, **kwargs):
    """
    The model-based agent chooses an action by iteratively generating candidate actions and the corresponding
    predicted future environment state, up to some depth, and then predicting the reward for each
    candidate action in each trajectory. 
    The agent then applies its policy function to select the action which is expected to maximize the future
    cumulative reward. 
    
    In this example the policy is simply to choose the action which maximizes the predicted cumulative reward
    over the next <depth> number of steps. That is, the action which leads to the largest total reward across
    all subsequent steps.
    
    This implementation generates the default number of candidate trajectories from the action_generator method
    (say, <C> candidate trajectories) at each step, leading to <C>**<depth> random sample trajectories to compare.

    Returns the next action to be input to the environment.
    """
    action_generator = kwargs['action_generator']
    environment_model = kwargs['environment_model']
    reward_predictor = kwargs['reward_predictor']
    depth = kwargs['depth']
        
    # Generate a list of randomly sampled game trajectories
    sampled_trajectories = {"0": [current_state, 0, 0]}
    for d in range(depth):
        keys_to_expand = [k for k in sampled_trajectories.keys() if k[-1]==str(d)]

        for key in keys_to_expand:
            state = sampled_trajectories[key][0]
            candidate_actions =  action_generator(state)
            
            for num, action in enumerate(candidate_actions):
                new_state = environment_model(state, action)
                reward = reward_predictor(current_state, action)
                new_key = key + str(num)
                sampled_trajectories.update({new_key: [new_state, action, reward]})
                # Now add the reward from the candidate action to the parent node in the search tree
                sampled_trajectories[key][2] += reward
                
    # Apply the policy function to the sampled trajectories. 
    # In this case, the policy is to choose the maximum cumulative predicted reward at depth 1.
    action_reward_pairs = [(el[1], el[2]) for key, el in sampled_trajectories.items() if key[-1]=="1"]
    best_action = max(action_reward_pairs, key=itemgetter(1))[0]

    return best_action

        
def model_based_action_generator(state):
    """
    The action generator for our model-free agent. In this example, the function generates four random action vectors.
    
    (This code is a template which should be replaced for optimal agent performance)
    """
    return np.random.uniform(low=-1.0, high=1.0, size=(4,4))


def model_based_environment_model(current_state, candidate_action):
    """
    The environment model for our model-based agent. In this example, the function generates a random state vector.
    
    (This code is a template which should be replaced for optimal agent performance)
    """
    return ENV.observation_space.sample()


def model_based_reward_predictor(current_state, candidate_action):
    """
    The function to predict the reward for a given environment state and candidate action.
    
    (This code is a template which should be replaced for optimal agent performance)
    """
    return np.random.uniform(low=-100.0, high=10.0)

Let's test out our template model-based agent. We expect it to do about as well as the random agent from above, since the candidate actions and predicted rewards are providing absolutely zero additional information to the agent.

In [66]:
my_model_based_agent = ModelBasedAgent(model_based_policy, model_based_action_generator,
                                       model_based_environment_model, model_based_reward_predictor)

my_model_based_agent.play(ENV,
                          steps=100,
                          action_generator=my_model_based_agent.action_generator,
                          environment_model=my_model_based_agent.environment_model,
                          reward_predictor=my_model_based_agent.reward_predictor,
                          depth=3)

-9.24223374316765

Note that we currently do not include any training for either model-based or model-free RL. This is still to come!