# 18. Reinforcement Learning

### Learning to Optimize Rewards

In Reinforcement Learning, a software **agent** makes **observations** and takes **actions** within an **environment**, and in return it receives **rewards**. 

### Policy Search

The algorithm a software agent uses to determine its actions is called its **policy**. The crux of the matter is: how do we find the best (e.g. least time / energy consuming, etc.) policy? This is what **policy search** is all about. There are different approaches:

1. **Brute force**: Try out many different values for the parameters that define our actions, and pick the combination that performs best.   

2. **Genetic algorithms**: Randomly create N policies and try them out, then kill worst X% and make more policies (e.g. adding random variation) out of the remaining ones. 

3. **Policy gradients**: Evaluating the gradients of the rewards with regard to the policy parameters, then tweaking these parameters by following the gradients toward higher rewards. 

### Introduction to OpenAI Gym

One of the challenges of Reinforcement Learning is that in order to train an agent, you first need to have a working environment. Training in the real world is hard and expensive, so we resort to a simulated environment. **OpenAI Gym** provides such an environment. 

In [1]:
import gym

In [2]:
env = gym.make("CartPole-v1")

In [3]:
obs = env.reset()

In [4]:
obs

array([-0.02427601, -0.00117788,  0.00012786,  0.03746302])

This is a 2D simulation of a cart that can be accelerated left or right in order to balance a pole placed on top of it. 

* Horizontal position (0.0 = center)
* Velocity (positive = right)
* Angle of the pole (0.0 = vertical)
* Angular velocity (positive = clockwise)

Let's see which actions are possible in this env:

In [5]:
env.action_space

Discrete(2)

Two possible dicrete values are allowed (accelerating left = 0 or right = 1). Since our pole is leaning right we will move right: 

In [6]:
action = 1 # accelerate right

In [7]:
obs, reward, done, info = env.step(action) # excecute new action

In [8]:
obs

array([-0.02429957,  0.19394224,  0.00087712, -0.25517956])

In [9]:
reward # in this env reward is always 1

1.0

In [10]:
done # True when episode is over

False

In [11]:
info 

{}

Let's hardcode a policy: accelerate left when the pole is leaning toward the left and accelerates right when the pole is leaning toward the right.

In [12]:
def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

totals = [] # rewards over 500 episodes
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

In [13]:
# results
import numpy as np
np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(41.614, 8.417660244984946, 24.0, 68.0)

### Neural Network Policies

Our NN will estimate a probability for each action, and then we will select an action randomly, according to the estimated probabilities. 

In [14]:
import tensorflow as tf
from tensorflow import keras

n_inputs = env.observation_space.shape[0] # = 4
model = keras.models.Sequential([
keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
keras.layers.Dense(1, activation="sigmoid"),
])

How do we train it? 

### Evaluating Actions: The Credit Assignment Problem

It's not possible to use our usual supervised approach here. For example, if the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad? In other words, there is no target probability distribution to learn from. 

A strategy to tackle this issue is to evaluate an action based on the sum of all the rewards that come after it, usually applying a discount factor $\gamma$ at each step. 

Return = Reward 1 + ($\gamma$ x Reward 2) + ($\gamma^2$ x Reward 3)

The higher $\gamma$ the more future rewards will count as much as present ones. 

### Policy Gradients

