<div class="title">Introduction to Reinforcement Learning</div>
<div class="subtitle">Métodos Avanzados en Aprendizaje Automático</div>
<div class="author">Carlos María Alaíz Gudín - Universidad Autónoma de Madrid</div>

---

**Initial Configuration**

This cell defines the configuration of Jupyter Notebooks.

In [1]:
%%html
<head><link rel="stylesheet" href="style.css"></head>

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

This cell imports the packages to be used (all of them quite standard except for `Utils`, which is provided with the notebook).

In [None]:
import numpy as np

import gym
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import tensorflow as tf
gpu_devices = tf.config.experimental.list_physical_devices("GPU")
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)
from tensorflow import keras
from IPython import display

from Utils import generate_sequence

seed = 123

# Reinforcement Learning

## Definition

Alongside with **Supervised Learning** and **Unsupervised Learning**, there is another machine learning paradigm called **Reinforcement Learning** (RL), which tries to determine how an agent ought to take actions in an environment in order to maximize a cumulative reward.

This type of problems is usually formalized as a Markov Decision Process:
* $s_t$ is the state at time $t$. Some of the states are terminal, so they end the episode.
* $r_t$ is the reward obtained at time $t$.
* $a_t$ is the action taken by the agent at time $t$.
* $p_s(s' | s, a)$ is the transition model, which models the probability of moving from state $s$ to state $s'$ when taking the action $a$.
* $p_r(r | s, a)$ is the reward model, which models the probability of getting reward $r$ from state $s$ when taking the action $a$.

The objective in RL is to find a policy, a probability $\pi(a | s)$ that maximizes the expected reward.

## Solving RL Problems

If the true probabilities $p_s$  and $p_r$ are known, then the optimal policy $\pi(a | s)$ can be calculated using dynamic programming.
Nevertheless, usually these distributions are unknown, hence there are two approaches for solving RL problems:
1. The **model-based methods** try to model $p_s$ and $p_r$, and then derive $\pi(a | s)$ from these models.
1. The **model-free methods** try to directly optimize the policy $\pi(a | s)$.

A couple of model-free methods are briefly described next.

### Policy Gradient Method

This method directly optimizes the expected reward with respect to the policy.

Since the distributions are unknown, the expected reward has no explicit form.
Nevertheless, Monte Carlo sampling can be used to estimate the policy gradient efficiently.

In practice, several episodes (trials of the agent) are run, and the obtained rewards are used to promote the strategies that lead to good results.

### Actor-Critic Method

In this case, the agent learns to map the observed state to two outputs:
1. The policy, $\pi(a | s)$. The part of the agent responsible for this output is called the **Actor**.
1. The estimated rewards in the future. Specifically, the sum of all rewards it expects to receive in the future. The part of the agent responsible for this output is the **Critic**.

The Critic learns its task by comparing the rewards obtained in the sampled real episodes with its predictions, and applying gradient descent.
On the other side, the Agent uses a slight modification of the policy gradient to learn its task, so that it uses also the information provided by the Critic.

# CartPole v0

## Introduction

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

### Observation (State)

| Index | Observation | Min | Max|
|:-:|:-:|:-:|:-:|
| 0 | Cart Position | $$-2.4$$ | $$2.4$$ |
| 1 | Cart Velocity | $$-\infty$$ | $$\infty$$ |
| 2 | Pole Angle | $$\sim -41.8^\circ$$ | $$\sim 41.8^\circ$$ |
| 3 | Pole Velocity At Tip | $$-\infty$$ | $$\infty$$ |

For the **starting value**, all observations are assigned a uniform random value between $\pm 0.05$.

### Actions and Reward

| Index | Action |
|:-:|:-:|
0 | Push cart to the left |
1 | Push cart to the right |

The **reward** is $1$ for every step taken, including the termination step.

### Episode Termination

The episode finishes when any of the following conditions is met:
1. Pole Angle is more than $\pm 12^\circ$.
2. Cart Position is more than $\pm 2.4$ (center of the cart reaches the edge of the display).
3. Episode length is greater than $500$.

## Random Policy

The following cell creates the environment.

In [None]:
env = gym.make('CartPole-v0')

This code generates an episode by taking a random policy (the next action is selected arbitrarily).

In [None]:
generate_sequence(env, model=None, early_stop=True)

## Actor-Critic Method with Neural Networks

### Configuration

This cell configures the model.
The parameter $\gamma$ (variable `gamma`) determines the forgetting or discount factor for past rewards.

In [None]:
gamma = 0.99
max_steps_per_episode = 10000
env.seed(seed)
eps = np.finfo(np.float32).eps.item()

### Model

The following cell builds the neural network that will implement both the Actor and the Critic, sharing the input and hidden layers, and using a different output layer for the policy (Actor) and the estimation of the rewards (Critic).

In [None]:
num_inputs = env.observation_space.shape[0]
num_actions = env.action_space.n
num_hidden = 128

inputs = keras.layers.Input(shape=(num_inputs,))
common = keras.layers.Dense(num_hidden, activation="relu")(inputs)
action = keras.layers.Dense(num_actions, activation="softmax")(common)
critic = keras.layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])

<div class="qst">

* What is the activation function of each output layer? Why?

</div>

### Training

The training consists in completing episodes, accumulating the loss both of the actor and the critic at every step.
During each episode, the current policy is applied.
Once the episode is finished, the weights are updated using the gradient corresponding to both losses.

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

while True:
    state = env.reset()
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # Show the attemps.
            env.render()

            # Estimate the policy (prediction of the next actions) and the future rewards using the model.
            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Choose random action using the policy.
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            # Apply the sampled action.
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Once the episode is finished, update running reward to check condition for solving.
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate the real expected value from rewards.
        returns = []
        discounted_sum = 0
        for r in rewards_history[:-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Compute the loss values (both for Actor and Critic) to update the network.
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            diff = ret - value
            actor_losses.append(-log_prob * diff)
            critic_losses.append(huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0)))

        # Update the weights through backpropagation.
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear variables.
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    episode_count += 1
    if episode_count % 10 == 0:
        print("Running reward: %6.2f at episode %3d" % (running_reward, episode_count))

    if running_reward > 80:
        print("Solved at episode {}!".format(episode_count))
        break

### Trained Model

The trained model is able to generate much larger sequences.

In [None]:
generate_sequence(env, model=model)