<h1> Reinforcement Learning Using KerasRL</h1>
Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. The picture below shows RL settings, where the agent interacts with the environment. The agent can take an actions from set of actions (for example move up in a game). The action take the agent/enviroment to the same or a new state. Then, the agent is rewarded for that.  
<img src="images/rl.png" width=500><br>
RL is studied in different fields with different names like optimal control theory, game theory, Operation research. There are a variety of approaches to solve the RL problems.  <br><br>
There are some packages for RL in python. Here we focus on Keras-RL. <br>
For installing keras-rl you can use: <code>pip install keras-rl</code><br>
For using Keras-RL you might need to install openmpi on your machine.<br>

<br>Also for simulating environments, we are going to use gym. Gym has some classic RL problems and provide animation representation of the environment. 
<br>For installing gym on your machine you should use:<code>conda install -c conda-forge gym</code>
Let's explore gym first before going further.

In [None]:
# if you are on mac BigSur has issues with gym and raises many warnings. igonore them for now.
# This cell pop ups another window which show the animated environment
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(100):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()

The step function interacts with the environment and returns four values as follow:
<ul>
  <li><code class="highlighter-rouge">observation</code> (<strong>object</strong>): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.</li>
  <li><code class="highlighter-rouge">reward</code> (<strong>float</strong>): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.</li>
  <li><code class="highlighter-rouge">done</code> (<strong>boolean</strong>): whether it’s time to <code class="highlighter-rouge">reset</code> the environment again. Most (but not all) tasks are divided up into well-defined episodes, and <code class="highlighter-rouge">done</code> being <code class="highlighter-rouge">True</code> indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)</li>
  <li><code class="highlighter-rouge">info</code> (<strong>dict</strong>): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.</li>
</ul>

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward. The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag:

In [None]:
# A better way of implementing agent-env loop
import gym
env = gym.make('CartPole-v0')
for i_episode in range(1):
    observation = env.reset()
    for t in range(10):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

<h2> Cart Pole V0 System</h2>
In the CartPole-v0 environment, a pole is attached to a cart moving along a frictionless track. The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of -1 or +1 to the cart. A reward of +1 is given for every time step the pole remains upright. An episode ends when (1) the pole is more than 15 degrees from vertical or (2) the cart moves more than 2.4 units from the center.
<br>The problem is considered "solved" when the average total reward for the episode reaches 195 over 100 consecutive trials.

<h1>Actor - Critic Learning</h1>
<h2> Introduction</h2>
If you’re learning to play Go, one of the best ways to improve is to get a stronger player to review your games. Sometimes the most useful feedback just points out where you won or lost the game. The reviewer might give comments like, “You were already far behind by move 30” or “At move 110, you had a winning position, but your opponent turned it around by move 130.”

Why is this feedback helpful? You may not have time to scrutinize all 300 moves in a game, but you can focus your full attention on a 10- or 20-move sequence. The reviewer lets you know which parts of the game are important.

Reinforcement-learning researchers apply this principle in actor-critic learning, which is a combination of policy learning  and value learning . The policy function plays the role of the actor: it picks what moves to play. The value function is the critic: it tracks whether the agent is ahead or behind in the course of the game. That feedback guides the training process, in the same way that a game review can guide your own study.
<h2>Method</h2>
Actor-Critic methods are temporal difference (TD) learning methods that represent the policy function independent of the value function.

A policy function (or policy) returns a probability distribution over actions that the agent can take based on the given state. A value function determines the expected return for an agent starting at a given state and acting according to a particular policy forever after.

In the Actor-Critic method, the policy is referred to as the actor that proposes a set of possible actions given a state, and the estimated value function is referred to as the critic, which evaluates actions taken by the actor based on the given policy. In this tutorial, both the Actor and Critic will be represented using one neural network with two outputs.
   

In [None]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0")  # Create the environment
env.seed(seed)
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0

<h2>Model</h2>
The Actor and Critic will be modeled using one neural network that generates the action probabilities and critic value respectively. 

During the forward pass, the model will take in the state as the input and will output both action probabilities and critic value V, which models the state-dependent value function. The goal is to train a model that chooses actions based on a policy π that maximizes expected return.

For Cartpole-v0, there are four values representing the state: cart position, cart-velocity, pole angle and pole velocity respectively. The agent can take two actions to push the cart left (0) and right (1) respectively.

In [None]:
num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])

<h2>Training</h2>
To train the agent, you will follow these steps:

Run the agent on the environment to collect training data per episode.
<ol>
<li>Run the agent on the environment to collect training data per episode.</li>
<li>Compute expected return at each time step.</li>
<li>Compute the loss for the combined actor-critic model.</li>
<li>Compute gradients and update network parameters.</li>
<li>Repeat 1-4 until either success criterion or max episodes has been reached.</li>
</ol>
<h3>Collecting Data</h3>
As in supervised learning, in order to train the actor-critic model, you need to have training data. However, in order to collect such data, the model would need to be "run" in the environment.

Training data is collected for each episode. Then at each time step, the model's forward pass will be run on the environment's state in order to generate action probabilities and the critic value based on the current policy parameterized by the model's weights.

The next action will be sampled from the action probabilities generated by the model, which would then be applied to the environment, causing the next state and reward to be generated.


<h2> Comupting the expected return</h2>
The sequence of rewards for each timestep t, {r_t} for t= 1,...,T  collected during one episode is converted into a sequence of expected returns {G_t} for t=1,..., T  in which the sum of rewards is taken from the current timestep t to T and each reward is multiplied with an exponentially decaying discount factor :

<img src="images/reward.png" width=200>
Since  γ is between 0 and 1, rewards further out from the current timestep are given less weight.

Intuitively, expected return simply implies that rewards now are better than rewards later. In a mathematical sense, it is to ensure that the sum of the rewards converges.

<h2> The Critic Loss</h2>
Since a hybrid actor-critic model is used, the chosen loss function is a combination of actor and critic losses for training, as shown below:
<br>Loss = L_Actor + L_Critic<br>
The actor loss is based on policy gradients with the critic as a state dependent baseline and computed with single-sample (per-episode) estimates.
<img src="images/l_actor.png" width =350>
here:<br>

T : the number of timesteps per episode, which can vary per episode<br>
s_t: the state at timestep t<br>
a_t: chosen action at timestep t given state s<br>
π_Θ: is the policy (actor) parameterized by Θ<br>
V_Θ^π: is the value function (critic) also parameterized by π<br>
G: G_t the expected return for a given state, action pair at timestep t<br>
A negative term is added to the sum since the idea is to maximize the probabilities of actions yielding higher rewards by minimizing the combined loss.

<h2> Advantage</h2>
the G-V term in our L_actor formulation is called the advantage, which indicates how much better an action is given a particular state over a random action selected according to the policy π for that state.

While it's possible to exclude a baseline, this may result in high variance during training. And the nice thing about choosing the critic V as a baseline is that it trained to be as close as possible to G, leading to a lower variance.

In addition, without the critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.

For instance, suppose that two actions for a given state would yield the same expected return. Without the critic, the algorithm would try to raise the probability of these actions based on the objective J. With the critic, it may turn out that there's no advantage (G-V) and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.

<h2>Critic Loss</h2>
Training V to be as close possible to G can be set up as a regression problem with the following loss function:<br>
<img src="images/critic_loss.png" width=150>
where Lƍ is the Huber loss, which is less sensitive to outliers in data than squared-error loss.

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

while True:  # Run until solved
    state = env.reset()
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); Adding this line would show the attempts
            # of the agent in a pop up window.

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )

        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break

In [None]:
# Let's check the trained model.
for i_episode in range(1):
    state = env.reset()
    for t in range(10000):
        env.render()
        print(state)
        # action = env.action_space.sample()
        state = tf.convert_to_tensor(state)
        state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
        action_probs, critic_value = model(state)
        action = np.random.choice(num_actions, p=np.squeeze(action_probs))
        state, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()