<a href="https://colab.research.google.com/github/marcinwolter/MachineLearning-KISD-2024/blob/main/lecture6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>




#**<font color = "red">Introduction to machine learning</font>**

**Lecture 6**


##**<font color = "green">Reinforcement learning</font>**

*17 April 2024*


---

*Marcin Wolter, IFJ PAN*

*e-mail: marcin.wolter@ifj.edu.pl*


---
</center>

#<font color='green'>**Program for today:**


* ###  <font color='red'>Reinforcement learning: how to train a robot?


<br>


**As always all slides are here:**

*https://github.com/marcinwolter/MachineLearning-KISD-2024*

<br>




# <font color='green'> **Reinforcement learning**

##**Definition**
Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal behavior in an environment to obtain maximum reward. This optimal behavior is learned through interactions with the environment and observations of how it responds, similar to children exploring the world around them and learning the actions that help them achieve a goal.

In the absence of a supervisor, the learner must independently discover the sequence of actions that maximize the reward. This discovery process is akin to a trial-and-error search. The quality of actions is measured by not just the immediate reward they return, but also the delayed reward they might fetch. As it can learn the actions that result in eventual success in an unseen environment without the help of a supervisor, reinforcement learning is a very powerful algorithm.

##**How Does Reinforcement Learning Work?**
The Reinforcement Learning problem involves an agent exploring an unknown environment to achieve a goal. RL is based on the hypothesis that all goals can be described by the maximization of expected cumulative reward. The agent must learn to sense and perturb the state of the environment using its actions to derive maximal reward. The formal framework for RL borrows from the problem of optimal control of Markov Decision Processes (MDP).

The main elements of an RL system are:

1. The agent or the learner
2. The environment the agent interacts with
3. The policy that the agent follows to take actions
4. The reward signal that the agent observes upon taking actions

A useful abstraction of the reward signal is the value function, which faithfully captures the ‘goodness’ of a state. While the reward signal represents the immediate benefit of being in a certain state, the value function captures the cumulative reward that is expected to be collected from that state on, going into the future. The objective of an RL algorithm is to discover the action policy that maximizes the average value that it can extract from every state of the system.

<img src='https://images.synopsys.com/is/image/synopsys/reinforcement-learningV1-02?qlt=82&wid=1200&ts=1680107783898&$responsive$&fit=constrain&dpr=off' width=500px>

## **Examples of Reinforcement Learning**
Any real-world problem where an agent must interact with an uncertain environment to meet a specific goal is a potential application of RL. Here are a few RL success stories:

1. **Robotics.** Robots with pre-programmed behavior are useful in structured environments, such as the assembly line of an automobile manufacturing plant, where the task is repetitive in nature. In the real world, where the response of the environment to the behavior of the robot is uncertain, pre-programming accurate actions is nearly impossible. In such scenarios, RL provides an efficient way to build general-purpose robots. It has been successfully applied to robotic path planning, where a robot must find a short, smooth, and navigable path between two locations, void of collisions and compatible with the dynamics of the robot.

2. **AlphaGo.** One of the most complex strategic games is a 3,000-year-old Chinese board game called Go. Its complexity stems from the fact that there are 10^270 possible board combinations, several orders of magnitude more than the game of chess. In 2016, an RL-based Go agent called AlphaGo defeated the greatest human Go player. Much like a human player, it learned by experience, playing thousands of games with professional players. The latest RL-based Go agent has the capability to learn by playing against itself, an advantage that the human player doesn’t have.

3. **Autonomous Driving.** An autonomous driving system must perform multiple perception and planning tasks in an uncertain environment. Some specific tasks where RL finds application include vehicle path planning and motion prediction. Vehicle path planning requires several low and high-level policies to make decisions over varying temporal and spatial scales. Motion prediction is the task of predicting the movement of pedestrians and other vehicles, to understand how the situation might develop based on the current state of the environment.

<h2><strong>How Actor-Critic works:</strong></h2>
<p>Imagine you play a video game with a friend that provides you some feedback. You're the Actor, and your friend is the Critic:</p>
<p><img class="img-fluid" style="display: block; margin-left: auto; margin-right: auto;" src="https://pylessons.com/media/Tutorials/Reinforcement-learning-tutorial/A2C-reinforcement-learning/09_A2C-reinforcement-learning.png" alt="" width="949" height="534" />

In the beginning, you don't know how to play, so you try some action randomly. The Critic observes your action and provides feedback.
<p>In the Actor-Critic Methods:</p>
<ul>
<li>The "Critic" estimates the value function. This could be the action-value (the Q value) or state-value (the V value).</li>
<li>Critic: Q-learning algorithm that critiques the action that the Actor selected, providing feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such as memory replay.</li>
</ul>
<p>We update both the Critic network and the Value network at each update step.</p>
<p>Intuitively, this means how better it is to take a specific action than the average general action at the given state. So, using the Value function as the baseline function, we subtract the Q value term with the Value.

# Actor Critic Method

**Author:** [Apoorv Nandan](https://twitter.com/NandanApoorv)<br>
Modified by Marcin Wolter<br>
**Date created:** 2020/05/13<br>
**Last modified:** 2023/04/16<br>
**Description:** Implement Actor Critic Method in CartPole environment.

## Introduction

This script shows an implementation of Actor Critic method on CartPole-V0 environment.

### Actor Critic Method

As an agent takes actions and moves through an environment, it learns to map
the observed state of the environment to two possible outputs:

1. Recommended action: A probability value for each action in the action space.
   The part of the agent responsible for this output is called the **actor**.
2. Estimated rewards in the future: Sum of all rewards it expects to receive in the
   future. The part of the agent responsible for this output is the **critic**.

Agent and Critic learn to perform their tasks, such that the recommended actions
from the actor maximize the rewards.

### CartPole-V1

A pole is attached to a cart placed on a frictionless track. The agent has to apply
force to move the cart. It is rewarded for every time step the pole
remains upright. The agent, therefore, must learn to keep the pole from falling over.

### References

- [CartPole](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)
- [CartPole in GYM](https://www.gymlibrary.dev/environments/classic_control/cart_pole/ )
- [Actor Critic Method](https://hal.inria.fr/hal-00840470/document)


# **Cart Pole**
<figure class="align-default" id="id1">
<a class="reference internal image-reference" href="https://www.gymlibrary.dev/_images/cart_pole.gif"><img alt="https://www.gymlibrary.dev/_images/cart_pole.gif" src="https://www.gymlibrary.dev/_images/cart_pole.gif" style="width: 200px;" /></a>
</figure>
<p>This environment is part of the <a href='..'>Classic Control environments</a>. Please read that page first for general information.</p>
<div class="table-wrapper colwidths-auto docutils container">
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head"><p></p></th>
<th class="head"><p></p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>Action Space</p></td>
<td><p>Discrete(2)</p></td>
</tr>
<tr class="row-odd"><td><p>Observation Shape</p></td>
<td><p>(4,)</p></td>
</tr>
<tr class="row-even"><td><p>Observation High</p></td>
<td><p>[4.8   inf 0.42  inf]</p></td>
</tr>
<tr class="row-odd"><td><p>Observation Low</p></td>
<td><p>[-4.8   -inf -0.42  -inf]</p></td>
</tr>
<tr class="row-even"><td><p>Import</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">gym.make(&quot;CartPole-v1&quot;)</span></code></p></td>
</tr>
</tbody>
</table>
</div>
<section id="description">
<h2>Description</h2>
<p>This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in
<a class="reference external" href="https://ieeexplore.ieee.org/document/6313077">“Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”</a>.
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces
in the left and right direction on the cart.</p>
</section>
<section id="action-space">
<h2>Action Space</h2>
<p>The action is a <code class="docutils literal notranslate"><span class="pre">ndarray</span></code> with shape <code class="docutils literal notranslate"><span class="pre">(1,)</span></code> which can take values <code class="docutils literal notranslate"><span class="pre">{0,</span> <span class="pre">1}</span></code> indicating the direction
of the fixed force the cart is pushed with.</p>
<div class="table-wrapper colwidths-auto docutils container">
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head"><p>Num</p></th>
<th class="head"><p>Action</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>0</p></td>
<td><p>Push cart to the left</p></td>
</tr>
<tr class="row-odd"><td><p>1</p></td>
<td><p>Push cart to the right</p></td>
</tr>
</tbody>
</table>
</div>
<p><strong>Note</strong>: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle
the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it</p>
</section>
<section id="observation-space">
<h2>Observation Space</h2>
<p>The observation is a <code class="docutils literal notranslate"><span class="pre">ndarray</span></code> with shape <code class="docutils literal notranslate"><span class="pre">(4,)</span></code> with the values corresponding to the following positions and velocities:</p>
<div class="table-wrapper colwidths-auto docutils container">
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head"><p>Num</p></th>
<th class="head"><p>Observation</p></th>
<th class="head"><p>Min</p></th>
<th class="head"><p>Max</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>0</p></td>
<td><p>Cart Position</p></td>
<td><p>-4.8</p></td>
<td><p>4.8</p></td>
</tr>
<tr class="row-odd"><td><p>1</p></td>
<td><p>Cart Velocity</p></td>
<td><p>-Inf</p></td>
<td><p>Inf</p></td>
</tr>
<tr class="row-even"><td><p>2</p></td>
<td><p>Pole Angle</p></td>
<td><p>~ -0.418 rad (-24°)</p></td>
<td><p>~ 0.418 rad (24°)</p></td>
</tr>
<tr class="row-odd"><td><p>3</p></td>
<td><p>Pole Angular Velocity</p></td>
<td><p>-Inf</p></td>
<td><p>Inf</p></td>
</tr>
</tbody>
</table>
</div>
<p><strong>Note:</strong> While the ranges above denote the possible values for observation space of each element,
it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:</p>
<ul class="simple">
<li><p>The cart x-position (index 0) can be take values between <code class="docutils literal notranslate"><span class="pre">(-4.8,</span> <span class="pre">4.8)</span></code>, but the episode terminates
if the cart leaves the <code class="docutils literal notranslate"><span class="pre">(-2.4,</span> <span class="pre">2.4)</span></code> range.</p></li>
<li><p>The pole angle can be observed between  <code class="docutils literal notranslate"><span class="pre">(-.418,</span> <span class="pre">.418)</span></code> radians (or <strong>±24°</strong>), but the episode terminates
if the pole angle is not in the range <code class="docutils literal notranslate"><span class="pre">(-.2095,</span> <span class="pre">.2095)</span></code> (or <strong>±12°</strong>)</p></li>
</ul>
</section>
<section id="rewards">
<h2>Rewards</h2>
<p>Since the goal is to keep the pole upright for as long as possible, a reward of <code class="docutils literal notranslate"><span class="pre">+1</span></code> for every step taken,
including the termination step, is allotted. The threshold for rewards is 475 for v1.</p>
</section>
<section id="starting-state">
<h2>Starting State</h2>
<p>All observations are assigned a uniformly random value in <code class="docutils literal notranslate"><span class="pre">(-0.05,</span> <span class="pre">0.05)</span></code></p>
</section>
<section id="episode-end">
<h2>Episode End</h2>
<p>The episode ends if any one of the following occurs:</p>
<ol class="arabic simple">
<li><p>Termination: Pole Angle is greater than ±12°</p></li>
<li><p>Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)</p></li>
<li><p>Truncation: Episode length is greater than 500 (200 for v0)</p></li>
</ol>
</section>
<section id="arguments">
<h2>Arguments</h2>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">gym</span><span class="o">.</span><span class="n">make</span><span class="p">(</span><span class="s1">&#39;CartPole-v1&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>No additional arguments are currently supported.</p>
</section>
</section>

## Setup


In [None]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from gym import wrappers
from gym.wrappers.record_video import RecordVideo
import os
import matplotlib.pyplot as plt
from IPython import display


In [None]:

# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v1")  # Create the environment  , render_mode='rgb_array'
env.reset(seed=seed)
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0


  and should_run_async(code)
  deprecation(
  deprecation(


## Implement Actor Critic network

This network learns two functions:

1. Actor: This takes as input the state of our environment and returns a
probability value for each action in its action space.
2. Critic: This takes as input the state of our environment and returns
an estimate of total rewards in the future.

In our implementation, they share the initial layer.


In [None]:
num_inputs = 4    # four parameters describing the cart state
num_actions = 2   # move the cart left or right
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])

model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, 4)]          0           []                               
                                                                                                  
 dense_2 (Dense)                (None, 128)          640         ['input_3[0][0]']                
                                                                                                  
 dense_3 (Dense)                (None, 2)            258         ['dense_2[0][0]']                
                                                                                                  
 dense_4 (Dense)                (None, 1)            129         ['dense_2[0][0]']                
                                                                                            

In [None]:
def make_video(i, env, model):

    video = VideoRecorder(env, 'final_'+str(i)+'.mp4')

    steps = 0
    done = False
    state = env.reset()
    while not done:
        env.render(mode='rgb_array')
        video.capture_frame()
        # Predict action probabilities and estimated future rewards
        # from environment state
        state = tf.convert_to_tensor(state)
        state = tf.expand_dims(state, 0)
        action_probs, critic_value = model(state)
        # Sample action from action probability distribution
        action = np.random.choice(num_actions, p=np.squeeze(action_probs))
        state, reward, done, _ = env.step(action)
        steps += 1

    print("Testing steps: {}: ".format(steps))
    video.close()


## Train


In [None]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

from gym.wrappers.monitoring.video_recorder import VideoRecorder
video = VideoRecorder(env, 'training.mp4')
while True:  # Run until solved

    state = env.reset()

    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); Adding this line would show the attempts
            # of the agent in a pop up window.
            env.render(mode='rgb_array')
            video.capture_frame()  # capture the video frame

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )

        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

        make_video(episode_count, env, model)


    if running_reward > 195: # 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        video.close()
        break




  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


running reward: 10.25 at episode 10


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 18: 
running reward: 17.16 at episode 20


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 21: 
running reward: 22.37 at episode 30


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 49: 
running reward: 26.78 at episode 40


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 105: 
running reward: 38.56 at episode 50


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 121: 
running reward: 53.16 at episode 60


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 78: 
running reward: 44.92 at episode 70


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 24: 
running reward: 77.10 at episode 80


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 32: 
running reward: 103.78 at episode 90


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 120: 
running reward: 133.37 at episode 100


  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 163: 
Solved at episode 105!


### **Record video after training**

In [None]:
make_video(episode_count, env, model)

  and should_run_async(code)
  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Testing steps: 500: 


# **So, we have trained the robot!!!**

###**Film showing the training of a robot:**

[![Blinking LEDs](http://img.youtube.com/vi/n2gE7n11h1Y/0.jpg)](https://www.youtube.com/watch?v=n2gE7n11h1Y "Blinking LEDs")