[Hands-On Reinforcement Learning Course: Part 1 - From zero to hero, step by step](https://towardsdatascience.com/hands-on-reinforcement-learning-course-part-1-269b50e39d08)

by [Pau Labarta Bajo](https://pau-labarta-bajo.medium.com/?source=post_page-----269b50e39d08--------------------------------), Nov 27, 2021.  [[github - reinforcement_learning_training_loop.py]](https://gist.github.com/Paulescu/9a2270e4801162f1dbb3e183716c4fc5/raw/2a49f76707b52e9920f656ed771292e1dfa9e276/reinforcement_learning_training_loop.py)

Summarized and Revised by Ivan H.P. Lin, July 13 2022

#0 - Contents
1. What is a Reinforcement Learning problem? 🤔
2. Policies 👮🏽 and value functions.
3. How to generate the training data? 📊
4. Python boilerplate code.🐍
5. Recap ✨
6. Homework 📚
7. What’s next? ❤️

Let’s start!

#1 - What is a reinforcement learning problem? 🤔
**AGENT**, **ENVIRONMENT**, **ACTION**, and **REWARDS**
<figure><center>
<img src="https://miro.medium.com/max/850/1*w-BQXtdOQgzdZbcZ8yaiCQ.jpeg" width="60%" >
<figcaption>Reinforcement learning ingredients (Image by the author) </figcaption> 
</center></figure>

An intelligent **agent** 🤖 needs to learn, through trial and error, how to take **actions** inside and **environment** 🌎 in order to maximize a **cumulative reward**.

If you are asking yourself these questions you are on the right track.

The definition I just gave introduces a bunch of terms that you might not be familiar with. In fact, they are ambiguous on purpose. This generality is what makes RL applicable to a wide range of seemingly different learning problems. This is the philosophy behind mathematical modeling, which stays at the roots of RL.

Let's use few examples to explain **AGENT**, **ENVIRONMENT**, **ACTION**, and **REWARDS**:

##**AGENT**, **ENVIRONMENT**, **ACTION**, and **REWARDS**
### Example 1: learning to walk 🚶🏽‍♀️🚶🏿🚶‍♀️

### **AGENT**
* The **agent** is my son, Kai. And he wants to stand up and walk. His muscles are strong enough at this point in time to have a chance at it. The learning problem for him is: how to sequentially adjust his body position, including several angles on his legs, waist, back, and arms to balance his body and not fall.


### **ENVIRONMENT**
* The **environment** is the physical world surrounding him, including the laws of physics. The most important of which is gravity. Without gravity the learning-to-walk problem would drastically change, and even become irrelevant: why would you wanna walk in a world where you can simply fly? Another important law in this learning problem is Newton’s third law, which in plain words tells that if you fall on the floor, the floor is going to hit you back with the same strength. Ouch!

### **ACTION**
* The **actions** are all the updates in these body angles, which determine his body position and speed as he starts chasing things around. 
  - Sure he can do other things at the same time, like imitating the sound of a cow, but these are probably not helping him accomplish his goal. We ignore these actions in our framework. Adding unnecessary actions does not change the modeing step, but it makes the problem harder to solve later on.

### **REWARDS**, and ***Cumulative Reward***
* The **reward** he receives is a stimulus coming from the brain, that makes him happy or makes him feel pain. There is the negative reward he experiences when falling on the floor, which is physical pain maybe followed by frustration. 
  - On the other side, there are several things that contribute positively to his happiness, like the happiness of getting to places faster 👶🚀, or the external stimulus that comes from my wife Jagoda and I when we say “Good job!” or “Bravo!” to each attempt and marginal improvement he shows.

* A little bit more about rewards 💰
Some actions might seem very appealing for the baby at the beginning, like trying to run to get a boost of excitement. However, he soon learns that in some (or most) cases he ends up falling on his face, and experiencing an extended period of pain and tears. This is why **intelligent agents maximize cumulative reward**, and **not marginal reward**. 
  * **They trade short-term rewards with long-term ones. An action that would give immediate reward, but put my body in a position about to fall, is not an optimal one.**

* The frequency and intensity of the rewards are key for helping the agent learn. Very infrequent (sparse) feedback means harder learning. Think about it, if you do not know if what you do is good or bad, how can you learn? This is one of the main reasons why some RL problems are harder than others.

* Reward shaping is a tough modeling decision for many real-world RL problems.


##**AGENT**, **ENVIRONMENT**, **ACTION**, and **REWARDS**
### Example 2: learning to play monopoly like a pro 🎩🏨💰 
<img src="https://miro.medium.com/max/960/0*T_ritMztZYRTR74R.jpg" width="30%">

###**AGENT**, **ENVIRONMENT**, **ACTION**, and **REWARDS**


What would the 4 RL ingredients be?

* The **agent** is you, the one who wants to win at Monopoly.
* Your **actions** are the ones you see on this screenshot below:
<img src="https://miro.medium.com/max/1050/0*XVrxPYg9fKpAWtwj" width="70%">
*The **environment** is the current state of the game, including the *list of properties*, *positions*, and *cash amounts each player* has. 
  - There is also the strategy of your opponent, which is something you cannot predict and lies outside of your control.
* And the **reward** is 0, except in your last move, where it is +1 if you win the game, and -1 if you go bankrupt. 
  - This reward formulation makes sense but makes the problem hard to solve. 
  - As we said above, a more sparse reward means a harder solution. Because of this, there are other ways to model the reward, making them noisier but less sparse.

### **self-play**
When you play against another person in Monopoly, you do not know how she or he will play. 
* What you can do is play against yourself. As you learn to play better, your opponent does too (because it is you), forcing you to level up your game to keep on winning. You see the positive feedback loop.

This trick is called **self-play**. It gives us a path to bootstrap intelligence without using the external advice of an expert player.

  * **Self-play** is the main difference between [AlphaGo](https://deepmind.com/research/case-studies/alphago-the-story-so-far) and [AlphaGo Zero](https://deepmind.com/blog/article/alphago-zero-starting-scratch), the two models developed by DeepMind that play the game of Go better than any human.

##**AGENT**, **ENVIRONMENT**, **ACTION**, and **REWARDS**
### Example 3: learning to drive 🚗 

Learning to drive a car is not easy. The goal of the driver is clear: to get from point A to point B, comfortably for her and any passengers on board.

There are many external aspects to the driver that make driving challenging, including:

  - other drivers behavior
  - traffic signs
  - pedestrian behaviors
  - pavement conditions
  - weather conditions.
  - … even fuel optimization (who wants to spend extra on this?)

How would we approach this problem with reinforcement learning?

### **AGENT**, **ENVIRONMENT**, **ACTION**, and **REWARDS**

* The **agent** is the driver who wants to get from A to B, comfortably.
* The **state of the environment** the driver observes has lots of things, including the position, speed and acceleration of the car, all other cars, passengers, road conditions or traffic signs. Transforming such a big vector of inputs into an appropriate action is challenging as you can imagine.
* The **actions** are basically three: the direction of the steering wheel, throttle intensity and break intensity.
* The **reward** after each action is a weighted sum of the different aspects you need to balance when driving. A decrease in distance to point B brings a positive reward, while an increase a negative one. To ensure no collisions, getting too close (or even colliding) with another car, or even a pedestrian should have a very big negative reward. Also, in order to encourage smooth driving, sharp changes in speed or direction contribute to a negative reward.

## RL Diagram
After these 3 examples, I hope the following representation of RL elements and how they play together makes sense

<figure><center>
<img src="https://miro.medium.com/max/850/1*w-BQXtdOQgzdZbcZ8yaiCQ.jpeg" width="60%" >
<figcaption>Reinforcement learning ingredients (Image by the author) </figcaption> 
</center></figure>

# 2. **Policies**, **Value functions**, **Bellman Equation**

## **Policy**, "- Deterministic / Stochastic"
The agent picks the action she thinks is the best based on the current state of the environment.

This is the agent’s strategy, commonly referred to as the agent’s **policy**.

> A **policy** is a learned mapping from states to actions.
- **Solving a reinforcement learning probem means finding the best possible policy.**

### **Deterministic** or **Stochastic**, *policy optimization methods*
Policies are either **deterministic**, when they map each state to one action,
**$$\pi(state) = action$$**

or **stochastic** when they map each state to a probability distribution over all possible actions.

**$$\pi(state) = (prob(action_1), prob(action_2),\dots ,prob(action_N))$$**

**Stochastic** is a word you often read and hear in Machine Learning and it essentially means uncertain, random. In environments with high uncertainty, like Monopoly where you are rolling dices, **stochastic** policies are better than deterministic ones.

There exist several methods to actually compute this optimal policy. These are called **policy optimization method**s.

## **Value functions**, ***q*-value functions**
Sometimes, depending on the problem, instead of directly trying to find the optimal policy, one can try to find the **value function** associated with that optimal policy.

But, what is a value function?  And what does value mean in this context?

The **value** is a number associated with each state **$s$** of the environment that estimates how good it is for the agent to be in state **$s$**.

It is the cumulative reward the agent collects when starting at state **$s$** and choosing actions according to policy **$π$** .

**A value function is a learned mapping from states to values**.
The value function of a policy is commonly denoted as 

* **$v_\pi(s) =$ cumulative reward when the agent starts at state $s$ and follow policy $π$**

###**Bellman equation**, ***Q*-learning**
Value functions can also map pairs of **(action, state)** to values. In this case, they are called ***q-value*** functions.

* $q_π(s)$=cumulative reward when the agent starts at state **$s$**, take an action **$a$**,  and follow policy **$π$** thereafter

The optimal value function (or $q$-value function) satisfies a mathematical equation, called the **Bellman equation**.
<figure><center>
<img src="https://miro.medium.com/max/1400/0*8ltyj-zO-6Li3zuX.png" width="50%">
</figure></center>

This equation is useful because it can be transformed into an iterative procedure to find the optimal value function

* why are value functions useful?
  - Because you can infer an optimal policy from an optimal q-value function.
* How?  
  - The optimal policy is the one where at each state **$s$** the agent chooses the action **$a$** that maximizes the ***$q$***-value function
* So, you can jump from optimal policies to optimal q-functions, and vice versa 😎.

There are several RL algorithms that focus on finding optimal ***q***-value functions. These are called **Q-learning methods**  



#3. [OpenAI's Gym](https://gym.openai.com/), [DeepMind's MoJoCo](https://www.endtoend.ai/envs/gym/mujoco/) - How to generate training data? 📊
Reinforcement learning agents are VERY data-hungry

A way to overcome this hurdle is by using **simulated environments**. Writing the engine that simulates the environment usually requires more work than solving the RL problem. Also, changes between different engine implementations can render comparisons between algorithms meaningless.

This is why guys at **OpenAI** released the [Gym toolkit](https://gym.openai.com/) back in 2016. OpenAIs’s gym offers a standardized API for a collection of environments for different problems, including

- the classic Atari games,
- robotic arms
- or landing on the Moon (well, a simplified one)

**OpenAI Gym** also defines a standard API to build environments, allowing third parties (like you) to create and make your environments available to others.

There are proprietary environments too, like **[MuJoCo](https://www.endtoend.ai/envs/gym/mujoco/)** (recently bought by DeepMind). MuJoCo is an environment where you can solve continuous control tasks in 3D, like learning to walk 👶.

If you are interested in **self-driving cars**, then you should check out [CARLA](https://carla.org/), the most popular open urban driving simulator.

# 4. Python boilerplate code 🐍
Pseudo sample code



```
import random

def train(n_episodes: int):
    """
    Pseudo-code of a Reinforcement Learning agent training loop
    """

    # python object that wraps all environment logic. Typically you will
    # be using OpenAI gym here.
    env = load_env()

    # python object that wraps all agent policy (or value function)
    # parameters, and action generation methods.
    agent = get_rl_agent()

    for episode in range(0, n_episodes):

        # random start of the environmnet
        state = env.reset()

        # epsilon is parameter that controls the exploitation-exploration trade-off.
        # it is good practice to set a decaying value for epsilon
        epsilon = get_epsilon(episode)

        done = False
        while not done:

            if random.uniform(0, 1) < epsilon:
                # Explore action space
                action = env.action_space.sample()
            else:
                # Exploit learned values (or policy)
                action = agent.get_best_action(state)

            # environment transitions to next state and maybe rewards the agent.
            next_state, reward, done, info = env.step(action)

            # adjust agent parameters. We will see how later in the course.
            agent.update_parameters(state, action, reward, next_state)

            state = next_state
```



## **$ϵ$** and **exploration-exploitation**

Epsilon (**$ϵ$**) is a key parameter to ensure our agent explores the environment enough, before drawing definite conclusions on what is the best action to take in each state.

It is a value between 0 and 1, and it represents the probability the agent chooses a random action instead of what she thinks is the best one.

This tradeoff between *exploring new strategies* vs *sticking to already known ones* is called the **exploration-exploitation problem**. This is a key ingredient in RL problems and something that distinguishes RL problems from supervised machine learning.

Technically speaking, we want the agent to find the global optimum, not a local one.

It is good practice to start your training with a large value (e.g. 50%) and progressively decrease after each episode. This way the agent explores a lot at the beginning, and less as she perfects her strategy.

# 5. Recap

* Every RL problem has an agent (or agents), environment, actions, states and rewards.
* The agent sequentially takes actions with the goal of maximizing total rewards. For that she needs to find the optimal policy.
* Value functions are useful as they give us an alternative path to find the optimal policy.
* In practice, you need to try different RL algorithms for your problem, and see what works best.
* RL agents need a lot of training data to learn. OpenAI gym is a great tool to re-use and create your environments.
* Exploration vs exploitation is necessary when training RL agents, to ensure the agent does not get stuck in local optimums.