# Reinforcement Learning (RL)

## What is a RL problem?

 > Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

- RL is applicable to a wide range of seemingly different learning problems

**RL terminology:**

What is an agent? And an environment? What are exactly these actions the agent can take? And the reward? Why do you say cumulative reward?

#### Example: Learning to drive

- The __agent__ is the driver who wants to get from A to B, comfortably.

- The __state__ of the environment the driver observes has lots of things, including the position, speed and acceleration of the car, all other cars, passengers, road conditions or traffic signs. Transforming such a big vector of inputs into an appropriate action is challenging as you can imagine.

  
- The __actions__ are basically three: the direction of the steering wheel, throttle intensity and break intensity.


- The __reward__ after each action is a weighted sum of the different aspects you need to balance when driving. A decrease in distance to point B brings a positive reward, while an increase a negative one. To ensure no collisions, getting too close (or even colliding) with another car, or even a pedestrian should have a very big negative reward. Also, in order to encourage the smooth driving, sharp changes in speed or direction contribute to a negative reward.

<img src="img/rl.png" width=500 height=500 />

#### The reward hypothesis: the central idea of Reinforcement Learning
#### Why is the goal of the agent to maximize the expected return?

Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

That’s why in Reinforcement Learning, to have the best behavior, we aim to learn to take actions that maximize the expected cumulative reward.

__Markov Property__

The RL process is called a Markov Decision Process (MDP).

The Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

**Action Space**
  
The Action space is the set of all possible actions in an environment.

The actions can come from a **discrete** or **continuous** space:

**Discrete space:** the number of possible actions is finite.

**Continuous space:** the number of possible actions is infinite.

Now that we understand how to formulate an RL problem, we need to solve it.

## Policies and value functions

#### Policies

The agent picks the action she thinks is the best based on the current state of the environment. This is the agent’s strategy, commonly referred to as the agent’s policy.

> A policy is a learned mapping from states to actions.

> Solving a reinforcement learning probem means finding the best possible policy.

Policies are either __deterministic__, when they map each state to one action,

$$π(state) = action$$


or __stochastic__ when they map each state to a probability distribution over all possible actions.

$$π(state)=(prob \, action \, 1, prob \, action \, 2, ... prob \, action \, N)$$

__There exist several methods to actually compute this optimal policy. These are called policy optimization methods.__

### Value functions

Sometimes, depending on the problem, instead of directly trying to find the optimal policy, one can try to find the value function associated with that optimal policy.

But, what is a value function? And before that, what does value mean in this context?

> The value is a number associated with each state s of the environment that estimates how good it is for the agent to be in state s.

> It is the cumulative reward the agent collects when starting at state s and choosing actions according to policy π .

> A value function is a learned mapping from states to values.

The value function of a policy is commonly denoted as

$$v_{π}(s)\, = \text{cumulative reward when the agent starts at state s and follows policy π}$$

Value functions can also map pairs of (action, state) to values. In this case, they are called q-value functions.

$$q_{π}(s,a)= \text {cumulative reward when the agent start at state s takes action a and follows policy π thereafter}$$

The optimal value function (or q-value function) satisfies a mathematical equation, called the Bellman equation.

<img src="img/belleq.png" width=400 height=400 />

This equation is useful because it can be transformed into an iterative procedure to find the optimal value function.

__Note__: We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99.

But, why are value functions useful?

Because you can infer an optimal policy from an optimal q-value function.

How?

The optimal policy is the one where at each state __s__ the agent chooses the action __a__ that maximizes the q-value function.

So, you can jump from optimal policies to optimal q-functions, and vice versa.


__Two main approaches for solving RL problems__

- Policy-Based Methods
  
    - teach the agent to learn which action to take, given the current state
 
      
      
- Value-based methods



    - teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states

In [2]:
def train(n_episodes: int):
    """
    Pseudo-code of a Reinforcement Learning agent training loop
    """

    env = load_env()

    agent = get_rl_agent()

    for episode in range(0, n_episodes):

        # random start of the environmnet
        state = env.reset()

        # epsilon is parameter that controls the exploitation-exploration trade-off.
        # it is good practice to set a decaying value for epsilon
        epsilon = get_epsilon(episode)

        done = False
        while not done:

            if random.uniform(0, 1) < epsilon:
                # Explore action space
                action = env.action_space.sample()
            else:
                # Exploit learned values (or policy)
                action = agent.get_best_action(state)

            # environment transitions to next state and maybe rewards the agent.
            next_state, reward, done, info = env.step(action)

            # adjust agent parameters.
            agent.update_parameters(state, action, reward, next_state)

            state = next_state

__Epsilon__

It is a value between 0 and 1, and it represents the probability the agent chooses a random action instead of what she thinks is the best one.


This tradeoff between exploring new strategies vs sticking to already known ones is called the __exploration-exploitation__ problem. This is a key ingredient in RL problems and something that distinguishes RL problems from supervised machine learning.

__Exploration__ is exploring the environment by trying random actions in order to find more information about the environment.

__Exploitation__ is exploiting known information to maximize the reward.


Technically speaking, we want the agent to find the global optimum, not a local one.