# Deep Reinforcement Learning

Welcome to this lecture of Deep Reinforcement Learning. Reinforcement Learning is ultimately about building code capable of learning to perform complex tasks from scratch by itself. It is basically a learning based on a trial and error approach. Many recent applications like robots capable of walking or Alpha Go, Self Driving Cars, are based on this kind of paradigm. 

**Part I Foundations of RL **:  How to define real-world problems as Markov Decissions Processes (MDP's) --> classical methods such as SARSA and Q-Learning to solve several environments in OpenAI Gym. Explore techniques like tile coding and coarse coding to expand the size of the problems that can be solved with traditional RL algorithms. 

**Part II Value-Based Methods**: In the second part, you'll learn how to leverage neural networks when solving complex problems using the Deep Q-Networks (DQN) algorithm. You will also learn about modifications such as double Q-learning, prioritized experience replay, and dueling networks. Then, you'll use what you’ve learned to create an artificially intelligent game-playing agent that can navigate a spaceship!

**Part III Policy-Based Methods**: In the third part, you'll learn about policy-based and actor-critic methods such as Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), and Deep Deterministic Policy Gradients (DDPG). You’ll also learn about optimization techniques such as evolution strategies and hill climbing.

**Part IV Multi-Agent Reinforcement Learning**: Most of reinforcement learning is concerned with a single agent that seeks to demonstrate proficiency at a single task. In this agent's environment, there are no other agents. However, if we'd like our agents to become truly intelligent, they must be able to communicate with and learn from other agents. In the final part of this nanodegree, we will extend the traditional framework to include multiple agents. You'll also learn all about Monte Carlo Tree Search (MCTS) and master the skills behind DeepMind's AlphaZero.






### Applications

* TD-Gammon, one of the first successful applications of neural networks to reinforcement learning.
* AlphaGo Zero, the state-of-the-art computer program that defeats professional human Go players --> Google
* Learn about how reinforcement learning (RL) is used to play Atari games.
* Research used to teach humanoid bodies to walk.
* Self-driving cars --> Google + Amazon + Uber
* Reinforcement Learning for telecommunication --> Google
* Reinforcement Learning for inventory management --> Amazon

### Terminology

* Agent: The learner or decision maker. You can think of that like a small puppy.

* Environment: The space problem, however you can visualize it as a trainer to the puppy.

* State: The observation of the environment gives the Agent a knowledge of the state of the environment. Example: if it's a robot the states are the context provided for the agent for choosing intelligent actions, that's to say in this case, the positions and velocities of the joints along with some measurements of the ground, and a contact sensor data in order to know if the robot is still walking or not. Based on the information of the state, the Agent needs to choose which action comes next. 

* Actions: The task to perform whatsoever. Example: if it's a robot, the action can be the force that the robot applies to its joints at every point time or timestep. The action has an influence over the environment, thus a response (feedback) is produced from the Environment, as a Reward. 

* Reward: A positive or negative feedback from the Environment.

* Timestep: Think of this in terms of time, since it is a learning process. 

Example https://www.youtube.com/watch?v=gn4nRCC9TwQ

### Assumptions 

* The Agent wants to win and have a positive actitud
* The Agent observes the environment
* Does the Agent have all the information needed to make a good decision?
* Is the all the environment fully observable? or only a certain part of the environment is known by the agent? 
* **THE GOAL OF THE AGENT IS TO MAXIMIZE THE MAXIMIZE THE CUMULATIVE REWARD.** In other words, the Agent must learn to play by the rules of the Environment. Thus, if you think you can solve a problem using RL, you will need to define the states, actions and rewards and also the rules of the environment. 


### Exploration - Exploitation Dilemma 

Let's suppose the puppie (the agent) knows how to behave versus some known situations, instead of taking the risk of try randomly new actions from the actions set possible? 

* Exploration of potential hypotheses for how to choose actions
* Exploiting limited knowledge about what is already known that should work well


### Episodic vs Continuing tasks

Not all the problems have a ending point, eg. wining a game vs the finantial bots that are trading all the time. In the first case, the Agent eventually will be off the game, on the second case, the Agent lives forever and therefore the learning process has no ending. 

This is an episodic task, where an episode finishes when the game ends. The idea is that by playing the game many times, or by interacting with the environment in many episodes, you can learn to play chess better and better.

Let's say you are playing chess, and you only have a reward at the end of the game. It's important to note that this problem is exceptionally difficult, because the feedback is only delivered at the very end of the game. So, if you lose a game (and get a reward of -1 at the end of the episode), it’s unclear when exactly you went wrong: maybe you were so bad at playing that every move was horrible, or maybe instead … you played beautifully for the majority of the game, and then made only a small mistake at the end. When the reward signal is largely uninformative in this way, we say that the task suffers the problem of **sparse rewards**. There’s an entire area of research dedicated to this problem. 

### The Reward Hypothesis

How do we design a good reward? How do we make a significant enough reward to the problem? Reinforcement is a term originally from behavioural science. It's important to note that the term "Reinforcement Learning" comes from behavioral science. It refers to a stimulus that's delivered immediately after behavior to make the behavior more likely to occur in the future. It's an important to defining hypothesis and reinforcement learning that we can always formulate an agents goal,
along the lines of maximizing expected cumulative reward. And we call this hypothesis, the "Reward Hypothesis".

#### Cumulative reward

Example https://www.youtube.com/watch?v=gn4nRCC9TwQ [Reward function]

The Agent needs to learn about the complex effect of its actions over the environment over time. This is why we need a cumulative reward, this way the robot is able to make long-term decisions. 

**Example**

Consider an agent who would like to learn to escape a maze. Which reward signals will encourage the agent to escape the maze as quickly as possible? 

 
* The reward is -1 for every time step that the agent spends inside the maze. Once the agent escapes, the episode terminates.
* The reward is +1 for every time step that the agent spends inside the maze. Once the agent escapes, the episode terminates.
* The reward is -1 for every time step that the agent spends inside the maze. Once the agent escapes, it receives a reward of +10, and the episode terminates.
* The reward is 0 for every time step that the agent spends inside the maze. Once the agent escapes, it receives a reward of +1, and the episode terminates.

SOLUTION:
The reward is -1 for every time step that the agent spends inside the maze. Once the agent escapes, the episode terminates.
The reward is -1 for every time step that the agent spends inside the maze. Once the agent escapes, it receives a reward of +10, and the episode terminates.

#### Return and discounted return

The Agent has to choose its Actions to maximize the goals, but who knows what the future holds? That is to say, how is time affecting to the lifelong decisions? Which strategy is better at a short/long term? Maybe the rewards that are closer in time should be more weighted since are more predictible. 


### Markov Decision Process (MDP)

A (finite) Markov Decision Process (MDP) is defined by: 

* a (finite) set of States $S$
* a (finite) set of Actions $A$
* a (finite) set of Rewards $R$

#### Example vacuum cleaner
<img src='images/markov_d.png' />

Suppose a Vacuum cleaner which has: 

* $A = ${$  Search, Recharge, Wait  $}
* $S = ${$ High, Low $} 

Let's suppose the Vacuum cleaner battery is high, so it is then less likely it recharges, so the possible actions are to wait or to search. If it waits, that does not consume battery at all, so at a timestep, the battery will still full. If the robot does that, then it gets a reward of +1. Instead, if it decides to search the space a collect garbage, it is still likely that the battery remains full (70%) but there is also a probability that the battery gets low. Either the case, the reward will be +4. If the battery gets low, the robot can choose among recharge (the most likely option that leads to new state of high), wait and getting a reward of +1 or take the risk of searching the space. If the robot battery completely dies and needs to be rescued, it gets a reward of -3 but if it collects something, gets a reward of +4. 

#### Deterministic Policy: Example
An example deterministic policy $\pi$: $\mathcal{S}\to\mathcal{A}$ can be specified as:

$\pi(\text{low})$ = $\text{recharge}$ 

$\pi(\text{high})$ = $\text{search}$

In this case: 

* if the battery level is low, the agent chooses to recharge the battery.
* if the battery level is high, the agent chooses to search for cans.

Therefore, 
* If the state is _low_, the agent chooses action _search_.
* The agent will always _search_ for cans at every time step (whether the battery level is _low_ or _high_).


#### Stochastic Policy: Example

An example stochastic policy $\pi: \mathcal{S}\times\mathcal{A}\to [0,1]$ can be specified as:

$\pi(\text{recharge} $|$ \text{low}) = 0.5$

$\pi(\text{wait}  $|$ \text{low}) = 0.4$

$\pi(\text{search} $|$ \text{low}) = 0.1$

$\pi(\text{search} $|$ \text{high}) = 0.9$

$\pi(\text{wait} $|$ \text{high}) = 0.1$

In this case,

* if the battery level is low, the agent recharges the battery with 50% probability, waits for cans with 40% probability, and searches for cans with 10% probability.
* if the battery level is high, the agent searches for cans with 90% probability and waits for cans with 10% probability.

Therefore, 
* If the battery level is _high_, the agent chooses to _search_ for a can with 60% probability, and otherwise _waits_ for a can.
* If the battery level is _low_, the agent is most likely to decide to _wait_ for cans.


### State-Value Functions

* The state-value function for a policy $\pi$ is denoted $v_\pi$ 
For each state $s \in\mathcal{S}$, it yields the expected return if the agent starts in state $s$ and then uses the policy to choose its actions for all time steps. That is, $v_\pi(s) \doteq \text{} \mathbb{E}\pi[G_t|S_t=s]$ 
We refer to $v\pi$ as the value of state $s$ under policy $\pi$.

* The notation $\mathbb{E}\pi[\cdot]$ is borrowed from the suggested textbook, where $E\pi$ defined as the expected value of a random variable, given that the agent follows policy $\pi$.



file:///C:/Users/nveigamo/Udacity/_FREEC~1.ME_/_FREEC~1.0/PA1183~1/19.%20Summary.html