# Step-1 What is Deep Reinforcement Learning?
Link = https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419

## The Reinforcement Learning Process
![](https://cdn-images-1.medium.com/max/1116/1*aKYFRoEmmKkybqJOvLt2JQ.png?raw=true)

This RL loops output a sequnce of **state**, **action** and **reward**

### The central idea of the Reward Hypothesis

Cumulative reward at each time step $t$:

$$G_t=\sum_{k=0}^{T}R_{t+k+1}\tag{1.1}$$

The rewards that come sooner are more probable to happen, since they are more predictable than the long term future reward.

$$G_t=\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}\text{, where } 0\leq\gamma<1\tag{1.2}$$

- the larger $\gamma$ the smaller the discount
- the smaller $\gamma$ the bigger the discount

### Episodic or Continuing tasks

**Episodic task** has an starting and an ending points<br>
**continuous task** does not have ending point. The agent keeps running until we decide to stop him.

### Monte Carlo vs TD Learning methods

In **Monte Carlo Approach** we collect the rewards **at the end of the episode** and then *calculate* the **maximum expected future reward**.

$$V(S_t)\leftarrow V(S_t) + \alpha[G_t-V(S_t)]\tag{1.3}$$

In **Temporal difference learning** we estimate the **reward at *each step***

$$\displaystyle V(S_t) \leftarrow \underbrace{V(S_t)}_{\text{previous estimate}} + \alpha\left[\overbrace{\underbrace{R_{t+1}}_{\text{reward at t+1}}+\underbrace{\gamma V(S_{t+1})}_{\text{discount value on next step}}}^{\text{TD target}} \right]\tag{1.4}$$

By running more and more episodes, **the agent will learn to play better and better**.

### Exploration/Exploitation trade-off

- **Exploration** is finding more information about the environment
- **Exploitation** is exploiting known information to maximize the reward.

### Three approaches to Reinforcement Learning
#### Value-based
Value-based RL optimizes the value function
<br>***Value function** is a function that tells us the maximum expected future reward the agent will get at each step*.
$$v_\pi(s) =\mathbb{E}\left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1}\mid S_t=s \right]
\tag{1.5}$$

#### Policy based
Policy based RL directly optimizes the policy function $\pi(s)$ without using a value function.

$$\underbrace{a\quad=\quad\pi(s)}_{\text{action = policy(function)}}
\tag{1.6}$$

- *Deterministic* policy will always return the same action at a given state.
- *Stochastic* policy outputs a distribution probability over actions

$$\pi(a|s) =\mathbb{P}[A_t=a|S_t=s]
\tag{1.7}$$

#### Model Based
Model based RL models the environment.

## Introducing Deep Reinforcement Learning
Deep reinforcement learning introduces deep neural networks to solve Reinforcement Learning problems--hence the name "deep".
![](https://cdn-images-1.medium.com/max/1395/1*w5GuxedZ9ivRYqM_MLUxOQ.png?raw=true)

# Step-2 Diving deeper into Reinforcement Learning with Q-Learning
Link = https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe

## Introducing the Q-table
In **Q-table**, the columns will be actions $a$, the rows will be states $s$, the value of each cell will bethe maximum expected future reward for that given state and action $Q^*(x,a)$.

## Q-Learning algorithm: laerning the Action Value Function
$$Q^\pi(s_t,a_t)=\mathbb{E}\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1}\mid s_t,a_t\right]\tag{2.1}$$
![](https://cdn-images-1.medium.com/max/1116/1*yklmxNRdXleiDbv6aSZUIg.png?raw=true)

### The Q-learning algorithm Process
![](https://cdn-images-1.medium.com/max/1116/1*QeoQEqWYYPs1P8yUwyaJVQ.png?raw=true)

1. $\ $ Initialize Q-values $Q(s,a)$ arbitrarily for all state-action pairs
2. $\ $For life or until learning is stopped 
3. $\quad$ Choose an action $(a)$ in the current world state $(s)$ based on current Q-value estimates $Q(s,\cdot)$
4. $\quad$ Take the action $(a)$ and observe the outcome state $(s')$ and reward $(r)$
5. $\quad$ Update $Q(s,a):=Q(s,a)+\alpha[r+\gamma\max_{a'}Q(s',a')-Q(s,a)]$

#### How to choose an action at step 3?
The idea is that in the beginning, we'll use the **epsilon greedy strategy**:
- We specify an exploration ratie "epsilon", which we set to 1 in the beginning. This is the rate of steps that we'll do randomly. In the beginning, this rate must be at its highest value, because we don't know anything about the values in Q-table. This means we need to do alot of exploration, by randomly chossing our actions.
- We generate a random number. If this number is larger "epsilon", the we will do "exploitation" (this means we use waht we already know to select the best action at each step). Else, we'll do exploration.
- The idea is that we must have a big epsilon at the beginnign of the training of the Q-function. Then, reduce it progressively as the agent becomes more confident at estimating Q-values.


![](https://cdn-images-1.medium.com/max/1116/1*9StLEbor62FUDSoRwxyJrg.png?raw=true)

#### Recap ...
- Q-learning is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a q function.
- It evaluates which action to take based on an action-value function that determines the value of being in a certain state and taking a certain action at that state.
- Goal: maximize the value function Q (expected future reward given a state and action).
- Q table helps us to find the best action for each state.
- To maximize the expected reward by selecting the best of all possible actions.
- The Q come from quality of a certain action in a certain state.
- Function Q(state, action) → returns expected future reward of that action at that state.
- This function can be estimated using Q-learning, which iteratively updates Q(s,a) using the Bellman Equation
- Before we explore the environment: Q table gives the same arbitrary fixed value → but as we explore the environment → Q gives us a better and better approximation.

# Step-3 Deep Q-Learning
# An introduction to Deep Q-Learning: let’s play Doom
Link = https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8


## Adding ‘Deep’ to Q-Learning
Q-learning updates Q-table, this is a good strategy but not scalable (when size of state and action state are **giant**).

Create a neural network that will approximate, given a state, the different Q-values for each action.

## How does Deep Q-Learning work?
![](https://cdn-images-1.medium.com/max/1395/1*LglEewHrVsuEGpBun8_KTg.png?raw=true)

Our Deep Q Neural Network takes a stack or four frams as an input. These pass through its network, and output a vector of Q-values for each action possible in the given state. We need to take the biggest Q-value of this vector to find our best action (move left/right or shoot).

### Preprocessing part
![](https://cdn-images-1.medium.com/max/1116/1*QgGnC_0BkQEtPqMUftRC6A.png?raw=true)

### The problem of temporal limitation
Link = https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

The first question that you can ask is why we stack frames together?

We stack frames together because it helps us to handle the problem of temporal limitation.

### Using convolution networks
#### ELU function
Link = https://arxiv.org/pdf/1511.07289.pdf
$$
f(x)=
\left\{\begin{matrix}
 x & \text{if } x> 0 \\  
 \alpha(\exp(x)-1)& \text{otherwise} 
\end{matrix}\right.
\tag{3.1}$$

$$
f'(x)=
\left\{\begin{matrix}
 1 & \text{if } x> 0 \\  
 f(x)+\alpha& \text{otherwise} 
\end{matrix}\right.
\tag{3.2}$$

### Experience Replay: making more efficient use of observed experience
### Avoid forgetting previous experiences
![](https://cdn-images-1.medium.com/max/2000/1*RFt8MBBkUSPZdolp_WfZFA.png?raw=true)
Think of the **replay buffer** as a folder where every sheet is an experiment tuple (state, action, reward). We feed it by interacting with the environment. And then we take some random sheet to feed the newral network.
### Reducing correlation between experiences
First, we must stop learning while interacting with the environment. We shouldtry different things and play a little randomly to explore the state space. We can save these experiences in the replay buffer.

Then, we can recall these experiences and learn from them. After that, go back to play with updated value function.

## Deep Q-Learning algorithm
The error is calculated by taking the difference between our Q_target (maximum possible value from the next state) and Q_value (our current prediction of the Q-value)

$$\underbrace{\Delta w}_{\text{change in weights}}=\alpha\left[\overbrace{\underbrace{R+\gamma\max_a\hat{Q}(s',a,w)}_{\text{Q-target}}-\underbrace{\hat{Q}(s,a,w)}_{\text{Q-value}}}^{\text{Temporal difference error}}\right]\underbrace{\nabla_w\hat{Q}(s,a,w)}_{\text{gradient of current predicted Q-value}} \tag{3.3}$$

There are two processes that are happening in this algorithm:
- We sample the environment where we performan actions and store the observed experiences tuples in a replay memory.
- Select the small batch of tuple random and learn from it using a gradient descent update step.