#### Autonomous Helicopter

$ Input[\text{Position of Helicopter}] \longrightarrow Output[\text{Movement of control sticks}] $

$ \text{[reward function]} $

Currently popular applications:
- Robot Control
- Factory Optimization
- Playing games (including video games)
- Financial Trading


### Key concepts of Reinforcement Learning

- state $S$: current state of the agent e.g. position

- action $a$: action taken to change the state

- reward $R(S, a)$: reward function for the given state and action

- new_state $S'$: new state after the action

#### Concept of Return

$Return = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \dots + \gamma^{(N-1)} R_N$

where $\gamma$ is discount factor $0<\gamma\approx1.00$

#### Policy (Controller) $\Pi$

Goal is to come up with a controller that $\Pi(a)$

#### Markov Decision Process (MDP)
- future only depends on the current state

## State-action value function (Q or Q* function)

$Q(S,a)$ = reward for action $a$ once, starting at state $S$ and then **behave optimally after that**.

The best possible return from state $S$ is the action $a$ that gives $max_a Q(S,a)$

### Bellman Equation

$ Q(S,a) = R(s) + \gamma max Q(S',a') $

### Stochastic Environment (random)

$\text{Expected Return} = Average[R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \dots + \gamma^{(N-1)} R_N]$

where $\gamma$ is discount factor $0<\gamma\approx1.00$

#### Stochastic Bellman Equation
$ Q(S,a) = R(s) + \gamma Average[ max Q(S',a') ]$

## Continuous State Space

Autonomous Truck, Lunar Lander

$ S = [x, y, \dot{x}, \dot{y}, \theta, \dot\theta, L, R] $

$x,y$ = position of the vehicle

$\dot x,\dot y$ = speed of the vehicle

$\theta, \dot\theta$ = orientation and angular velocity (rate of change of orientation)

$L, R$ = binary value left/right leg/wheel is grounded

## Deep Reinforcement Learning
### DQN Algorithm

1. Initialize NN randomly as guess of $Q(S,a)$
2. Repeat 
    - Take actions in the lunar lander, Get $(s, a, R(s), s')$
    - Store 10,000 most recent $(s, a, R(s), s')$ tuples
3. Train neural network
    - Create a training set of 10,000 examples using 
    $x = (s,a)$ and $y = R(s) + \gamma max Q(S',a') $
    - Train such that $Q_{new} \approx y$
4. Set $Q = Q_{new}$ and repeat from step-2.

### $\epsilon$ Greedy Policy

While choosing the action $a$
1. **Greedy Phase** : With $\epsilon=0.95$, pick the action that maximizes Q(s,a).
2. **Exploration Phase**: With $\epsilon=0.05$, pick an action a randomly.

In practice, $\epsilon$ is smaller at start and increases as the training progresses. Essentially, more exploration at start and more greedy towards end.

## Algorithm Refinement
### Mini-batches
- Use subset of training examples for each iteration of gradient descent.
- Noisy but fast

### Soft-update

$ Q(s,a) = Q_{new}(s',a') $ --> always update $Q$ with new value.

$ Q(s,a) = (0.05)Q_{new}(s',a') + (1-0.05) Q $ --> update $Q$ with part of new value and old value.


### State of Reinforcement Learning
- Much easier to get to work in a simulation than a real robot or system.
- Far fewer real applications than supervised and unsupervised learning.
- Exciting research is going on.

## Deep Q-Learning (DQN)
### Target Network

$ y = R + \gamma max Q(s', a': w) $
Where $w$ are weights of the network.

We are trying to minimize the error by adjusting $w$,

$Error = R + \gamma max Q(s', a': w) - Q(s,a:w) = y - Q(s,a:w) $

Since $y$ will be chaning each iteration, the error minimization will oscillilate and cause instabilities.

Therefore we use a clone network $\hat{Q}(s,a:w^-)$ as the target $\hat{Q}$ network.

We use $\hat{Q}$ network to generate $y$ and update $w^-$ using $w$ with **soft update**.

$w^-_{new} \leftarrow \tau w + (1-\tau) w^-$

where $\tau << 1$

This ensures that $y$ changes slowly and improves the stability of our learning algorithms.

### Experience Replay
States, actions, and rewards within the environment are sequential. The agent will be biased due to strong correction between them. To resolve the issue, we store agent's experience in a memory buffer. To do the learning we randomly sample mini-batches from the buffer. 
