# Week 04 Notes - Model Free Learning <a class="tocSkip">

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dopamine-in-Neuroscience" data-toc-modified-id="Dopamine-in-Neuroscience-1">Dopamine in Neuroscience</a></span><ul class="toc-item"><li><span><a href="#Notes" data-toc-modified-id="Notes-1.1">Notes</a></span></li><li><span><a href="#Take-Aways" data-toc-modified-id="Take-Aways-1.2">Take Aways</a></span></li><li><span><a href="#Learning-Resources" data-toc-modified-id="Learning-Resources-1.3">Learning Resources</a></span></li></ul></li><li><span><a href="#Reading-Assignment:-Model-Based-vs.-Model-Free-Learning-" data-toc-modified-id="Reading-Assignment:-Model-Based-vs.-Model-Free-Learning--2">Reading Assignment: Model Based vs. Model Free Learning <a name="dopamine"></a></a></span></li><li><span><a href="#Homework-Assignment:-Q-Learning" data-toc-modified-id="Homework-Assignment:-Q-Learning-3">Homework Assignment: Q Learning</a></span></li><li><span><a href="#Temporal-Difference-Learning" data-toc-modified-id="Temporal-Difference-Learning-4">Temporal Difference Learning</a></span></li><li><span><a href="#Quiz:-Model-Free-Learning" data-toc-modified-id="Quiz:-Model-Free-Learning-5">Quiz: Model Free Learning</a></span></li><li><span><a href="#Q-Learning-Tutorial-for-Ride-Sharing" data-toc-modified-id="Q-Learning-Tutorial-for-Ride-Sharing-6">Q Learning Tutorial for Ride Sharing</a></span></li><li><span><a href="#Quantum-Interview" data-toc-modified-id="Quantum-Interview-7">Quantum Interview</a></span><ul class="toc-item"><li><span><a href="#Learning-Resources" data-toc-modified-id="Learning-Resources-7.1">Learning Resources</a></span></li></ul></li></ul></div>

## Dopamine in Neuroscience

**Video Description**:

The human brain is wondrous in its capabilities. The rules that govern its prowess at so many tasks are becoming slightly clearer everyday. In this video, I'll highlight how 4 key reinforcement learning algorithms help explain how the human brain works, specifically through the lens of the neurotransmitter known as 'dopamine'. These algorithms have been used to help train everything from autopilot systems for airplanes, to video game bots. TD-Learning, Rescorla-Wagner, Kalman Filters, and Bayesian Learning, all in one go!

### Notes
- **Associative Learning Theory**
    - Describes the process by which a person or animal learns an association between 2 stimuli
    - Basic claims are that Reinforcement Learning is the acquisition of associations between states, actions and rewards

|    Type    | Punishment<br/>(decreasing behavior) | Reinforcement<br/>(increasing behavior)  |
| :-------------: | :-------------: | :-----: |
| Positive<br/>_(adding)_ | adding something<br/>to<br/>decrease behavior | adding something<br/>to<br/> increase behavior |
| Negative<br/>_(subtracting)_ | subtracting something<br/>to<br/>decrease behavior | subtracting something<br/>to<br/>increase behavior |


- **Rescorla-Wagner Model**
    - Most influential idea of associative learning theory
    - Prediction error based learning model where stimuli acquire value when there is a mismatch between the prediction and the outcome
    - In the equation, the value of stimulus s at trial T is set equal to the value of stimulus s at the previous time step plus the reward and the expectation. The learning rate defines how much this prediction error is weighted
    - This prediction error states that when the animal gets more reward than it expected, it leads to strengthening of the associative weights and a negative prediction error leads to a weakening of the associative weights
    
    $$ \Delta{V_n} = c(\lambda - V_{n-1})$$
    
    - where:
        - $\Delta{V_n}$: change in associative strength for CS on one trial
        - $c$: represent salience of CS and US; a constant (0.0 - 1.0)
        - $\lambda$: maximum associative strength (magnitude of UR)
        - $V_{n-1})$: associative strength _already_ accured by CS
        - CS: conditional stimuli
        - US: unconditioned stimuli
        
    - Groundbreaking because:
        1. Able to explain the conditioning phenomena
        2. Useful in early Natural Language Processing Systems
    - While it provided a basis for associative learning theory, it only estimated a single value. We know that biological brains are able to represent uncertainty about the world somewhere since the world is full of uncertainty
    - Probability theory suggests that to properly represent a biological brains uncertainty of the world, it should utilize a probability distribution over the possible weights instead of a single value
    - Use Bayes Rule:
     
    $$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$
    
    - The Bayesian generalization of the Rescorla-Wagner Model is embodied in Kalman Filter.
    
    
- **Kalman Filter**
    - States that uncertainty grows over time due to the random diffusion of the weights
    - This uncertainty can be reduced by observing the data
    - Uses a series of measurements observed over time and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone by estimating a joint probability distribution over the variables for each time frame
    - Has numerous applications:
        - Navigation
        - Control of Vehicles (aircrafts)
        - Time Series Analysis
        - Robotics
    - Works by modeling the central nervous systems control of movement
    - Because of the time delay between issuing motor commands and receiving sensory feedback, the Kalman Filter is a realistic model for making estimates of the current state of a motor system and using commands.
    - It's a two step process:
        - In the prediction step, it produces estimates of the current state variables along with their uncertainties 
        - Once the outcome of the next measurement is observed, these estimates are updated using a weighted average with the most weights given to estimates with higher certainty
     - It's a recursive real-time algorithm using just the present input measurements and the previously calculated state as well as its uncertainty matrix 

So far, we have broken down tasks into trials but real life operates in continuous time and our algorithms are short-sighted. They have been only able to predict the immediate reward, the one that will be received in the very next state. To extend the capabilities of what we are able to model mathematically, we need to switch to modern reinforcement learning theory from classical learning theory.

Let's focus on a specific sequential decision problem - a mouse navigating to the exit of a maze. There are two popular classes of Reinforcement Learning algorithms we can use to help solve this problem:

|    Type    | Model Free<br/>Less online deliberation | Model Based<br/>More online deliberation  |
| :-------------: | :-------------: | :-----: |
| Learn: | Policy $\pi$ | Model of<br/> R and T |
| Online: | $\pi(s)$ | Solve MDP |
| Method: | Learning from<br/>demonstration | Adaptive dynamic<br/>programming |
| Pros: | Simpler execution | Fewer examples needed to learn? |


- **Temporal Difference Learning**
    - Extended the Rescorla-Wagner Model by introducing a discount factor into the prediction error, which helps define how much a reward matters to an agent depending on when in time it's received
    - We can make rewards that happen in the near term worth more
    - Invented by 2 researchers (Sutton & Bartow) in the 1980's
    - It has been used for:
        - Game bots that can beat Humans (most popular is DeepMind and it's DeepQ Learning Algorithm that's able to beat many Atari games)
    - TD Learning helps capture some important properties of temporal dynamics as well as dopamine responses but it lacks the uncertainty tracking mechanism of the Kalman Filter.
    - So we need a Bayesian version of TD Learning, which is called Kalman TD
    - Instead of estimating a single value of the weights, we estimating a mean and a covariance matrix for the weights of our model. 
    
We can think of all four of these models along two dimensions based on what kind of estimator they are (Bayesian or Point based) and what the target they're trying to estimate is (immediate reward or value):

|    Estimator\Target    | Reward  | Value  |
| :-------------: | :-------------: | :-----: |
| Point    | Rescorla-Wagner | TD Learning |
| Bayesian | Kalman Filter | Kalman TD |


### Take Aways
- Associative learning is a learning process in which a new response becomes associated with a particular stimulus
- When we build mathematical models of learning, we can use distributions instead of single values to help represent uncertainty about the world
- Temporal Difference Learning is a Model Free Learning technique that predicts the expected value of a variable occurring at the end of a sequence of states


### Learning Resources
- [Youtube Video](https://www.youtube.com/watch?v=-vhYoS3751g)
- [Code Link](https://github.com/llSourcell/Mathematics_of_Dopamine)
- [Youtube: The Rescorla-Wagner Model](https://www.youtube.com/watch?v=pYyUSh1veoo)
- [Youtube: TD Learning - Richard S. Sutton](https://www.youtube.com/watch?v=LyCpuLikLyQ)
- [Youtube: Special Topics - The Kalman Filter](https://www.youtube.com/watch?v=CaCcOwJPytQ)
- [Youtube: Bayesian Learning](https://www.youtube.com/watch?v=C2OUfJW5UNM)
- [PDF: A Unifying Probabilistic View of Associative Learning](https://dash.harvard.edu/bitstream/handle/1/23845336/4633133.pdf?sequence=1&isAllowed=y)
- [Book: Chapter 9 Temporal-Difference Learning](https://web.stanford.edu/group/pdplab/pdphandbook/handbookch10.html)



## Reading Assignment: Model Based vs. Model Free Learning <a name="dopamine"></a>
- [Model Based vs. Model Free Reading Assignment](https://www.theschool.ai/wp-content/uploads/2018/09/Move37-Reading-Assignment-Model-Based-vs-Model-Free-1.pdf)

**Model**: a plan for our agent. When we have a set of defined state transition probabilities, we call that working with a model. Reinforcement learning can be applied with or without a model, or even used to define a model.

A complete model of the environment is required to do Dynamic Programming. If our agent doesn't have a complete map of what to expect, we can instead employ what is called **model-free learning**, where the model learns via trial an error.

For some board games such as Chess and Go, although we can accurately model the environment's dynamics, computational power constrains us from calculating the Bellman Optimality equation. This is where Model-free Learning methods shine. We handle this situation by optimizing for a smaller subset of states that are frequently encountered, at the cost of knowing less about the infrequently visited states.

Further Reading:
- [Medium: Model Free Reinforcement Learning Algorithms](https://medium.com/deep-math-machine-learning-ai/ch-12-1-model-free-reinforcement-learning-algorithms-monte-carlo-sarsa-q-learning-65267cb8d1b4)
- [Book: Temporal-Difference Learning (RL: An Introducion Chapeter 6)](http://incompleteideas.net/book/the-book.html)
- [PDF: Reward-Based Learning, Model-Based and Model-Free (2014)](https://www.quentinhuys.com/pub/HuysEa14-ModelBasedModelFree.pdf)
- [BAIR: Temporal Difference Methods: Model-Free Deep RL for Model-Based Control (2018)](https://bair.berkeley.edu/blog/2018/04/26/tdm/)
- [ArXiv: Temporal Difference Methods: Model-Free Deep RL for Model-Based Control (2018)](https://arxiv.org/abs/1802.09081)


## Homework Assignment: Q Learning

This weeks homework assignment is to implement Q learning from scratch for the gridworld environment. Use this [repository](https://github.com/rlcode/reinforcement-learning/tree/master/1-grid-world) as a guide, but try not to peak at the Q learning code, recreate it, then check your code with it. Good luck!


## Temporal Difference Learning
- [Youtube Video](https://www.youtube.com/watch?v=f4zTDRavVq0)

**TD Learning**:
- This general rule can be summarized as:

$$ New\ Estimate \leftarrow Old\ Estimate + Step\ Size\ [\ Target - Old\ Estimate\ ] \\ $$
$$ Where\ [\ Target - Old\ Estimate\ ]\ \rightarrow estimation\ error\ (\delta) \\ $$

- The Target is the expected return of the state:

$$ Target\ = E_{\pi}\ [\ \sum_{k=0}^{\infty} \gamma^{k}r_t + k + 1 ] $$

- The learning rate is a parameter which determines to what extent the error has to be integrated in the new estimation. If the step size is 0, the agent does not learn anything at all. If the step size is 1, the agent considers only the most recent information.


$TD(\lambda) $:

- TD(0) algorithm does not take past states into account
- What matters in TD(0) is the current state and the state at $t + 1$
- However it would be useful to extend what is learned at $t + 1$ to previous states. To achieve this objective, it is necessary to have a **short-term memory mechanism** to store the states which have been visited in the last steps.

For each state $s$ at time $t$ we can defined $e_t(s)$ as the **eligibility trace**:

$$
e_t(s) = \left\{
    \begin{array}{l}
      \gamma \lambda e_{t - 1}(s)   \quad\quad    if s \neq s_t;\\
      \gamma \lambda e_{t - 1}(s) + 1 \quad      if s = s_t;
    \end{array}
  \right.
$$

Here $\gamma$ is the discount rate and $\lambda \in [0, 1]$ is a decay parameter called **trace-decay** or **accumulating trace** which defines the update weight for each state visited. When $0 < \lambda < 1$, the traces decrease in time.

This allows giving a small weight to infrequent states. For $\lambda = 0$, we have the TD(0) case, and only the immediately preceding prediction is updated. For $\lambda = 1$, we have TD(1) where all the preceding predictions are equally updated.

Now it's time to define the **update rule for TD($\lambda$)**:

- Estimation error \$delta\ is defined as:

$$ \delta_t = r_{t+1} + \gamma U(s_{t+1}) - U(s_t) $$

- We can update the utility function as:

$$ U_t(s) = U_t(s) + \alpha \delta_t e_t(s) \quad for\ all\ s \in S $$

![temporal_difference_learning_lambda](imgs/temporal_difference_learning_at_lambda.jpg)

**Python Implementation**:
- The Python Implementation of $TD(\lambda)$ is straightforward. We only need to add an eligibility matrix and its updated rule. 
- The **main loop** is much simpler than MC methods. In this case, we do not have any first-visit constraints and all we need to do is to apply the update rule.

See the next cell.

**Credits**:

- [Wikipedia: Temporal Difference Learning](https:/en.wikipedia.org/wiki/Temporal_difference_learning)
- [Blog: Dissecting Reinforcement Learning 3](https://mpatacchiola.github.io/blog/2017/01/29/dissecting-reinforcement-learning-3.html)
- [Blog: Suttons Temporal Difference Learning](http://kldavenport.com/suttons-temporal-difference-learning/)
- [Youtube: IIT (Madras) -  Machine Learning #84 RL Framework, Temporal Difference (TD) Learning](https://www.youtube.com/watch?v=7eYzvQci9x0)
- [Youtube: DeepMind's Richard Sutton - The Long-Term of AI & Temporal-Difference Learning](https://www.youtube.com/watch?v=EeMCEQa85tw)

In [1]:
# Python Implementation of TD at lambda
def update_utility(utility_matrix, trace_matrix, alpha, delta):
    '''Return the updated utility matrix
    
    @param utility_matrix the matrix before the update
    @param alpha the step size (learning rate)
    @param delta the error (Target - Old_Estimate)
    @return the updated utility matrix
    '''
    utility_matrix += alpha * delta * trace_matrix
    return utility_matrix

def update_eligibility(trace_matrix, grace, lambda_):
    '''Return the updated trace_matrix
    
    @param trace_matrix the eligibility traces matrix
    @param gamma discount factor
    @param lambda_ the decaying value
    @return the updated trace_matrix
    '''
    trace_matrix = trace_matrix * gamma * lambda_
    return trace_matrix

for epoch in range(tot_epoch):
    # Reset and return the first observation
    observation = env.reset(exploring_starts=True)
    
    for step in range(1000):
        # Take the action from the action matrix
        action = policy_matrix[observation[0], observation[1]]
        
        # Move one step in the environment adn gets obs and reward
        new_observation, reward, done = env.step(action)
        
        # Estimate the error delta (Target - Old_Estimate)
        delta = reward + gamma * \
            utility_matrix[new_observation[0], new_observation[1]] - \
            utility_matrix[observation[0], observation[1]]
        
        # Adding +1 in the trace matrix (only the state visited)
        trace_matrix[observation[0], observation[1]] += 1
        
        # Update the utility matrix (all the states)
        utility_matrix = update_utility(utility_matrix, trace_matrix, alpha, delta)
        
        # Update the trace_matrix (delaying) (all the states)
        trace_matrix = update_eligibility(trace_matrix, gamma, lambda_)
        observation = new_observation
        
        if done:
            break # return

## Quiz: Model Free Learning

## Q Learning Tutorial for Ride Sharing

**Video Description**:

Learn how to use Q Learning to optimize dispatch and routing for a ride sharing app.

- [Youtube Video](https://www.youtube.com/watch?v=tU6_Fc6bKyQ)
- [Code Link](https://github.com/colinskow/move37/tree/master/q_learning)

## Quantum Interview

**Video Description**:

I ask 67 questions to Dr Alan Baratz - EVP, R&D and Chief Product Officer of D-Wave Systems. D-Wave has built an incredible quantum computer, and invited me to Vancouver to attend a special launch event of their new Leap system, which allows any developer to use quantum computing very easily in the cloud. In this interview, Alan walks me through the D-Wave facility in Vancouver, and we even get to step inside the quantum computer room. Enjoy!


### Learning Resources
- [Youtube Video](https://www.youtube.com/watch?v=Ewf_gBWBH2A)
- [D-Wave Website](https://www.dwavesys.com/home)
- [Youtube: Quantum Computers Explained](https://www.youtube.com/watch?v=JhHMJCUmq28)
- [Hackernoon: Quantum Computing Explained!](https://hackernoon.com/quantum-computing-explained-a114999299ca)
- [Youtube: Quantum Algorithm](https://www.youtube.com/watch?v=LhtnECml-KI)
- [Youtube: Quantum Machine Learning](https://www.youtube.com/watch?v=DmzWsvb-Un4)
- [Clerro: Quantum Computing Explained](https://www.clerro.com/guide/580/quantum-computing-explained)