# Reinforcement Learning

Aside: Reinforcement Learning

In [None]:
from IPython.display import Image
Image(filename="images/rl-01.png")

### API
API is kinda like a box.
(img)

In [None]:
Image(filename="images/rl-02.png")

1. Planner
    - Last time Charles talked about the Planner box.
    - Transition fn T, reward function R
    - e.g. value or policy iteration

2. Learner (Reinforcement learning)
    - Will see many transitions.

3. Modeler

4. Simulator

### Ways of gluing these together:

In [None]:
Image(filename="images/rl-03.png")

(img)
Planner with a learner inside vs a learner that uses a planner inside 

e.g.
- Backgammon-playing RL used a RL-based planner.

## Three Approaches to RL

In [None]:
Image(filename="images/rl-04.png")

1. Policy search
    - Policies maps states to actions
    - Adv: Direct Use -> Learning quantity you directly need to use
    - Disadv: Indirect Learning (function). Data doesn't tell you what action to choose (Temporal credit assignment problem)

2. Value-function based 
    - Maps states to values
    - Adv: Direct learning
    - Disadv: Indirect use. Need to turn into policy. But it's an okay conversion with some types of value function (conversion) using argmax.
    - Adv: Simple if you do it right. Can be powerful.

3. Model-based RL
    - Going from T,R to U: Value iteration to solve Bellman equations. Not nice to do but doable.
    - Adv: Direct learning.
    - Indirect use cause you have to do planning and optimising (translate)

Focus on Value-function based approaches for now.


## A new kind of value function

$$U(s)=R(s) + \gamma \max_a \sum_{s'}T(s,a,s')U(s')$$
- Long-term value of being in a state is the reward for arriving in that state + the discounted reward of the future. (To leave the state we're going to choose an action and take the expectation ...)

$$\pi(s) = \text{arg}\max_a \sum_{s'} T(s,a,s')U(s')$$
- Look at expected values: Iterate over all possible next states weightedby their probability of utility of landing in state.

### **NEW value function** Q: 
$$Q(s,a) = R(s) + \gamma \sum_{s')T(s,a,s')\max_{a'}Q(s',a')$$
- Q cause Q is in the latter half of the alphabet and many other letters are taken
- Value for arriving in S, leaving via a (landing in s' with T probability), proceeding optimally thereafter.

**Using Q to define U and $\pi$**
- Observe U(s) returns a scalar, $\pi$(s) returns an acition
$$U(s) = \max_a Q(s,a)$$
$$\pi(s) = \text{arg}\max_a Q(s,a)$$

### Q-learning
- Evaluating the Bellman equations from data


In [None]:
Image(filename="images/rl-05.png")

Estimating Q from transitions

$$Q(s,a) = R(s) + \gamma \sum_{s'}T(s,a,s)$$
- Don't have R or T (vs MDP have R and T). It's polynomial to do if we have R and T
- A transition is <s,a,r,s'>.
$$\hat Q(s,a) \leftarrow^{\alpha_t} r + \gamma \max_{a'} \hat Q(s',a')$$
where $\alpha$ is the learning rate. 0 is no learning, 1 is full learning. 0.5 is averaging the previous and the new.


- Don't have sum over transitions but have max a' and estimate of Q in next state.
- Notation: e.g. $V\leftarrow^\alpha X$ means $V \leftarrow V + \alpha(X-V) = (1-\alpha)V + \alpha X$. Moving alpha of the way from X to V.
    - Converges to E(X)
    - Believe things less over time
    - Like an average
    - Adding things up and computing a weighted average with weightn decaying over time
- Computing average value you'd get if you follow the optimal policy after taking a particular action.
$$\hat Q(s,a) \leftarrow^{\alpha_t} r + \gamma \max_{a'} \hat Q(s',a')$$
- Which we'll hand-wave and ignore that the above line is a moving target to get
$$=E[r+\gamma \max_{a'} \hat Q(s',a')]$$
- from linearity of expectation
$$=R(s) + \gamma E_{s'}[\max_{a'} \hat Q(s',a')]$$
$$=R(s) + \gamma \sum_{s'}T(s,a,s')\hat Q(s',a')$$

## Q-learning convergence


In [None]:
Image(filename="images/rl-06.png")


line 2: Update $\hat Q$.
Remarkable that it's one line of code.
Important caveat, need to visit all states etc.

Q-learning is actually a **family of algorithms**.
Vary along following themes:
- How initialise $\hat Q$?
- How decay $\alpha_t$?
- How choose actions?
    - Bad ways of choosing actions
        - Always choose $a_0$ -> Bad b/c doesn't visit all actions and doesn't learn anything.
        - Choose randomly -> May have learned Q, but we don't use it. Don't take advantage of anything you learn.
        - Use $\hat Q$. 
            - Can be bad (**Greedy action selection**): Only don't a_0 all the time if you update Q and get worse than terrible. Kind of a **local min**.
            - i.e. if you set up $\hat Q$ that makes some local min look better than the optimal.
        - random restarts -> start it over over and over again.
            - Going to take an even longer time to get to an answer
            - Might help us get unstuck. (In random optimisation, we did this so if we got stuck we could throw out everything and get unstuck.)
    - Use simulated annealing-like approach 
        -> Take uphill steps but randomly take a downhill step. Mixture of choosing randomly and using $\hat Q$. So it's a random action.
        - Take a random action sometimes $\hat \pi (s) = \text{arg}\max_a \hat Q(s,a)\text{w. prob} 1-\epsilon$, random action otherwise.
        - Chance of exploring whole space and learning true Q if you're stuck.

## $\epsilon$-greedy exploration

In [None]:
Image(filename="rl-07.png")

- Decayed $\epsilon$ -> Over time more greedy, less random.
- $\hat Q -> Q$ from standard Q-learning convergence result

### Exploration-exploitation dilemma
**Fundamental tradeoff in RL**
- Exploration: Getting data you need so you learn
- Exploitation: Using what you know
- Tradeoff because there's only one agent acting in the world but there are two types of actions.
- How modelling and planning interact with each other

- **Optimism in the face of uncertainty** Can also do EE via initialising $\hat Q$. 
    - A*

- Other approaches to EE: some in the model-based setting are more powerful because you can keep track of what you've learned in the environment where you haven't (Transfer?). Alg can then explore what it doesn't know and exploit what it does know.

## Summary
- Learn to solve an MDP not knowing T or R, but having the able to interact with the environment <s,a,r,s'>
- Q-learning family: converges, Q function
- Exploration-expolitation: learn and use
    - Optimisation in the face of uncertainty
- Approaches to RL
- Connection to planning

Connection to function approximation: overfitting comes up in more generalised RL situations.