## Decision Making and Reinforcement Learning
(RL is a mechanism for doing Decision Making)

* Supervised Learning: y = f(x)
    * Function approximation
    * Given x,y pairs, aim is to find f to map x to y.
* Unsupervised Learning: f(x)
    * Clustering description
    * Given bunch of xs and goal is to find some f that gives a compact description of x.
* Reinforcement Learning: y = f(x)
    * Given string of x,z pairs of data and learn f that's going to generate ys.
    
Grid world, 3x4 matrix.
- Introduce uncertainty (stochasticity)
    - When you choose an action, it executes correctly with prob 0.8
    - Moves at a right angle P(0.1), P(0.1).
- Q: What is reliability of previous sequence UURRR?

Way of capturing these uncertainties directly:
# Markov Decision Processes

Problem:
* States: S
    * Set of elements (one for every state you can be in).
    * Often have initial and goal states
* **Model**: T(s,a,s') ~Pr(s'|s,a)
    * Rules of the game you're playing. Physics of the world.   
    * T is a function of a state, an action and another state. (That other state s' can be the same as state s.)
    * Model is simple in a deterministic world.
* **Actions**: A(s), A
    * E.g. Up, down, left, right. (No option not to move in this game.)
    * Generally we think of it as a function of states.
* Reward: R(s), R(s,a), R(s,a,s')
    * Scalar value you get for being in a state. E.g. R(goal) = 1, R(red) = -1.
    * Reward encompasses our domain knowledge: The usefulness of entering into that state.
Solution
* Policy: $\pi(s) -> a$
    * Action you should take in a state. Like a command.
    * $\pi^*$ the optimal policy that maximises the long-term expected reward.

### Markovian Property
1. Only the present matters. You don't have to condition on anything past the most recent state.
    - Even if something isn't really Markovian, you can make your state remember everything from the past. -> But that makes it hard to learn cause you'll only ever see each state once
    - Could also fold action into state.

Another property:
2. The model is stationary: The model (rules) don't change. (Definition we use for now)

Putting it into contex of RL:
* We would like <s,a> pairs to be the training set, with a being the action we SHOULD take.
* But what we actually get is <s,a,r> pairs and we need to work out what the optimal policy $\pi^*$ is. And that's kind of our f.
    * s is x
   
   
Policies that are more robust to underlying stochasticities vs plans



## Rewards
- Idea of sequences: Actions that set you up for other actions which then lead to rewards.
    - Don't know WHAT led to you ending up playing well or badly (reward +1 or -1) -> i.e. what was or what were the action(s) that led to you winning or losing? (Chess analogy) vs SL
- **Delayed rewards**
- Minor changes matter

Temporal Credit Assignment Problem

e.g. R(s) = -.04 
- (for all states except determined goal state = +1, NO state = -1.)
Can represent policy with arrows
- End states: Absorbing states
(img)
- -> **Minor changes (to R(s), say) matter**

Bottom right case: Minimise chances of slippage and delay. Encouraged to end the game no matter what vs to end the game on +1 for LHS.

Reward // **teaching signal**
or because rewards define MDP, rewards are **domain knowledge**.

### Sequences of Rewards: Assumptions
STATIONARY.

1. **Infinite Horizons**
    - Assuming you can live forever. E.g. if grid world lasted 10 moves, you might choose to avoid -1 rather than risk -1 to go for +1. (Or you might choose to take more risk.)
    - -> Policy can change even if you're in the same state (different number of timesteps left). 
        - i.e. $\pi(s,t)$.
        - I suppose time could be part of the state.
2. **Utility of Sequences** (Addition true based on Stationary Preferences because nothing else can be guaranteed to give this property)
    - if $U(S_0, S_1, S_2, ...) >$ $U(S_0, S_1^', S_2^')$
then $U(S_1 S2 ...) >$ U(S_1^', S_2^')$
    - (Utility over sequence of states)
    
$$U(S_0 S_1 S_2 ...) = \sum_{t=0}^\infty R(s_t)$$

- With this rule, infinite accumulation of rewards (1 1 ...) vs (0.5 0.5 ...) no different -> Infty, infty example

$$U(S_0 S_1 S_2 ...) = \sum_{t=0}^\infty \gamma^t R(s_t), 0\leq\gamma < 1$$
$$ \leq \sum_{t=0}^\infty \gamma^t R_{max} = \frac{R_{max}}{1-\gamma}$$

Discounted sum. Allows us to go an infinite distance in finite time.

**Singularity**: Limit to computer power growing faster is time it takes to design next computer. Computer can design next gen of computers etc. Next gen of computer design its successor twice as fast etc.  Time between generations halves every time. So you will be able to do an infinite number of successors in finite time.

## Policies

Optimal policy $\pi^*$ is one that maximises long-term reward
$$\pi^* = \text{argmax}_\pi E[\sum_{t=0}^{\infty} \gamma^tR(s_t)|\pi]$$
* Expected value of reward of sequence of states we'll see if we follow pi

$$U^{\pi}(s)=E[\sum_{t=0}^{\infty} \gamma^tR(s_t)|\pi, s_0=s]$$
* How good being in a state given a policy is is exactly what we expect to see from that state on given that policy.
* Manages ST-LT tradeoffs. Accounts for late rewards.
* $U^{\pi}(s) \ne R(s)$

$$\pi^*(s) = \text{argmax}_a\sum_{s'}T(s,a,s')U(s')$$
where $U(s') = U^{\pi^*}(s)$

Optimal policy maximises expected utility.

**Bellman Equation**
$$U(s) = R(s) + \gamma max_a \sum_{s'} T(s,a,s')U(s')$$
* Reward in this state + Discount of all reward you're going to get from the next states


### Finding Policies

Suppose we have n states. Then we have n Bellman equations
$$U(s) = R(s) + \gamma \max_a \sum_{s'} T(s,a,s')U(s')$$

and n unknowns U of each s.
BUT max makes the equations non-linear. 
- (Aside you can turn maxes into differentiable stuff that is sometimes useful)

**Value Iteration**

Algo:
- Start with arbitrary utilities
- Update utilities based on neighbours
    - Neighbours: States they can reach.
- Repeat until convergence

How to update:
- Suppose every time you update is time t.
$$\hat U_{t+1}(s) =  R(s) + \gamma \max_a \sum_{s'} T(s,a,s')\hat U_t(s')$$
- $\hat U(s')$ is an estimate of utility

All n equations are tangled together.


* Like a contraction proof. Helps that $\gamma < 1$.
    * R(S) is truth. So 'adding more truth to wrong' so it'll overwhelm the original wrong (initialisation). So $\hat U_{t+1}(s)$ converges.
(Maybe rewatch vid 24 because I was super distracted.)

So solving for utility (true value) of a state is the same thing as solving for the optimal policy.

...
(img) (vid 26)

...next time utility for x state is greater than 0.
So at some point it'll be worth it to try to go up instead of bashing your head against the wall.

Value iteration works because eventually value **propagates out** from its neighbours.

After more timesteps, you need to figure out the utilities of other states. 

Policy is a mapping from state to actions, NOT states to utilities. If we have U we can figure out \pi, but U is more info than we need to figure out \pi. If U has correct orderings it's sufficient.
    - U more like regression, \pi more like classifier.
    

#### Policy Iterations (vs value iterations)
Emphasis on caring about policies > values.
- Start with initial policy $\pi_0$ <- a guess
- Evaluate how good that policy is. Given $\pi_t$ calculate $U_t =  U^{\pi}_t$.
- Improve: $\pi_{t+1} = \text{arg}\max_a\sum T(s,a,s')U_t(s')$

Allows us to change \pi over time. E.g. suppose we found a great action in some state. Then all other states that can reach that state might end up taking a different action than they did before because the best action would  be moving towards that state.
- How do we calculate U_t? Bellman's equation. $$U_t(s)=R(s)+\gamma \sum_{s'} T(s,\pi_t(s),s')U_t(s')$$
    - Instead of max, stick policy in cause we have the policy.
    - n equations in n unknowns but there is no max. Now they are **linear equations**.
- Fewer iterations than value iteration. Apps.
- Bigger jumps than value iterations. Making jumps in policy space rather than in value space.
- Computational tricks e.g. do a step of value iteration to get an estimate of $U_t$.
- Guaranteed to converge. Finite number of policies and you're always getting better.

## Summary
- Markov Decision Processes
- States, Rewards, Actions, Transitions, (Discounts <- Parameter)
    - Capturing the underlying process you care about. Rewards & Discounts capture the nature of the task more than the underlying physics.
- Policies
- Value functions (Utilities) -> Factor in long-term aspects vs rewards don't.
- Discounting: deal with infinite sequences in finite time(?)
- Stationary
- Bellman equation
    - Value iteration
    - Policy iteration
        - These can be mapped into linear programs and solved in polynomial time.

Note: Haven't done any reinforcement learning. in RL you don't necessary know the rewards or transitions. Or indeed the actions or states.

