## Markov Decision Process

#### Definition

A sequential decision problem for a fully observable, **stochastic** environment with a Markovian transition model and additive rewards is called a Markov decision process(MDP)

It is defined by:

* A set of states $s\in S$

* A set of actiom $a\in A$

* A transition function T(s, a, s') (Markovian)

* A reward function R(s, a, s')

* A start state $s_0$

* A terminal state(optional)

#### Solution

A solution to a MDP is called a policy $\pi$

* $\pi(s)$ is the action recommended by the policy $\pi(s)$ for the state s

* An optimal policy $\pi^*$ is a policy that yields the highest expected utility

#### MDP search tree 

Each MDP state projects an **expectimax-like** search tree

s: state; a: action;  --> (s, a): Q state, the agent has committed to the action but has not done it yet.

--> via T(s, a, s')  --> s' and get R(s, a, s')

#### Utility of State Sequencies

(The only way) **disctounting**: consider to maximize the sum of rewards and rewards earlier. 

--> Values of rewards decay exponentially (with a decay factor $0<\gamma<1$)

#### Stationary preferences

Assume an agent's preferences between state sequences are stationary.

Then if two states begin with the same state r, then the two sequences should be preference-ordered the same way as the sequences without r, which gives: $$[r, s_1, s_2,\dots] \succ [r, s'_1, s'_2, \dots]$$ 

Stationarity has strong consequences and it turns out there are **just(only)** two coherent ways to assign utilities to sequences:

* Additive rewards ($\gamma = 1$) and Discounted rewards ($0<\gamma<1$)
 
For the additive undiscounted rewards will be inifinite: $$\lim_{t\to \infty}U[s_o, s_1, s_2, \dots, s_t] = \infty$$ 

But the discounted rewards have upper bound: $$\lim_{t\to \infty}U[s_o, s_1, s_2, \dots, s_t] = \sum_{t = 0}^{\infty}\gamma^nR(s_n)\leq \frac{R_{max}}{1 - \gamma}$$

#### Optimal Quantities

* $V^*(s)$ is the value(utility) of a state s: expected utility starting from s and acting optimally

* $Q^*(s,a )$ is the utility of a Q state (s,a)

* $\pi^*(s)$ is the optimal policy for state s

By bellman equation: $$V^*(s)= \mathop{max}\limits_{a}Q^*(s, a)$$ 
$$Q^*(s,a) = \sum_{s'}T(s, a, s')[R(s, a, s') + \gamma V^*(s')]$$
$$-->V^*(s)= \mathop{max}\limits_{a}\sum_{s'}T(s, a, s')[R(s, a, s') + \gamma V^*(s')]$$

#### Computation

The MDP tree often goes forever and states are repeated

--> Do a depth-limited computation, but with increasing depths until change is small --> (it is gurranted by $\gamma$)

**Time-limited value:**

Define $V_k(s)$ to be the optimal value of s if the game ends in k more steps

**Value Iteration:**

$$V_{k+1}^*(s) \leftarrow \mathop{max}\limits_{a}\sum_{s'}T(s, a, s')[R(s, a, s') + \gamma V^(s')]$$

**Policy Methods:**

* Policy Evaluation

    For a fixed policy, the action in each state is given bt the policy, then $$V^{\pi}(s) = \sum_{s'}T(s, \pi(s), s')[R(s, \pi(s), s') + \gamma V^{\pi}(s')]$$

    There are two options to calculate the V's for a fixed policy $\pi$: $$V^{\pi}_{k+1}(s) \leftarrow \sum_{s'}T(s, \pi(s), s')[R(s, \pi(s), s') + \gamma V^{\pi}_k(s')]$$

    Just like the value-iteration, and the complexity is $O(S^2)$ per iteration, compared with $O(AS^2)$, the other option is just use the linear solver

* Policy Extraction (gets the policy implied by values)

    Instead of extracting from $V^*(s)$, just simply extract from: $$\pi^*(s) = \mathop{argmax}\limits_a Q^*(s,a)$$

* Policy Iteration

    Two step: (1) for fixed current policy $\pi$, find values with policy evaluation --> (2) or fixed values, get a better policy using policy extraction

## Reinforcement Learning

#### Basic

**Idea**

* Receive feedback in the form of rewards

* Agent’s utility is defined by the reward function

* Must (learn to) act so as to maximize expected rewards

* All learning is based on observed samples of outcomes

**Compared to MDP:**

* A set of states $s\in S$

* A set of actiom $a\in A$

* A transition function T(s, a, s') (Markovian)

* A reward function R(s, a, s')

And we still try to look for a $\pi(s)$, but we don't know T or R

#### Model-based learning

Since we don't know T and R, we consider to learn an approximate model based on experiences, and then it's just like MDP. We can solve for values as if the T and R were correct

**Step1:**

* count outcome s' for each s, a

* normalize to give an estimate of T^(s, a, s')   --> use frequency to replace the probability

* discover each R^(s, a, s') when experience (s, a, s')

**Step2:**

* solve the learned MDP given bt step 1

#### MOdel-free learning

compare what you want and what you get

##### Passive reinforcement learning

**Direct evaluation**

* Goal: Compute values for each state under $\pi$

* Idea: Average together observed sample values

* Advantage: knowledge about T and R is not required; Eventually compute correct values

* Disadvantage: State connections are not considered; Each state must be learned separately; Take a long time to learn

* Cannot use policy Evaluation since T and R are unknown

* --> Instead, use Sample-Based Policy Evaluation(very space consuming and time consuming, because we cannot rewind to state s after take action): $$V_{k+1}^{\pi}(s)\leftarrow \frac{1}{n}\sum_{i}sample_i$$
where $sample_i = R(s, \pi(s), s'_i)+\gamma V_{k}^{\pi}(s'_i)$

##### Active reinforcement learning 

**Temporal Difference Learning**

* Idea: learn from every experience

* Temporal difference learning of values: $$sample = R(s, \pi(s), s') + \gamma V^{\pi}(s')$$

and then update: $$V^{\pi}(s) \leftarrow (1-\alpha)V^{\pi}(s)+(\alpha) sample$$
it can be also represented as $$V^{\pi}(s) \leftarrow V^{\pi}(s)+\alpha (sample - V^{\pi}) \leftarrow \text{teporal difference}$$

* The running interpolation update makes recent samples more important

* Decreasing learning rate (alpha) can give converging averages

**Q-learning**

* the Q-learning iteration is similar to value-iteration:$$Q_{k+1}(s, a)\leftarrow \sum_{s'}T(s, a, s')[R(s, a, s') + \gamma \mathop{max}\limits_{a'}Q_{k}(s', a')]$$
where $\mathop{max}\limits_{a'}Q_{k}(s', a') = V_k(s')$

* Sample-based Q-leanring: $$Q(s, a)\leftarrow (1-\alpha)Q(s, a) + \alpha[sample]$$
where $sample = R(s, a, s') + \gamma \mathop{max}\limits_{a'}Q_{k}(s', a')$ 

* Property

    Result is amazing. Q-learning always converges to optimal policy even if you're acting suboptimally (off-policy learning)

* Limitation

    need to explore enough, and you have to eventually make the learning rate small enough (not decrease it quickly)

**Exploration and Exploitation**

* Exploration gives up a reward that you know about in order to learn more about the environment

* Exploitation exploits known rewards to maximize the reward