## 1. Intro

An RL agent may include one or more of these components:

* Policy: agent's behavior function
* Value function: how good is each state or actions
* Model: agent's state representation of the environment

### 1.1 Policy

* A policy is the agent's behavior model
* It is a map function from state/observation to action
* Stochastic policy: Probabilistic sample $\pi(a|s) = P[A_t=a|S_t=s]$
* Deterministic policy: $a^* = \underset{a}{\operatorname{argmax}} \pi(a|s)$

### 1.2 Value function

* Value function: expected discounted sum of future rewards under a particular policy $\pi$
* Discount factor weights immediate vs future rewards
* Used to quantify goodness/badness of states and actions

$$v_\pi(s) = E_\pi[G_t|S_t=s] = E_{\pi}[\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}|S_t=s], \text{for all } s\in S $$

* Q-function (could be used to select among actions):

$$q_\pi(s, a) = E_\pi[G_t|S_t=s, A_t=a] = E_{\pi}[\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}|S_t=s, A_t=a]$$

### 1.3 Model

A model predicts what the environment will do next. Predict the next state:

$$P_{ss'}^{a} = P[S_{t+1}=s'|S_t=s,A_t=a]$$

Predict the next reward:

$$R_s^a = E[R_{t+1}|S_t=s,A_t=a]$$

### 1.4 Markov Decision Processes (MDPs)

Definition of MDP:

* $P^a$ is dynamics/transition model for each action, $P(S_{t+1}=s'|S_t=s,A_t=a)$
* R is a reward function $R(S_t=s,A_t=a) = E[R_t|S_t=s,A_t=a]$
* Discount factor: $\gamma \in [0, 1]$


### 1.5 Types of RL Agents based on what the agent learns

* Value-based agent
    * explicit: value function
    * implicit: policy (can derive a policy from value function)
* Policy-based agent
    * explicit: policy
    * no value function
* Actor-Critic agent
    * explicit: policy and value function

### 1.6 Types of RL Agents on if there is model

* Model-based
    * explicit: model
    * May or may not have policy and/or value function
* Model-free
    * explicit: value function and/or policy function
    * No model

<img src="../imgs/RL/rl_types.png" height="321" width="342" />

## Q learning

### 算法

$$
\begin{aligned}
Q(s, a) & \leftarrow Q(s, a) + \alpha [r + \gamma max_{a'}Q(s', a') - Q(s, a)] \\
s & \leftarrow s'
\end{aligned}
$$

## Sarsa

### 算法

$$
\begin{aligned}
Q(s, a) & \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)] \\
s & \leftarrow s', a \leftarrow a'
\end{aligned}
$$