### 1. Markov Process and Markov Decision Process

Markov process. A sequence of states is Markov if and only if the
probability of moving to the next state $S_{t+1}$ depends only on the present state St and not on the
previous states $S_1, S_2, \dots , S_{t−1}$. That is, for all $t$,

$\mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1, S_2, \dots, S_t]$

We always talk about time-homogeneous Markov chain in RL, in which the probability of the
transition is independent of $t$:

$\mathbb{P}[S_{t+1} = s^{\prime}|S_t = s] = \mathbb{P}[St = s^{\prime}|S_{t−1} = s].$

Formally,

__Definition 1__ (Markov Process). a Markov Process (or Markov Chain) is a tuple (S,P), where

- $\mathcal{S}$ is a finit set of states
- $\mathcal{P}$ is a state transition probability matrix. $P_{ss^{'}} = \mathbb{P}[S_{t+1}=s^{'}|S_t=s]$ 

The dynamics of the Markov process proceeds as follows: We start in some state $s_0$, and moves to some successor state $s_1$. drawn from $P_{s_0s_1}$. We then moves to $s_2$ drawn from $P_{s_1s_2}$ and so on. We represent this dynamic as follows:

$s_0 \to s_1 \to s_2 \to s_3 \to \cdots$

If we introduce reward, action and discount into a Markov process, we get a Markov decision process.

__Definition 2__ (Markov Decision Process). A Markov decision process is a tuple $(\mathcal{S}, \mathcal{A}, \mathcal{P}, \gamma, \mathcal{R})$, where:

- $\mathcal{S}$ is a finite set of states
- $\mathcal{A}$ is a finite set of actions
- $\mathcal{P}$ is the state transition probability matrix $\mathcal{P}^a_{ss^{'}} = \mathbb{P}[S_{t+1} = s^{'}|S_t=s,A_t=a]$
- $\gamma$ is called the discount factor.
- $\mathcal{R} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ is a reward function.

The MDP is used to model the environment in reinforcement learning. In the MDP, the transition to the next state $S_{t+1}$ depends not only on the current state $S_t$, but also depends on the action
$A_t$ you make at the current state. Also, each state-action pair is attached with a reward function.  

The dynamic of MDP proceeds as follows: We start in some state $s_0$, and choose some actions
$a_0 ∈ A$ to take in the MDP. As a result of our choice, the state of the MDP randomly transits to
some successor state $s_1$, drawn from $P^{a_0}_{s_0s_1}$.  
Then, from state $s_1$, we pick another action $a_1$. Again, we come to some state $s_2$, drawn from $P^{a_1}_{s_1s_2}$. We then pick $a_2$, and so on. We can represent this sequential decision making process as follows:

$s_0 \to^{a_0}$

### 2. Return, Policy and Value function

Our goal in RL is to choose actions over time so as to maximize the expected value of the return,
i.e. choose the optimal policy. We define return and policy as follows. The return $G_t$ is the total
discounted reward from time-step $t$.

$G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum^{\infty}_{k=0}\gamma^k R_{t+k+1}$.