# Chapter 17 - Reinforcement Learning

Reinforcement Learning (RL) is a suite of techniques to build machine learning systems that take decisions sequentially.

The key distinction between reinforcement learning and standard deep learning is that in standard deep learning the prediction of a trained model on one test datum does not affect the predictions on a future test datum; in reinforcement learning decisions at future instants (in RL, decisions are also called actions) are affected by what decisions were made in the past.

## 17.1. Markov Decision Process (MDP)

### 17.1.1. Definition of an MDP

A Markov decision process (MDP) is a model for how the state of a system evolves as different actions are applied to the system. An MDP is defined by the following components:
* The set of states $\mathcal{S}$ in the MDP.
* The set of actions $\mathcal{A}$ that an agent can take. Actions can change the current state of the agent to some other state within the set $\mathcal{S}$.
* We may not know how the agent's state changes exactly but only know it up to approximation. In this case, there is a transition probability function $T: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$ such that $T(s, a, s') = P(s' \mid s, a)$ using the conditional probability of eaching a state $s'$ given that the robot was at state $s$ and took an action $a$. The transition function is a probability distribution and we therefore have $\sum_{s' \in \mathcal{S}} T(s, a, s') = 1$ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$.
* A reward function $R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ that gives a reward to the agent for taking an action $a$ at state $s$. If the reward $r(s,a)$ is large, this indicates that taking the action $a$ at state $s$ is more useful to achieving the goal of the agent. The reward is designed by the user with the goal in mind.

### 17.1.2. Return and Discount Factor

A Markov decision process (MDP) is defined as
\begin{split}
\textrm{MDP}: (\mathcal{S}, \mathcal{A}, T, r).
\end{split}

When an agent starts at a particular state $s_0 \in \mathcal{S}$ and continues taking actions, the agent will end up in a trajectory
\begin{split}
\tau = (s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \ldots).
\end{split}

At each time step $t$, the agent is at a state $s_t$ and takes an action $a_t$, which results in a reward $r_t=r(s_t, a_t)$. The *return* of a trajectory is the total reward obtained by the agent along such a trajectory:
\begin{split}
R(\tau) = r_0 + r_1 + r_2 + \cdots.
\end{split}

The objective in reinforcement learning is to find a trajectory that has the largest *return*.

The sequence of states and actions in a trajectory can be infinitely long and the return of any such infinitely long trajectory will be infinite. In order to keep the reinforcement learning formulation meaningful even for such trajectories, we introduce the notion of a discount facotr $\gamma \in [0, 1)$ and define the *discounted return* as
\begin{split}
R(\tau) = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots = \sum_{t=0}^\infty \gamma^t r_t.
\end{split}

If $\gamma$ is very small, the rewards earned by the agent in the far future, say $t=1000$, are heavily discounted by the factor $\gamma^{1000}$. This encourages the agent to select short trajectories that achieve its goal.

For large values of the discount factor, say $\gamma=0.99$, the agent is encouraged to *explore* and then find the best trajectory to go to the goal state.

### 17.1.3. Discussion of the Markov Assumption

Markov systems are all systems where the next state $s_{t+1}$ is only a function of the current state $s_t$ and the action $a_t$ taken at the current state. In Markov systems, the next state does not depend on which actions were taken in the past or the states that the robot was at in the past.

## 17.2. Value Iteration