## Reinforcement Learning

Reinforcement learning considers the situation where a series of decisions have to be made in a
stochastically changing environment. At the time of each decision, there are a number of ***states***
and a number of possible ***actions***.  

The decision maker takes an action, $A_0$, at time zero when the state $S_0$ is known. This results in a reward, $R_1$, at time $t=1$ and a new state, $S_1$, is then encountered. The decision maker then takes another action, $A_1$ which
results in a reward, $R_2$ at time $t=2$ and a new state, $S_2$; and so on.

The aim of reinforcement learning is to maximize expected future rewards. Specifically, it attempts
to maximize the expected value of $G_t$ where

\begin{equation}\label{eqn_1}
G_t =R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} +\dots+\gamma^{T-1}R_T
\end{equation}

$T$ is a horizon date and  $\gamma \in (0, 1]$ is a discount factor. To ensure that equation \eqref{eqn_1} reflects the time value of money, we define $R_t$ as the the cash flow received at time $t$ multiplied by $\gamma$ (i.e., discounted
by one period). 

In order to maximize $G_t$ in equation \eqref{eqn_1}, the decision maker needs a set of rules for what action
to take in any given state. This set of rules is represented by a policy function $\pi : \cal{S} \rightarrow \cal{A}$, where $\cal{S}$ and $\cal{A}$ are the sets of all possible ***states*** and ***actions***, respectively. 

If the decision maker uses policy $\pi$ and is in state $S_t$ at time $t$, then the action taken is $A_t =\pi(S_t)$. The policy is updated as the reinforcement learning algorithm progresses. As we explain later, learning an optimal policy
involves both ***exploration*** and ***exploitation***.

For example, at a particular stage in the execution of the reinforcement learning algorithm, the policy might involve, for all states and all times, a 90% chance of taking the best action identified so far (***exploitation***) and a 10% chance of randomly selecting a different action (***exploration***).

For a specific policy $\pi$, we define the value of taking action $A_t$ in a state $S_t$ as the expected total
reward (discounted) starting from state $S_t$ taking action $A_t$ and taking the actions given by the
policy in the future states that are encountered. 

The value of each state-action pair is represented by a function $Q: \cal{S} \times \cal{A}\rightarrow \mathbb{R}$, referred to as the action-value function or ***Q-function***:

\begin{equation}
Q(S_t;A_t)= \mathbb{E}(G_t \vert S_t;A_t)
\end{equation}



The two core stages of the reinforcement learning process are: 

- the ***estimation of the Q-function*** and 
- the ***policy update***.

$$
\sum\limits_a \, \pi(a \vert s) \sum\limits_{s^\prime}\sum\limits_r \, p(s^\prime, r \vert s, a) =1
$$

### Temporal Difference Learning