# WHAT IS REINFORCEMENT LEARNING?

**Reinforcement Learning (RL)**, a subfield of Machine Learning (ML), is an approach that natively incorporates an extra dimension (which is usually *time*, but not necessarily) into learning equation. This places RL much closer to how people understand Artificial Intelligent (AI).

We'll discuss:
1. How RL is related to and differes from other ML disciplines: *supervised learning* and *unsupervised learning*
2. What the main *RL formalism*s are and how they are related to each other
3. Theoretical foundations of RL: **Markov processes (MPs)**, **Markov reward processes (MRPs)**, and **Markov decision processes (MDPs)**

## 1. The Spectrum

1. **Supervised Learning**: Its basic question is, how do we automatically build a function that maps some input into some output when given a set of examples pairs?
    - Example problems: text classification, image classification and object location, regression problems, sentiment analysis
    - The name **supervised** comes from the fact that we learn from known answers provided by a "ground truth" data source

2. **Unsupervised Learning**: Main objective is to learn some hidden structure of the dataset at hand, assuming no supervision and has no labels assigned to data
    - Example problems: clustering, GAN
    - **Generative Adversarial Networks (GANs)**: two competing neural networks (NNs):
        - the first network tries to generate fake data to fool the second network, while
        - the second network tries to discriminate artificially generated data from data sampled from the dataset
        - overtime, both networks become more and more skillful in their tasks by capturing subtle specific patterns in the datasets

3. **Reinforcement Learning (RL)**: somewhere in between full supervision and a complete lack of predefined labels
    - Imagine an `agent` that needs to take actions in some `environment`
    - the `reward` is given to the `agent` by the `environment` as additional feedback about the `agent`'s action. The final goal of the `agent` is to maximize its `reward` as much as possible
    - the `reward` can be posivite, negative, or neutral. By observing the `reward` and relating it to the actions taken, our `agent` learns how to perform an action better
    - but RL generality and flexibility comes with a price



## 2. Complications in RL

1. **Having non-iid (independent and identically distributed) data**: note that observations in RL depends on an `agent`'s behavior and, to some extent, it is the result of this behavior
    - if `agent` decides to do inefficient things, then observations will tell us nothing about what it has done wrong and what should be done to improve the outcome (the `agent` will just get negative feedback all the time)

2. **Exploration/Exploitation dilemma**: `agent` needs not only `exploit` the knowledge it has learned, but actively `explore` the `environment`, because maybe doing things differently will significantly improve the outcome
    - `PROBLEM`: too much exploration may also seriously decrease the `reward` (not to mention the `agent` can actually forget what it has learned before)
    - we need to find a balance between these two. There are no universal answers

3. **`Reward` can be seriously delayed after actions**: during learning, we need to discover such causualities, which can be tricky to discern during the flow of time and our actions

## 3. RL formalisms

Two major RL entities - `agent` and `environment` - and their communication channels - `actions`, `reward`, and `observations`:

![RL formalism](../images/figure_1-2.png)


1. **Reward**: a scalar value obatained periodically from the `environment`
    - purpose: to tell `agent` how well it behaved
    - common practice on how frequently the `agent` receives `reward`: every fixed timestamp or at every `environment` interaction, just for convenient. In the case of once-in-a-lifetime `reward` systems, all `reward`s except the last one will be zero
    - the term `reinforcement` comes from the fact that `reward` obtained by an `agent` should reinfoce its behavior in a positive/negative way
    - `reward` is local: reflects the benefits and losses achieved by the `agent` so far
    - examples: financial trading, chess, dopamine system in the brain, computer games, web navigation, NN architecture search, dog training, school marks

2. **The agent**: somebody or something who/that interacts with the `environment` by executing certain `actions`, making `observations`, and receiving eventual `reward`s for this

3. **The environment**: everything outside of an `agent`
    - the `agent`'s communication with the `environment` is limited to `reward` (obtained from the `environment`), `actions` (executed by the `agent` and sent to the `environment`), and `observations` (some information besides the `reward`s that the `agent` receives from the `environment`)

4. **Actions**: things that an `agent` can do in the `environment`. Two types of `actions`:
    1. **Discrete actions**: form the finite set of mutually exclusive things an `agent` can do
    2. **Continuous actions**: have some value attached to them

5. **Observations**: of the `environment` form the second information channel for an `agent`, with the first being the `reward`. `Observations` are pieces of information that the `environment` provides the `agent` with that indicate what's going on around the `agent`

It's also important to distinguish between an `environment`'s `state` and `observations`:
- the `state` of an `environment` most of the time is *internal* to the `environment` and potentially includes every atom in the universe, which makes it possible to measure everything about the `environment`
- even if we limit the `environment`'s `state` to be small enough, most of the time, it'll be either possible to get full information about it or our meassurements will contain noise

There are many other areas that contribute or relate to RL:

![Various domains in RL](../images/figure_1-3.png)

## 4. The theoretical foundations of RL

### 4.0 Markov decision processes (MDPs)

**Markov decision processes (MDPs)**: bedescribed like a Russian matryoshka doll:
    - start from the simplest case of a **Markov process (MP)**
    - then extend that with `rewards`, which will turn it into a **Markov reward process (MRP)**
    - then adding `actions`, which will lead us to an **MDP**

### 4.1 The Markov process (MP)

**The Markov process (MP)**: the simplest concept in the Markov family, which is also known as **Markov chain**
1. imagine that we have some system in front of us that we can only observe:
    - `state`: is what we observe
    - the system can switch between `states` according to some laws of dynamics (most of the time unknown to us)
    - we cannot influence the system, can only watch the `states` changing
    - `state space`: a set of all possible `states` for a system
    - this set is required to be finite (but can be extremely large to compensate for this limitation)
    - `chain`: a sequence of `states` formed by our observations
    - a sequence of observations over time forms a `chain` of `states`, and this is called `history`

2. to call such a system an `MP`, it needs to fulfill the **Markov property**:
    - `Markov property`: the future system dynamics from any `state` have to depend on this `state` only. This is to make every observable `state` self-contained to describe the future of the system
    - in other words, the `Markov property` requires the `states` of the system to be distinguishable from each other and unique. In this case, only one `state` is required to model the future dynamics of the system and not the whole `history` or, say, the last $N$ states

3. as system model complies with the `Markov property`, we can capture `transition probabilities` with a `transition matrix`:
    - `transition matrix`: a square matrix of the size $N \times N$, where $N$ is the number of `states` in our model
    - the cell $(i, j)$ contains the probability of the system to transition from `state` $i$ to `state` $j$

4. The formal definition of an `MP`:
    - a set of states ($S$) that a system can be in
    - a transition matrix ($T$), with transition probabilities, which defines the system dynamics

5. the `state transition graph`: visual representation of an `MP` - a cyclic directed graph
    - nodes corresponding to system states
    - (directed) edges labeled with probabilities representing a possible transition from state to state

6. in practice, we rarely have the luxury of knowing the exact `transition matrix`
    - a much more real-world situation is when we only have observations of our system's `states`, which are also called `episodes`
    - it's not complicated to estimate the `transition matrix` from observations - count all the transitions from every state and normalize them. The more observation data we have, the closer our estimation will be to the true underlying model

7. worth noting: the `Markov property` implies `stationary`, i.e. the underlying transition distribution for any `state` does not change overtime
    - `non-stationary`: means some hidden factor influences our system dynamic, and it is not included in observations. This contradicts the `Markov property`, which requires the probability distribution to be the same for the same `state` regardless of transition `history`

8. important to understand the difference between the actual transitions observed in an `episode` and the underlying distribution given in the `transition matrix`
    - concrete `episode` that we observe are randomly sampled from the distribution of the model, so they can differ from `episode` to `episode`
    - however, the probability of the concrete transition to be sampled remains the same
    - if this is not the case, `Markov chain` formalism becomes non-applicable

### 4.2 Markov reward processes (MRPs)

1. reward - an extra scalar number - added to our transition from `state` to `state` 
    - reward can be represented in various forms
    - the most general way is another square matrix, with $(i, j)$ represents value of reward for transitioning from `state` $i$ to `state` $j$, but it's redundant
    - a more compact representation: (`state`, `reward`) pairs, applicable only if the reward value depends solely on the target `state1, which is not only the case
    - `Discount factor` $\gamma \in [0,1]$ (explained later)
    - as in an `MP`, we observe a chain of state transition in a `MRP`, but for every transition, we have an extra quantity - reward. So now, all observations have a reward value attached to every transition of the system

2. for every episode, we define `return` at the time $t$ as $G_t$:$$G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum\limits^{\infty}_{k=0}\gamma^k R_{t+k+1}$$
    - for now, think of $\gamma$ as a measure of how far into the future we look to estimate the future return (closer to $1$ means more steps ahead we take into account)
    - $\gamma$ stands for foresightedness of the agent:
        - $\gamma = 1$: $G_t$ is a sum of all subsequent rewards and corresponds to the agent that has perfect visibility of any subsequent rewards (applicable in situations of short finite episodes)
        - $\gamma = 0$: $G_t$ will be just immediate reward without any subsequent state and will correspond to absolute short-sightedness
        - most of the time $\gamma = 0.9, 0.99$: we look into future rewards, but not too far
    - this `return` quantity is not very useful in practice, as it was defined for every specific chain we observed from our `MRP`, so it can vary widely, even for the same sate

3. `value of the state` (more practical quantity) - the expectation of return for any state (by averaging a large number of chain):$$V(s) = \mathbb{E}[G \mid S_t = s]$$
    - interpretation: for every state $s$, the value $V(s)$ is the average (or expected) return we get by folowing the `MRP`

4. Example
    - ![The state transition graph with transition probabilities and rewards](../images/figure_1-7.png)
    - if $\gamma = 0$: `computer` is the most valuable state to be in$$\begin{align*}
V(\text{chat}) &= -1\cdot 0.5 + 2\cdot 0.3 + 1\cdot 0.2 &= 0.3 \\
V(\text{coffee}) &= 2\cdot 0.7 + 1\cdot 0.1 + 3\cdot 0.2 &= 2.1 \\
V(\text{home}) &= 1\cdot 0.6 + 1\cdot 0.4 + 1\cdot 0.2 &= 1.0 \\
V(\text{computer}) &= 5\cdot 0.5 + (-3)\cdot 0.1 + 1\cdot 0.2 + 2\cdot 0.2 &= 2.8 \\
\end{align*}$$
    - if $\gamma = 1$, the value is infinity for all states (this is the reason why we introducing $\gamma$, instead of just summing all future rewards)
    - for $\gamma \in (0, 1)$, we will introduce `Q-learning methods` in [Chapter05](../Chapter05/) on how to quickly calulate values for given transition and reward matrices

Now, let's put another layer of complexity around our `MRPs` and introduces the final missing piece: `actions`


### 4.3 Adding actions to MDP

1. to extend our `MDP` to include actions:
    1. add a set of actions ($A$), where $\lvert A \rvert < \infty$, this is our agent's `action space`
    2. condition transition matrix with actions, i.e. add an extra action dimension, turn it into a 3-tensor of shape $\lvert S \rvert \times \lvert S \rvert \times \lvert A \rvert$, recall that $S$ is `state space`

2. Recall:
    - in `MPs` and `MRPs`, the transition (square) matrix with `source state` in rows and `target state` in columns. So every row $i$ contains a list of probabilities to jump to every state from state $i$ ![The transition matrix for the Markov process](../images/figure_1-8.png)
    - in an `MDP`, the agent no longer passively observes state transitions, but can actively choose an action to take at every state transition
        - for every `source state`, we have a matrix, where the `depth` dimension contains actions that the agent can take, and the other dimension i what the `target state` system will jump to after actions are performed by the agent ![The transition probabilities for the MDP](../images/figure_1-9.png)
        - so, in general, by choosing an action, the agent can affect the probabilities of the `target states`, which is a useful ability

Now, with a formally defined `MDP`, we're finally ready to cover the most important thing for `MDP`s and RL: `policy`

### 4.4 Policy

**Policy**: (simple definition) some set of rules that defines the agent's behavior
- it's important to find a good `policy`, because different `policies` can give different amounts of return
- formally, `policy` is defined as the probability distribution over actions for every possible state: 
$$\pi(a \mid s) = P[A_t = a \mid S_t = s]$$
- this is defined as probability and not as a concrete action to introduce randomness into an agent's behavior
- (will talk in `Section 3`) Deterministic `policy` is a special case of probabilistics with the needed action having $1$ as its probability
- (another useful notation) if our `policy` is fixed and not changing during training (i.e. when `policy` always returns the same actions for the same states), then `MDP` becomes `MRP`, as we can reduce the transition and reward matrices with a `policy`'s probabilities and get rid of the action dimensions

## End

After two more introductory chapters about `OpenAI Gym` and `deep learning`, we'll finally start tackling this question - `how do we teach agents to solve practical tasks?`