In [2]:
import numpy as np
import matplotlib.pyplot as plt


# Basic Definitions
## Reward Hypothesis
Reward Hypothesis: All Goals can be described by the maximisation of expected cumulative reward.

## History
History is a sequence of Rewards, Observations and Actions.

## State
The state is a function of history: $S_t = f(H_t)$

### Environment State
The environment state is the environments private representation of the state, denoted: $S^e_t$
This is the data the environment uses to pick the next reward and observation. It's typically note visible to the Agent, and even if it is, the information may not be relevant.

The environment state is Markov.

### Agent State
The agent state is the agents reprentation of the state, denoted: $S^a_t$
It's what the agent uses to make decisions, and is used by the reinforcement learning algorithm. 
It's a function of the history, like mentieond before: $S_t^a = f(H_t)$

### Information State
The information state contains all the useful information from the history. It's also known as the Markov state. 
A state if only markov if the most recent state is equivalent to the previous state and all previous states before that. 
Put mathemtically:
$\mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1, ..., S_t]$
This essentially means the current state contains all the information from all previous states.

Another way of satying this is that the future is independant of the past fiven the present, and that's the markov property.

The markov property is helpful as it makes training RL models easier and more efficient, but it's not always applicable. This is especially true in situations where all information is not present, in the present. An example would be poker, where you don't know the other peoples cards. But there are methods to deal with hidden variables like these.

## Observability
### Full Observability
In these scenarios, the agent can view the entire state, so $O_t = S^a_t = S^e_t$
($O_t$ is the Observation at time t) 

This is a markov decision process (MDP).

### Partial Observability
Here the agent indirectly views the environment. 
Here the environment state $\neq$ agent state.

These problems are known as Partially Observal Markov Decision Processes (POMDP).
So the agent must construct their own state representation.

# Components of an RL Agent
Policy: Agents behaviour function
Value Function: How good is each state/action
Model: Agents representation of the environment

## Policy
The policy is how the agent maps the state to an action, it's usually denoted as $\pi$
The policy may be deterministic, as in every time it encounters a certain state, it will always perform the same action. Denoted as: $a = \pi(s)$

Or it may be stochastic, in this case it will sample from a distribution of actions. Deonted as $\pi(a | s) = \mathbb{P}[A=a | S=s]$

## Value Function
This is essentially a prediction of future reward and it's used to evaluate the goodness/badness of a particular state.

It's typically denoted as: $v_{\pi}(s) = \mathbb{E}_{\pi}[R_t+\gamma R_{t+1}+\gamma^2 R_{t+2} + ... | S_t = s]$

$\gamma$ here is the discount rate. This essentially is how much we value future rewards. If it's 0, then we only care about the reward in the next immediate timestep, if it's 1, then we value all future rewards equally. We usually use some value inbetween. 

## Model
The model predicts what the environment will do next.

Transitions: $\mathcal{P}$ predicts the next state. I.e., finds the dynamics of the system.

$\mathcal{P}^a_{ss'} = \mathbb{P}[S' = s' | S = s, A = a]$

Rewards: $\mathcal{R}$ predicts the next immediate reward, e.g.,

$\mathcal{R}_s^a = \mathbb{E}[R | S = s, A = a]$

Models are not always required, but they're pretty helpful.

# Categorising RL Agents
## Value Based
- Policy is implicit
- Value function

## Policy Based
- Has an explicit policy
- No value function

## Actor Critic
- Has an explicit policy
- Has a value function

Tries to get the best of both worlds

## Model Free
- Policy and/or value function
- No model

## Model Based
- Policy and/or value function
- Has a model

![image.png](attachment:image.png)

# Problems in Machine Learning
## Reinforcement Learning
When the environment is unknown. 
When the algorithms first begin, the model typially has no idea what the environment is.
So the agent interacts with the environment and over time improves its policy.

## Planning
We give the agent a model of the environment.
The agent performs computations with that model.
Then the agent improves its policy.


# Prediction and Control

Prediction: Evaluate the Future
- Given a policy

Control: Optimising the future
- Finding the best policy

