# Reinforcement learning

## Sources
- https://medium.com/@lgendrot/teaching-myself-reinforcement-learning-7b4157ee3b68
- https://medium.com/@sriramp_98201/reinforcement-learning-from-intuition-to-actor-critic-and-ppo-db3d1e3907a1
- https://medium.com/@reddyyashu20/reinforcement-learning-tutorial-theo-97a6caa5d5f7




Reinforcement Learning is about learning behavior through interaction and experiences NOT with labeled data.

An agent repeatedly:

- Observes the current state s
- Chooses an action a
- Receives a reward r
- Moves to a new state s′

The agent’s goal is simple to state, but hard to solve - Maximize total future reward, not just immediate reward. 

It learns which action lead to positive feedback or rewards and which one to negative feedback. 

<img src="../image-61.png" width="500px"/>


<img src="../image.png" width="500px"/>

## Observations
Agents need to get information about the environment in order to make decisions on how to act. In the Atari example the observations might amount to screenshots of the game being played, the same as a human would see

## Agents
The agent receives observations of the environment, and uses those observations to select actions, which affects the environment and ultimately causes a reward signal to be given to the agent.


## States and the Markov Property
In the course of operation, the agent generates a “path” through an environment, which consists of sequential sets of observations, rewards, and next actions for each step the agent takes.

The history of an agent/environment pair includes all of the observations, rewards, and actions generated up to a particular point in time. 

<img src="../image-2.png" width="500px"/>


It’s unnecessary to store the entire history of an agent -> concept of state. The state is formally just a function that produces a summary of the history up to a certain point.

<img src="../image-3.png" width="200px" />


The state can be thought of from both the perspective of the environment (environment state) and the agent (agent state, built entirely by the developer and by the algorithm used to solve the reinforcement learning problem.).

Ideally any state, being a summary of the history, would contain all the useful information about the history

<img src="../image-1.png" width="500px" />



**fully observable** environment, the agent state is the same as the environment state.

**partially observable** environment the agent state must be somehow computed from the history, since the observations are now incomplete representations of the environment



### Return and long term thinking

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

- $\gamma$ : discount factor. Close to 1 -> long term planning. Close to 0 -> short term planning

### Value function: compressing the future

<img src="../image-6.png" width="500px" />

Estimate of how much future reward the agent can expect to receive if it is governed by our current policy.

Expected return when starting from state s -> how good to be in one state.

Parameterized by a **discount** factor (gamma in the above formula)

### Q-value:

Similar to value but parametrized by an action.


----
## Anatomy of a Reinforcement Learning Agent

<img src="../image-4.png" width="500px" />



## Model

“What’s going to happen next if I do this?”

Agent’s learned conception of how the environment works. Given a particular state and an action to take, the agent’s model will give it a prediction of the next state

Can also predict the reward given a state and action pair.



<img src="../image-7.png" width="500px" />


## Categories of agents


<img src="../image-8.png" width="500px" />


### Model free

it learns policy directly from experience with environment. No model of environment dynamics

#### value based

 where it has no explicit policy, but instead learns a value function and acts based on its estimates. It finds the optimal value function -> max value at a state(Technically it still has an implicit policy: choose the best action).


#### policy based

learning a policy directly without learning a value function. The policy it learns apply the action to maximize future reward.

- **deterministic**: same action is produced by policy at a given state, $a=\pi(s)$

- **stochastic**: output a probability distribution over actions again a given state (which are sampled), $\pi(a|s)=P[A_t=a|S_t=s]$


#### actor/critic

A mix of value and policy based, having both a learned policy and an estimated value function for that learned policy.


### Model-based
A model of the environment is created and the agent explores it to learn it. Model is different for every env.


Use the model for planning to derive a policy

## Planning vs control

### Planning

Thinking what to do

Use model to simulate future trajectories and optimize over them

### Control

Execute actions (do it)

A control problem determine the best possible action to take to maximize long term reward 

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

Control problem is optimization that tries to find optimal policy.

Non control problem (like iterative policy iteration) is about predicting a value of a state.

## Exploration vs Exploitation

- Exploitation: Use current knowledge to maximize reward
- Exploration: Try new things to improve knowledge

Stochastic policies have built-in exploration (sampling using temperature params)

Deterministic policies need explicit exploration mechanism like **e-greedy**

## Model Map

RL Hierarchy:

**Dynamic programming (DP)**: model-based, assume environment is known (state transition probabilities are known). 

Use of bellman equation in 
- policy iteration 
- value iteration

**Model-free RL**: learns from sample. Environment is not known. It learns it by estimating action values
- Temporal difference:
    - Q-Learning(Off-Policy): it update its knowledge by comparing what its predicted reward and the real reward. It always think he taking the best action (max)
        - Deep Q-Network, q-learning with neural network as used by deep mind in atari games. 
    - SARSA (On-Policy): use real action value  (more cautious approach)
- Monte carlo methods
- Policy gradients: REINFORCE, PPO (actor critic)

**Model-based RL**: learns the model, then uses it with DP-like planning
- Dyna-Q, AlphaZero, MuZero, world models

<img src="../image-60.png" width="800px">
