# "Model Based Reinforcement Learning (MBRL)"
> "This is a summary of MBRL from ICML-2020 tutorial."

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true


This post is a summary of the model-based RL tutorial at ICML-2020. You can find the videos [here](https://sites.google.com/view/mbrl-tutorial).


## Introduction and Motivation

Having access to a model of the world and using it for decision making is a powerful idea. 
There are a lot of applications of MBRL in different areas like robotics (manipulation- what will happen by doing an action), 
self-driving cars (having a model of other agents decisions and future motions and act accordingly),
games (AlphaGo- search over different possibilities), Science ( chemical usecases),
and peration research and energy applications (how to allocate renewable energy in different points in time to meet the demand).

## Problem Statement

In sequential decision making, the agent will interact with the world by doing action $a$ and getting the next state $s$ and reward $r$.


<img src="images/rl.png">


We can write this problem as a Markov Decision Process (MDP) as follows:

- States $S \epsilon R^{d_S}$
- Actions $A \epsilon R^{d_A}$
- Reward function $R: S \times A \rightarrow R$
- Transition function $T: S \times A \rightarrow S$
- Discount $\gamma \epsilon (0,1)$
- Policy $\pi: S \rightarrow A$

The goal is to find a policy which maximizes the sum of discounted future rewards:
$$
argmax_{\pi} \sum_{t=0}^\inf \gamma^t R(s_t, a_t)
$$
subject to
$$
a_t = \pi(s_t) , s_{t+1}=T(s_t, a_t)
$$

How to solve this optimization problem?! 

- Collect data $D= \{ s_t, a_t, r_{t+1}, s_{t+1} \}_{t=0}^T$.
- Model-free: learn policy directly from data

$ D \rightarrow \pi$ e.g. Q-learning, policy gradient

- Model-based: learn model, then use it to **learn** or **improve** a policy 

$ D \rightarrow f \rightarrow \pi$ 


## What is a model?

a model is a representation that explicitly encodes knowledge about the structure of the environment and task.

This model can take a lot of different forms:

- A transition/dynamic model: $s_{t+1} = f_s(s_t, a_t)$
- A model of rewards: $r_{t+1} = f_r(s_t, a_t)$
- An inverse transition/dynamics model (which tells you what is the action to take and go from one state to the next state): $a_t = f_s^{-1}(s_t, s_{t+1})$
- A model of distance of two states: $d_{ij} = f_d(s_i, s_j)$
- A model of future returns: $G_t = Q(s_t, a_t)$ or $G_t = V(s_t)$

Typically when someone says MBRL, he/she means the firs two items.

<img src="images/model.png">

Sometimes we know the ground truth dynamics and rewards. Might as well use them! Like game environments or simulators like Mujoco, Carla, and so on.

But we don't have access to the model in all cases, so we need to learn the model. In cases like in robots, complex physical dynamics, and interaction with humans.




## How to use model?

In model-free RL agent we have a policy and learning algorithm like the figure below:

<img src="images/rl2.png">

In model-based RL we can use the model in three different ways:
- simulating the environment: replacing the environment with model and use it to generate data and use it to update the policy.
- Assisting the learning algorithm: modify the learning algorithm to use the model to interpret the data it is getting in a different way. 
- Strengthening the policy: allow the agent at test time to use the model to try out different actions before it commits to one of them (taking the action in the real world).

<img src="images/mbrl.png">

In general, to compare model-free and model-based:
    
<img src="images/mbrl_vs_mfrl.png">

## How to learn a model?


There are two different dimensions that are useful to pay attention to:
- representation of the features for the states that the model is being learned over them

- representation of the transition between states

<img src="images/learn_model.png">


In continue we take a look at different transition models.

### state-transition models

In some cases, we know equations of motion and dynamics but we don't know the exact parameters like mass. We can use system identification to estimate unknown parameters like mass. But these sort of cases require having a lot of domain knowledge about how exactly the system works.

<img src="images/learn_model2.png">
<img src="images/learn_model3.png">



In some cases that we don't know the dynamics of motion, we can simply use an MLP to get a concatenation of $s_t, a_t$ and output the next state $s_{t+1}$.

<img src="images/learn_model4.png">


In cases that we have some, not perfect, domain knowledge about the environment, we can use graph neural networks (GNNs) to model the agent (robot). For example in Mujoco we can model a robot (agent) with nodes as its body parts and edges as joint and learn the physics engine.

<img src="images/learn_model5.png">



### observation-transition models

In this cases, we don't have access to states (low level states like joint angles), but we have access to images. The MDP for this cases would be like this:

<img src="images/learn_model6.png">

So what can we do with this? 

- Directly predict transitions between observations (observation-transition models)

<img src="images/learn_model7.png">

- Reconstruct observation at every timestep: Using sth like LSTMs. Here we need to reconstruct the whole observation in each timestep. The images can be blurry in these cases.

<img src="images/learn_model8.png">

<img src="images/learn_model88.png">



### latent state-transition models

Another option when we have just access to observation is to instead of making transition between observations we can infere a latent state and then make transitions in that latent space (latent state-transition models) not in the observation space. It would be much faster than reconstructing the observation on every timestep. We take our initial observation or perhaps the last couple of observations and embed them into the latent state and then unroll it in time and do predictions in $z$ instead of $o$.

<img src="images/learn_model9.png">

Usually we use the observation and reconstruct it during training but at test time we can unroll it very quickly. we can also reconstruct observation at each timestep we want (not necessarily in all timesteps).

<img src="images/learn_model10.png">


### Structured latent state-transition models

Another thing that you can do if you have a little bit more domain knowledge is to add a little bit of structure into your latent state. For example, if you know that the scene that you are trying to model consists of objects, then you can try to actually explicitly detect those objects, segment them out and then learn those transitions between objects.

<img src="images/learn_model11.png">


### Recurrent value models

The idea is that when you unroll your latent-state, you additionally predict the value of the state at each point of the furture, in addition to reward. we can train the model without necessarily needing to train using observations, but just training it by predicting the value progressing toward actual observed values when you roll it out in the real environment.

<img src="images/learn_model12.png">

why this is useful? because some types of planners actually only need you to predict values rather than predicting states lime MCTS (monte carlo tree search).


### Non-Parametric models

So far we talked about parametric ways of learning the model. We can also use non-parametric methods like graphs.

For example the replay buffer that we use in off-policy methods can be seen as an approximation to a type of model where if you have enough data in your replay buffer then you sample from buffer, it basically access the density model over your transitions. You can use extra replay to basically get the same level performances you would get using a model based method which learns a parametric model.

<img src="images/learn_model13.png">

The other work we can do using data in buffer is to use data points and learn the transition between them and interpolate to find states between those states in buffer. Somehow learning a distribution and use it to generate new datapoints.

<img src="images/learn_model14.png">

Another form of non-parametric transition is a symbolic description which is popular in the planning community not in the deep learning community. 

<img src="images/learn_model15.png">

The other form of non-parametric models is gaussian processes which gives us strong predictions using very very small amout of data. PILCO is one example of these algorithms.

<img src="images/learn_model16.png">

