# "Model Based Reinforcement Learning (MBRL)"
> "This is a summary of MBRL from ICML-2020 tutorial."

- toc:true
- branch: master
- badges: true
- comments: true
- author: Isaac Kargar
- categories: [jupyter]


## Introduction and Motivation

Having access to a model of the world and using it for decision making is a powerful idea. 
There are a lot of applications of MBRL in different areas like robotics (manipulation- what will happen by doing an action), 
self-driving cars (having a model of other agents decisions and future motions and act accordingly),
games (AlphaGo- search over different possibilities), Science ( chemical usecases),
and peration research and energy applications (how to allocate renewable energy in different points in time to meet the demand).

## Problem Statement

In sequential decision making, the agent will interact with the world by doing action $a$ and getting the next state $s$ and reward $r$.


<img src="files/images/rl.png">


We can write this problem as a Markov Decision Process (MDP) as follows:

- States $S \epsilon R^{d_S}$
- Actions $A \epsilon R^{d_A}$
- Reward function $R: S \times A \rightarrow R$
- Transition function $T: S \times A \rightarrow S$
- Discount $\gamma \epsilon (0,1)$
- Policy $\pi: S \rightarrow A$

The goal is to find a policy which maximizes the sum of discounted future rewards:
$$
argmax_{\pi} \sum_{t=0}^\inf \gamma^t R(s_t, a_t)
$$
subject to
$$
a_t = \pi(s_t) , s_{t+1}=T(s_t, a_t)
$$

How to solve this optimization problem?! 

- Collect data $D= \{ s_t, a_t, r_{t+1}, s_{t+1} \}_{t=0}^T$.
- Model-free: learn policy directly from data

$ D \rightarrow \pi$ e.g. Q-learning, policy gradient

- Model-based: learn model, then use it to **learn** or **improve** a policy 

$ D \rightarrow f \rightarrow \pi$ 


## What is a model?

a model is a representation that explicitly encodes knowledge about the structure of the environment and task.

This model can take a lot of different forms:

- A transition/dynamic model: $s_{t+1} = f_s(s_t, a_t)$
- A model of rewards: $r_{t+1} = f_r(s_t, a_t)$
- An inverse transition/dynamics model (which tells you what is the action to take and go from one state to the next state): $a_t = f_s^{-1}(s_t, s_{t+1})$
- A model of distance of two states: $d_{ij} = f_d(s_i, s_j)$
- A model of future returns: $G_t = Q(s_t, a_t)$ or $G_t = V(s_t)$

Typically when someone says MBRL, he/she means the firs two items.

## How to use model?

<img src="files/images/rl2.png">