## Introduction:
The goal of this exercise is to implement a simple model-based reinforcement learning algorithm.  First, we will learn a dynamics function to model observed state transitions, and then we will use model decision timing planning to maximize predicted rewards [paper](https://arxiv.org/pdf/1708.02596.pdf)

Miro Board Link: https://miro.com/app/board/uXjVOtP-obk=/?share_link_id=685135424478

Before we start, we install some necessary packages to visualise the network

In [None]:
!pip install gym

## Developing an RL cycle using OpenAI GYM
`Gym` is a toolkit for developing and comparing reinforcement learning algorithms. `Gym` has a lot of built-in environments like the cartpole, pendulum,... In this [link](https://gym.openai.com/envs/), you can find a list of all defined environments.

<img src=img/rl.png width="400">

Import the required packages.

In [None]:
import random
import numpy as np 
import gym
import matplotlib.pyplot as plt
import os
import pathlib
import shutil

import torch
import torch.nn as nn
import torch.nn.functional as F

from ml2_utils import Dataset, Logger
from ml2_utils import normalize, unnormalize, weight_init

set the global seeds

In [None]:
def set_global_seeds(i):
    torch.manual_seed(i)
    np.random.seed(i)
    random.seed(i)


# Set the random seed:
seed = 999
set_global_seeds(seed)

### Environment

Create the environment

In [None]:
env = gym.make("Pendulum-v0")

In [None]:
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]
print("the shape of the observation space: ", obs_dim)
print("the shape of the action space: ", act_dim)

The observation space of our system contains 3 Measurements $ [\cos(\phi), \sin(\phi), \dot{\phi}] $. The instance attributes  `low` and `high` return the minimum and maximum values of the observation space


In [None]:
print("The minimum value of the observation space :", env.observation_space.low)
print("The maximum value of the observation space :", env.observation_space.high)

This task aims to control the pendulum to its rest position using motor torque $a$.

In [None]:
print("The minimum value of the observation space :", env.action_space.low)
print("The maximum value of the observation space :", env.action_space.high)

As we saw in the lecture, if the system is not complex, like the pendulum, the model can be represented mathematically. Here you can find all the mathematical equations describing the motion of the pendulum [link](https://en.wikipedia.org/wiki/Pendulum_(mechanics)). However, today we will assume that the dynamic for this system is unknown and will try to learn it.

## Model-Based RL with Model Predictive Control (MPC)


### Step 1: Run some policy (e.g. random policy) to collect data $D_{env} = \{s_t, a_t, r_t, s_{t+1}\}$

The following code lets the RL agent plays for **4 episodes** in which Agent makes **100 moves**. At the same time, the game is rendered at each step and prints the accumulated reward for each game.

In [None]:
# play 4 games
number_episodes = 20
max_rollout_length = 300
dataset = Dataset()

for i in range(number_episodes):
    # initialize the environment
    state = env.reset()
    done = False
    t = 0        
    episode_rew = 0  # accumulated reward
    while not done:
        # update the counter
        t += 1
        # choose a random action
        action = env.action_space.sample()
        # take a step in the environment
        next_state, reward, done, info = env.step(action)
        # is the episode done? 
        done = done or (t >= max_rollout_length)
        episode_rew += reward
        #env.render()
        # add the transition to the dataset
        dataset.add(state, action, next_state, reward, done)
        # update the state
        state = next_state

    # when is done, print the cumulative reward of the game and reset the environment
    print('Episode %d finished, reward:%d, the lenght of the episode:%d'% (i, episode_rew,t))
env.close()

The environment is initialized by calling `reset()`. After doing so, the cycle loops 10 times. In each iteration, `env.action_space.sample()` samples a random action, executes it in the environment with `env.step()`, and displays the result with the `render()` method; that is, the current state of the game, as in the preceding screenshot. In the end, the environment is closed by calling `env.close()`.  Indeed, the `step()` method returns four variables that provide information about the interaction with the environment; namely, Observation, Reward, Done, and Info.

Whenever `done` is True, this means that the episode has terminated and that the environment should be reset. 

The class `Dataset` provides some functions for obtaining summary statistics. 

In [None]:
print("The state mean: ", dataset.state_mean)
print("The state std: ",  dataset.state_std)
print("The action mean: ",dataset.action_mean)
print("The action std: ", dataset.action_std)
print("shape of the random dataset: ", dataset.__len__())

How to use the dataset?

In [None]:
iterator = dataset.random_iterator(batch_size=3)
next(iterator)

**Note:** the batch is not torch tensores 

## Step 2: Learn model $P_\phi(s_t,a_t)$ 
We parameterize our learned dynamics function $P_\phi (s_t, a_t)$ as a deep neural network, where the parameter vector $\phi$ represents the network's weights. 


We don't want to learn a network to predict the next state $s_{t+1}$, given the current state and the current action $s_t, a_t$.  This function can be challenging to learn when the states $s_t$  and $s_{t+1}$ are too similar, and the action has seemingly little effect on the output. This difficulty becomes more evident as the time between states $∆t$ becomes small.

Note that increasing this $∆t$ increases the information available from each data point and can help with dynamics learning and planning using the learned dynamics model. However, increasing $∆t$ also increases the discretization and complexity of the underlying continuous-time dynamics, making the learning process more difficult.

We will learn a neural network dynamics model encodes the change in state that occurs as a result of executing the action $a_t$from state $s_t$ of the form:
$$\hat{\Delta}_{t+1} = P_\phi (s_t, a_t)$$
such that
$$ s_{t+1} =  s_t + \hat{\Delta}_{t+1} $$

We will train $P_\phi$ in a standard supervised learning setup, by performing gradient descent on the following objective:
$$L(\phi) =   \sum_{(s_t, a_t,s_{t+1} ) \in D}  \lVert (s_t + \hat{\Delta}_{t+1}) − s_{t+1}\rVert_2^2$$


<img src=img/img1.png width="500">

We will implement a neural network dynamics model and train it using a fixed dataset consisting of rollouts collected by a random policy.

Firs part: one Layer 

In [None]:
def linear_bn_relu_block(in_feat, out_feat, normalize=True):
    """ linear + batchnorm + leaky relu """
    layers = [nn.Linear(in_feat, out_feat)]
    if normalize:
        layers.append(nn.BatchNorm1d(out_feat))
    layers.append(nn.ReLU(inplace=True))
    return layers

Second part: Module

In [None]:
class MLP(nn.Module):
    def __init__(self, state_dim, action_dim,  nn_size=64):
        super(MLP, self).__init__()
        self.network = nn.Sequential(
            *linear_bn_relu_block(state_dim + action_dim, nn_size, normalize=True),
            *linear_bn_relu_block(nn_size, nn_size, normalize=True),
            nn.Linear(nn_size, state_dim)
        )
        self.apply(weight_init)

    # actions and the states should be normalizd tensores
    def forward(self, states, actions):
        state_action_input = torch.cat((states, actions), dim=-1)
        return self.network(state_action_input)

In [None]:
network = MLP(state_dim=3,action_dim=1)
print(network)

Third part: Model

In [None]:
class Model(nn.Module):
    def __init__(self, state_shape, action_shape, nn_size):
        super().__init__()
        self.model = MLP(state_dim = state_shape[0],
                        action_dim = action_shape[0],
                        nn_size = nn_size)


        self.state_mean = None
        self.state_std = None
        self.action_mean = None
        self.action_std = None
        self.delta_state_mean = None
        self.delta_state_std = None

    def set_statistics(self, dataset):
        # dataset is on cpu and numpy
        self.state_mean  = torch.from_numpy(dataset.state_mean).cuda().unsqueeze(dim=0)
        self.state_std   = torch.from_numpy(dataset.state_std).cuda().unsqueeze(dim=0)
        self.action_mean = torch.from_numpy(dataset.action_mean).cuda().unsqueeze(dim=0)
        self.action_std  = torch.from_numpy(dataset.action_std).cuda().unsqueeze(dim=0)
        self.delta_state_mean = torch.from_numpy(dataset.delta_state_mean).cuda().unsqueeze(dim=0)
        self.delta_state_std  = torch.from_numpy(dataset.delta_state_std).cuda().unsqueeze(dim=0)
    
    # Funtion required to train the model
    def predicted_delta_state_normalized(self, states, actions):
        # normalize the state and the action
        states_normalized = normalize(states, self.state_mean, self.state_std)
        actions_normalized = normalize(actions, self.action_mean, self.action_std)
        # predict the normalized delta
        predicted_delta_state_normalized = self.model(states_normalized, actions_normalized)
        
        return predicted_delta_state_normalized

    # Funtion to evaluate the model
    def predict_next_states(self, states, actions):
        # predict the normalized delta
        predicted_delta_state_normalized = self.predicted_delta_state_normalized(states, actions)
        # unnormalize the normalized delta
        predicted_delta_state = unnormalize(predicted_delta_state_normalized, self.delta_state_mean,
                                            self.delta_state_std)
        # next state: state + delta
        return states + predicted_delta_state

In [None]:
def get_reward_tensor(self, actions, states):
    # convert to tensores 
    cos_th = states[:,0].cuda()
    sin_th = states[:,1].cuda()
    th_dot = states[:,2].cuda()
    th = np.arctan2(sin_th,cos_th)
    th_normalize = (((th+np.pi) % (2*np.pi)) - np.pi)
    action = np.clip(actions,-2.0, 2.0)[0]
    reward = - (th_normalize ** 2 + .1 * th_dot ** 2 + .001 * (action ** 2))
    return reward

Last part: Our Algorithm 

In [None]:
class MBRL(object):
    def __init__(self, model, planner, device, action_shape, args):
        self.model                     = model
        self.planner                   = planner
        self.device                    = device
        self.update_freq               = args.update_freq
        self.log_interval              = args.log_interval
        
        # optimizers
        self.model_optimizer = torch.optim.Adam(
            self.model.parameters(), lr=args.model_lr, betas=(args.model_beta, 0.999))

        self.train()

    def train(self, training=True):
        self.training = training
        self.model.train(training)

    def get_action(self,env, model,state, step):
        state = np.squeeze(state.numpy(), axis=0)
        actions_argmin = self.planner.obtain_solution(state)
        actions = actions_argmin.cpu().detach().numpy()
        action = actions[0][0:self.action_dim]
        return action

    def update_model(self,state, action, next_state, L,step):
        delta_state_normalized  = self.model.predicted_delta_state_normalized(state, action)
        target_delta             = next_state - state
        target_delta_normalized  = normalize(target_delta, self.model.delta_state_mean, self.model.delta_state_std)
        model_loss               = F.mse_loss(delta_state_normalized,target_delta_normalized)

        if step % self.log_interval == 0:
            L.log('train_model/loss', model_loss, step)

        # Optimize the critic
        self.model_optimizer.zero_grad()
        model_loss.backward()
        self.model_optimizer.step()

    def update(self, iterator,L, step):
        #iterator = dataset.random_iterator(args.batch_size)
        state, action, next_state, reward, done = next(iterator)
        # Convert numpy to tensores
        state      =  torch.from_numpy(state).cuda().float()
        action     =  torch.from_numpy(action).cuda().float()
        next_state =  torch.from_numpy(next_state).cuda().float() 
        #if step % self.log_interval == 0:
        #    L.log('train', step)

        self.update_model(state, action, next_state,L, step)

    def save_model(self, dir, step):
        torch.save(self.model.state_dict(), os.path.join(dir, f'{step}.pt'))

### Arguments 

In [None]:
args = lambda:x
# Training and optimization arguments
args.update_freq = 1
args.model_lr    = 0.01
args.model_beta  = 0.99
args.batch_size  = 128
args.num_train_steps = 10
# model argument
args.nn_size     = 64

# prints logs
args.log_interval = 1
args.work_dir     = '.'
args.save_tb      = './tb'
device = 'cuda'

In [None]:
L = Logger(args.work_dir, use_tb=args.save_tb, config='model')

### Put it all together

In [None]:
## Model
state_shape    =  env.observation_space.shape
action_shape  =  env.action_space.shape
model         =  Model(state_shape, action_shape, nn_size = args.nn_size)
model.cuda()

In [None]:
## Planner
planner = None

In [None]:
## Model based RL + MPC
mbrl = MBRL(model, planner, device, action_shape, args)

In [None]:
## Update the statustics of the model (required for normalization)
mbrl.model.set_statistics(dataset)

In [None]:
iterator = dataset.random_iterator(args.batch_size)
#mbrl.update(iterator, step = 1)

Training Loop

In [None]:
for step in range(args.num_train_steps+1):
    iterator = dataset.random_iterator(args.batch_size)
    # evaluate agent periodically

    #if step > 0 and step % args.eval_freq == 0:
    #    print("evaluation")
    #    print('eval/episode', episode, step)
    #    with torch.no_grad():
    #        #evaluate(eval_env, agent, video, args.num_eval_episodes, L, step)
    #        evaluate(eval_env, agent, video, 3, L, step)
    #    if args.save_model:
    #        agent.save_model(model_dir, step)
    for _ in range(3):
        mbrl.update(iterator, L,step)

## MPC

In [None]:
class MPC():
    def __init__(self,state_dim, action_dim):
        #update regularly
        self.model = None
        self.env = None
        self.init_state = None

        #only init once
        self.args     = args
        self.action_dim = action_dim
        self.state_dim  = state_dim
        self.horizon    = args.horizon
        self.discount   = args.discount
        self.lb         = args.lb
        self.ub         = args.ub
        #self.popsize = self.config["popsize"]
        #self.elites = int(round(self.config["elites"]*self.popsize))
        self.epsilon = 1e-8
        self.init_mean = torch.zeros(self.action_dim).cuda()
        self.init_std = torch.ones(self.action_dim).cuda()

    def get_actions_uniform(self):
        m = torch.distributions.uniform.Uniform(
            torch.squeeze(torch.full((1,self.action_dim),self.lb,dtype=torch.float32)),
            torch.squeeze(torch.full((1,self.action_dim),self.ub,dtype=torch.float32))
            )
        actions = m.sample_n(1024).cuda()
        actions[:1,:self.action_dim] = 0
        return actions
        
    def cost_fn(self):
        #inititialze Tensors
        self.action_samples = torch.FloatTensor(torch.zeros((self.popsize,self.horizon*self.action_dim))).cuda() 
        self.next_states_0 = torch.FloatTensor(torch.zeros((self.popsize,self.state_dim))).cuda() 
        init_states = torch.FloatTensor(np.repeat([self.init_state], len(self.action_samples), axis=0)).cuda() if self.model.cuda_enabled else torch.FloatTensor(np.repeat([self.init_state], len(action_samples), axis=0)) #[popsize[state_dim]]
        self.reward_trusted_out = torch.FloatTensor(np.zeros((self.horizon,len(self.action_samples)))).cuda()
        all_costs = torch.FloatTensor(np.zeros((self.horizon,len(self.action_samples)))).cuda() if self.model.cuda_enabled else torch.FloatTensor(np.zeros(len(self.action_samples)))

        #define number of batches and action_samples per batch
        n_batch = max(1, int(len(self.action_samples)/1024))
        per_batch = len(self.action_samples)/n_batch
        
        #predict rewards for batch of action with horizon h
        for i in range(n_batch):
            start_index = int(i*per_batch)
            end_index = len(self.action_samples) if i == n_batch - \
                1 else int(i*per_batch + per_batch)

            start_states = init_states[start_index:end_index]

            #call model
            dyn_model = self.model

            h=0
            for h in range(self.horizon):
                # sample an action batch, concatenate model input and predict next states
                actions = self.action_sampler() 
                model_input = torch.cat((start_states, actions), dim=1)
                next_states = dyn_model.predict_tensor(model_input)+start_states #de-normalized model output

                # predict reward based on cost function from environment
                predicted_reward = self.env.get_reward_tensor(actions, start_states,next_states)

                # add negative rewards (costs) to the cost function for evaluation of the best action sequence
                all_costs[h,start_index: end_index] += torch.neg(predicted_reward*self.discount**h)

                self.action_samples[start_index:end_index][:, h*self.action_dim: h * self.action_dim + self.action_dim] = 0
                self.action_samples[start_index:end_index][:, h*self.action_dim: h * self.action_dim + self.action_dim] += actions

                if h == 0:
                    self.next_states_0[start_index:end_index] += next_states

                start_states = next_states

                h=h+1
        
        return all_costs

    def obtain_solution(self):
        all_costs = self.cost_fn()
        costs = torch.sum(all_costs,axis=0)
        indices = torch.argsort(costs)
        return self.action_samples[indices], -1*self.all_costs[0,indices[0]], self.next_states_0[indices[0]]


### View in TensorBoard
Open an embedded  TensorBoard viewer inside a notebook:

In [None]:
%load_ext tensorboard 

In [None]:
#docs_infra: no_execute
%tensorboard --logdir {'./tb'}