# Training a forward model for `Reacher-v4`

This is a supplementary notebook showing how a forward model can be trained for the MuJoCo environment `Reacher-v4`.
At the end, this notebook generates and saves a pickle file which stores the newly trained forward model.
The generated pickle file can be used with the model predictive control example.

## Requirements

Although not a dependency of EvoTorch, this notebook uses [skorch](https://github.com/skorch-dev/skorch) for the required supervised learning operations. `skorch` can be installed via:

```bash
pip install skorch
```

## Initial imports

We begin our code with initial imports

In [None]:
import torch
import numpy as np
import gymnasium as gym
from typing import Iterable
import multiprocessing as mp
import math
from torch import nn
from skorch import NeuralNetRegressor

## Declarations

We declare the environment name below.

In [None]:
ENV_NAME = "Reacher-v4"

By default, we use all the available CPUs of the local computer.

In [None]:
NUM_PROCESSES = mp.cpu_count()

We are going to collect data from this many episodes:

In [None]:
NUM_EPISODES = 20000

## Utilities for training

Here, we define helper functions and utilities for the training of our model.

We begin by defining the function $\text{reacher\_state}(\cdot)$ which, given an observation from the reinforcement learning environment `Reacher-v4`, extracts and returns the state vector of the simulated robotic arm.

In [None]:
def reacher_state(observation: Iterable) -> Iterable:
    observation = np.asarray(observation, dtype="float32")
    state = np.concatenate([observation[:4], observation[6:10]])
    state[-2] += observation[4]
    state[-1] += observation[5]
    return state

We now define a function $\text{collect\_data}(\cdot)$ which collects data from multiple episodes, number of these episodes being specified via the argument `num_episodes`.
Within each episode, the data we collect is:

- current state
- action (uniformly sampled)
- next state (i.e. the state obtained after applying the action)

The forward model that we wish to train should be able to answer this question: _given the current state and the action, what is the prediction for the next state?_ Therefore, among the data we collect, the current states and the actions are categorized as the inputs, while the next states are categorized as the targets.
The function $\text{collect\_data}(\cdot)$ organizes its data into inputs and targets, and finally returns them.

In [None]:
def collect_data(num_episodes: int):
    inputs = []
    targets = []

    env = gym.make(ENV_NAME)
    for _ in range(num_episodes):
        observation, _ = env.reset()

        while True:
            action = np.clip(np.asarray(env.action_space.sample(), dtype="float32"), -1.0, 1.0)
            state = reacher_state(observation)
            
            observation, reward, terminated, truncated, info = env.step(action)
            done = terminated | truncated

            next_state = reacher_state(observation)

            current_input = np.concatenate([state, action])
            current_target = next_state - state
            
            inputs.append(current_input)
            targets.append(current_target)
            
            if done:
                break
    
    return np.vstack(inputs), np.vstack(targets)

The function below uses multiple CPUs of the local computer to collect data in parallel.

In [None]:
def collect_data_in_parallel(num_episodes: int):
    n = math.ceil(num_episodes / NUM_PROCESSES)
    
    with mp.Pool(NUM_PROCESSES) as p:
        collected_data = p.map(collect_data, [n for _ in range(NUM_PROCESSES)])
    
    all_inputs = []
    all_targets = []
    
    for inp, target in collected_data:
        all_inputs.append(inp)
        all_targets.append(target)
    
    all_inputs = np.vstack(all_inputs)
    all_targets = np.vstack(all_targets)
    
    return all_inputs, all_targets

To make the supervised learning procedure more efficient, we also introduce a normalizer.
This normalizing function receives a batch (i.e. a collection) of vectors (where this batch can be the input data or the output data), and returns:

- the normalized counterpart of the entire data
- mean of the data
- standard deviation of the data

In [None]:
def normalize(x: np.ndarray) -> tuple:
    mean = np.mean(x, axis=0).astype("float32")
    stdev = np.clip(np.std(x, axis=0).astype("float32"), 1e-5, np.inf)
    normalized = np.asarray((x - mean) / stdev, dtype="float32")
    return normalized, mean, stdev

We are now ready to collect our data and store them.

The following class (not to be instantiated) serves as a namespace where all our collected data and their stats (i.e. means and standard deviations) are stored.
The rest of this notebook will refer to this namespace when training, saving, and testing the model.

In [None]:
class data:
    inputs = None
    targets = None

    input_mean = None
    input_stdev = None

    target_mean = None
    target_stdev = None

Below, we collect the data and their stats, and store them in the `data` namespace.

In [None]:
data.inputs, data.targets = collect_data_in_parallel(NUM_EPISODES)

data.inputs, data.input_mean, data.input_stdev = normalize(data.inputs)
data.targets, data.target_mean, data.target_stdev = normalize(data.targets)

In [None]:
data.inputs.shape, data.targets.shape

In [None]:
data.input_mean.shape, data.input_stdev.shape

In [None]:
data.target_mean.shape, data.target_stdev.shape

We declare the following architecture for our neural network:

In [None]:
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.Tanh(),
    nn.LayerNorm(64),
    nn.Linear(64, 64),
    nn.Tanh(),
    nn.LayerNorm(64),
    nn.Linear(64, 8),
)

model

Declare a regression problem and set the values of the hyperparameters to be used for the training procedure:

In [None]:
regressor = NeuralNetRegressor(
    model,
    max_epochs=50,
    lr=0.0001,
    optimizer=torch.optim.Adam,
    iterator_train__shuffle=True,
    batch_size=500,
)

Train the model:

In [None]:
regressor.fit(data.inputs, data.targets)

At this point, we should have a trained model.

To test this trained model, we define the convenience function below which receives the current state and an action, and with the help of the trained model, returns the prediction for the next state.

In [None]:
@torch.no_grad()
def use_net(state: Iterable, action: Iterable) -> Iterable:
    input_mean = torch.as_tensor(data.input_mean, dtype=torch.float32)
    input_stdev = torch.as_tensor(data.input_stdev, dtype=torch.float32)
    target_mean = torch.as_tensor(data.target_mean, dtype=torch.float32)
    target_stdev = torch.as_tensor(data.target_stdev, dtype=torch.float32)
    
    state = torch.as_tensor(state, dtype=torch.float32)
    action = torch.clamp(torch.as_tensor(action, dtype=torch.float32), -1.0, 1.0)
    
    x = torch.cat([state, action])    
    x = (x - input_mean) / input_stdev
    y = model(x)
    y = (y * target_stdev) + target_mean
    result = (y + state).numpy()

    return result

To compare the predictions of our model against the actual states, we instantiate a `Reacher-v4` environment.

In [None]:
env = gym.make(ENV_NAME)
env

In the code below, we have a loop which feeds both the actual `Reacher-v4` environment and our trained predictor the same actions.
During the execution of this loop, the x and y coordinates of the robotic arm's tip, reported both by the actual environment and by the trained predictor are collected.
At the end, the collected x and y coordinates are plotted.

In [None]:
observation, _ = env.reset()
observation = np.asarray(observation, dtype="float32")

actual_state = reacher_state(observation)
pred_state = actual_state.copy()

class actual:
    x = []
    y = []

actual.x.append(actual_state[-2])
actual.y.append(actual_state[-1])    

class predicted:
    x = []
    y = []

predicted.x.append(pred_state[-2])
predicted.y.append(pred_state[-1])    

while True:
    action = np.asarray(env.action_space.sample(), dtype="float32")
    
    observation, reward, terminated, truncated, info = env.step(action)
    done = terminated | truncated

    actual_state = reacher_state(observation)
    
    pred_state = use_net(pred_state, action)

    actual.x.append(actual_state[-2])
    actual.y.append(actual_state[-1])    

    predicted.x.append(pred_state[-2])
    predicted.y.append(pred_state[-1])    

    if done:
        break

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(actual.x)
plt.plot(predicted.x)

In [None]:
plt.plot(actual.y)
plt.plot(predicted.y)

Below, we save our trained model.
This trained model can be used by the `Reacher-v4` MPC example notebook, if copied next to it.

In [None]:
import pickle

with open("reacher_model.pickle", "wb") as f:
    pickle.dump(
        {
            "model": model,
            "input_mean": data.input_mean,
            "input_stdev": data.input_stdev,
            "target_mean": data.target_mean,
            "target_stdev": data.target_stdev,
        },
        f
    )