## Results of Adaptive Robust Model Predictive Shielding
This jupyter notebook illustrates the workflow to produce the results presented in the work "Safe Reinforcement Learning via Adaptive Robust Model Predictive Shielding". The structure of the jupyter notebook is oriented on the paper structure and divided into the different subsections. 

In [None]:
import os
import pandas as pd
import pickle
import numpy as np

# Define directory to store results
save_dir = os.path.join(os.getcwd(), 'results')
os.makedirs(save_dir, exist_ok=True)

## 4.2. RL training setup + 4.3. Penalty term for reward shaping
In this section the influence of different reward shaping approaches on standard RL policies is investigated. For the presented evaluations for three different pentalty weights 20 RL models each have been trained. 

#### Training setup

In [None]:
from RL_policy.RL_training import RL_training 
# penalty term for reward shaping
lambda_weight = 5        # lambda_weight in [2.5, 5, 7.5]*1e3
max_sp = 0.2
save_dir_RL_base_models = rf'{save_dir}\RL_models_base\penalty_{lambda_weight}'

# define RL training setting
RL_settings_base = {
    'env_type': 'CSTR_base',
    'timesteps': int(1.5e6),
    'gamma' : 0.999,
    'verbose' : 1,
    'algorithm' : 'TD3',
    'action_noise_sigma' : 1e-2,
    'batch_size' : 512,
    'learning_rate' : 1e-5,
    'eval_freq' : 2000,
    'action_noise_type': 'OU',
    'tau': 0.005,
    'scheduling' : 'exponential',        # None or 'linear' or 'exponential'
    'net': dict(pi=[400, 300], qf=[400,300]),
    'size_of_replay_buffer' : int(1.5e6),
    'policy_delay' : 2,
    'save_dir': save_dir_RL_base_models,
    'penalty_weight': lambda_weight*1e3,
    'max_sp': max_sp,
    'seed': 48, 
    'add_info' : 'Additionaly informaion about training setting'
    }  

# train RL agent with defined settings for different penalty weights
num_models = 20

# we recommend parallelizing the training on the CPU for faster results
for i in range(num_models):
    RL_dict = {**RL_settings_base,
                'seed': 48 + i}
    RL_training(RL_dict)
   

#### Evaluation - only RL

In [None]:
# only RL evaluation of trained policies without shielding
from policy_deployment.only_RL import eval_ensemble_agents

only_RL_base_results = eval_ensemble_agents(dir = save_dir_RL_base_models, num_models=num_models, num_eval = 100, sp = False)

## Section 5.1. Backup policy construction
In this section the development of an approximate multi-stage MPC as a backup policy for model predicitve shielding is presented. The construction starts with trajectory-based sampling of the training data. Next, the training data is preprocessed and finally, several neural networks are trained in a small grid search. 

#### Data sampling

In [None]:
from backup_policy.data_sampling import trajectory_sampling
from RL_policy.RL_environments import CSTR_sp

save_dir_backup_policy = rf'{save_dir}\backup_policy'

# environemnt for sampling of initial states
env = CSTR_sp()

# define sampling parameters
max_sp =0.2
alpha_var = beta_var =  np.array([1 - max_sp, 1 + max_sp])

# trajectory based sampling
trajectory_sampling(env = env, num_traj = 1000, len_traj = 100, save_dir = save_dir_backup_policy, alpha_var = alpha_var, beta_var = beta_var)


#### Preprocessing

In [None]:
from backup_policy.data_sampling import drop_infeasibles, drop_duplicates, scale_data
from RL_policy.RL_environments import CSTR_sp


# load raw data
x_data_raw = pd.read_csv(os.path.join(save_dir_backup_policy, 'RMPC_raw_x_data.csv'))
y_data_raw = pd.read_csv(os.path.join(save_dir_backup_policy, 'RMPC_raw_y_data.csv'))
feas = pd.read_csv(os.path.join(save_dir_backup_policy, 'feas_check_raw_data.csv'))

# drop infeasible data points
x_drop, y_drop = drop_infeasibles(x_data_raw, y_data_raw, feas)

# drop duplicate data points
x_red, y_red = drop_duplicates(x_drop, y_drop, n_dec=4)

# scale data for training according to scaling of RL environment
env = CSTR_sp()
x_scaled, y_scaled = scale_data(env, x_red, y_red)

# save processed data

with open(os.path.join(save_dir_backup_policy, 'RMPC_x_data_red_scaled.pkl'), 'wb') as f:
    pickle.dump(x_scaled, f)

with open(os.path.join(save_dir_backup_policy, 'RMPC_y_data_red_scaled.pkl'), 'wb') as f:
    pickle.dump(y_scaled, f)


#### Grid search


In [None]:
from backup_policy.neural_network_training import train_ARMPC, hyperparameter_list, eval_on_test_data
from sklearn.model_selection import train_test_split

save_dir_grid = rf'{save_dir_backup_policy}\grid_search'

# load data and split into training and testing set
with open(os.path.join(save_dir_backup_policy, 'RMPC_x_data_red_scaled.pkl'), 'rb') as f:
    x_data = pickle.load(f)

with open(os.path.join(save_dir_backup_policy, 'RMPC_y_data_red_scaled.pkl'), 'rb') as f:
    y_data = pickle.load(f)

X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=48)

with open(os.path.join(save_dir, 'X_train.pkl'), 'wb') as f:
    pickle.dump(X_train, f) 
    f.close()

with open(os.path.join(save_dir, 'X_test.pkl'), 'wb') as f:
    pickle.dump(X_test, f)
    f.close()

with open(os.path.join(save_dir, 'Y_train.pkl'), 'wb') as f:
    pickle.dump(Y_train, f)
    f.close()

with open(os.path.join(save_dir, 'Y_test.pkl'), 'wb') as f:
    pickle.dump(Y_test, f)
    f.close()

# hyperparameters for grid search
hp_list = hyperparameter_list(X_train, Y_train, save_dir_grid)

for hp in hp_list:
    loss, val_loss = train_ARMPC(X_train, Y_train, hp)
    test_loss = eval_on_test_data(X_test, Y_test, rf'{save_dir_grid}\grid{hp["number"]}') 


#### Probabilistic validation of backup policy

In [None]:
from backup_policy.auxiliary_functions_NN import compute_N_prob
from policy_deployment.only_RL import closed_loop_evaluation

# backup policy for probabilistic validation
backup_dir = save_dir_grid + r'\grid23'

# Define parameters for sampling discarding probabilistic validation
epsilon = 0.005  # probability of violation
r = 1            # number of discarded samples
delta = 10**-6   # confidence

N_prob = compute_N_prob(epsilon, r, delta)      # result is rounded up inside the function

# evaluate N_prob trajectories for probabilistic validation
backup_validation = closed_loop_evaluation(dir = backup_dir, num_eval = N_prob, sp = False, RL_agent=False, feas_IC=True) 
if np.count_nonzero(backup_validation['CV_all'].to_numpy()) <= r:
    print('Successful probabilistic validation!')
else:
    print('Probabilistic validation failed.')

## Section 5.2. Robust model predicitve shielding
In this section the constructed approximate multi-stage mpc is applied as a backup policy for model predicitve shielding.

#### Evaluation - Robust MPS

In [None]:
from policy_deployment.MPS import shielded_deployment
from RL_policy.RL_environments import CSTR_sp

max_sp = 0.2

# tuning parameters
n_horizon = 20
num_MC = None  # number of Monte Carlo simulations for robust rollouts, if set to None, min-max approach is applied and four rollouts are performed

backup_dir = save_dir_grid + r'\grid23'
agent_dir = save_dir_RL_base_models + r'\model_0'

env = CSTR_sp(seed = 123)
sampling = 'uniform'

# exemplarily deployment with robust shielding for one RL policy
shielded_deployment_instance = shielded_deployment(agent_dir=agent_dir, backup_dir=backup_dir, env=env, max_sp=max_sp, adaptive=False, n_horizon=n_horizon)
shielded_deployment_instance.shielded_closed_loop(num_eval=1, num_MC = num_MC, sampling=sampling)

# results are stored in a dataframe in attribute results_df
results = shielded_deployment_instance.results_df

## Section 5.3. RL training for adaptive shielding
In this section RL policies are trained for the augmented state space with the safety parameter $\sigma$. The training hyperparameters are set identically to those used in the training process of RL models without safety parameters. Again, 20 RL policies are trained for each reward. Afterwards, the monotonicity of the safety parameter is investigated to determine the policies considered for further evaluations.

#### Training setup

In [None]:
from RL_policy.RL_training import RL_training   

lambda_weight = 5        # lambda_weight in [2.5, 5, 7.5]*1e3

save_dir_RL_sp_models = rf'{save_dir}\RL_models_sp\penalty_{lambda_weight}'

RL_settings_sp = RL_settings_base.copy()
RL_settings_sp.update({'fix_p_unc':True,
                       'env_type': 'CSTR_sp',
                       'lambda_weight': lambda_weight*1e3,
                       'save_dir': save_dir_RL_sp_models})

# train RL agent with defined settings for different penalty weights
num_models = 20

for i in range(num_models):
    RL_dict = {**RL_settings_sp,
                'seed': 48 + i}
    RL_training(RL_dict)

#### Monotonicity analysis

In [None]:
from policy_deployment.only_RL import monotonicity_check

# results are stored in agent_dir
monotonicity_check(agent_dir=save_dir_RL_sp_models, num_models=num_models, episode_length=100)

#### Evaluation - only RL

In [None]:
from policy_deployment.only_RL import eval_ensemble_agents

# only RL evaluation of trained policies without shielding, results are stored in only_RL_sp_results
only_RL_base_results = eval_ensemble_agents(dir = save_dir_RL_sp_models, num_models=num_models, num_eval = 100, sp = True)

## Section 5.4. Adaptive robust model predictive shielding
In the following, the shielded deployment of RL policies augmented with the safety parameter is presented. We exemplarily show the evaluation of one model for the robust MPS, adaptive MPS, and adaptive robust MPS settings. Note that the same backup policy as derived in Section 5.1 and applied in Section 5.2 is used.

In [None]:
from RL_policy.RL_environments import CSTR_sp
from policy_deployment.MPS import shielded_deployment

# define backup policy and tuning parameters for all evaluations
backup_dir = save_dir_grid + r'\grid23'

n_horizon = 20
num_MC = None  # number of Monte Carlo simulations for robust rollouts, if set to None, min-max approach is applied and four rollouts are performed

# tuning parameters for adaptation of safety parameter
step_sp = 0.025
max_sp = 0.2

# environment
env = CSTR_sp(seed = 123)
sampling = 'uniform'  

#### Evaluation - robust MPS


In [None]:
agent_dir = save_dir_RL_sp_models + r'\model_0'

# exemplarily deployment with robust shielding for one RL policy
shielded_deployment_instance = shielded_deployment(agent_dir=agent_dir, backup_dir=backup_dir, env=env, max_sp=max_sp, sp = True, adaptive=False, n_horizon=n_horizon)
shielded_deployment_instance.shielded_closed_loop(num_eval=100, num_MC = num_MC, sampling=sampling)

# results are stored in a dataframe in attribute results_df
results = shielded_deployment_instance.results_df

#### Evaluation - adaptive MPS

In [None]:
agent_dir = save_dir_RL_sp_models + r'\model_0'

# exemplarily deployment with adapitve shielding for one RL policy
shielded_deployment_instance = shielded_deployment(agent_dir=agent_dir, env=env, max_sp=max_sp, sp = True, adaptive=True, n_horizon=n_horizon)
shielded_deployment_instance.shielded_closed_loop(num_eval=100, num_MC = num_MC, sampling=sampling)

# results are stored in a dataframe in attribute results_df
results = shielded_deployment_instance.results_df

#### Evaluation - adaptive robust MPS

In [None]:
agent_dir = save_dir_RL_sp_models + r'\model_0'

# exemplarily deployment with robust shielding for one RL policy
shielded_deployment_instance = shielded_deployment(agent_dir=agent_dir, backup_dir=backup_dir, env=env, max_sp=max_sp, sp = True, adaptive=True, n_horizon=n_horizon)
shielded_deployment_instance.shielded_closed_loop(num_eval=100, num_MC=num_MC, sampling=sampling)

# results are stored in a dataframe in attribute results_df
results = shielded_deployment_instance.results_df

#### $\beta$ distribution sampling

In [None]:
# adjust sampling method and repeat evaluations with shield and only RL evaluations
sampling = 'beta'

## Section 5.5. Comparison with alternative backup policy
In this section, the evaluations using multi-stage MPC as an alternative backup policy are presented. Again, only the exemplary evaluation of one model is presented. The evaluation is to be performed for the base RL policies without safety parameter for the robust MPS setting, and for RL policies with safety parameter for the robust MPS and adaptive robust MPS settings.

In [None]:
from RL_policy.RL_environments import CSTR_sp
from policy_deployment.MPS import shielded_deployment_multi_stage_mpc

n_horizon = 20

# tuning parameters for adaptation of safety parameter
step_sp = 0.025
max_sp = 0.2

# system environment
env = CSTR_sp(seed = 123) 

#### Evaluation - robust MPS with multi-stage MPC

In [None]:
agent_dir = save_dir_RL_sp_models + r'\model_0'

# exemplarily deployment with robust shielding for one RL policy
shielded_deployment_instance = shielded_deployment_multi_stage_mpc(agent_dir=agent_dir, env=env, max_sp=max_sp, sp = True, adaptive=False, n_horizon=n_horizon)
shielded_deployment_instance.shielded_closed_loop(num_eval=100)

# results are stored in a dataframe in attribute results_df
results = shielded_deployment_instance.results_df

#### Evaluation - adaptive robust MPS with multi-stage MPC

In [None]:
agent_dir = save_dir_RL_sp_models + r'\model_0'

# exemplarily deployment with robust shielding for one RL policy
shielded_deployment_instance = shielded_deployment_multi_stage_mpc(agent_dir=agent_dir, env=env, max_sp=max_sp, sp = True, adaptive=True, n_horizon=n_horizon)
shielded_deployment_instance.shielded_closed_loop(num_eval=100)

# results are stored in a dataframe in attribute results_df
results = shielded_deployment_instance.results_df

## Section 5.6. Generalizability of safety parameter
We refrain from showing the entire procedure to demonstrate the genaralizability of the safety parameter and limit ourselves to a brief description: The above reported procedures to describe the RL models (with and without safety parameter) and the construction of the backup policy are repeated for a new maximum value of the safety parameter, which is set to $\sigma_{\text{max}}=0.15$ for the RL environments and neural network training data sampling. The evaluations are performed for $\sigma_{\text{max}}=0.15$ and $\sigma_{\text{max}}=0.2$.  

In [None]:
# repeat RL training and backup policy construction for narrowed safety parameter range and repeat evaluation for previous safety parameter range
max_sp = 0.15

## Section 5.7. Investigation of tuning parameters
We refrain from showing the entire investigation of tuning parameters. To reproduce our presented results, the above described shielding evaluations are to be repeated for the different values of $N_{\text{roll}}$, $\alpha_\sigma$ and $N$. 

In [None]:
# adjust tuning parameters and repeat evaluation described above
n_horizon = 20            # [5, 10, 20]
num_MC = None             # [10, 100, 1000, None]
step_sp = 0.05            # [0.025, 0.05, 0.1]