# Causal Knowledge Transfer for Safe Reinforcement Learning

### Train Agents
#### Sumo Network File
When editing the sumo network (nets/simple_unprotected_right.net.xml) never edit the xml directly. Instead, go to nets/netconfig and make desired changes there. Generate the new net.xml by executing generate config.sh
#### Sumo Route File
This is part of the generate_config.sh now.
#### Reward Function
* TODO: Come up with a fitting reward function that penalises collisions
#### RL Training
* TODO: Come up with Hyperparameters for the training loop

**Desired Output: Trained_Model.zip**

### Creating the Sumo Environment

In [None]:
from env.SumoEnvironmentGenerator import SumoEnvironmentGenerator
from pathlib import Path

environments = SumoEnvironmentGenerator(
    net_file=str(Path().joinpath('nets', 'simple_unprotected_right', 'simple_unprotected_right.net.xml')),
    route_file=str(Path().joinpath('nets', 'simple_unprotected_right', 'simple_unprotected_right.rou.xml')),
    sumocfg_file=str(Path().joinpath('nets', 'simple_unprotected_right', 'simple_unprotected_right.sumocfg')),
    duration=3600,
    learning_data_csv_name=str(Path().joinpath('env', 'training_data', 'output.csv')),
)

### Training and saving the Model

In [None]:
from stable_baselines3.dqn import DQN

%load_ext tensorboard
env = environments.get_training_env()
model = DQN(
    env=environments.get_training_env(),
    policy='MlpPolicy',
    learning_rate=0.001,
    learning_starts=0,
    train_freq=1,
    target_update_interval=500,
    exploration_fraction=0.05,
    exploration_final_eps=0.01,
    verbose=1,
    tensorboard_log='dqn_sumo_tensorboard'
)
model.learn(10_000, tb_log_name='test_run_short')
model.save(Path().joinpath('env', 'training_data', 'dqn'))

Giving the model a test run in an evaluation environment

In [None]:
from stable_baselines3.dqn import DQN

env = environments.get_demonstration_env()
model = DQN(env=env, policy='MlpPolicy').load(Path().joinpath('env', 'training_data', 'dqn'))

obs, info = env.reset()
done = False
while not done:
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
env.close()

### Produce Traces
Run the simulation repeatably to produce traces (data) for causal discovery.

#### Data Selection
TODO: Select which columns we want to do Causal Discovery on
#### Data Summary
TODO: Incorporate old data summary script

**Desired Output: One CSV File containing all interesting data**


In [None]:
from stable_baselines3.dqn import DQN

simulation_output_path = Path().joinpath('data', 'dqn')
Path.mkdir(simulation_output_path, parents=True, exist_ok=True)

for experiment in range(10):
    env = environments.get_generation_env(output_prefix=str(simulation_output_path.joinpath(str(experiment).zfill(4))))
    model = DQN(env=env, policy='MlpPolicy').load(Path().joinpath('env', 'training_data', 'dqn'))
    
    obs, info = env.reset()
    done = False
    while not done:
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
    env.close()

### Causal Discovery
Discover causal graph

TODO: Decide which discovery algorithm to use

TODO: Figure out how to incorporate R code in Jupyter Notebook

**Desired Output: Causal Graph XML File**

In [None]:
# TODO

### Fit MLMs
Fit MLMs based on Causal Discovery Graph

TODO: Parse Graph XML into MLM parameters / formulae

**Desired Output: MLM**

In [None]:
# TODO

### Produce Interventions

#### Covariate Shift Distribution
* Create a distribution for the covariate (friction) shift
* Sample from distribution
    * Fulfill Assumption: sparse sample data is representative for covariate shift ground truth
* Produce Traces for sparse input data

#### Crank MLM the other way
* Calculate Intervention Distribution by inputting sparse data into MLM

**Desired Output: Intervention Distribution**

In [None]:
# TODO

### Generate Posterior Distributions
TODO: Generate Posterior Distributions without intervention

TODO: Generate Posterior Distributions with intervention

**Desired Output: Two XML Files**

In [None]:
# TODO

### Query
Compare Distributions and decide, which part of the model to retrain.

TODO: Classify the data / model in parts

In [None]:
# TODO

### Evaluation

#### Agent
compare new resulting agent (partially continued training depending on Query) to:
* Old agent (Lower performance bound)
* Completely newly trained agent (upper performance bound)
* (New Agent that is trained completely on new data (without Query))

#### Intervention
Function: Number of Covariate Shift Samples --> Wasserstein distance: Intervention vs. ground truth (distribution)

#### MLM
* Wasserstein Distance: Effect of Intervention vs. ground truth effect
* Maybe also as a function of the number of retrain samples



In [None]:
# TODO