# Archive

## HER 

- HER is an algorithm that works with off-policy methods (SAC,TQC, TD3 and DDPG)
- HER is no longer a separate algorithm but a replay buffer class HerReplayBuffer that must be passed to an off-policy algorithm when using MultiInputPolicy (to have Dict observation support).
- HER requires the environment to inherits from gym.GoalEnv
- For performance reasons, the maximum number of steps per episodes must be specified. In most cases, it will be inferred if you specify max_episode_steps when registering the environment or if you use a gym.wrappers.TimeLimit (and env.spec is not None). Otherwise, you can directly pass max_episode_length to the model constructor

### Train

### Eval 

### Tuning TD3

## Contrib packages: ARS

### Train

### Eval 

## Tuning DDPG

### Parameters

- policy = "MlpPolicy" , "CnnPolicy" , "MultiInputPolicy"
- **learning_rate** = staic or range(1,0)
- buffer_size (int) – size of the replay buffer
- **learning_starts (int)** – how many steps of the model to collect transitions for before learning starts
    -  For a fixed number of steps at the beginning (set with the start_steps keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal DDPG exploration.
- batch_size (int) – Minibatch size for each gradient update
- **tau (float)** – the soft update coefficient (“Polyak update”, between 0 and 1)
- gamma (float) – the discount factor
- train_freq (Union[int, Tuple[int, str]]) – Update the model every train_freq steps. Alternatively pass a tuple of frequency and unit like (5, "step") or (2, "episode").
- gradient_steps (int) – How many gradient steps to do after each rollout (see train_freq) Set to -1 means to do as many gradient steps as steps done in the environment during the rollout.
- action_noise (Optional[ActionNoise]) – the action noise type (None by default), this can help for hard exploration problem. Cf common.noise for the different action noise type.
    -  uncorrelated, mean-zero Gaussian noise works perfectly well. 
    -  To facilitate getting higher-quality training data, you may reduce the scale of the noise over the course of training. (We do not do this in our implementation, and keep noise scale fixed throughout.)


- replay_buffer_class (Optional[ReplayBuffer]) – Replay buffer class to use (for instance HerReplayBuffer). If None, it will be automatically selected.
- optimize_memory_usage (bool) – Enable a memory efficient variant of the replay buffer at a cost of more complexity. See https://github.com/DLR-RM/stable-baselines3/issues/37#issuecomment-637501195
- create_eval_env (bool) – Whether to create a second environment that will be used for evaluating the agent periodically. (Only available when passing string for the environment)

- seed (Optional[int]) – Seed for the pseudo random generators
- _init_setup_model (bool) – Whether or not to build the network at the creation of the instance





stable_baselines3.ddpg.MlpPolicy Parameters
- lr_schedule (Callable[[float], float]) – Learning rate schedule (could be constant)
- n_critics (int) – Number of critic networks to create.

stable_baselines3.ddpg.MlpPolicy.set_training_mode()
- mode (bool) – if true, set to training mode, else set to evaluation mode

stable_baselines3.ddpg.CnnPolicy

stable_baselines3.ddpg.MultiInputPolicy


In [None]:
# hide all deprecation warnings from tensorflow
#import tensorflow as tf
#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

import optuna

#from stable_baselines import PPO2
from stable_baselines3 import DDPG
from stable_baselines3 import HerReplayBuffer
from gym.wrappers import RecordEpisodeStatistics
from stable_baselines3.common.noise import NormalActionNoise
#from stable_baselines.common.evaluation import evaluate_policy
#from stable_baselines.common.cmd_util import make_vec_env

# https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/master/5_custom_gym_env.ipynb
#from custom_env import GoLeftEnv

# The noise objects for DDPG
n_actions = env.action_space.shape[-1]
normal_action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))


def optimize_ddpg(trial):
    """ Learning hyperparamters we want to optimise"""
    
    replay_buffer_class = trial.suggest_categorical("replay_buffer_class", ["HER", "None"])
    replay_buffer_class = {"HER": HerReplayBuffer, "None": None}[replay_buffer_class]
    
    action_noise = trial.suggest_categorical("action_noise", ["action_noise", "None"])
    action_noise = {"action_noise": normal_action_noise, "None": None}[action_noise]
    
    params =  {
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.0001, 1.0), #default: 0.001
        'learning_starts': int(trial.suggest_int('learning_starts', 0, 200, 10)),  #default: 100
        'batch_size': int(trial.suggest_int('batch_size', 0, 200,10)),  #default: 100
        'tau': trial.suggest_loguniform('tau', 0.001, 1.0), #default: 0.005
        'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999), # default: gamma=0.99
        'replay_buffer_class' : replay_buffer_class,
        'action_noise' : action_noise
    }
    
    return params
        



def optimize_agent(trial):
    """ Train the model and optimize
        Optuna maximises the negative log likelihood, so we
        need to negate the reward here
    """
    
    model_params = optimize_ddpg(trial)
    
    # init tracking experiment.
    # hyper-parameters, trial id are stored.
    config = dict(trial.params)
    config["trial.number"] = trial.number
    wandb.init(
        project="RL-optuna",
        entity="jlu237", 
        sync_tensorboard=True,
        config=config,
        reinit=True
    )
    
    env = make('VPPBiddingEnv-TRAIN-v1', render_mode=None)
    env = Monitor(env) 
    env = RecordEpisodeStatistics(env) # record stats such as returns


    model = DDPG('MultiInputPolicy', env, verbose=0, tensorboard_log=f"runs/ddpg", seed = 1, **model_params)
    model.learn(total_timesteps=557, log_interval=1)
    
    wandb.finish()
    
study = optuna.create_study()
try:
    study.optimize(optimize_agent, n_trials=20)
except KeyboardInterrupt:
    print('Interrupted by keyboard.')

In [None]:
env = make('VPPBiddingEnv-TRAIN-v1', render_mode=None)
env.observation_space.spaces["observation"]

In [None]:
print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))


In [None]:
model.get_parameters()["critic.optimizer"]["param_groups"]

In [None]:
model.get_parameters()["actor.optimizer"]["param_groups"]

In [None]:
# !apt-get install swig cmake ffmpeg freeglut3-dev xvfb

In [None]:
# Alternative from araffin for optuna from: https://github.com/optuna/optuna-examples/blob/52ed3aff3e3e936be3873b5acc6ee3ccdadea914/rl/sb3_simple.py#L60

""" Optuna example that optimizes the hyperparameters of
a reinforcement learning agent using A2C implementation from Stable-Baselines3
on a OpenAI Gym environment.

This is a simplified version of what can be found in https://github.com/DLR-RM/rl-baselines3-zoo.

You can run this example as follows:
    $ python sb3_simple.py

"""
from typing import Any
from typing import Dict

import gym
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from stable_baselines3 import A2C
from stable_baselines3.common.callbacks import EvalCallback
import torch
import torch.nn as nn


N_TRIALS = 100
N_STARTUP_TRIALS = 5
N_EVALUATIONS = 2
N_TIMESTEPS = int(2e4)
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_EPISODES = 3

ENV_ID = "CartPole-v1"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    "env": ENV_ID,
}


def sample_a2c_params(trial: optuna.Trial) -> Dict[str, Any]:
    """Sampler for A2C hyperparameters."""
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    gae_lambda = 1.0 - trial.suggest_float("gae_lambda", 0.001, 0.2, log=True)
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)
    learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
    ent_coef = trial.suggest_float("ent_coef", 0.00000001, 0.1, log=True)
    ortho_init = trial.suggest_categorical("ortho_init", [False, True])
    net_arch = trial.suggest_categorical("net_arch", ["tiny", "small"])
    activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("gae_lambda_", gae_lambda)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = [
        {"pi": [64], "vf": [64]} if net_arch == "tiny" else {"pi": [64, 64], "vf": [64, 64]}
    ]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "gae_lambda": gae_lambda,
        "learning_rate": learning_rate,
        "ent_coef": ent_coef,
        "max_grad_norm": max_grad_norm,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
            "ortho_init": ortho_init,
        },
    }


class TrialEvalCallback(EvalCallback):
    """Callback used for evaluating and reporting a trial."""

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            super()._on_step()
            self.eval_idx += 1
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True


def objective(trial: optuna.Trial) -> float:

    kwargs = DEFAULT_HYPERPARAMS.copy()
    # Sample hyperparameters
    kwargs.update(sample_a2c_params(trial))
    # Create the RL model
    model = A2C(**kwargs)
    # Create env used for evaluation
    eval_env = gym.make(ENV_ID)
    # Create the callback that will periodically evaluate
    # and report the performance
    eval_callback = TrialEvalCallback(
        eval_env, trial, n_eval_episodes=N_EVAL_EPISODES, eval_freq=EVAL_FREQ, deterministic=True
    )

    nan_encountered = False
    try:
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_env.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward


if __name__ == "__main__":
    # Set pytorch num threads to 1 for faster training
    torch.set_num_threads(1)

    sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
    # Do not prune before 1/3 of the max budget is used
    pruner = MedianPruner(n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3)

    study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")
    try:
        study.optimize(objective, n_trials=N_TRIALS, timeout=600)
    except KeyboardInterrupt:
        pass

    print("Number of finished trials: ", len(study.trials))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: ", trial.value)

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

    print("  User attrs:")
    for key, value in trial.user_attrs.items():
        print("    {}: {}".format(key, value))

In [None]:
# code from https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/utils/hyperparams_opt.py#L340

def sample_ddpg_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for DDPG hyperparams.
    :param trial:
    :return:
    """
    gamma = trial.suggest_categorical("gamma", [0.9, 0.95, 0.98, 0.99, 0.995, 0.999, 0.9999])
    learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1)
    batch_size = trial.suggest_categorical("batch_size", [16, 32, 64, 100, 128, 256, 512, 1024, 2048])
    buffer_size = trial.suggest_categorical("buffer_size", [int(1e4), int(1e5), int(1e6)])
    # Polyak coeff
    tau = trial.suggest_categorical("tau", [0.001, 0.005, 0.01, 0.02, 0.05, 0.08])

    train_freq = trial.suggest_categorical("train_freq", [1, 4, 8, 16, 32, 64, 128, 256, 512])
    gradient_steps = train_freq

    noise_type = trial.suggest_categorical("noise_type", ["ornstein-uhlenbeck", "normal", None])
    noise_std = trial.suggest_uniform("noise_std", 0, 1)

    # NOTE: Add "verybig" to net_arch when tuning HER (see TD3)
    net_arch = trial.suggest_categorical("net_arch", ["small", "medium", "big"])
    # activation_fn = trial.suggest_categorical('activation_fn', [nn.Tanh, nn.ReLU, nn.ELU, nn.LeakyReLU])

    net_arch = {
        "small": [64, 64],
        "medium": [256, 256],
        "big": [400, 300],
    }[net_arch]

    hyperparams = {
        "gamma": gamma,
        "tau": tau,
        "learning_rate": learning_rate,
        "batch_size": batch_size,
        "buffer_size": buffer_size,
        "train_freq": train_freq,
        "gradient_steps": gradient_steps,
        "policy_kwargs": dict(net_arch=net_arch),
    }

    if noise_type == "normal":
        hyperparams["action_noise"] = NormalActionNoise(
            mean=np.zeros(trial.n_actions), sigma=noise_std * np.ones(trial.n_actions)
        )
    elif noise_type == "ornstein-uhlenbeck":
        hyperparams["action_noise"] = OrnsteinUhlenbeckActionNoise(
            mean=np.zeros(trial.n_actions), sigma=noise_std * np.ones(trial.n_actions)
        )

    if trial.using_her_replay_buffer:
        hyperparams = sample_her_params(trial, hyperparams)

    return hyperparams


In [None]:
!git clone --recursive https://github.com/DLR-RM/rl-baselines3-zoo

In [None]:
#!cd rl-baselines3-zoo/

In [None]:
!pip install -r rl-baselines3-zoo/requirements.txt

In [None]:
!python rl-baselines3-zoo/train.py --algo ddpg --env VPPBiddingEnv-TRAIN-v1 -n 697 -optimize --n-trials 5 --n-jobs -1 \
  --sampler tpe --pruner median

In [None]:
!python rl-baselines3-zoo/scripts/parse_study.py -i path/to/study.pkl --print-n-best-trials 10 --save-n-best-hyperparameters 10


### PPO - Proximal Policy Optimization algorithm 

#### Train the agent

#### Evaluate Agent

## A2C - synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C)

#### Training

#### Eval

## Other Algorithm 

In [None]:
# todo

## DQN -- needs Discrete Action Space. 

# Testing

#### Run Episodes

### Check the Environment

# Keras 

## 2. Create a Deep Learning Model with Keras

## 3. Build Agent with Keras-RL


## 4. Reloading Agent from Memory


# Archive


### 2. create test list , check if date is in test list, if yes, skip day 

1. data set start date = 01.07.2020
2. training start date = 02.07.2020 
3. first slot lower boundary = 02.07.2020 22.00
4. make test set 
    - take time_features_df
    - substract 2 hours from each timestamp = start of slot 
    - iterate over df and get date every 5 days, add to list = test list. 
5. in training mode -> skip dates in list. 
6. in test mode -> take only dates from test list. 
7. unterschiedliche testsets erstellen? izzy meinte, zusammenhängende woche wäre gut (seasonality)

Vorgehen:
Ab der ersten Vollen Woche: Woche nehmen und Testset-Liste hinzufügen, 
dann skip 5 wochen , dann 1 woche test woche 


### scaler for observations


In [None]:
# scaler for observations

scaler = MinMaxScaler(feature_range=(-1,1))

a_raw = asset_data_historic
print("a_raw")

print(a_raw)

scaler.fit(np.array(a_raw).reshape(-1, 1))

b_transformed = scaler.transform((a_raw.reshape(-1, 1)))
print("b_transformed")

print(b_transformed)

# convert from array to list
c_list = [x for xs in list(b_transformed) for x in xs]
print("c_list")

print(c_list)

# transform back to 

d_transformed_back = (scaler.inverse_transform(np.array(c_list).reshape(-1, 1)))
print("d_transformed_back")
print(d_transformed_back)

print("e_array")
e_array = d_transformed_back.flatten()
print(e_array)


print("f_list")
f_list = [x for xs in list(d_transformed_back) for x in xs]

print(f_list)