## Example usage of `adversarial_gym` wrappers using OpenAI Gym and `stable-baselines3`

Import required packages.

`gym` is needed for RL environments.
We use PPO from stable-baselines.

`adversarial_gym` contains 3 wrappers at the moment. One each for adding nosie to observations, actions, and reward. Each of these wrappers is imported.

In [5]:
%load_ext autoreload
%autoreload 2
import gym
from adversarial_gym.wrappers import ObsNoiseWrapper, ActionNoiseWrapper, RewardNoiseWrapper
from stable_baselines3 import PPO

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
from IPython import display

Create the environment.

In [7]:
env = gym.make("ContinuousCartPole-v0")

Let's try training a PPO agent on this version of the environment without noise so we can see what impact, if any, the noise we add to the next version of the environment will have on performance.

In [8]:
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  return np.array(self.state), reward, done, {}


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 24.9     |
|    ep_rew_mean     | 24.9     |
| time/              |          |
|    fps             | 2537     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 28.5         |
|    ep_rew_mean          | 28.5         |
| time/                   |              |
|    fps                  | 1653         |
|    iterations           | 2            |
|    time_elapsed         | 2            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0062948144 |
|    clip_fraction        | 0.0695       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.42        |
|    explained_variance   | 0.000331     |
|    learning_r

<stable_baselines3.ppo.ppo.PPO at 0x7ffbd002aa00>

Wrap the environment in an observation noise wrapper. This wrapper will sample noise from a normal distribution with a mean of zero and a standard deviation of 1.

These parameters can be changed to make the noise more challenging, for example you can change the mean (which will immediately make the environment much more challenging compared to a 0 mean) or increase the standard deviation so there is greater spread to the noise.

You can also try a different noise type. The available options are:
- `"normal"`
- `"uniform"`
- `"discrete_uniform"`
- `"beta"`
- `"poisson"`

Each distribution has the same keyword arguments as the NumPy functions of the same signature (e.g `"normal"` has the same arguments as `numpy.random.normal` and so on).

In [9]:
env = ObsNoiseWrapper(env, "normal", dict(loc=0, scale=1))

Now, we can use stable-baselines to train on the environment for 10000 timesteps.

In [10]:
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 25.2     |
|    ep_rew_mean     | 25.2     |
| time/              |          |
|    fps             | 2354     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 24.7         |
|    ep_rew_mean          | 24.7         |
| time/                   |              |
|    fps                  | 1564         |
|    iterations           | 2            |
|    time_elapsed         | 2            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0026303488 |
|    clip_fraction        | 0.0157       |
|    clip_range           | 0.2          |
|    en

<stable_baselines3.ppo.ppo.PPO at 0x7ffbd66d75b0>

Even though we only ran for 10,000 timesteps, which is not enough to train a model to convergence, we can already see the impact that the added noise is having on the training process. The model trained on the environment *without* noise has already obtained a reward of 65.5 at timestep 10240, while the model trained on the *noisy* environment only has a reward of 33.8.

The `ActionNoiseWrapper` and `RewardNoiseWrapper` work in the same way as the `ObsNoiseWrapper`demoed here.