# `rllib` algorithm (PPO) with default settings does not learn anything in our custom environment

<img src="images/no_learning.png" width="700"/>

# Two ways to get the agent to learn

1. Method 1: **Modify the environment** to make it easier for the agent to learn
2. Method 2: **Modify the `rllib` algorithm settings** 

## Method 1: How do we modify the environment to make it easier for the agent to learn

`rllib` algorithms like `PPO` uses deep neural nets for policy improvement

<img src="images/deep_rl.png" width="700"/> 

The neural net learns from visited states during *exploration* and uses that knowledge to improve the policy in unknown states during *exploitation*

<img src="images/generalize.png" width="700"/>

### The neural nets work better when the observations and actions are scaled to a standard interval

| Variable | Good range | Bad range |
| --- | --- | --- |
| Observation | $\left( -1, 1 \right)$, $\left( 0, 1 \right)$ | $\left( 0, 4000 \right)$, $\left( -300, 300 \right)$ |
| Action (only for `Box` action space) | $\left( -1, 1 \right)$, $\left( 0, 1 \right)$| $\left( 0, 4000 \right)$, $\left( -300, 300 \right)$ |

In [1]:
from inventory_env.inventory_env import InventoryEnv

env = InventoryEnv()
obs = env.reset()
while True:
    action = env.action_space.sample()
    obs, r, done, _ = env.step(action)
    print(obs)
    if done:
        break

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 3.01903052e+03 1.27692539e+02 9.43980690e+01 3.94489580e+01
 1.17184951e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 3.01903052e+03
 9.80706665e+02 1.27692539e+02 9.43980690e+01 3.94489580e+01
 1.17184951e+00]
[0.00000000e+00 0.00000000e+00 3.01903052e+03 9.80706665e+02
 2.62817383e-01 1.27692539e+02 9.43980690e+01 3.94489580e+01
 1.17184951e+00]
[0.00000000e+00 3.01903052e+03 9.80706665e+02 2.62817383e-01
 0.00000000e+00 1.27692539e+02 9.43980690e+01 3.94489580e+01
 1.17184951e+00]
[3.01903052e+03 9.80706665e+02 2.62817383e-01 0.00000000e+00
 0.00000000e+00 1.27692539e+02 9.43980690e+01 3.94489580e+01
 1.17184951e+00]
[3.89473718e+03 2.62817383e-01 0.00000000e+00 0.00000000e+00
 0.00000000e+00 1.27692539e+02 9.43980690e+01 3.94489580e+01
 1.17184951e+00]
[3.77400000e+03 0.00000000e+00 0.00000000e+00 0.00000000e+00
 1.05000000e+02 1.27692539e+02 9.43980690e+01 3.94489580e+01
 1.17184951e+00]
[3.64800000e+03 0.00000000e

  logger.warn(


In [2]:
obs = env.reset()
while True:
    action = env.action_space.sample()
    print(action)
    obs, r, done, _ = env.step(action)
    if done:
        break

[2819.21]
[2917.763]
[1263.5724]
[904.40155]
[2690.5845]
[3436.7837]
[3103.8987]
[2928.612]
[3235.6257]
[2306.3406]
[834.32416]
[255.38545]
[3493.4897]
[2667.1821]
[1491.7552]
[3883.3718]
[2594.145]
[2981.6672]
[1169.2024]
[3767.83]
[3341.0586]
[2865.3967]
[534.53143]
[676.79504]
[2059.7349]
[2919.43]
[228.20187]
[1631.8925]
[2447.3618]
[1652.7802]
[2804.589]
[2655.4734]
[2702.1873]
[1849.9341]
[2587.542]
[2418.0916]
[2159.426]
[2193.1501]
[2736.8752]
[2045.3375]
[1838.3569]
[3218.8494]
[1322.9371]
[3632.0747]
[1515.0603]
[1552.869]
[1275.1548]
[2209.0667]
[3903.706]
[1159.2866]
[3587.6445]
[1868.0546]
[2235.1768]
[3762.895]
[571.20447]
[1677.2751]
[3558.2402]
[894.3923]
[3054.7708]
[1276.8757]
[3365.2861]
[3596.924]
[1400.6174]
[2747.01]
[1676.6804]
[3813.6633]
[3970.4348]
[2771.1846]
[1039.4005]
[2650.8455]
[1153.9792]
[2423.8372]
[1555.0676]
[1555.9216]
[3336.9985]
[158.08917]
[1286.9114]
[1725.9342]
[1989.9186]
[1039.8077]
[1057.867]
[2680.121]
[3005.145]
[658.3809]
[708.3073]
[114

# Tasks

1. Modify the environment so that observations have standard range $\left( -1, 1 \right)$, $\left( 0, 1 \right)$ etc.
2. Modify the environment so that actions have standard range $\left( -1, 1 \right)$, $\left( 0, 1 \right)$ etc.

## RL algos work better/faster when rewards are non-sparse and have low variance

| Variable | Good variance | Acceptable variance | Bad |
| --- | --- | --- | --- |
| rewards | 1 | 10 | 1000 |

In [3]:
for _ in range(100):
    obs = env.reset()
    while True:
        action = env.action_space.sample()
        obs, r, done, _ = env.step(action)
        print(r)
        if done:
            break

-6460.500713647208
-21912.04591269641
-102432.81758347484
-28293.818304237295
-14103.654692964157
4036.0907088461718
3391.2621881650084
-6755.828917161115
-7970.745599566568
-7839.319608595042
-8564.323979924497
-8376.019793696329
-6648.7064540264855
-8322.470631251534
-9487.889783149338
-7860.449257884262
-7888.559640021095
-8902.76483765844
-8137.411657557355
-7160.385670031493
-7897.49060451799
-9729.60526834455
-8086.083881971594
-8100.1079482380555
-7046.528882170062
-9651.030916025047
-7777.872450819811
-6918.131075690178
-7613.075768771037
-8390.872144301837
-7587.262471334588
-9480.76661507822
-7344.410429048044
-8818.122443228393
-7016.361705191811
-9264.285225363557
-7484.740211895063
-8570.36986524093
-8524.108462884868
-9037.808952667247
-7618.7559294831635
-6242.9482196217
-10324.744971732187
-7654.920106443868
-8748.652439947582
-8331.739572307843
-9060.396405448719
-7487.82540415719
-8325.666958451548
-8662.099906678048
-8040.699649288556
-8245.111425164469
-7713.2866437

## Tasks

1. Modify the environment so that observations have standard range $\left( -1, 1 \right)$, $\left( 0, 1 \right)$ etc.
2. Modify the environment so that actions have standard range $\left( -1, 1 \right)$, $\left( 0, 1 \right)$ etc.
3. Modify the environment to reduce reward variance

# `gym` wrappers: a construct for modifying environments 

- Preserves original env
- avoids duplicated code

=> smoother experiments