## RL algos work better/faster when rewards are non-sparse and have low variance

| Variable | Good variance | Acceptable variance | Bad |
| --- | --- | --- | --- |
| rewards | 1 | 10 | 1000 |

### Variance of stepwise rewards in `InventoryEnv` (calculated over 10000 episodes)

In [1]:
from inventory_env.inventory_env import InventoryEnv
import numpy as np

env = InventoryEnv()
rewards = []
for _ in range(10000):
    obs = env.reset()
    while True:
        obs, r, done, _ = env.step(env.action_space.sample())
        rewards.append(r)
        if done:
            break
print(f"Variance of stepwise rewards is: {np.var(np.array(rewards))}")

  logger.warn(


Variance of stepwise rewards is: 133809964.03725684


<img src="images/state_action_transition_rewards.png" width="1000"/>

In [3]:
# max_capacity: 4000, max_unit_selling_price: 100, max_daily_holding_cost_per_unit: 5
upper_bound = 100 * 4000
print(f"upper bound: {upper_bound}")
lower_bound = - 100 * 4000 - 5 * 4000
print(f"lower bound: {lower_bound}")

upper bound: 400000
lower bound: -420000


<img src="images/reward_scaling/1.png" width="1000"/>

### We need a way to compress the range

#### Option 1: Linear map

<img src="images/reward_scaling/2.png" width="1000"/>

#### Option 2: `arctan` map

<img src="images/reward_scaling/3.png" width="1000"/>

#### Map the most frequented range to the nearly linear part of `arctan`

<img src="images/reward_scaling/4.png" width="1000"/>

To calculate the average high and low scales, we will assume the following:

- num item sold to customers: `max_mean_daily_demand / 2`
- num item bought: `max_mean_daily_demand / 2`
- price at which item is sold: `max_unit_selling_price / 2`
- price at which item is bought: `max_unit_selling_price / 4`
- daily holding cost per unit: `max_daily_holding_cost_per_unit / 2`
- num items held: `max_mean_daily_demand / 2`

### Variance of stepwise rewards in the wrapped environment

In [3]:
import numpy as np

from inventory_env.inventory_env import InventoryEnv
from inventory_env.wrappers import MyScaleReward

env = MyScaleReward(InventoryEnv())
rewards = []
for _ in range(10000):
    obs = env.reset()
    while True:
        obs, r, done, _ = env.step(env.action_space.sample())
        rewards.append(r)
        if done:
            break
print(f"Variance of stepwise rewards is: {np.var(np.array(rewards))}")

Variance of stepwise rewards is: 0.6341989861289701


### Summary

- Reducing the variance of stepwise rewards helps the RL algorithms to learn faster.
- The main idea is **not** `arctan`; this was just an example. The important idea is to try various ways to reduce variance and run experiments to see how the agent performs with these various methods.
- Sometimes, very simple methods are effective. Deepmind used `np.clip()` to reduce the variance of rewards in their famous Atari playing Deep RL agent. It worked very well for that specific problem.
- Try to ensure a monotonic transformation i.e. if $r_1 \le r_2$, then $f(r_1) \le f(r_2)$.