## Creating a custom `gym` environment for the Inventory Management problem Part 5: Gotchas in `step()` implementation

<img src="images/shop.png" width="500"/>

<img src="images/state_action_transition_rewards.png" width="1000"/>

In [1]:
import gym
from gym.spaces import Box
import numpy as np
from numpy.random import default_rng


class InventoryEnv(gym.Env):
    def __init__(self):
        """
        Must define self.observation_space and self.action_space here
        """
        
        # Define action space: bounds, space type, shape
        
        # Bound: Shelf space is limited
        self.max_capacity = 4000
        
        # Space type: Better to use Box than Discrete, since Discrete will lead to too many output nodes in the NN
        # Shape: rllib cannot handle scalar actions, so turn it into a numpy array with shape (1,)
        self.action_space = Box(low=np.array([0]), high=np.array([self.max_capacity]))
        
        # Define observation space: bounds, space type, shape
        
        # Shape: The lead time controls the shape of observation space
        self.lead_time = 5
        self.obs_dim = self.lead_time + 4
        
        # Bounds: Define high of the remaining observation space elements
        self.max_mean_daily_demand = 200
        self.max_unit_selling_price = 100
        self.max_daily_holding_cost_per_unit = 5
        
        obs_low = np.zeros((self.obs_dim,))
        obs_high = np.array([self.max_capacity for _ in range(self.lead_time)] +
                            [self.max_mean_daily_demand, self.max_unit_selling_price,
                             self.max_unit_selling_price, self.max_daily_holding_cost_per_unit
                             ]
                            )
        self.observation_space = Box(low=obs_low, high=obs_high)
        
        # The random number generator that will be used throughout the environment
        self.rng = default_rng()
        
        # All instance variables are defined in the __init__() method
        self.current_obs = None
        self.episode_length_in_days = 90
        self.day_num = None

    def reset(self):
        """
        Returns: the observation of the initial state
        Reset the environment to initial state so that a new episode (independent of previous ones) may start
        """
        # Sample parameter values from the parameter space
        
        # Set mean daily demand (lambda)
        mean_daily_demand = self.rng.uniform() * self.max_mean_daily_demand
        
        # Set selling price
        selling_price = self.rng.uniform() * self.max_unit_selling_price
        
        # Set buying price: buying price cannot be higher than selling price
        buying_price = self.rng.uniform() * selling_price
        
        # Set daily holding cose per unit: holding cost cannot be higher than buying_price
        daily_holding_cost_per_unit = self.rng.uniform() * min(buying_price,
                                                               self.max_daily_holding_cost_per_unit
                                                               )
        
        # Return the first observation
        self.current_obs = np.array([0 for _ in range(self.lead_time)] +
                                    [mean_daily_demand, selling_price, buying_price,
                                     daily_holding_cost_per_unit,
                                     ]
                                    )
        self.day_num = 0
        return self.current_obs

    def step(self, action):
        """
        Returns: Given current obs and action, returns the next observation, the reward, done and optionally additional info
        """
        # Action looks like np.array([20.0]). We convert that to float 20.0 for easier calculation
        buys = min(action[0], self.max_capacity - np.sum(self.current_obs[:self.lead_time]))
        
        # Compute next obs
        demand = self.rng.poisson(self.current_obs[self.lead_time])
        next_obs = np.concatenate((self.current_obs[1: self.lead_time],
                                   np.array([buys]),
                                   self.current_obs[self.lead_time:]
                                   )
                                  )
        next_obs[0] += max(0, self.current_obs[0] - demand)
        
        # Compute reward
        reward = (self.current_obs[self.lead_time + 1] * (self.current_obs[0] + self.current_obs[1] - next_obs[0]) -
                  self.current_obs[self.lead_time + 2] * buys - 
                  self.current_obs[self.lead_time + 3] * (next_obs[0] - self.current_obs[1])
                  )
                  
        # Compute done
        self.day_num += 1
        done = False
        if self.day_num >= self.episode_length_in_days:
            done = True
            
        self.current_obs = next_obs

        # info must be a dict
        return self.current_obs, reward, done, {}

    def render(self, mode="human"):
        """
        Returns: None
        Show the current environment state e.g. the graphical window in `CartPole-v1`
        This method must be implemented, but it is OK to have an empty implementation if rendering is not
        important
        """
        pass

    def close(self):
        """
        Returns: None
        This method is optional. Used to cleanup all resources (threads, graphical windows) etc.
        """
        pass
    
    def seed(self, seed=None):
        """
        Returns: List of seeds
        This method is optional. Used to set seeds for the environment's random number generator for 
        obtaining deterministic behavior
        """
        return

In [2]:
env = InventoryEnv()
for _ in range(1000):
    obs = env.reset()
    while True:
        action = np.array([3000])
        obs, r, done, _ = env.step(action)
        assert np.all(obs <= env.observation_space.high), f"Observation {obs} does not respect observation space upper bound {env.observation_space}"
        if done:
            break

  logger.warn(


AssertionError: Observation [5.99800000e+03 3.00000000e+03 3.00000000e+03 3.00000000e+03
 3.00000000e+03 6.94954109e-01 4.41213796e+00 4.10669941e+00
 4.00165833e-01] does not respect observation space upper bound Box([0. 0. 0. 0. 0. 0. 0. 0. 0.], [4000. 4000. 4000. 4000. 4000.  200.  100.  100.    5.], (9,), float32)

| State | Day num | Buys | Demand |
| --- | --- | --- | --- |
| `[0, 0, 0, 0, 0, ...]` | 0 | 3000 | 0 |
| `[0, 0, 0, 0, 3000, ...]` | 1 | 3000 | 0 |
| `[0, 0, 0, 3000, 3000, ...]` | 2 | 3000 | 0 |
| `[0, 0, 3000, 3000, 3000, ...]` | 3 | 3000 | 0 |
| `[0, 3000, 3000, 3000, 3000, ...]` | 4 | 3000 | 0 |
| `[3000, 3000, 3000, 3000, 3000, ...]` | 5 | 3000 | 0 |
| `[6000, 3000, 3000, 3000, 3000, ...]` | 6 | 3000 | 0 |


- All units that are already in the inventory or will eventually arrive in the shop:  `self.current_obs[:self.lead_time]`.
- Remaining inventory space in case of `0` demand:  `self.max_capacity - self.current_obs[:self.lead_time]`.
- `buys = min(action[0], self.max_capacity - self.current_obs[:self.lead_time])`.

| State | Day num | Buys | Buys after imposing constraint | Demand |
| --- | --- | --- | --- | --- |
| `[0, 0, 0, 0, 0, ...]` | 0 | 3000 | 3000 | 0 |
| `[0, 0, 0, 0, 3000, ...]` | 1 | 3000 | 1000 | 0 |
| `[0, 0, 0, 3000, 1000, ...]` | 2 | 3000 | 0 | 0 |
| `[0, 0, 3000, 1000, 0, ...]` | 3 | 3000 | 0 | 0 |
| `[0, 3000, 1000, 0, 0, ...]` | 4 | 3000 | 0 | 0 |
| `[3000, 1000, 0, 0, 0, ...]` | 5 | 3000 | 0 | 0 |
| `[4000, 0, 0, 0, 0, ...]` | 6 | 3000 | 0 | 0 |

- **action postprocessing due to constraint**: can affect performance significantly, because the agent thinks it is performing a particular action, but a different action is actually being executed in the environment. It may be hard for the agent to learn this connection.

- Try different ways of imposing the constraint to check if they lead to different performances.

- Test your environment thoroughly (especially if action and environment bounds are being respected), and if possible, by writing unit tests

<img src="images/inv_sim.png" width="750"/>