# Using wrappers to derive `InventoryEnvHard` from `InventoryEnv`

In the last chapter's exercises, we have implemented `InventoryEnvHard` from scratch. In this chapter's exercises, your job is to write wrappers so that we can derive `InventoryEnvHard` by wrapping `InventoryEnv`. We are aiming for

```
inventory_env_hard = Wrapper1(Wrapper2(...WrapperN(InventoryEnv())))...
```

instead of 

```
inventory_env_hard = InventoryEnvHard()
```

This will drastically reduce code duplication and make the implementation clean and flexible.

The `InventoryEnvHard` environment differs from `InventoryEnv` in terms of the state definition and reward function.

In this exercise, your job is to write a wrapper which changes the state definition of `InventoryEnv` to include an additional element for `goodwill_penalty_per_unit`.

I have included the `InventoryEnv` definition in the following code block. Please run it before starting this exercise.

In [1]:
import gym
from gym.spaces import Box
import numpy as np
from numpy.random import default_rng


class InventoryEnv(gym.Env):
    def __init__(self):
        """
        Must define self.observation_space and self.action_space here
        """
        self.max_capacity = 4000

        self.action_space = Box(low=np.array([0]), high=np.array([self.max_capacity]))

        self.lead_time = 5
        self.obs_dim = self.lead_time + 4

        self.max_mean_daily_demand = 200
        self.max_unit_selling_price = 100
        self.max_daily_holding_cost_per_unit = 5

        obs_low = np.zeros((self.obs_dim,))
        obs_high = np.array([self.max_capacity for _ in range(self.lead_time)] +
                            [self.max_mean_daily_demand, self.max_unit_selling_price,
                             self.max_unit_selling_price, self.max_daily_holding_cost_per_unit
                             ]
                            )
        self.observation_space = Box(low=obs_low, high=obs_high)

        self.rng = default_rng()

        self.current_obs = None
        self.episode_length_in_days = 90
        self.day_num = None

    def reset(self):
        """
        Returns: the observation of the initial state
        Reset the environment to initial state so that a new episode (independent of previous ones) may start
        """
        mean_daily_demand = self.rng.uniform() * self.max_mean_daily_demand
        selling_price = self.rng.uniform() * self.max_unit_selling_price
        buying_price = self.rng.uniform() * selling_price
        daily_holding_cost_per_unit = self.rng.uniform() * min(buying_price,
                                                               self.max_daily_holding_cost_per_unit
                                                               )
        self.current_obs = np.array([0 for _ in range(self.lead_time)] +
                                    [mean_daily_demand, selling_price, buying_price,
                                     daily_holding_cost_per_unit,
                                     ]
                                    )
        self.day_num = 0
        return self.current_obs

    def step(self, action):
        """
        Returns: Given current obs and action, returns the next observation, the reward, done and optionally additional info
        """
        buys = min(action[0], self.max_capacity - np.sum(self.current_obs[:self.lead_time]))

        demand = self.rng.poisson(self.current_obs[self.lead_time])
        next_obs = np.concatenate((self.current_obs[1: self.lead_time],
                                   np.array([buys]),
                                   self.current_obs[self.lead_time:]
                                   )
                                  )
        next_obs[0] += max(0, self.current_obs[0] - demand)

        reward = (self.current_obs[self.lead_time + 1] * (self.current_obs[0] + self.current_obs[1] - next_obs[0]) -
                  self.current_obs[self.lead_time + 2] * buys -
                  self.current_obs[self.lead_time + 3] * (next_obs[0] - self.current_obs[1])
                  )

        self.day_num += 1
        done = False
        if self.day_num >= self.episode_length_in_days:
            done = True

        self.current_obs = next_obs

        return self.current_obs, reward, done, {}

    def render(self, mode="human"):
        """
        Returns: None
        Show the current environment state e.g. the graphical window in `CartPole-v1`
        This method must be implemented, but it is OK to have an empty implementation if rendering is not
        important
        """
        pass

    def close(self):
        """
        Returns: None
        This method is optional. Used to cleanup all resources (threads, graphical windows) etc.
        """
        pass

    def seed(self, seed=None):
        """
        Returns: List of seeds
        This method is optional. Used to set seeds for the environment's random number generator for
        obtaining deterministic behavior
        """
        return

Implement the wrapper in the following code block.

HINTS:

1. Which methods return the state? Perhaps you need to override some or all of them.
2. Don't forget about the observation space.
3. Can you use the convenience class `gym.ObservationWrapper` to simplify the implementation?

In [None]:
class ModifyObservation(...):
    ...

Now test your wrapper. First, create a wrapped environment using the `ModifyObservation` wrapper.

In [None]:
wrapped = ...

## Tests

1. Reset your environment and obtain the initial state. Does the state have one more element compared to `InventoryEnv`? Is it in the expected range $0 \le \mathrm{goodwill\_penalty\_per\_unit} \le 10$? 

2. Reset the environment again. Do you get a different value for the last element?

3. Perform a random action on the wrapped environment and obtain the next state. Does the new state look alright?

4. Does `wrapped.observation_space` have 10 dimensions (instead of 9)?