# Write a reward wrapper to apply penalty for unfulfilled demand (goodwill penalty)

In the last exercise, we wrote a `ModifyObservation` wrapper that modified the observation of `InventoryEnv` to return the state required by `InventoryEnvHard`. 

`InventoryEnvHard` also uses a different reward function which penalizes unfulfilled demand. In particular, we need to add the following term to the reward of `InventoryEnv`.

$$-k \max(0, d - I)$$.

Here, $d$ is the realized demand for the day (sampled from a Poisson distribution), $I$ is the on-hand-inventory, and $k$ is the `goodwill_penalty_per_unit`.

In this exercise, your job is to write a reward wrapper `ModifyReward` so that we can instantiate the hard environment as follows

```
inventory_env_hard = ModifyReward(ModifyState(env=InventoryEnv()))
```

Notice that the additional reward term depends on demand. However, in the `step()` method of `InventoryEnv`, the demand (after we randomly generate it from a Poisson distribution) is neither stored as an instance variable nor returned in the `info` dict.  But the wrapper must have access to the demand to compute the additional term. 

To fix this flaw, I have included a `InventoryEnv` implementation below which is slightly different from the lesson. The only difference is that the `demand` is returned in the `info` dict. Please use the class below when writing your wrapper.

In [1]:
import gym
from gym.spaces import Box
import numpy as np
from numpy.random import default_rng


class InventoryEnv(gym.Env):
    def __init__(self):
        """
        Must define self.observation_space and self.action_space here
        """
        self.max_capacity = 4000

        self.action_space = Box(low=np.array([0]), high=np.array([self.max_capacity]))

        self.lead_time = 5
        self.obs_dim = self.lead_time + 4

        self.max_mean_daily_demand = 200
        self.max_unit_selling_price = 100
        self.max_daily_holding_cost_per_unit = 5

        obs_low = np.zeros((self.obs_dim,))
        obs_high = np.array([self.max_capacity for _ in range(self.lead_time)] +
                            [self.max_mean_daily_demand, self.max_unit_selling_price,
                             self.max_unit_selling_price, self.max_daily_holding_cost_per_unit
                             ]
                            )
        self.observation_space = Box(low=obs_low, high=obs_high)

        self.rng = default_rng()

        self.current_obs = None
        self.episode_length_in_days = 90
        self.day_num = None

    def reset(self):
        """
        Returns: the observation of the initial state
        Reset the environment to initial state so that a new episode (independent of previous ones) may start
        """
        mean_daily_demand = self.rng.uniform() * self.max_mean_daily_demand
        selling_price = self.rng.uniform() * self.max_unit_selling_price
        buying_price = self.rng.uniform() * selling_price
        daily_holding_cost_per_unit = self.rng.uniform() * min(buying_price,
                                                               self.max_daily_holding_cost_per_unit
                                                               )
        self.current_obs = np.array([0 for _ in range(self.lead_time)] +
                                    [mean_daily_demand, selling_price, buying_price,
                                     daily_holding_cost_per_unit,
                                     ]
                                    )
        self.day_num = 0
        return self.current_obs

    def step(self, action):
        """
        Returns: Given current obs and action, returns the next observation, the reward, done and optionally additional info
        """
        buys = min(action[0], self.max_capacity - np.sum(self.current_obs[:self.lead_time]))

        demand = self.rng.poisson(self.current_obs[self.lead_time])
        next_obs = np.concatenate((self.current_obs[1: self.lead_time],
                                   np.array([buys]),
                                   self.current_obs[self.lead_time:]
                                   )
                                  )
        next_obs[0] += max(0, self.current_obs[0] - demand)

        reward = (self.current_obs[self.lead_time + 1] * (self.current_obs[0] + self.current_obs[1] - next_obs[0]) -
                  self.current_obs[self.lead_time + 2] * buys -
                  self.current_obs[self.lead_time + 3] * (next_obs[0] - self.current_obs[1])
                  )

        self.day_num += 1
        done = False
        if self.day_num >= self.episode_length_in_days:
            done = True

        self.current_obs = next_obs
        
        # ----- THIS IS DIFFERENT FROM THE VIDEO LESSONS ----- #
        # We return the demand in the info dict so that any wrapper can access it
        return self.current_obs, reward, done, {"demand": demand}

    def render(self, mode="human"):
        """
        Returns: None
        Show the current environment state e.g. the graphical window in `CartPole-v1`
        This method must be implemented, but it is OK to have an empty implementation if rendering is not
        important
        """
        pass

    def close(self):
        """
        Returns: None
        This method is optional. Used to cleanup all resources (threads, graphical windows) etc.
        """
        pass

    def seed(self, seed=None):
        """
        Returns: List of seeds
        This method is optional. Used to set seeds for the environment's random number generator for
        obtaining deterministic behavior
        """
        return

Write your implementation of `ModifyReward` below.

HINT: You can access any instance variable defined in the wrapped env (e.g. `self.current_obs`) inside the wrapper's methods.

In [None]:
class ModifyReward(...):
    ...

Next, check that your `ModifyReward` wrapper works.

1. First, create an environment instance by chaining the `ModifyReward` and `ModifyObservation` wrappers (I have supplied the correct implementation of `ModifyObservation` below).
2. Run an episode and always take action 0. In the original `InventoryEnv`, this would lead to zero rewards (no buying costs, no profits, and no holding cost). But in the wrapped version of the environment, you should have negative rewards (because of unfulfilled demand). Verify that this is the case.

In [1]:
from gym import ObservationWrapper


class ModifyObservation(ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.max_goodwill_penalty_per_unit = 10
        obs_low = self.env.observation_space.low
        obs_high = self.env.observation_space.high
        self.observation_space = Box(
            low = np.append(obs_low, 0),
            high = np.append(obs_high, self.max_goodwill_penalty_per_unit)
        )
        
    def reset(self):
        self.goodwill_penalty_per_unit = self.env.rng.uniform() * self.max_goodwill_penalty_per_unit
        return super().reset()
    
    def observation(self, obs):
        return np.append(obs, self.goodwill_penalty_per_unit)

In [2]:
# Define the environment using the wrappers
inventory_env_hard = ...

# Run an episode, always taking action 0 (no buys). Check that the stepwise reward is negative.