# Multi-Product Supermarket Inventory Management with pi_optimal

## Introduction

This notebook demonstrates how to use `pi_optimal` to optimize inventory management decisions for multiple product categories in a supermarket setting. We'll focus on finding the optimal ordering strategy for different product types: bakery items, dairy products, and fresh produce, each with different shelf lives, profit margins, and demand patterns.

The key challenges in multi-product inventory management include:
- **Stockouts**: Not having enough inventory leads to lost sales and unhappy customers
- **Waste**: Ordering too much leads to expired products and financial losses
- **Holding Costs**: Maintaining inventory incurs costs (refrigeration, space, handling)
- **Product Interactions**: Demand for one product may influence demand for others

Using reinforcement learning with `pi_optimal`, we'll develop an ordering strategy that balances these competing objectives across multiple product types.

## Reinforcement Learning Concepts

Before diving into the implementation, let's understand key reinforcement learning terms in the context of multi-product inventory management:

- **State**: Represents the current situation of our inventory system, including:
  - Current stock levels for each product category
  - Demand history for each product category
  - Day of week (affects demand patterns)
  - Recent waste and stockouts by product type

- **Action**: The decisions we make - specifically how many units to order for each product category (bakery, dairy, fresh produce).

- **Reward**: A numerical signal indicating how good or bad our decisions were. For multi-product inventory, rewards might consider:
  - Overall profit across all product categories
  - Category-specific performance
  - Balance between waste minimization and stockout prevention

- **Policy**: A strategy that determines what ordering decisions to make for each product type given the current state.

- **Offline RL**: Learning from historical inventory data rather than through direct trial-and-error. This approach is ideal for inventory optimization where experimentation can be costly.

---
## Table of Contents

1. [Setup and Configuration](#setup-and-configuration)
2. [Understanding the Multi-Product Environment](#understanding-the-multi-product-environment)
3. [Data Collection](#data-collection)
   - [Simulating Random Ordering Strategies](#simulating-random-ordering-strategies)
   - [Exploring Collected Data](#exploring-collected-data)
4. [Defining Custom Reward Functions](#defining-custom-reward-functions)
5. [Training with pi_optimal](#training-with-pi_optimal)
   - [Dataset Preparation](#dataset-preparation)
   - [Agent Configuration](#agent-configuration)
   - [Training the Agent](#training-the-agent)
6. [Evaluating Performance](#evaluating-performance)
   - [Policy Visualization](#policy-visualization)
7. [Conclusion and Next Steps](#conclusion-&-next-steps)
---

## Setup and Configuration

First, let's import the necessary libraries and load our multi-product supermarket environment.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns

import gymnasium as gym
import pi_optimal as po
from supermarket_env import MultiProductSupermarketEnv

# Set plotting style
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = [14, 8]
plt.rcParams['font.size'] = 12

In [None]:
# Create a multi-product supermarket environment with default settings
multi_product_env = MultiProductSupermarketEnv()

# Print the product names in the multi-product environment
print(f"Products available in the multi-product environment: {multi_product_env.product_names}")

## Understanding the Multi-Product Environment

Let's explore the properties of our multi-product environment to better understand what we're working with.

In [None]:
# Examine the observation and action spaces
print("Observation space:", multi_product_env.observation_space)
print("Action space:", multi_product_env.action_space)

# Get initial observation and info
initial_obs, info = multi_product_env.reset(seed=42)

# Print observation keys to understand what information is available
print("\nObservation keys:")
for key in initial_obs.keys():
    print(f"  - {key}: {type(initial_obs[key])}")

# Display sample info dictionary
print("\nInfo keys:")
for key in info.keys():
    print(f"  - {key}")

Let's examine the product characteristics to understand their differences:

### Dataset Features

The `product_configs` dictionary defines how each product behaves in the environment. For each product (e.g. `"fresh_produce"`, `"dairy"`, `"bakery"`), you’ll find:

| **Key**            | **Meaning**                                                                                             |
|--------------------|---------------------------------------------------------------------------------------------------------|
| `max_inventory`    | Maximum units you can stock at once.                                                                    |
| `shelf_life`       | Number of days before unsold items spoil.                                                               |
| `purchase_cost`    | Cost per unit when ordering from supplier.                                                               |
| `selling_price`    | Revenue per unit sold to customers.                                                                     |
| `holding_cost`     | Cost per unit per day for storage (e.g. refrigeration, shelf space).                                     |
| `stockout_cost`    | Penalty per unit of unmet demand (e.g. lost sale or backorder cost).                                     |
| `waste_cost`       | Penalty per spoiled unit at end of its shelf life.                                                      |
| `demand_mean`      | Average daily demand (in units) before noise/weekday adjustments.                                        |
| `demand_std`       | Standard deviation of daily demand (for stochastic sampling).                                           |
| `weekday_factors`  | List of 7 multipliers (Mon→Sun) that scale demand to capture weekday/weekend effects.                     |

Additionally, the shared `common_config` contains:

- `episode_length`: number of days per simulation episode (default 30).  
- `seed`: random seed for reproducibility (if set).  

By tuning these parameters, you can model everything from highly perishable produce to long‑lasting goods, and explore how your reward functions and ordering policies must adapt.

In [None]:
# Create a table showing the characteristics of each product
product_data = []

for product in multi_product_env.product_names:
    product_info = {
        'Product': product,
        'Shelf Life (days)': multi_product_env.product_configs[product]['shelf_life'],
        'Purchase Cost': multi_product_env.product_configs[product]['purchase_cost'],
        'Selling Price': multi_product_env.product_configs[product]['selling_price'],
        'Storage Cost': multi_product_env.product_configs[product]['holding_cost'],
        'Profit Margin': multi_product_env.product_configs[product]['selling_price'] - 
                        multi_product_env.product_configs[product]['purchase_cost']
    }
    product_data.append(product_info)

# Create a DataFrame to display
product_df = pd.DataFrame(product_data)
product_df.sort_values('Shelf Life (days)')
product_df

In [None]:
product_df

## Data Collection

To train our RL agent, we need to collect data on different ordering strategies for multiple products. We'll implement and simulate various ordering policies.

### Simulating Random Ordering Strategies

Let's implement a function to collect data with different ordering strategies for multiple products.

In [None]:
def collect_multi_product_data(env, n_episodes=50, max_steps=30, ordering_strategies=None, random_seed=42):
    """
    Collect data by running multiple episodes with different ordering strategies
    for multiple products.
    """
    # Ensure reproducibility
    np.random.seed(random_seed)
    
    # Define default strategies if none provided
    if ordering_strategies is None:
        # -- Multi-product strategies --
        def random_small_order(obs, info):
            return tuple(np.array([np.random.randint(0, 21)]) for _ in env.product_names)
        
        def random_medium_order(obs, info):
            return tuple(np.array([np.random.randint(10, 31)]) for _ in env.product_names)
        
        def random_large_order(obs, info):
            return tuple(np.array([np.random.randint(20, 41)]) for _ in env.product_names)
        
        def replenish_to_target(obs, info, target=30):
            actions = []
            for pname in env.product_names:
                inv = np.sum(obs[f'{pname}_inventory'])
                actions.append(np.array([max(0, target - inv)]))
            return tuple(actions)
        
        def order_based_on_demand(obs, info):
            actions = []
            for pname in env.product_names:
                dh = obs[f'{pname}_demand_history']
                avg_d = np.mean(dh)
                actions.append(np.array([int(avg_d + 5)]))
            return tuple(actions)
        
        def product_specific_strategy(obs, info):
            actions = []
            # For bakery (shorter shelf life) - order close to immediate demand
            bakery_demand = np.mean(obs['bakery_demand_history'][-3:])
            actions.append(np.array([int(bakery_demand * 1.1)]))
            
            # For dairy (medium shelf life) - maintain moderate stock
            dairy_inv = np.sum(obs['dairy_inventory'])
            actions.append(np.array([max(0, 25 - dairy_inv)]))
            
            # For fresh produce - balance based on recent sales
            fp_demand = np.mean(obs['fresh_produce_demand_history'])
            fp_inv = np.sum(obs['fresh_produce_inventory'])
            actions.append(np.array([max(0, int(fp_demand * 1.5) - fp_inv)]))
            
            return tuple(actions)
        
        ordering_strategies = [
            random_small_order,
            random_medium_order,
            random_large_order,
            replenish_to_target,
            order_based_on_demand,
            product_specific_strategy
        ]

    all_episodes = []

    for ep in range(n_episodes):
        strat = np.random.choice(ordering_strategies)
        strat_name = strat.__name__

        obs, info = env.reset(seed=random_seed + ep)
        for step in range(max_steps):
            action = strat(obs, info)
            obs, reward, terminated, truncated, info = env.step(action)
            if terminated or truncated:
                break

        df = env.get_episode_history()
        df['episode'] = ep
        df['strategy'] = strat_name
        all_episodes.append(df)

    return pd.concat(all_episodes, ignore_index=True)

In [None]:
# Collect training data
train_df = collect_multi_product_data(multi_product_env, n_episodes=60, max_steps=30)

# Collect test data
test_df = collect_multi_product_data(multi_product_env, n_episodes=10, max_steps=20)

# Display the first few rows
train_df.head()

### Exploring Collected Data

Let's visualize the data to understand the relationships between different metrics for each product category.

In [None]:
# Summarize key metrics by product category with generic columns
def summarize_by_product(df):
    product_metrics = []
    
    for product in multi_product_env.product_names:
        # Extract metrics for this product but use generic column names
        metrics = {
            'product': product,
            'order': df[f'{product}_order'].mean(),
            'demand': df[f'{product}_demand'].mean(),
            'inventory': df[f'{product}_inventory'].mean(),
            'waste': df[f'{product}_waste'].mean(),
            'stockouts': df[f'{product}_stockouts'].mean(),
            'profit': df[f'{product}_profit'].mean()
        }
        product_metrics.append(metrics)
    
    # Create DataFrame from list of dictionaries
    return pd.DataFrame(product_metrics).set_index('product')

# Calculate summary by product
product_summary = summarize_by_product(train_df)
product_summary


In [None]:
# Visualize order vs. waste vs. stockouts for each product
fig, axes = plt.subplots(1, len(multi_product_env.product_names), figsize=(18, 6))

for i, product in enumerate(multi_product_env.product_names):
    ax = axes[i]
    scatter = ax.scatter(
        train_df[f'{product}_order'], 
        train_df[f'{product}_waste'],
        c=train_df[f'{product}_stockouts'],
        cmap='viridis',
        alpha=0.6
    )
    ax.set_title(f'{product.capitalize()}')
    ax.set_xlabel('Order Quantity')
    ax.set_ylabel('Waste')
    ax.grid(True, alpha=0.3)
    
    # Add a colorbar
    cbar = plt.colorbar(scatter, ax=ax)
    cbar.set_label('Stockouts')

plt.suptitle('Order vs. Waste vs. Stockouts by Product Category', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Analyze profit distribution by strategy
strategy_performance = train_df.groupby('strategy').agg({
    'bakery_profit': 'mean',
    'dairy_profit': 'mean',
    'fresh_produce_profit': 'mean',
    'total_profit': 'mean',
    'total_waste': 'mean',
    'total_stockouts': 'mean'
}).reset_index()

# Sort by total profit
strategy_performance = strategy_performance.sort_values('total_profit', ascending=False)
strategy_performance

## Defining Custom Reward Functions

Let's define different reward functions to train our agent for different objectives in multi-product inventory management.

In [None]:
def profit_oriented_reward(row):
    """
    Calculates a reward focused on maximizing total profit across all products.
    
    Args:
        row: A pandas Series containing metrics for a single step
    
    Returns:
        The calculated reward value
    """
    # Simply return the total profit
    return row['total_profit']


# Apply the profit-oriented reward to our training data
train_df['reward'] = train_df.apply(profit_oriented_reward, axis=1)

Below this heading is where **you** get to shape the agent’s behavior by defining your own reward functions. Think of a reward function as the “business objective” you hand to the learner—anything you can express in terms of profit, cost, service or quality can become part of your reward signal.

**What to try**  
- **Penalty terms**: discourage stockouts, spoilage, large order swings, or over‑holding.  
- **Bonus terms**: reward freshness (selling early), high service levels, or hitting promotional targets.  
- **Multi‑objective trade‑offs**: blend profit with waste, customer satisfaction, or inventory turns via weighted sums.  
- **Non‑linear incentives**: use thresholds, piecewise rewards, or saturating functions to prioritize certain ranges.

**How to add your function**  
1. Define a Python function that takes a single row of your simulation log (`train_df`) and returns a scalar reward.  
2. Register it by assigning to `train_df['reward'] = train_df.apply(your_reward, axis=1)`.  
3. Re‑train your agent and observe how its ordering policy shifts under your design!

Feel free to experiment wildly—seeing how small changes in your reward shape can lead to very different ordering strategies is the heart of reward‑function engineering. Have fun, and don’t forget to visualize profit vs. waste vs. stockouts afterward!  

In [None]:
def your_custom_reward_function(row):
    """
    Custom reward function that considers multiple factors.
    
    Args:
        row: A pandas Series containing metrics for a single step
    
    Returns:
        The calculated reward value
    """
    # Example: penalize waste and stockouts, reward profit
    return 0

## Training with Pi_Optimal

Now let's prepare the dataset and train pi_optimal agents for multi-product inventory optimization.

### Dataset Preparation

Let's prepare the dataset for Pi_Optimal, focusing on features relevant to multi-product inventory management.

In [None]:

state_columns = ['day_of_week']

# Add product-specific state columns
for product in multi_product_env.product_names:
    state_columns.extend([
        f'{product}_inventory',  # Current inventory level
        f'{product}_waste',      # Recent waste
        f'{product}_stockouts',  # Recent stockouts
        f'{product}_demand'      # Recent demand
    ])

# Define action columns - one for each product
action_columns = [f'{product}_order' for product in multi_product_env.product_names]

lookback = 7 # Number of days to look back for state features 
reward_column = 'reward' 

# Create the dataset
train_dataset = po.datasets.timeseries_dataset.TimeseriesDataset(
    df=train_df,
    lookback_timesteps=lookback,
    unit_index='episode',         # Each episode is a separate unit
    timestep_column='day',        # Day is our timestep
    reward_column=reward_column,  # Use the specified reward
    state_columns=state_columns,  # State features
    action_columns=action_columns # Action features
)

### Agent Configuration

Let's set up the Pi_Optimal agent configuration suitable for multi-product inventory management.

In [None]:
# Define the model configuration for multi-product agents
# We'll use random forest models which work well for this type of problem
model_config = [
    {
        "model_type": "RandomForest",
        "params": {
            "n_estimators": 150,
            "max_depth": None,
            "min_samples_leaf": 2
        }
    },
    {
        "model_type": "RandomForest",
        "params": {
            "n_estimators": 150,
            "max_depth": None,
            "min_samples_leaf": 2
        }
    }
]

### Training the Agent

Now let's train agents.

In [None]:
from pi_optimal.agents.agent import Agent

# Train the profit-oriented agent
profit_agent = Agent()
profit_agent.train(train_dataset, model_config=model_config)


In [None]:
profit_agent.save()

## Evaluating Performance

Now let's evaluate how each agent performs with the test data.

In [None]:
# Apply reward functions to test data
test_df['reward'] = test_df.apply(profit_oriented_reward, axis=1)


# Create test datasets
current_dataset = po.datasets.timeseries_dataset.TimeseriesDataset(
                                    df=test_df,
                                    dataset_config=profit_agent.dataset_config,
                                    train_processors=False,
                                    is_inference=True
)

In [None]:
# Predict optimal actions using each agent
best_actions = profit_agent.predict(current_dataset, horizon=6)

### Policy Visualization

Let's visualize the predicted actions from each agent to understand their decision-making patterns.

In [None]:
for i in range(len(best_actions)):
    print(f"Timestep {i}:")    
    for j, product in enumerate(multi_product_env.product_names):
        print(f"  - {product}: {best_actions[i][j]}")

In [None]:
from pi_optimal.utils.trajectory_visualizer import TrajectoryVisualizer

trajectory_visualizer = TrajectoryVisualizer(profit_agent, current_dataset, best_actions=best_actions)
trajectory_visualizer.display()

## Conclusion & Next Steps

You’ve now explored how to build and customize a multi‑product inventory RL environment, defined your own reward functions, and trained agents to balance profit, waste and service. Here are some ideas for where to take your sprint next:

1. **Extend the Environment**  
   - Add order lead‐times or supplier capacity constraints  
   - Model volume discounts, spoilage rates that change over time, or dynamic pricing incentives  

2. **Advance Your Reward Designs**  
   - Incorporate customer satisfaction metrics (e.g. fill‑rate targets)  
   - Experiment with non‑linear or thresholded penalties (e.g. steep drop after a stockout count)  

3. **Use Advance Models**  
   - Add new models to our package to experiment with more complex neural networks

4. **Analyze & Share Your Findings**  
   - Compare your policies under different business scenarios  
   - Visualize trade‑offs between profit, waste and service levels  
   - Contribute your best reward functions or policy benchmarks back to the pi_optimal repository  
