### The Smart Supplier: Optimizing Orders in a Fluctuating Market

Develop a reinforcement learning agent using dynamic programming to help a Smart Supplier decide which products to manufacture and sell each day to maximize profit. The agent must learn the optimal policy for choosing daily production quantities, considering its limited raw materials and the unpredictable daily demand and selling prices for different products.

#### **Scenario**
 A small Smart Supplier manufactures two simple products: Product A and Product B. Each day, the supplier has a limited amount of raw material. The challenge is that the market demand and selling price for Product A and Product B change randomly each day, making some products more profitable than others at different times. The supplier needs to decide how much of each product to produce to maximize profit while managing their limited raw material.

#### **Objective**
The Smart Supplier's agent must learn the optimal policy π∗ using dynamic programming (Value Iteration or Policy Iteration) to decide how many units of Product A and Product B to produce each day to maximize the total profit over the fixed number of days, given the daily changing market conditions and limited raw material.

**1. Custom Environment Creation (SmartSupplierEnv)**

**Function to calculate Reward**

Parameters:

1.   env: object representing the environment
2.   market: This argument represents the current market state.
3.   a_units: This is the number of units of Product A produced in the current step.
4.   b_units: This is the number of units of Product B produced in the current step.

In [1]:
def calculateReward(env, market, a_units, b_units):
    reward = (a_units * env.market_prices[market]['A'] +
              b_units * env.market_prices[market]['B'])
    return reward

**SmartSupplierEnv Class Explanation**

This section of code defines the environment for the Smart Supplier problem class called SmartSupplierEnv. In reinforcement learning, the environment represents the world with which the agent interacts.

The class has **reset** and **step** methods defined below

**Reset method**

Parameter:

1. self -  refers to the instance of the SmartSupplierEnv class.

This code defines the reset method within the SmartSupplierEnv class. In the context of reinforcement learning, the reset method is crucial for starting a new episode (a complete run of the simulation, from the beginning to the end).

**Step method**

Parameter:
1. self - refers to the instance of the SmartSupplierEnv class.
2. action_name - action name

This code defines the step method within the SmartSupplierEnv class. In reinforcement learning, the step method is how the agent interacts with the environment. It takes an action as input and returns the resulting new state, the reward received, and whether the episode is finished.

In [2]:
import numpy as np
from enum import Enum

# Define market states and their product prices
# Define product raw material costs

class MarketState(Enum):
    HIGH_DEMAND_A = 1
    HIGH_DEMAND_B = 2

class SmartSupplierEnv:
    def __init__(self):
        self.max_raw_material = 10
        self.max_days = 5
        self.product_a_rm_cost = 2
        self.product_b_rm_cost = 1

        # Define actions: (num_A, num_B, raw_material_cost_precalculated)
        # Action ID mapping:
        # 0: Produce_2A_0B
        # 1: Produce_1A_2B
        # 2: Produce_0A_5B
        # 3: Produce_3A_0B
        # 4: Do_Nothing
        self.actions = {
            'Produce_2A_0B': (2, 0),
            'Produce_1A_2B': (1, 2),
            'Produce_0A_5B': (0, 5),
            'Produce_3A_0B': (3, 0),
            'Do_Nothing': (0, 0)
        }

        # Structure: {Market_State_ID: {'A_price': X, 'B_price': Y}}
        self.market_prices = {
            MarketState.HIGH_DEMAND_A: {'A': 8, 'B': 2},
            MarketState.HIGH_DEMAND_B: {'A': 3, 'B': 5}
        }
    def reset(self):
      """Reset the environment to initial state"""
      self.current_day = 1
      self.current_rm = self.max_raw_material
      self.current_market = np.random.choice(list(MarketState))
      return (self.current_day, self.current_rm, self.current_market.value)

    def step(self, action_name):
        """Execute one step in the environment"""
        if self.current_day > self.max_days:
            raise ValueError("Episode has already ended")

        # Get production quantities for the action
        a_units, b_units = self.actions[action_name]

        # Calculate required raw material
        required_rm = a_units * self.product_a_rm_cost + b_units * self.product_b_rm_cost

        # Check if action is feasible
        if required_rm <= self.current_rm:
            # Calculate profit
            profit = (a_units * self.market_prices[self.current_market]['A'] +
                    b_units * self.market_prices[self.current_market]['B'])
            self.current_rm -= required_rm
        else:
            # Action is not feasible, no production, no profit
            profit = 0

        # Prepare for next day
        self.current_day += 1
        self.current_rm = self.max_raw_material  # Daily reset
        self.current_market = np.random.choice(list(MarketState))  # Random market state

        # Check if episode is done
        done = self.current_day > self.max_days

        return (self.current_day, self.current_rm, self.current_market.value), profit, done


**2. Dynamic Programming Implementation**

**Value iteration method**

The function takes three arguments:

1.   env: This is an object representing the environment.
2.   gamma: This is the discount factor, typically a value between 0 and 1. It determines the importance of future rewards compared to immediate rewards. A gamma of 1.0 means future rewards are just as important as immediate rewards.
3.  theta: This is a small threshold value used to determine when the value iteration process has converged.

The function iteratively updates the value of each possible state in the environment (V) until it converges to the optimal value function. It does this by repeatedly calculating the maximum expected future reward for each state, considering all possible actions and the transitions to subsequent states. Once the value function has converged, the optimal policy is derived by selecting the action that yields the highest value for each state. The code iterates backwards through the days, calculating the optimal values and actions. It also handles the random nature of the market state transitions and the daily reset of raw materials.

In [3]:
def value_iteration(env, gamma=1.0, theta=1e-6):
    """Value Iteration algorithm to find optimal policy"""
    # Initialize value function
    V = np.zeros((env.max_days + 2, env.max_raw_material + 1, len(MarketState) + 1))

    # Create action list for easy indexing
    action_names = list(env.actions.keys())

    while True:
        delta = 0
        # Iterate through all possible states
        for day in range(env.max_days, 0, -1):
            for rm in range(env.max_raw_material + 1):
                for market in MarketState:
                    # Current state
                    state = (day, rm, market.value)

                    # Skip if day is beyond max_days
                    if day > env.max_days:
                        continue

                    # Find the best action
                    max_value = -float('inf')
                    best_action = None

                    for action_name in action_names:
                        a_units, b_units = env.actions[action_name]
                        required_rm = a_units * env.product_a_rm_cost + b_units * env.product_b_rm_cost

                        # Calculate immediate reward
                        if required_rm <= rm:
                            reward = calculateReward(env, market, a_units, b_units)
                            next_rm = env.max_raw_material  # Daily reset

                            # Next market state is random with equal probability
                            next_value = 0
                            for next_market in MarketState:
                                next_state = (day + 1, next_rm, next_market.value)
                                next_value += 0.5 * V[next_state]
                        else:
                            reward = 0
                            next_rm = env.max_raw_material  # Daily reset

                            # Next market state is random with equal probability
                            next_value = 0
                            for next_market in MarketState:
                                next_state = (day + 1, next_rm, next_market.value)
                                next_value += 0.5 * V[next_state]

                        # Calculate action value
                        action_value = reward + gamma * next_value

                        if action_value > max_value:
                            max_value = action_value
                            best_action = action_name

                    # Update delta
                    delta = max(delta, abs(max_value - V[state]))

                    # Update value function
                    V[state] = max_value

        # Check for convergence
        if delta < theta:
            break

    # Extract optimal policy
    policy = {}
    for day in range(1, env.max_days + 1):
        for rm in range(env.max_raw_material + 1):
            for market in MarketState:
                state = (day, rm, market.value) #State structure

                # Find best action for this state
                best_action = None
                best_value = -float('inf')

                for action_name in action_names:
                    a_units, b_units = env.actions[action_name]
                    required_rm = a_units * env.product_a_rm_cost + b_units * env.product_b_rm_cost

                    if required_rm <= rm:
                        reward = calculateReward(env, market, a_units, b_units)
                        next_rm = env.max_raw_material

                        # Next market state is random with equal probability
                        next_value = 0
                        for next_market in MarketState:
                            next_state = (day + 1, next_rm, next_market.value)
                            next_value += 0.5 * V[next_state]
                    else:
                        reward = 0
                        next_rm = env.max_raw_material

                        # Next market state is random with equal probability
                        next_value = 0
                        for next_market in MarketState:
                            next_state = (day + 1, next_rm, next_market.value)
                            next_value += 0.5 * V[next_state]

                    action_value = reward + gamma * next_value
                    #select best action value
                    if action_value > best_value:
                        best_value = action_value
                        best_action = action_name

                policy[state] = best_action
    return V, policy

**Print the Policy table.**

This function takes the learned optimal policy as input. The policy is a dictionary that maps each possible state to the best action to take in that state. A state is represented as a tuple containing the current day, the amount of raw material available, and the current market state.

In [4]:
from IPython.display import display, HTML

def print_policy_table(policy):
  # Print the policy in a readable table format
    print("\n---------------------------------------------------------")
    display(HTML("<b><font size='+2'>Optimal Policy Table:</font></b>"))
    print("---------------------------------------------------------\n")
    print(f"{'Day':<5}{'RM':<5}{'Market':<10}{'Best Action':<15}")
    print("-" * 35)

    for day in range(1, 6):
        for rm in [0,1,2,3,4,5,6,7,8,9,10]:
            for market in MarketState:
                state = (day, rm, market.value)
                print(f"{day:<5} {rm:<5} {market.name:<10} {policy[state]:<15}")
        print("-" * 35 if day < 5 else "")

**Print Value table**

The purpose of this function is to print a subset of the learned optimal values in a formatted table, making it easier to understand how the expected future profit changes based on the day, available raw material, and market condition.

**Function will print values of day 1 and day 5 only**

In [5]:
def print_value_table(V):
    #Print key values from the value function
    print("\n---------------------------------------------------------")
    display(HTML("<b><font size='+2'>Value Function Highlights:</font></b>"))
    print("---------------------------------------------------------\n")
    print(f"{'Day':<5}{'RM':<5}{'Market':<10}{'Value':<10}")
    print("-" * 30)

    # Print values for day 1 with various RM and both markets
    for rm in [0,1,2,3,4,5,6,7,8,9,10]:
        for market in MarketState:
            state = (1, rm, market.value)
            print(f"{1:<5}{rm:<5}{market.name:<10}${V[state]:<10.2f}")

    # Print values for day 5 with various RM and both markets
    print("\n")
    for rm in [0,1,2,3,4,5,6,7,8,9,10]:
        for market in MarketState:
            state = (5, rm, market.value)
            print(f"{5:<5}{rm:<5}{market.name:<10}${V[state]:<10.2f}")

**3. Optimal Policy Analysis**

Optimal Policy Analysis
Let's analyze the learned optimal policy:

1.   Market State Impact:
  *   In Market State 1 (High Demand A), the policy favors producing more Product A when raw materials allow
  *   In Market State 2 (High Demand B), the policy favors producing more Product B

2.   Raw Material Impact:
  *   With low remaining RM, the policy chooses actions that fully utilize available materials.
  *   For intermediate RM, it selects combinations that maximize profit per RM (like 1A+2B in Market State 2)

3.   Day Progression Impact:

  *   On the last day (day 5), the policy becomes more aggressive since there's no future to consider.
  *   Earlier days show more balanced choices considering future opportunities.

**4. Performance Evaluation:**

The function simulates the SmartSupplierEnv for a specified number of num_episodes = 1000.

In each episode, it starts by resetting the environment. Then, it follows the provided policy, which dictates the action to take in each state. It calculates the reward for each action and accumulates the episode_profit. After an episode finishes (after 5 days), the episode_profit is added to the total_profit. Finally, it returns the average_profit across all episodes, providing an estimate of the policy's expected performance.

In [6]:
# simulate policy function - Simulates the learned policy over multiple runs to evaluate performance

def evaluate_policy(env, policy, num_episodes=1000):
    total_profit = 0

    for _ in range(num_episodes):
        state = env.reset()
        episode_profit = 0
        done = False

        while not done:
            action = policy[state]
            next_state, reward, done = env.step(action)
            episode_profit += reward
            state = next_state

        total_profit += episode_profit

    average_profit = total_profit / num_episodes
    return average_profit

**Main execution function.**

This code block serves as the main entry point for running the Smart Supplier reinforcement learning simulation.

We calculate the state-value function (V∗) for key states by calling method value_iteration()

In [7]:
# --- Main Execution ---
from IPython.display import display, HTML
if __name__ == "__main__":

    # Create environment
    env = SmartSupplierEnv()

    # Run value iteration
   # Update the state values and get Policy and state table.
    V, policy = value_iteration(env)

    # Print policy and value tables
    print_policy_table(policy)
    print_value_table(V)

    # Evaluate policy
    print(f"-----------------------------------------------------------------------------------------")
    print(f"-----------------------------------------------------------------------------------------")
    avg_profit = evaluate_policy(env, policy, 1000)
    display(HTML(f"<b><font size='+2'>Average total profit over 5 days: ${avg_profit:.2f}</font></b>"))
    print(f"-----------------------------------------------------------------------------------------")

    print(f"/n------Program completed successfully-----")



---------------------------------------------------------


---------------------------------------------------------

Day  RM   Market    Best Action    
-----------------------------------
1     0     HIGH_DEMAND_A Produce_2A_0B  
1     0     HIGH_DEMAND_B Produce_2A_0B  
1     1     HIGH_DEMAND_A Produce_2A_0B  
1     1     HIGH_DEMAND_B Produce_2A_0B  
1     2     HIGH_DEMAND_A Produce_2A_0B  
1     2     HIGH_DEMAND_B Produce_2A_0B  
1     3     HIGH_DEMAND_A Produce_2A_0B  
1     3     HIGH_DEMAND_B Produce_2A_0B  
1     4     HIGH_DEMAND_A Produce_2A_0B  
1     4     HIGH_DEMAND_B Produce_1A_2B  
1     5     HIGH_DEMAND_A Produce_2A_0B  
1     5     HIGH_DEMAND_B Produce_0A_5B  
1     6     HIGH_DEMAND_A Produce_3A_0B  
1     6     HIGH_DEMAND_B Produce_0A_5B  
1     7     HIGH_DEMAND_A Produce_3A_0B  
1     7     HIGH_DEMAND_B Produce_0A_5B  
1     8     HIGH_DEMAND_A Produce_3A_0B  
1     8     HIGH_DEMAND_B Produce_0A_5B  
1     9     HIGH_DEMAND_A Produce_3A_0B  
1     9     HIGH_DEMAND_B Produce_0A_5B  
1     10    HIGH_DEMAND_A Pro

---------------------------------------------------------

Day  RM   Market    Value     
------------------------------
1    0    HIGH_DEMAND_A$98.00     
1    0    HIGH_DEMAND_B$98.00     
1    1    HIGH_DEMAND_A$98.00     
1    1    HIGH_DEMAND_B$98.00     
1    2    HIGH_DEMAND_A$98.00     
1    2    HIGH_DEMAND_B$98.00     
1    3    HIGH_DEMAND_A$98.00     
1    3    HIGH_DEMAND_B$98.00     
1    4    HIGH_DEMAND_A$114.00    
1    4    HIGH_DEMAND_B$111.00    
1    5    HIGH_DEMAND_A$114.00    
1    5    HIGH_DEMAND_B$123.00    
1    6    HIGH_DEMAND_A$122.00    
1    6    HIGH_DEMAND_B$123.00    
1    7    HIGH_DEMAND_A$122.00    
1    7    HIGH_DEMAND_B$123.00    
1    8    HIGH_DEMAND_A$122.00    
1    8    HIGH_DEMAND_B$123.00    
1    9    HIGH_DEMAND_A$122.00    
1    9    HIGH_DEMAND_B$123.00    
1    10   HIGH_DEMAND_A$122.00    
1    10   HIGH_DEMAND_B$123.00    


5    0    HIGH_DEMAND_A$0.00      
5    0    HIGH_DEMAND_B$0.00      
5    1    HIGH_DEMAND_A$0.00      
5 

-----------------------------------------------------------------------------------------
/n------Program completed successfully-----


**5. Impact of Dynamics Analysis**

### Fixed Market State 1 vs. Fluctuating Market:
1. Fixed Market State 1:


*   Policy would always favor Product A production

*   Optimal strategy would be to produce maximum A (3A when possible)

*   Less consideration for Product B

2.  Fluctuating Market:

*   Policy must balance between both products

*   More conservative in early days to preserve flexibility

*   Values Product B more highly when Market State 2 is possible

*   Overall strategy is more adaptive and robust

The dynamic environment requires the agent to develop a more flexible strategy that can capitalize on whichever market state emerges each day, rather than specializing in just one product. This leads to a more balanced production approach that achieves good performance across both market conditions.

The key insight is that in a fluctuating market, the optimal policy values flexibility and adapts production to current conditions, while in a fixed market it can specialize completely in the most profitable product.
