### The Smart Supplier: Optimizing Orders in a Fluctuating Market - 6 Marks

Develop a reinforcement learning agent using dynamic programming to help a Smart Supplier decide which products to manufacture and sell each day to maximize profit. The agent must learn the optimal policy for choosing daily production quantities, considering its limited raw materials and the unpredictable daily demand and selling prices for different products.

#### **Scenario**
 A small Smart Supplier manufactures two simple products: Product A and Product B. Each day, the supplier has a limited amount of raw material. The challenge is that the market demand and selling price for Product A and Product B change randomly each day, making some products more profitable than others at different times. The supplier needs to decide how much of each product to produce to maximize profit while managing their limited raw material.

#### **Objective**
The Smart Supplier's agent must learn the optimal policy π∗ using dynamic programming (Value Iteration or Policy Iteration) to decide how many units of Product A and Product B to produce each day to maximize the total profit over the fixed number of days, given the daily changing market conditions and limited raw material.

### --- 1. Custom Environment Creation (SmartSupplierEnv) --- ( 1 Mark )

In [1]:
import numpy as np
import random

class SmartSupplierEnv:
    def __init__(self, num_days=5, initial_raw_material=10):
        # Environment parameters
        self.num_days = num_days
        self.initial_raw_material = initial_raw_material
        
        # Market states and their product prices
        self.market_states = {
            1: {'A_price': 8, 'B_price': 2},  # High Demand for A
            2: {'A_price': 3, 'B_price': 5}   # High Demand for B
        }
        
        # Product raw material costs
        self.rm_costs = {'A': 2, 'B': 1}
        
        # Action space definition
        self.actions = {
            0: {'A': 2, 'B': 0},  # Produce_2A_0B
            1: {'A': 1, 'B': 2},  # Produce_1A_2B
            2: {'A': 0, 'B': 5},  # Produce_0A_5B
            3: {'A': 3, 'B': 0},  # Produce_3A_0B
            4: {'A': 0, 'B': 0}   # Do_Nothing
        }
        
        # State space dimensions
        self.state_dimensions = {
            'day': range(1, num_days + 1),
            'raw_material': range(initial_raw_material + 1),
            'market_state': [1, 2]
        }
        
        # Initialize state
        self.reset()
    
    def reset(self):
        """Reset the environment to initial state"""
        self.current_day = 1
        self.current_raw_material = self.initial_raw_material
        self.current_market_state = random.choice([1, 2])
        return self.get_state()
    
    def get_state(self):
        """Return current state as a tuple"""
        return (self.current_day, self.current_raw_material, self.current_market_state)
    
    def calculate_reward(self, action):
        """Calculate reward for taking an action in current state"""
        action_quantities = self.actions[action]
        
        # Calculate required raw material
        required_rm = (action_quantities['A'] * self.rm_costs['A'] + 
                      action_quantities['B'] * self.rm_costs['B'])
        
        # Check if enough raw material
        if required_rm > self.current_raw_material:
            return 0  # Action fails, no reward
        
        # Calculate profit
        profit = (action_quantities['A'] * self.market_states[self.current_market_state]['A_price'] +
                 action_quantities['B'] * self.market_states[self.current_market_state]['B_price'])
        
        return profit
    
    def step(self, action):
        """Take a step in the environment"""
        # Calculate reward
        reward = self.calculate_reward(action)
        
        # Update state
        action_quantities = self.actions[action]
        required_rm = (action_quantities['A'] * self.rm_costs['A'] + 
                      action_quantities['B'] * self.rm_costs['B'])
        
        if required_rm <= self.current_raw_material:
            self.current_raw_material -= required_rm
        
        # Move to next day
        self.current_day += 1
        
        # Check if episode is done
        done = self.current_day > self.num_days
        
        if not done:
            # Reset raw material for next day
            self.current_raw_material = self.initial_raw_material
            # Randomly change market state
            self.current_market_state = random.choice([1, 2])
        
        return self.get_state(), reward, done

### --- 2. Dynamic Programming Implementation (Value Iteration) --- (2 Mark)

In [2]:
def value_iteration(env, gamma=1.0, theta=1e-6, max_iterations=1000):
    """Value Iteration algorithm to find optimal policy"""
    # Initialize value function
    V = {}
    for day in env.state_dimensions['day']:
        for rm in env.state_dimensions['raw_material']:
            for market in env.state_dimensions['market_state']:
                V[(day, rm, market)] = 0
    
    # Initialize terminal state values
    for rm in env.state_dimensions['raw_material']:
        for market in env.state_dimensions['market_state']:
            V[(env.num_days + 1, rm, market)] = 0
    
    # Value iteration
    for i in range(max_iterations):
        delta = 0
        # Update each state
        for day in env.state_dimensions['day']:
            for rm in env.state_dimensions['raw_material']:
                for market in env.state_dimensions['market_state']:
                    # Set current state
                    env.current_day = day
                    env.current_raw_material = rm
                    env.current_market_state = market
                    
                    # Find best action value
                    best_value = float('-inf')
                    for action in range(len(env.actions)):
                        next_state, reward, done = env.step(action)
                        value = reward + gamma * V[next_state]
                        best_value = max(best_value, value)
                        # Reset state for next action evaluation
                        env.current_day = day
                        env.current_raw_material = rm
                        env.current_market_state = market
                    
                    # Update value function
                    delta = max(delta, abs(V[(day, rm, market)] - best_value))
                    V[(day, rm, market)] = best_value
        
        # Check convergence
        if delta < theta:
            print(f"Value iteration converged after {i+1} iterations")
            break
    
    # Extract optimal policy
    policy = {}
    for day in env.state_dimensions['day']:
        for rm in env.state_dimensions['raw_material']:
            for market in env.state_dimensions['market_state']:
                best_action = None
                best_value = float('-inf')
                
                env.current_day = day
                env.current_raw_material = rm
                env.current_market_state = market
                
                for action in range(len(env.actions)):
                    next_state, reward, done = env.step(action)
                    value = reward + gamma * V[next_state]
                    if value > best_value:
                        best_value = value
                        best_action = action
                    env.current_day = day
                    env.current_raw_material = rm
                    env.current_market_state = market
                
                policy[(day, rm, market)] = best_action
    
    return V, policy

#### --- 3. Simulation and Policy Analysis ---  ( 1 Mark)

In [3]:
def simulate_policy(env, policy, num_episodes=1000):
    """Simulate the learned policy over multiple episodes"""
    total_rewards = []
    
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            action = policy[state]
            state, reward, done = env.step(action)
            episode_reward += reward
        
        total_rewards.append(episode_reward)
    
    return np.mean(total_rewards), np.std(total_rewards)

def analyze_policy(policy, env):
    """Analyze and print snippets of the learned optimal policy"""
    print("\nPolicy Analysis:")
    print("-" * 50)
    
    # Analyze policy for different market states
    for market in [1, 2]:
        print(f"\nMarket State {market}:")
        print(f"Market State {market} prices: A=${env.market_states[market]['A_price']}, B=${env.market_states[market]['B_price']}")
        
        for day in range(1, env.num_days + 1):
            print(f"\nDay {day}:")
            for rm in range(env.initial_raw_material + 1):
                action = policy[(day, rm, market)]
                action_desc = {
                    0: "Produce_2A_0B",
                    1: "Produce_1A_2B",
                    2: "Produce_0A_5B",
                    3: "Produce_3A_0B",
                    4: "Do_Nothing"
                }[action]
                print(f"RM={rm}: {action_desc}")

#### --- 4. Impact of Dynamics Analysis --- (1 Mark)

In [4]:
# Main execution
if __name__ == "__main__":
    # Create environment
    env = SmartSupplierEnv()
    
    # Run value iteration
    V, policy = value_iteration(env)
    
    # Analyze policy
    analyze_policy(policy, env)
    
    # Simulate policy
    mean_reward, std_reward = simulate_policy(env, policy)
    print(f"\nAverage reward over 1000 episodes: {mean_reward:.2f} ± {std_reward:.2f}")
    
    # Print value function for key states
    print("\nValue function for key states:")
    print(f"Day 1, RM=10, Market=1: {V[(1, 10, 1)]:.2f}")
    print(f"Day 1, RM=10, Market=2: {V[(1, 10, 2)]:.2f}")


Policy Analysis:
--------------------------------------------------

Market State 1:
Market State 1 prices: A=$8, B=$2

Day 1:
RM=0: Produce_2A_0B
RM=1: Do_Nothing
RM=2: Produce_2A_0B
RM=3: Do_Nothing
RM=4: Produce_2A_0B
RM=5: Produce_2A_0B
RM=6: Produce_3A_0B
RM=7: Produce_3A_0B
RM=8: Produce_3A_0B
RM=9: Produce_3A_0B
RM=10: Produce_3A_0B

Day 2:
RM=0: Produce_2A_0B
RM=1: Do_Nothing
RM=2: Produce_3A_0B
RM=3: Produce_3A_0B
RM=4: Produce_2A_0B
RM=5: Produce_2A_0B
RM=6: Produce_3A_0B
RM=7: Produce_3A_0B
RM=8: Produce_3A_0B
RM=9: Produce_3A_0B
RM=10: Produce_3A_0B

Day 3:
RM=0: Produce_0A_5B
RM=1: Do_Nothing
RM=2: Produce_1A_2B
RM=3: Produce_2A_0B
RM=4: Produce_2A_0B
RM=5: Produce_2A_0B
RM=6: Produce_3A_0B
RM=7: Produce_3A_0B
RM=8: Produce_3A_0B
RM=9: Produce_3A_0B
RM=10: Produce_3A_0B

Day 4:
RM=0: Produce_1A_2B
RM=1: Produce_2A_0B
RM=2: Produce_2A_0B
RM=3: Produce_2A_0B
RM=4: Produce_2A_0B
RM=5: Produce_2A_0B
RM=6: Produce_3A_0B
RM=7: Produce_3A_0B
RM=8: Produce_3A_0B
RM=9: Produce_3A_