<a href="https://colab.research.google.com/github/lcbjrrr/quantai/blob/main/03_FIAP_Ext_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning (Optimization)


## The Lemonade Stand

**Scenario:**

You're setting up a small lemonade stand for an hour. You have a limited supply of two key ingredients: **lemons** and **sugar**. You want to make as much money as possible in that hour.

You can make two types of drinks:

1.  **Classic Lemonade:** Sells for a good price, but uses more lemons and sugar.
2.  **Light Lemonade:** Sells for a bit less, but uses fewer ingredients.

**Resources:**

* You have **10 lemons** total.
* You have **12 scoops of sugar** total.

**Ingredient Requirements and Profit per Cup:**

| Drink Type     | Lemons per Cup | Sugar Scoops per Cup | Profit per Cup ($) |
| :------------- | :------------- | :------------------- | :----------------- |
| Classic        | 2              | 3                    | 1.50               |
| Light          | 1              | 1                    | 1.00               |

**The Problem:**

How many cups of **Classic Lemonade** and **Light Lemonade** should you make to earn the **most profit** without running out of lemons or sugar?


In [15]:
import random

class Environment:
    def __init__(self):
        self.C = 0
        self.L = 0
        self.max_steps = 33
        self.steps = 0

    def get_state(self):
        return (self.C, self.L)

    def is_valid(self, C, L):
        return (2 * C + L <= 10) and (3 * C + L <= 12) and (C >= 0 and L >= 0)

    def step(self, action):
        if action == "C+":
            self.C += 1
        elif action == "C-" and self.C > 0:
            self.C -= 1
        elif action == "L+":
            self.L += 1
        elif action == "L-" and self.L > 0:
            self.L -= 1

        self.steps += 1
        reward = self.calc_reward(action)
        if (self.steps >= self.max_steps):
          return (self.C, self.L), reward, True
        else:
          return (self.C, self.L), reward, False


    def calc_reward(self,action):
        if self.is_valid(self.C, self.L):
            return 1.5 * self.C + 1.0 * self.L
        else:
            return -10  # heavy penalty for violating constraints

class Agent:
    def __init__(self):
        self.actions = ["C+", "C-", "L+", "L-"]

    def choose_action(self):
        return random.choice(self.actions)



In [16]:

env = Environment()
agent = Agent()

done = False
steps = 0
best_reward = float("-inf")
best_state = (0, 0)


while not done:
    state = env.get_state()
    action = agent.choose_action()
    ((sC,sL), reward, done) = env.step(action)

    print(f"Step {steps}: State={state}, Action={action}, Next State={(sC,sL)}, Reward={reward}")

    if reward > best_reward and env.is_valid(sC,sL):
        best_reward = reward
        best_state = (sC,sL)

    steps += 1

print("\n🧠 Best valid solution found:")
print(f"  C = {best_state[0]}, L = {best_state[1]}, P = {best_reward}")


Step 0: State=(0, 0), Action=L+, Next State=(0, 1), Reward=1.0
Step 1: State=(0, 1), Action=C+, Next State=(1, 1), Reward=2.5
Step 2: State=(1, 1), Action=C-, Next State=(0, 1), Reward=1.0
Step 3: State=(0, 1), Action=C+, Next State=(1, 1), Reward=2.5
Step 4: State=(1, 1), Action=L+, Next State=(1, 2), Reward=3.5
Step 5: State=(1, 2), Action=C+, Next State=(2, 2), Reward=5.0
Step 6: State=(2, 2), Action=L+, Next State=(2, 3), Reward=6.0
Step 7: State=(2, 3), Action=L-, Next State=(2, 2), Reward=5.0
Step 8: State=(2, 2), Action=L-, Next State=(2, 1), Reward=4.0
Step 9: State=(2, 1), Action=C+, Next State=(3, 1), Reward=5.5
Step 10: State=(3, 1), Action=L+, Next State=(3, 2), Reward=6.5
Step 11: State=(3, 2), Action=L-, Next State=(3, 1), Reward=5.5
Step 12: State=(3, 1), Action=L+, Next State=(3, 2), Reward=6.5
Step 13: State=(3, 2), Action=C+, Next State=(4, 2), Reward=-10
Step 14: State=(4, 2), Action=L-, Next State=(4, 1), Reward=-10
Step 15: State=(4, 1), Action=C-, Next State=(3, 1

# Ativity: RL Optimization

Develop a reinforcement learning agent that autonomously learns, through trial and error. The agent will interact with the production environment, observing current levels and receiving a reward signal corresponding to the reward generated by its choices, while being implicitly penalized for attempting to exceed available resources.

**Problem 1: The Artisan Chocolate Maker**

A small artisan chocolate company makes two types of chocolate bars:
* **Dark Chocolate Bar:** Uses 3 units of cocoa and 2 units of sugar. Sells for \$4.
* **Milk Chocolate Bar:** Uses 2 units of cocoa and 2 units of sugar. Sells for \$3.

They have 30 units of cocoa and 24 units of sugar available. How many of each type of chocolate bar should they produce to maximize their revenue?

---


**Problem 2: The Toy Manufacturer**

A toy company produces two popular toys: "Robo-Buddy" and "Doll-Friend."
* **Robo-Buddy:** Requires 2 hours for assembly and 1 hour for painting. Sells for \$25.
* **Doll-Friend:** Requires 1 hour for assembly and 1 hour for painting. Sells for \$18.

They have 40 hours of assembly time and 25 hours of painting time available per week. How many of each toy should they produce to maximize their weekly revenue?

---

**Problem 3: The Urban Farmer**

An urban farmer has a small plot of land and wants to plant two types of herbs: "Basil" and "Mint."
* **Basil:** Requires 0.5 sq ft of land and 1 unit of water. Sells for \$6 per pot.
* **Mint:** Requires 0.25 sq ft of land and 1.5 units of water. Sells for \$4 per pot.

The farmer has 10 sq ft of land and 20 units of water available. How many pots of each herb should they grow to maximize their total sales value?

---

**Problem 4: The T-Shirt Printer**

A custom T-shirt printing business offers two designs: "Abstract Art" and "Nature Scene."
* **Abstract Art T-shirt:** Uses 2 units of ink and takes 10 minutes to print. Sells for \$12.
* **Nature Scene T-shirt:** Uses 1 unit of ink and takes 15 minutes to print. Sells for \$15.

They have 20 units of ink and 180 minutes of printing time available today. How many of each T-shirt design should they print to maximize their revenue?

---

**Problem 5: The Backpack Manufacturer**

A company manufactures two types of backpacks: "Daypack" and "Trekking Pack."
* **Daypack:** Requires 1.5 meters of fabric and 0.5 hours of labor. Sells for \$50.
* **Trekking Pack:** Requires 3 meters of fabric and 1 hour of labor. Sells for \$90.

They have 60 meters of fabric and 25 hours of labor available each day. How many of each backpack should they produce to maximize their daily revenue?

---

**Problem 6: The Coffee Shop Brewer**

A local coffee shop brews two specialty blends: "Morning Buzz" and "Evening Chill."
* **Morning Buzz:** Requires 150g of coffee beans and 5 minutes to brew per batch. Sells for \$20 per batch.
* **Evening Chill:** Requires 100g of coffee beans and 8 minutes to brew per batch. Sells for \$25 per batch.

They have 3000g of coffee beans and 120 minutes of brewing time available. How many batches of each blend should they brew to maximize their revenue?

---

**Problem 7: The Electronics Assembler**

An electronics company assembles two devices: "Smart Watch" and "Fitness Tracker."
* **Smart Watch:** Requires 3 units of Circuit A and 2 units of Circuit B. Sells for \$150.
* **Fitness Tracker:** Requires 1 unit of Circuit A and 2 units of Circuit B. Sells for \$100.

They have 30 units of Circuit A and 28 units of Circuit B in stock. How many of each device should they assemble to maximize their total sales value?

---


**Problem 8: The Furniture Maker**

A small custom furniture maker builds two types of chairs: "Dining Chair" and "Lounge Chair."
* **Dining Chair:** Requires 4 feet of wood and 1 hour of finishing time. Sells for \$80.
* **Lounge Chair:** Requires 6 feet of wood and 2 hours of finishing time. Sells for \$150.

They have 60 feet of wood and 20 hours of finishing time available per week. How many of each type of chair should they produce to maximize their weekly revenue?

---