<a href="https://colab.research.google.com/github/lcbjrrr/quantai/blob/main/04_FIAP_Ext_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning (Gym Lib)

## Gym Library

The Gym library (now maintained as Gymnasium) is a Python toolkit designed to help you build and interact with environments for reinforcement learning. It provides a standard way to create environments where an AI agent can observe the state of the world, take actions, and receive feedback in the form of rewards. This makes it easier to design, test, and train agents without worrying about the details of the environment itself. For example, you can use Gym to train an agent to balance a pole, play simple video games, or solve optimization problems, all within a consistent and easy-to-use framework. It's like a sandbox where your agent learns by trial and error through interaction.

In [12]:
import gym
from gym import spaces
import numpy as np
import random

class Environment(gym.Env):
    def __init__(self, max_steps=33):
        super().__init__()
        self.max_steps = max_steps
        self.current_step = 0
        # Actions: 0: C+, 1: C-, 2: L+, 3: L-
        self.action_space = spaces.Discrete(4)
        self.observation_space = spaces.Box(low=0, high=0)
        self.C = 0
        self.L = 0

    def _get_obs(self):
        return np.array([self.C, self.L])

    def _is_valid(self, C, L):
        return (2 * C + L <= 10) and (3 * C + L <= 12) and (C >= 0 and L >= 0)

    def _calc_reward(self, C, L):
        if self._is_valid(C, L):
            return 1.5 * C + 1.0 * L
        else:
            return -10  # heavy penalty for violating constraints

    def step(self, action):
        self.current_step += 1

        prev_C, prev_L = self.C, self.L

        if action == 0:  # C+
            self.C += 1
        elif action == 1 and self.C > 0:  # C-
            self.C -= 1
        elif action == 2:  # L+
            self.L += 1
        elif action == 3 and self.L > 0:  # L-
            self.L -= 1

        # Ensure C and L do not go below zero, although handled by C-/L- checks
        #self.C = max(0, self.C)
        #self.L = max(0, self.L)

        reward = self._calc_reward(self.C, self.L)
        observation = self._get_obs()
        terminated = self.current_step >= self.max_steps
        return observation, reward, terminated

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.C = 0
        self.L = 0
        self.current_step = 0

        observation = self._get_obs()
        info = {"current_C": self.C, "current_L": self.L}
        return observation, info

    def render(self, mode='human'):
        # For this problem, a simple print statement is sufficient for "human" mode
        if mode == 'human':
            print(f"Current State: C={self.C}, L={self.L}")

    def close(self):
        # No resources to close for this simple environment
        pass


class RandomAgent:
    def __init__(self, action_space):
        self.action_space = action_space

    def choose_action(self):
        return self.action_space.sample() # Randomly sample from the action space



In [13]:

env = Environment(max_steps=33)
agent = RandomAgent(env.action_space)

best_reward = 0
best_state = (0, 0)
best_step_num = -1

observation, info = env.reset()
done = False
current_step_num = 0

while not done:
    action = agent.choose_action()
    next_observation, reward, done = env.step(action)

    current_C, current_L = next_observation[0], next_observation[1]

    print(f"Step {current_step_num}: State={observation}, Action={action}, Next ={(current_C, current_L)}, Reward={reward}")

    if env._is_valid(current_C, current_L) and reward > best_reward:
        best_reward = reward
        best_state = (current_C, current_L)
        best_step_num = current_step_num

    observation = next_observation
    current_step_num += 1

env.close()

print("\n🧠 Best valid solution found:")
print(f"  C = {best_state[0]}, L = {best_state[1]}, P = {best_reward} (found at step {best_step_num})")

Step 0: State=[0 0], Action=2, Next =(np.int64(0), np.int64(1)), Reward=1.0
Step 1: State=[0 1], Action=3, Next =(np.int64(0), np.int64(0)), Reward=0.0
Step 2: State=[0 0], Action=2, Next =(np.int64(0), np.int64(1)), Reward=1.0
Step 3: State=[0 1], Action=1, Next =(np.int64(0), np.int64(1)), Reward=1.0
Step 4: State=[0 1], Action=0, Next =(np.int64(1), np.int64(1)), Reward=2.5
Step 5: State=[1 1], Action=0, Next =(np.int64(2), np.int64(1)), Reward=4.0
Step 6: State=[2 1], Action=0, Next =(np.int64(3), np.int64(1)), Reward=5.5
Step 7: State=[3 1], Action=0, Next =(np.int64(4), np.int64(1)), Reward=-10
Step 8: State=[4 1], Action=0, Next =(np.int64(5), np.int64(1)), Reward=-10
Step 9: State=[5 1], Action=0, Next =(np.int64(6), np.int64(1)), Reward=-10
Step 10: State=[6 1], Action=3, Next =(np.int64(6), np.int64(0)), Reward=-10
Step 11: State=[6 0], Action=2, Next =(np.int64(6), np.int64(1)), Reward=-10
Step 12: State=[6 1], Action=2, Next =(np.int64(6), np.int64(2)), Reward=-10
Step 13: 

## Ativity: Gym Lib

Your task is to revisit the problem you previously solved and re-implement your solution using OpenAI’s Gym library, which provides a standardized suite of environments for developing and testing reinforcement learning algorithms. Build your agent within the chosen environment and evaluate its performance using relevant metrics such as cumulative reward or the number of episodes required to achieve stable behavior. Finally, compare the results of your Gym-based implementation with your earlier approach.