<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Reinforcement/games
.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning with OpenAI Gym Environments

This notebook demonstrates reinforcement learning using OpenAI Gym's classic control problems. We'll implement a Q-learning agent and visualize the learning process across different environments.

## Brick Breaker with Reinforcement Learning

In this section, we'll implement a Brick Breaker game and train a reinforcement learning agent to play it using Gymnasium and Stable Baselines 3.

In [None]:
# Install required packages
%pip install gymnasium stable-baselines3[extra] pygame numpy matplotlib

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym
from gymnasium import spaces
import pygame
from stable_baselines3 import PPO, A2C, DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import CheckpointCallback, EvalCallback

### Custom Brick Breaker Environment

We'll create a custom Gymnasium environment for the classic Brick Breaker game where:
- The player controls a paddle at the bottom of the screen
- A ball bounces around, breaking bricks when it hits them
- The goal is to break all bricks without letting the ball fall below the paddle

In [None]:
class BrickBreakerEnv(gym.Env):
    metadata = {'render_modes': ['human', 'rgb_array'], 'render_fps': 30}
    
    def __init__(self, render_mode=None, width=400, height=500):
        super(BrickBreakerEnv, self).__init__()
        
        # Game settings
        self.width = width
        self.height = height
        self.render_mode = render_mode
        
        # Paddle settings
        self.paddle_width = 80
        self.paddle_height = 10
        self.paddle_speed = 10
        
        # Ball settings
        self.ball_radius = 8
        self.ball_speed = 5
        
        # Brick settings
        self.brick_rows = 5
        self.brick_cols = 8
        self.brick_width = (self.width - 20) // self.brick_cols
        self.brick_height = 20
        
        # Define action space: 0=left, 1=stay, 2=right
        self.action_space = spaces.Discrete(3)
        
        # Define observation space
        # [paddle_x, ball_x, ball_y, ball_vx, ball_vy, flattened brick grid]
        obs_size = 5 + (self.brick_rows * self.brick_cols)
        self.observation_space = spaces.Box(
            low=0, high=1, shape=(obs_size,), dtype=np.float32
        )
        
        # Initialize pygame if rendering
        if self.render_mode is not None:
            pygame.init()
            pygame.display.init()
            self.window = pygame.display.set_mode((self.width, self.height))
            self.clock = pygame.time.Clock()
            self.font = pygame.font.SysFont(None, 24)
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        
        # Reset paddle position
        self.paddle_x = (self.width - self.paddle_width) // 2
        self.paddle_y = self.height - 30
        
        # Reset ball position and velocity
        self.ball_x = self.width // 2
        self.ball_y = self.height // 2
        
        # Random initial ball velocity
        angle = self.np_random.uniform(0.1, 0.9) * np.pi
        self.ball_vx = self.ball_speed * np.cos(angle)
        self.ball_vy = self.ball_speed * np.sin(angle)
        
        # Reset bricks
        self.bricks = np.ones((self.brick_rows, self.brick_cols), dtype=np.int8)
        
        # Reset game state
        self.score = 0
        self.steps = 0
        self.done = False
        
        # Return initial observation
        observation = self._get_observation()
        info = {}
        
        return observation, info
    
    def _get_observation(self):
        # Normalize values to [0,1] range
        paddle_x_norm = self.paddle_x / self.width
        ball_x_norm = self.ball_x / self.width
        ball_y_norm = self.ball_y / self.height
        
        # Normalize velocities to [-1,1] range and then map to [0,1]
        ball_vx_norm = (self.ball_vx / self.ball_speed + 1) / 2
        ball_vy_norm = (self.ball_vy / self.ball_speed + 1) / 2
        
        # Flatten brick grid
        brick_grid_flat = self.bricks.flatten()
        
        # Combine all components into a single array
        observation = np.concatenate([
            [paddle_x_norm, ball_x_norm, ball_y_norm, ball_vx_norm, ball_vy_norm],
            brick_grid_flat
        ])
        
        return observation.astype(np.float32)
    
    def step(self, action):
        reward = 0
        self.steps += 1
        
        # Process action: move paddle
        if action == 0:  # Move left
            self.paddle_x = max(0, self.paddle_x - self.paddle_speed)
        elif action == 2:  # Move right
            self.paddle_x = min(self.width - self.paddle_width, self.paddle_x + self.paddle_speed)
        # action 1 means stay in place
        
        # Small negative reward for each step to encourage finishing quickly
        reward -= 0.01
        
        # Move ball
        self.ball_x += self.ball_vx
        self.ball_y += self.ball_vy
        
        # Check wall collisions
        if self.ball_x <= self.ball_radius or self.ball_x >= self.width - self.ball_radius:
            self.ball_vx = -self.ball_vx
        if self.ball_y <= self.ball_radius:
            self.ball_vy = -self.ball_vy
        
        # Check if ball is below paddle (game over)
        if self.ball_y >= self.height:
            self.done = True
            reward -= 10  # Big penalty for losing the ball
        
        # Check paddle collision
        if (self.ball_y + self.ball_radius >= self.paddle_y and 
            self.ball_y - self.ball_radius <= self.paddle_y + self.paddle_height and 
            self.ball_x + self.ball_radius >= self.paddle_x and 
            self.ball_x - self.ball_radius <= self.paddle_x + self.paddle_width):
            
            # Reverse y velocity
            self.ball_vy = -abs(self.ball_vy)
            
            # Adjust x velocity based on where the ball hit the paddle
            relative_intersect_x = (self.paddle_x + (self.paddle_width / 2)) - self.ball_x
            normalized_intersect_x = relative_intersect_x / (self.paddle_width / 2)
            bounce_angle = normalized_intersect_x * (np.pi / 3)  # Max angle: 60 degrees
            self.ball_vx = -self.ball_speed * np.sin(bounce_angle)
            
            # Reward for keeping the ball in play
            reward += 0.5
        
        # Check brick collisions
        for row in range(self.brick_rows):
            for col in range(self.brick_cols):
                if self.bricks[row, col] == 1:  # If brick exists
                    brick_x = col * self.brick_width + 10
                    brick_y = row * self.brick_height + 40
                    
                    # Check collision with this brick
                    if (self.ball_x + self.ball_radius >= brick_x and 
                        self.ball_x - self.ball_radius <= brick_x + self.brick_width and 
                        self.ball_y + self.ball_radius >= brick_y and 
                        self.ball_y - self.ball_radius <= brick_y + self.brick_height):
                        
                        # Remove brick
                        self.bricks[row, col] = 0
                        self.score += 1
                        
                        # Reward for breaking a brick
                        reward += 1.0
                        
                        # Determine bounce direction based on side of collision
                        dx = min(abs(self.ball_x - brick_x), abs(self.ball_x - (brick_x + self.brick_width)))
                        dy = min(abs(self.ball_y - brick_y), abs(self.ball_y - (brick_y + self.brick_height)))
                        
                        if dx < dy:  # Horizontal collision
                            self.ball_vx = -self.ball_vx
                        else:  # Vertical collision
                            self.ball_vy = -self.ball_vy
                        
                        break  # Only process one brick collision per step
        
        # Check if all bricks are broken (win condition)
        if np.sum(self.bricks) == 0:
            self.done = True
            reward += 50  # Big reward for clearing all bricks
        
        # Return observation, reward, done flag, truncated flag, and info
        observation = self._get_observation()
        info = {"score": self.score}
        
        return observation, reward, self.done, False, info
    
    def render(self):
        if self.render_mode is None:
            return
            
        if self.render_mode == "human":
            # Clear the screen
            self.window.fill((0, 0, 0))
            
            # Draw paddle
            pygame.draw.rect(self.window, (255, 255, 255), 
                             (self.paddle_x, self.paddle_y, self.paddle_width, self.paddle_height))
            
            # Draw ball
            pygame.draw.circle(self.window, (255, 255, 255), 
                              (int(self.ball_x), int(self.ball_y)), self.ball_radius)
            
            # Draw bricks
            colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0), (255, 0, 255)]
            for row in range(self.brick_rows):
                color = colors[row % len(colors)]
                for col in range(self.brick_cols):
                    if self.bricks[row, col] == 1:
                        pygame.draw.rect(self.window, color, 
                                        (col * self.brick_width + 10, row * self.brick_height + 40, 
                                         self.brick_width - 2, self.brick_height - 2))
            
            # Draw score
            score_text = self.font.render(f'Score: {self.score}', True, (255, 255, 255))
            self.window.blit(score_text, (10, 10))
            
            pygame.display.flip()
            self.clock.tick(self.metadata["render_fps"])
            return None
            
        elif self.render_mode == "rgb_array":
            # Return an RGB array representation of the game
            return np.zeros((self.height, self.width, 3), dtype=np.uint8)
    
    def close(self):
        if self.render_mode == "human":
            pygame.display.quit()
            pygame.quit()

### Register Custom Environment with Gymnasium

Register our Brick Breaker environment with Gymnasium so we can use it with Stable Baselines3.

In [None]:
# Register the custom environment
from gymnasium.envs.registration import register

register(
    id='BrickBreaker-v0',
    entry_point='__main__:BrickBreakerEnv',
    max_episode_steps=2000,
)

### Training with Stable Baselines3

Now we'll set up and train a reinforcement learning agent using PPO from Stable Baselines3.

In [None]:
# Create and wrap the environment
def make_env():
    env = gym.make('BrickBreaker-v0')
    env = Monitor(env)
    return env

# Create a vectorized environment
env = DummyVecEnv([make_env])

# Create a callback for saving checkpoints
checkpoint_callback = CheckpointCallback(
    save_freq=10000,
    save_path="./brickbreaker_model_checkpoints/",
    name_prefix="brickbreaker_model"
)

# Define policy network parameters
policy_kwargs = dict(
    net_arch=[64, 64]
)

# Set up the PPO agent
model = PPO(
    "MlpPolicy",
    env,
    policy_kwargs=policy_kwargs,
    learning_rate=0.0003,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    verbose=1
)

In [None]:
# Train the agent
model.learn(
    total_timesteps=300000,
    callback=checkpoint_callback,
    progress_bar=False  # Disable progress bar to avoid LiveError
)

# Save the final model
model.save("brickbreaker_ppo_final")

### Evaluation and Visualization

Let's evaluate the trained agent and watch it play.

In [None]:
# Evaluate the trained agent
eval_env = gym.make('BrickBreaker-v0', render_mode='human')
eval_env = Monitor(eval_env)

# Run a few episodes and watch the agent play
obs, info = eval_env.reset()
total_reward = 0
for i in range(5000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = eval_env.step(action)
    total_reward += reward
    eval_env.render()
    
    if done or truncated:
        print(f"Episode finished with reward {total_reward} and score {info['score']}")
        obs, info = eval_env.reset()
        total_reward = 0

eval_env.close()

### Compare Different RL Algorithms

Let's train and compare different algorithms from Stable Baselines3.

In [None]:
# Function to train and evaluate an algorithm
def train_and_evaluate(algo_class, algo_name, timesteps=100000):
    # Create a fresh environment
    train_env = DummyVecEnv([make_env])
    
    # Create the model
    model = algo_class(
        "MlpPolicy",
        train_env,
        policy_kwargs=dict(net_arch=[64, 64]),
        verbose=1
    )
    
    # Train the model
    print(f"Training {algo_name}...")
    model.learn(total_timesteps=timesteps)
    
    # Evaluate the model
    eval_env = DummyVecEnv([make_env])
    mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
    print(f"{algo_name} achieved mean reward: {mean_reward:.2f} ± {std_reward:.2f}")
    
    # Save the model
    model.save(f"brickbreaker_{algo_name.lower()}")
    
    return model, mean_reward, std_reward

In [None]:
# Train and evaluate different algorithms
algorithms = [
    (PPO, "PPO"),
    (A2C, "A2C"),
    (DQN, "DQN")
]

results = []

for algo_class, algo_name in algorithms:
    _, mean_reward, std_reward = train_and_evaluate(algo_class, algo_name, timesteps=100000)
    results.append((algo_name, mean_reward, std_reward))

In [None]:
# Plot the results
algo_names = [result[0] for result in results]
mean_rewards = [result[1] for result in results]
std_rewards = [result[2] for result in results]

plt.figure(figsize=(10, 6))
plt.bar(algo_names, mean_rewards, yerr=std_rewards, capsize=10)
plt.title('Performance Comparison of RL Algorithms on Brick Breaker')
plt.xlabel('Algorithm')
plt.ylabel('Mean Reward')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## Conclusion

In this notebook, we've implemented a Brick Breaker game as a custom Gymnasium environment and trained a reinforcement learning agent to play it using Stable Baselines3. We've seen how:

1. Custom game environments can be created with Gymnasium
2. Different RL algorithms (PPO, A2C, DQN) can be applied to the same problem
3. RL agents can learn complex game strategies through trial and error

The trained agent has learned to control the paddle to keep the ball in play and break bricks efficiently. This approach can be extended to other game environments and more complex control problems.