# Project 1: Atari Pong Baseline

This notebook provides a working baseline for training an RL agent on Atari Pong.

**Runtime:** ~5 minutes for baseline (100k steps)

## Setup

In [None]:
# Install dependencies
!pip install stable-baselines3[extra] gymnasium[atari,accept-rom-license] -q

In [None]:
import gymnasium as gym
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
import matplotlib.pyplot as plt

## Create Preprocessed Environment

Atari preprocessing:
- Convert to grayscale
- Resize to 84x84
- Stack 4 frames
- Frame skipping

In [None]:
# Create vectorized environment with preprocessing
env = make_atari_env('PongNoFrameskip-v4', n_envs=4, seed=42)
env = VecFrameStack(env, n_stack=4)

print(f"Environment created:")
print(f"  Observation shape: {env.observation_space.shape}")
print(f"  Action space: {env.action_space}")
print(f"  Number of parallel envs: {env.num_envs}")

## Train Baseline Agent

In [None]:
# Create PPO agent with CNN policy
model = PPO(
    'CnnPolicy',           # CNN for image input
    env,
    learning_rate=2.5e-4,
    n_steps=128,           # Steps per update per env
    batch_size=256,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.1,
    ent_coef=0.01,         # Entropy bonus for exploration
    verbose=1,
    seed=42
)

print("\nAgent created. Ready to train!")

In [None]:
# Train agent
print("Training for 100k steps (~5 minutes)...\n")
model.learn(total_timesteps=100_000)
print("\nTraining complete!")

## Evaluate Agent

In [None]:
# Evaluate trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=20)

print(f"\nEvaluation Results (20 episodes):")
print(f"  Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")

# In Pong, rewards range from -21 (lose) to +21 (win)
# Positive means agent is winning!
if mean_reward > 0:
    win_rate = (mean_reward + 21) / 42 * 100
    print(f"  Estimated win rate: {win_rate:.1f}%")
    print(f"  ✓ Agent is winning!")
else:
    print(f"  Agent needs more training")
    print(f"  Try: model.learn(total_timesteps=500_000)")

## Watch Agent Play

**Note:** Video rendering might not work in all environments. If it fails, the agent is still trained!

In [None]:
# Test in single environment
test_env = gym.make('PongNoFrameskip-v4', render_mode='rgb_array')
test_env = gym.wrappers.AtariPreprocessing(test_env, frame_skip=4)
test_env = gym.wrappers.FrameStack(test_env, num_stack=4)

obs, _ = test_env.reset(seed=42)
frames = []
episode_reward = 0

for _ in range(5000):  # Max steps
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = test_env.step(action)
    episode_reward += reward
    
    # Record frames (optional)
    if len(frames) < 500:  # Save first 500 frames
        frames.append(test_env.render())
    
    if terminated or truncated:
        break

print(f"Episode reward: {episode_reward}")
print(f"Result: {'WIN' if episode_reward > 0 else 'LOSE'}")

# Show sample frame
if frames:
    plt.figure(figsize=(6, 8))
    plt.imshow(frames[0])
    plt.title(f"Pong Gameplay (Episode reward: {episode_reward})")
    plt.axis('off')
    plt.show()

## Next Steps

Now that you have a working baseline, try:

1. **Train longer:** Change `total_timesteps` to 500k or 1M
2. **Tune hyperparameters:** Try different learning rates, batch sizes
3. **Compare algorithms:** Try DQN instead of PPO
4. **Visualize learning:** Track win rate over time

See `project1_atari_README.md` for detailed improvement ideas!