# 🚀 Improved DQN Training for Atari Games

This notebook runs the improved training with better hyperparameters to achieve actual learning!

**Key Improvements:**
- Slower epsilon decay (0.995 vs 0.99)
- Lower learning rate (0.0001 vs 0.0005)
- Larger replay buffer (100k vs 50k)
- More episodes for proper learning

**Expected Results:** Rewards should improve from -21 to positive values!


## 📦 Setup and Installation


In [None]:
# Check GPU availability and setup
import torch
import os

print("🔍 Checking GPU Setup...")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

if torch.cuda.is_available():
    print(f"✅ GPU detected!")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU count: {torch.cuda.device_count()}")
    
    # Test GPU allocation
    test_tensor = torch.randn(1000, 1000).cuda()
    print(f"✅ GPU test successful! Tensor on GPU: {test_tensor.device}")
    
    # Set default device
    torch.cuda.set_device(0)
    print(f"Default device set to: {torch.cuda.current_device()}")
    
else:
    print("❌ No GPU detected!")
    print("🔧 Troubleshooting steps:")
    print("1. Go to Runtime > Change runtime type")
    print("2. Set Hardware accelerator to GPU")
    print("3. Choose T4 GPU (free) or A100/V100 (if available)")
    print("4. Click Save and restart runtime")
    print("5. Re-run this cell")
    
    # Check if we're in Colab
    try:
        import google.colab
        print("✅ Running in Google Colab")
        print("💡 Make sure to enable GPU in runtime settings!")
    except ImportError:
        print("⚠️  Not running in Google Colab - GPU setup may be different")


In [None]:
# Install required packages
!pip install gymnasium[atari] ale-py opencv-python matplotlib tqdm
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118


In [None]:
# Verify installations
try:
    import gymnasium as gym
    import ale_py
    import cv2
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    import torch
    import numpy as np
    print("✅ All packages installed successfully!")
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("Please restart runtime and run the installation cell again.")


## 🧠 DQN Implementation Classes


In [None]:
# Atari Environment with Preprocessing
import gymnasium as gym
import numpy as np
import cv2
from collections import deque

# Register ALE environments
try:
    import ale_py
    gym.register_envs(ale_py)
except ImportError:
    print("Warning: ale_py not installed")
except Exception as e:
    print(f"Warning: Could not register ALE environments: {e}")

class AtariPreprocessor:
    """Handles frame preprocessing for Atari games."""
    
    def __init__(self, frame_size=(84, 84), grayscale=True):
        self.frame_size = frame_size
        self.grayscale = grayscale
    
    def preprocess_frame(self, frame):
        """Preprocess a single frame."""
        if self.grayscale and len(frame.shape) == 3:
            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, self.frame_size, interpolation=cv2.INTER_AREA)
        frame = frame.astype(np.float32) / 255.0
        return frame

class FrameStack:
    """Stack consecutive frames for temporal information."""
    
    def __init__(self, num_frames=4):
        self.num_frames = num_frames
        self.frames = deque(maxlen=num_frames)
    
    def reset(self, frame):
        """Initialize with the first frame."""
        for _ in range(self.num_frames):
            self.frames.append(frame)
        return self.get_state()
    
    def update(self, frame):
        """Add new frame and return stacked state."""
        self.frames.append(frame)
        return self.get_state()
    
    def get_state(self):
        """Return current state as stacked frames."""
        return np.stack(list(self.frames), axis=0)

class AtariEnvironment:
    """Complete Atari environment wrapper with preprocessing."""
    
    def __init__(self, game_name="ALE/Pong-v5", frame_size=(84, 84), num_frames=4, skip_frames=4, no_op_max=30):
        self.game_name = game_name
        self.skip_frames = skip_frames
        self.no_op_max = no_op_max
        
        self.env = gym.make(game_name, render_mode="rgb_array")
        self.action_space = self.env.action_space
        
        self.preprocessor = AtariPreprocessor(frame_size)
        self.frame_stack = FrameStack(num_frames)
        
        self.episode_reward = 0
        self.episode_length = 0
        self.lives = 0
        
    def reset(self):
        """Reset environment and return initial state."""
        obs, info = self.env.reset()
        self.episode_reward = 0
        self.episode_length = 0
        self.lives = info.get('lives', 0)
        
        for _ in range(np.random.randint(0, self.no_op_max)):
            obs, _, terminated, truncated, info = self.env.step(0)
            if terminated or truncated:
                obs, info = self.env.reset()
                self.lives = info.get('lives', 0)
        
        processed_frame = self.preprocessor.preprocess_frame(obs)
        state = self.frame_stack.reset(processed_frame)
        return state, info
    
    def step(self, action):
        """Execute action with frame skipping."""
        total_reward = 0
        
        for _ in range(self.skip_frames):
            obs, reward, terminated, truncated, info = self.env.step(action)
            total_reward += reward
            
            current_lives = info.get('lives', 0)
            life_lost = (current_lives < self.lives) and (current_lives > 0)
            self.lives = current_lives
            
            if terminated or truncated or life_lost:
                break
        
        self.episode_reward += total_reward
        self.episode_length += 1
        
        processed_frame = self.preprocessor.preprocess_frame(obs)
        state = self.frame_stack.update(processed_frame)
        
        info['episode_reward'] = self.episode_reward
        info['episode_length'] = self.episode_length
        info['life_lost'] = life_lost if 'life_lost' in locals() else False
        
        return state, total_reward, terminated, truncated, info
    
    def close(self):
        self.env.close()
    
    def get_action_space_size(self):
        return self.action_space.n
    
    def get_state_shape(self):
        if len(self.frame_stack.frames) == 0:
            dummy_frame = np.zeros(self.preprocessor.frame_size, dtype=np.float32)
            dummy_state = np.stack([dummy_frame] * self.frame_stack.num_frames, axis=0)
            return dummy_state.shape
        else:
            return self.frame_stack.get_state().shape

print("✅ Atari Environment classes loaded!")


In [None]:
# DQN Agent Implementation
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
from collections import deque, namedtuple

Experience = namedtuple('Experience', ['state', 'action', 'reward', 'next_state', 'done'])

class DQNNetwork(nn.Module):
    """Deep Q-Network architecture for Atari games."""
    
    def __init__(self, input_shape, num_actions, hidden_size=512):
        super(DQNNetwork, self).__init__()
        
        self.input_shape = input_shape
        self.num_actions = num_actions
        
        self.conv1 = nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        
        conv_out_size = self._get_conv_output_size()
        
        self.fc1 = nn.Linear(conv_out_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_actions)
        
        self._initialize_weights()
    
    def _get_conv_output_size(self):
        with torch.no_grad():
            x = torch.zeros(1, *self.input_shape)
            x = F.relu(self.conv1(x))
            x = F.relu(self.conv2(x))
            x = F.relu(self.conv3(x))
            return x.numel()
    
    def _initialize_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0)
    
    def forward(self, x):
        if len(x.shape) == 3:
            x = x.unsqueeze(0)
        
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        
        x = x.view(x.size(0), -1)
        
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        
        return x

class ReplayBuffer:
    """Experience replay buffer."""
    
    def __init__(self, capacity=100000):
        self.buffer = deque(maxlen=capacity)
        self.capacity = capacity
    
    def push(self, state, action, reward, next_state, done):
        experience = Experience(state, action, reward, next_state, done)
        self.buffer.append(experience)
    
    def sample(self, batch_size):
        experiences = random.sample(self.buffer, batch_size)
        
        states = torch.stack([torch.FloatTensor(e.state) for e in experiences])
        actions = torch.LongTensor([e.action for e in experiences])
        rewards = torch.FloatTensor([e.reward for e in experiences])
        next_states = torch.stack([torch.FloatTensor(e.next_state) for e in experiences])
        dones = torch.BoolTensor([e.done for e in experiences])
        
        return states, actions, rewards, next_states, dones
    
    def __len__(self):
        return len(self.buffer)

class DQNAgent:
    """Deep Q-Network Agent with experience replay."""
    
    def __init__(self, state_shape, num_actions, learning_rate=0.0001, gamma=0.99,
                 epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995,
                 memory_size=100000, batch_size=32, target_update_freq=1000, device=None):
        
        self.state_shape = state_shape
        self.num_actions = num_actions
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        
        # Force GPU usage if available
        if device is None:
            if torch.cuda.is_available():
                self.device = torch.device("cuda:0")  # Explicitly use first GPU
                torch.cuda.set_device(0)  # Set default GPU
            else:
                self.device = torch.device("cpu")
        else:
            self.device = device
        
        print(f"🚀 DQN Agent using device: {self.device}")
        print(f"🔧 GPU Memory before model creation: {torch.cuda.memory_allocated()/1024**2:.1f} MB" if torch.cuda.is_available() else "")
        
        # Create networks and move to device
        self.q_network = DQNNetwork(state_shape, num_actions).to(self.device)
        self.target_network = DQNNetwork(state_shape, num_actions).to(self.device)
        
        print(f"🔧 GPU Memory after model creation: {torch.cuda.memory_allocated()/1024**2:.1f} MB" if torch.cuda.is_available() else "")
        
        self.update_target_network()
        
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.memory = ReplayBuffer(memory_size)
        
        self.steps_done = 0
        self.training_step = 0
        
        # Test GPU usage
        if torch.cuda.is_available():
            test_input = torch.randn(1, *state_shape).to(self.device)
            with torch.no_grad():
                test_output = self.q_network(test_input)
            print(f"✅ GPU test successful! Input: {test_input.device}, Output: {test_output.device}")
    
    def select_action(self, state, training=True):
        """Select action using epsilon-greedy policy."""
        if training and random.random() < self.epsilon:
            return random.randrange(self.num_actions)
        else:
            with torch.no_grad():
                state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
                q_values = self.q_network(state_tensor)
                return q_values.max(1)[1].item()
    
    def store_experience(self, state, action, reward, next_state, done):
        """Store experience in replay buffer."""
        self.memory.push(state, action, reward, next_state, done)
    
    def train_step(self):
        """Perform one training step."""
        if len(self.memory) < self.batch_size:
            return None
        
        states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)
        
        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)
        
        current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        with torch.no_grad():
            next_q_values = self.target_network(next_states).max(1)[0]
            target_q_values = rewards + (self.gamma * next_q_values * ~dones)
        
        loss = F.mse_loss(current_q_values.squeeze(), target_q_values)
        
        self.optimizer.zero_grad()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=10.0)
        
        self.optimizer.step()
        
        self.training_step += 1
        
        if self.training_step % self.target_update_freq == 0:
            self.update_target_network()
        
        if self.epsilon > self.epsilon_end:
            self.epsilon *= self.epsilon_decay
        
        return loss.item()
    
    def update_target_network(self):
        """Copy weights to target network."""
        self.target_network.load_state_dict(self.q_network.state_dict())

print("✅ DQN Agent classes loaded!")


## 🚀 Improved Training Function


In [None]:
# Improved Training Function
import time
import json
from datetime import datetime
import matplotlib.pyplot as plt

def improved_train(game_name="ALE/Pong-v5", episodes=1000):
    """
    Improved training with better hyperparameters for actual learning.
    """
    print(f"🚀 Improved Training for Better Learning")
    print(f"Game: {game_name}")
    print(f"Episodes: {episodes}")
    print("=" * 50)
    
    # GPU monitoring
    if torch.cuda.is_available():
        print(f"🔧 Starting GPU Memory: {torch.cuda.memory_allocated()/1024**2:.1f} MB")
        print(f"🔧 GPU Memory Cached: {torch.cuda.memory_reserved()/1024**2:.1f} MB")
    
    # Create environment
    print("🎮 Creating environment...")
    env = AtariEnvironment(
        game_name=game_name,
        frame_size=(84, 84),
        num_frames=4,
        skip_frames=4
    )
    
    state_shape = env.get_state_shape()
    num_actions = env.get_action_space_size()
    
    print(f"✅ Environment created successfully")
    print(f"   State shape: {state_shape}")
    print(f"   Action space: {num_actions}")
    
    # Create agent with IMPROVED settings
    print("🧠 Creating DQN agent with improved hyperparameters...")
    agent = DQNAgent(
        state_shape=state_shape,
        num_actions=num_actions,
        learning_rate=0.0001,     # Lower learning rate for stability
        gamma=0.99,
        epsilon_start=1.0,
        epsilon_end=0.01,
        epsilon_decay=0.995,      # SLOWER decay - explore for much longer!
        memory_size=100000,       # Larger buffer for better experience diversity
        batch_size=32,
        target_update_freq=1000   # Less frequent target updates for stability
    )
    
    # Training metrics
    episode_rewards = []
    episode_lengths = []
    average_rewards = []
    losses = []
    epsilon_values = []
    
    best_reward = float('-inf')
    best_episode = 0
    start_time = time.time()
    
    print("🎯 Starting IMPROVED training...")
    print("🔧 Key improvements:")
    print("   • Slower epsilon decay (0.995 vs 0.99)")
    print("   • Lower learning rate (0.0001 vs 0.0005)")
    print("   • Larger replay buffer (100k vs 50k)")
    print("   • Less frequent target updates (1000 vs 500)")
    print("   • More episodes for proper learning")
    print("-" * 50)
    
    try:
        for episode in tqdm(range(episodes), desc="Improved Training"):
            # Train one episode
            state, _ = env.reset()
            episode_reward = 0
            episode_length = 0
            episode_losses = []
            
            while True:
                # Select and execute action
                action = agent.select_action(state, training=True)
                next_state, reward, terminated, truncated, info = env.step(action)
                
                # Store experience
                done = terminated or truncated
                agent.store_experience(state, action, reward, next_state, done)
                
                # Train agent (only if enough experiences)
                if len(agent.memory) >= agent.batch_size:
                    loss = agent.train_step()
                    if loss is not None:
                        episode_losses.append(loss)
                
                # Update state and statistics
                state = next_state
                episode_reward += reward
                episode_length += 1
                
                if done:
                    break
            
            # Record metrics
            avg_loss = np.mean(episode_losses) if episode_losses else 0.0
            
            episode_rewards.append(episode_reward)
            episode_lengths.append(episode_length)
            losses.append(avg_loss)
            epsilon_values.append(agent.epsilon)
            
            # Calculate moving average (last 100 episodes)
            if len(episode_rewards) >= 100:
                recent_avg = np.mean(episode_rewards[-100:])
            else:
                recent_avg = np.mean(episode_rewards)
            average_rewards.append(recent_avg)
            
            # Track best performance
            if recent_avg > best_reward:
                best_reward = recent_avg
                best_episode = episode + 1
            
            # Progress reporting
            if (episode + 1) % 50 == 0:
                elapsed_time = time.time() - start_time
                episodes_per_hour = (episode + 1) / (elapsed_time / 3600)
                
                gpu_info = ""
                if torch.cuda.is_available():
                    gpu_mem = torch.cuda.memory_allocated() / 1024**2
                    gpu_info = f" | GPU: {gpu_mem:.1f}MB"
                
                print(f"Episode {episode + 1:4d} | "
                      f"Reward: {episode_reward:6.1f} | "
                      f"Avg(100): {recent_avg:6.1f} | "
                      f"Best: {best_reward:6.1f}@{best_episode:4d} | "
                      f"Length: {episode_length:3d} | "
                      f"Epsilon: {agent.epsilon:.3f} | "
                      f"Speed: {episodes_per_hour:.1f}/hr{gpu_info}")
    
    except KeyboardInterrupt:
        print("\n⏹️  Training interrupted by user")
    
    finally:
        # Calculate final statistics
        total_time = time.time() - start_time
        final_avg_reward = average_rewards[-1] if average_rewards else 0
        improvement = final_avg_reward - episode_rewards[0] if episode_rewards else 0
        episodes_per_hour = len(episode_rewards) / (total_time / 3600)
        
        # Final results
        final_results = {
            'total_episodes': len(episode_rewards),
            'total_time_minutes': total_time / 60,
            'best_avg_reward': best_reward,
            'best_episode': best_episode,
            'final_avg_reward': final_avg_reward,
            'improvement': improvement,
            'episodes_per_hour': episodes_per_hour,
            'final_epsilon': agent.epsilon
        }
        
        # Print final summary
        print("\n" + "=" * 50)
        print("🎉 IMPROVED TRAINING COMPLETED!")
        print("=" * 50)
        print("📊 Final Results:")
        print(f"   Total Episodes: {final_results['total_episodes']}")
        print(f"   Total Time: {final_results['total_time_minutes']:.1f} minutes")
        print(f"   Best Average Reward: {final_results['best_avg_reward']:.2f} (Episode {final_results['best_episode']})")
        print(f"   Final Average Reward: {final_results['final_avg_reward']:.2f}")
        print(f"   Improvement: {final_results['improvement']:.2f}")
        print(f"   Episodes per Hour: {final_results['episodes_per_hour']:.1f}")
        print(f"   Final Epsilon: {final_results['final_epsilon']:.3f}")
        
        # Learning assessment
        if final_results['improvement'] > 5:
            print("🎯 EXCELLENT LEARNING! Significant improvement achieved!")
        elif final_results['improvement'] > 1:
            print("✅ GOOD LEARNING! Some improvement achieved!")
        elif final_results['improvement'] > 0:
            print("⚠️  MINIMAL LEARNING! Very small improvement.")
        else:
            print("❌ NO LEARNING! Consider adjusting hyperparameters further.")
        
        env.close()
        
        return {
            'results': final_results,
            'episode_rewards': episode_rewards,
            'average_rewards': average_rewards,
            'losses': losses,
            'epsilon_values': epsilon_values
        }

print("✅ Improved training function loaded!")


## 🔧 GPU Monitoring


In [None]:
# GPU Memory Monitoring Function
def monitor_gpu():
    """Monitor GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**2  # MB
        reserved = torch.cuda.memory_reserved() / 1024**2    # MB
        total = torch.cuda.get_device_properties(0).total_memory / 1024**2  # MB
        
        print(f"🔧 GPU Memory Status:")
        print(f"   Allocated: {allocated:.1f} MB ({allocated/total*100:.1f}%)")
        print(f"   Reserved:  {reserved:.1f} MB ({reserved/total*100:.1f}%)")
        print(f"   Total:     {total:.1f} MB")
        print(f"   Available: {total-reserved:.1f} MB")
        
        # Clear cache if needed
        if allocated > total * 0.9:  # If using >90% of memory
            print("⚠️  High GPU memory usage! Clearing cache...")
            torch.cuda.empty_cache()
    else:
        print("❌ No GPU available for monitoring")

# Run GPU monitoring
monitor_gpu()


## 🎮 Run Training

Choose your training configuration:


In [None]:
# Quick Test (10 episodes) - to verify everything works
print("🧪 Running Quick Test (10 episodes)...")
results = improved_train(game_name="ALE/Pong-v5", episodes=10)
print("✅ Quick test completed!")


In [None]:
# Medium Training (100 episodes) - to see some learning
print("🚀 Running Medium Training (100 episodes)...")
results = improved_train(game_name="ALE/Pong-v5", episodes=100)
print("✅ Medium training completed!")


In [None]:
# Full Training (500 episodes) - for proper learning
print("🎯 Running Full Training (500 episodes)...")
results = improved_train(game_name="ALE/Pong-v5", episodes=500)
print("✅ Full training completed!")


## 📊 Visualization


In [None]:
# Plot training results
if 'results' in locals():
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Improved Training Results', fontsize=16)
    
    # Episode rewards
    axes[0, 0].plot(results['episode_rewards'], alpha=0.6, label='Episode Reward', color='lightblue')
    axes[0, 0].plot(results['average_rewards'], label='Moving Average (100)', linewidth=2, color='blue')
    axes[0, 0].set_xlabel('Episode')
    axes[0, 0].set_ylabel('Reward')
    axes[0, 0].set_title('Episode Rewards')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Training loss
    axes[0, 1].plot(results['losses'], color='orange', alpha=0.7)
    axes[0, 1].set_xlabel('Episode')
    axes[0, 1].set_ylabel('Loss')
    axes[0, 1].set_title('Training Loss')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Epsilon decay
    axes[1, 0].plot(results['epsilon_values'], color='purple', alpha=0.7)
    axes[1, 0].set_xlabel('Episode')
    axes[1, 0].set_ylabel('Epsilon')
    axes[1, 0].set_title('Exploration Rate')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Reward distribution
    axes[1, 1].hist(results['episode_rewards'], bins=50, alpha=0.7, color='green')
    axes[1, 1].set_xlabel('Reward')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_title('Reward Distribution')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print summary
    print("\n📊 Training Summary:")
    print(f"   Initial Reward: {results['episode_rewards'][0]:.1f}")
    print(f"   Final Reward: {results['episode_rewards'][-1]:.1f}")
    print(f"   Best Average: {results['results']['best_avg_reward']:.1f}")
    print(f"   Improvement: {results['results']['improvement']:.1f}")
else:
    print("❌ No training results available. Run a training cell first.")


## 🎯 Custom Training

Run your own training with custom parameters:


In [None]:
# Custom Training Configuration
GAME_NAME = "ALE/Pong-v5"  # Try: ALE/Breakout-v5, ALE/SpaceInvaders-v5
EPISODES = 200  # Adjust as needed

print(f"🎮 Custom Training: {GAME_NAME} for {EPISODES} episodes")
custom_results = improved_train(game_name=GAME_NAME, episodes=EPISODES)
print("✅ Custom training completed!")


## 💡 Tips for Better Results

1. **Start with Pong** - It's the easiest Atari game to learn
2. **Watch for Improvement** - Rewards should go from -21 to positive values
3. **Be Patient** - DQN needs 200+ episodes to show meaningful learning
4. **Monitor Epsilon** - Should decay slowly from 1.0 to 0.01
5. **Check GPU Usage** - Make sure you're using GPU for faster training

## 🎯 Expected Learning Progression:

- **Episodes 1-50**: Random play, -21 rewards
- **Episodes 50-150**: Occasional ball returns, -15 to -10 rewards
- **Episodes 150-300**: Consistent returns, -5 to +5 rewards
- **Episodes 300+**: Strategic play, +5 to +20 rewards

## 🚨 Important Notes:

- **GPU Required**: Enable GPU in Runtime > Change runtime type > GPU
- **Restart Runtime**: If you encounter import errors, restart runtime and run all cells
- **Save Results**: Download plots and results before runtime expires
- **Monitor Progress**: Watch the progress bars and reward improvements

## 🔧 GPU Troubleshooting:

If you see 0% GPU usage:

1. **Check Runtime Settings**: Runtime > Change runtime type > GPU (T4 recommended)
2. **Restart Runtime**: Runtime > Restart runtime (after changing GPU settings)
3. **Re-run Setup Cells**: Run the GPU detection cell again after restart
4. **Verify CUDA**: Make sure PyTorch CUDA version matches Colab's CUDA
5. **Check GPU Monitoring**: Use the GPU monitoring cell to verify memory usage
6. **Force GPU**: The updated code now explicitly uses `cuda:0` device

**Expected GPU Usage**: Should show 200-500MB GPU memory during training
