# HW-04-01: Pokemon Image Classification - Preprocessing

**Dataset:** [Pokemon Images and Types](https://www.kaggle.com/datasets/vishalsubbiah/pokemon-images-and-types)

This notebook preprocesses the Pokemon dataset for CNN classification.

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import warnings

warnings.filterwarnings('ignore')
plt.style.use('dark_background')

## Step 1: Download Dataset

**Option A: Kaggle CLI**
```bash
pip install kaggle
kaggle datasets download -d vishalsubbiah/pokemon-images-and-types
unzip pokemon-images-and-types.zip -d ../data/pokemon/
```

**Option B: Manual Download**
1. Go to https://www.kaggle.com/datasets/vishalsubbiah/pokemon-images-and-types
2. Download and extract to `../data/pokemon/`

## Step 2: Load Images

In [None]:
def load_pokemon_images(data_dir, img_size=(120, 120)):
    """
    Load Pokemon images from directory structure.
    Assumes: data_dir/type_name/pokemon_image.png
    """
    data_dir = Path(data_dir)
    images = []
    labels = []
    
    # Get class directories
    class_dirs = sorted([d for d in data_dir.iterdir() if d.is_dir()])
    label_names = [d.name for d in class_dirs]
    
    print(f"Found {len(label_names)} classes: {label_names}")
    
    # Load images from each class
    for label_idx, class_dir in enumerate(tqdm(class_dirs, desc="Loading images")):
        image_files = list(class_dir.glob('*.png')) + list(class_dir.glob('*.jpg'))
        
        for img_file in image_files:
            try:
                img = Image.open(img_file).convert('RGB')
                img = img.resize(img_size)
                images.append(np.array(img))
                labels.append(label_idx)
            except Exception as e:
                print(f"Error loading {img_file}: {e}")
    
    return np.array(images), np.array(labels), label_names

In [None]:
# Load dataset
data_dir = Path('../data/pokemon')

if not data_dir.exists():
    print(f"⚠️  Data directory not found: {data_dir}")
    print("Please download the dataset first")
else:
    images, labels, label_names = load_pokemon_images(data_dir, img_size=(120, 120))
    
    print(f"\nLoaded {len(images)} images")
    print(f"Image shape: {images.shape}")
    print(f"Number of classes: {len(label_names)}")

## Step 3: Visualize Samples

In [None]:
# Display random samples
fig, axes = plt.subplots(3, 3, figsize=(9, 9))
axes = axes.ravel()

random_indices = np.random.choice(len(images), size=9, replace=False)

for i, idx in enumerate(random_indices):
    axes[i].imshow(images[idx])
    axes[i].set_title(label_names[labels[idx]])
    axes[i].axis('off')

plt.tight_layout()
plt.show()

## Step 4: Normalize Pixel Values

Convert pixel values from [0, 255] to [0, 1] for neural network training.

In [None]:
# Normalize to [0, 1] range
X = images.astype('float32') / 255.0
y = labels

print(f"Before normalization: min={images.min()}, max={images.max()}")
print(f"After normalization:  min={X.min()}, max={X.max()}")
print(f"\nData shape: {X.shape}")

## Step 5: Train/Test Split

In [None]:
# Split into train/test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training set: {X_train.shape[0]} images")
print(f"Test set:     {X_test.shape[0]} images")
print(f"\nClass distribution (train):")
for idx, name in enumerate(label_names):
    count = np.sum(y_train == idx)
    print(f"  {name}: {count}")

## Step 6: Save Preprocessed Data

In [None]:
# Save to compressed numpy format
output_dir = Path('../data/pokemon_preprocessed')
output_dir.mkdir(parents=True, exist_ok=True)

np.savez_compressed(
    output_dir / 'pokemon_data.npz',
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
    label_names=label_names
)

file_size = (output_dir / 'pokemon_data.npz').stat().st_size / (1024**2)
print(f"✓ Data saved to {output_dir / 'pokemon_data.npz'}")
print(f"  File size: {file_size:.2f} MB")

## Loading Preprocessed Data (for training notebook)

Use this code in your training notebook to load the preprocessed data:

In [None]:
# Load preprocessed data
data = np.load('../data/pokemon_preprocessed/pokemon_data.npz')
X_train = data['X_train']
X_test = data['X_test']
y_train = data['y_train']
y_test = data['y_test']
label_names = data['label_names']

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

## Summary

**Preprocessing steps completed:**
1. Loaded Pokemon images from directory structure
2. Resized images to 120×120 pixels
3. Normalized pixel values to [0, 1] range
4. Split data into train/test sets (80/20)
5. Saved preprocessed data for CNN training

**Next step:** Build and train a CNN model using the preprocessed data.