# 01 - Data Exploration & Understanding

This notebook covers:
1. Downloading the dataset
2. Exploring the images
3. Understanding data distribution
4. Visualizing augmentations
5. Testing our dataset class

## Setup

First, let's import our libraries and set up paths.

In [None]:
import sys
from pathlib import Path

# Add src to path so we can import our modules
sys.path.append(str(Path.cwd().parent / 'src'))

# Standard imports
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import torch
from torchvision import transforms

# Set up matplotlib
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

## 1. Dataset Download

We'll use a hard hat detection dataset. There are several options:

### Option A: Kaggle Dataset
1. Go to: https://www.kaggle.com/datasets/andrewmvd/hard-hat-detection
2. Download and extract to `data/` folder

### Option B: Roboflow Dataset
1. Go to: https://universe.roboflow.com/search?q=hard+hat+classification
2. Find a classification dataset (not object detection)
3. Download in folder structure format

### Expected Structure
```
data/
├── train/
│   ├── hard_hat/
│   │   ├── img001.jpg
│   │   └── ...
│   └── no_hard_hat/
│       ├── img001.jpg
│       └── ...
└── val/
    ├── hard_hat/
    └── no_hard_hat/
```

If your dataset has a different structure, we'll reorganize it below.

In [None]:
# Define paths
DATA_DIR = Path.cwd().parent / 'data'
TRAIN_DIR = DATA_DIR / 'train'
VAL_DIR = DATA_DIR / 'val'

print(f"Data directory: {DATA_DIR}")
print(f"Exists: {DATA_DIR.exists()}")

# Check what we have
if DATA_DIR.exists():
    print("\nContents:")
    for item in DATA_DIR.iterdir():
        print(f"  {item.name}/" if item.is_dir() else f"  {item.name}")
else:
    print("\n⚠️ Data directory not found. Please download the dataset.")

## 2. Explore the Images

Let's look at some sample images from each class.

In [None]:
def show_images_from_folder(folder_path, num_images=5, title="Images"):
    """
    Display sample images from a folder.
    
    Args:
        folder_path: Path to folder containing images
        num_images: Number of images to display
        title: Title for the plot
    """
    folder = Path(folder_path)
    if not folder.exists():
        print(f"Folder not found: {folder}")
        return
    
    # Get image files
    image_files = list(folder.glob('*.jpg')) + list(folder.glob('*.png'))
    image_files = image_files[:num_images]
    
    if not image_files:
        print(f"No images found in {folder}")
        return
    
    # Plot
    fig, axes = plt.subplots(1, len(image_files), figsize=(3*len(image_files), 3))
    if len(image_files) == 1:
        axes = [axes]
    
    for ax, img_path in zip(axes, image_files):
        img = Image.open(img_path)
        ax.imshow(img)
        ax.axis('off')
        ax.set_title(f"{img.size[0]}x{img.size[1]}")
    
    fig.suptitle(title, fontsize=14)
    plt.tight_layout()
    plt.show()

In [None]:
# Show samples from each class
if TRAIN_DIR.exists():
    show_images_from_folder(TRAIN_DIR / 'hard_hat', title="HARD HAT - Training Samples")
    show_images_from_folder(TRAIN_DIR / 'no_hard_hat', title="NO HARD HAT - Training Samples")
else:
    print("Please download the dataset first.")

## 3. Data Distribution

Let's check the class balance - we want roughly equal numbers of each class.

In [None]:
def count_images(folder):
    """Count images in a folder."""
    folder = Path(folder)
    if not folder.exists():
        return 0
    return len(list(folder.glob('*.jpg')) + list(folder.glob('*.png')))

# Count images in each split/class
if DATA_DIR.exists():
    stats = {
        'train_hard_hat': count_images(TRAIN_DIR / 'hard_hat'),
        'train_no_hard_hat': count_images(TRAIN_DIR / 'no_hard_hat'),
        'val_hard_hat': count_images(VAL_DIR / 'hard_hat'),
        'val_no_hard_hat': count_images(VAL_DIR / 'no_hard_hat'),
    }
    
    print("Dataset Statistics:")
    print("=" * 40)
    print(f"Training - Hard Hat:     {stats['train_hard_hat']:5d}")
    print(f"Training - No Hard Hat:  {stats['train_no_hard_hat']:5d}")
    print(f"Training - Total:        {stats['train_hard_hat'] + stats['train_no_hard_hat']:5d}")
    print()
    print(f"Validation - Hard Hat:     {stats['val_hard_hat']:5d}")
    print(f"Validation - No Hard Hat:  {stats['val_no_hard_hat']:5d}")
    print(f"Validation - Total:        {stats['val_hard_hat'] + stats['val_no_hard_hat']:5d}")
    
    # Plot distribution
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
    
    # Training split
    ax1.bar(['Hard Hat', 'No Hard Hat'], 
            [stats['train_hard_hat'], stats['train_no_hard_hat']],
            color=['green', 'red'])
    ax1.set_title('Training Set')
    ax1.set_ylabel('Number of Images')
    
    # Validation split
    ax2.bar(['Hard Hat', 'No Hard Hat'], 
            [stats['val_hard_hat'], stats['val_no_hard_hat']],
            color=['green', 'red'])
    ax2.set_title('Validation Set')
    ax2.set_ylabel('Number of Images')
    
    plt.tight_layout()
    plt.show()

## 4. Visualize Data Augmentation

Let's see what our data augmentation transforms do to images.
Augmentation helps the model generalize by showing it varied versions of images.

In [None]:
# Import our dataset module
from dataset import get_train_transforms, get_val_transforms, denormalize

# Get a sample image
sample_dir = TRAIN_DIR / 'hard_hat'
if sample_dir.exists():
    sample_images = list(sample_dir.glob('*.jpg'))
    if sample_images:
        sample_path = sample_images[0]
        original_image = Image.open(sample_path)
        print(f"Sample image: {sample_path.name}")
        print(f"Original size: {original_image.size}")

In [None]:
def show_augmentations(image, transform, num_samples=6):
    """
    Show multiple augmented versions of the same image.
    
    This demonstrates how random augmentation creates variety.
    """
    fig, axes = plt.subplots(2, 3, figsize=(12, 8))
    axes = axes.flatten()
    
    for i, ax in enumerate(axes):
        # Apply transform (random augmentation each time)
        augmented = transform(image)
        
        # Denormalize for visualization
        augmented = denormalize(augmented)
        
        # Convert to numpy for display
        # Tensor shape: (C, H, W) -> numpy shape: (H, W, C)
        augmented_np = augmented.permute(1, 2, 0).numpy()
        augmented_np = np.clip(augmented_np, 0, 1)  # Clip to valid range
        
        ax.imshow(augmented_np)
        ax.set_title(f"Augmented #{i+1}")
        ax.axis('off')
    
    fig.suptitle("Same image with different random augmentations", fontsize=14)
    plt.tight_layout()
    plt.show()

if 'original_image' in dir():
    train_transform = get_train_transforms()
    show_augmentations(original_image, train_transform)

## 5. Test Dataset Class

Let's test our custom PyTorch Dataset class.

In [None]:
from dataset import HardHatDataset, create_dataloaders

# Create dataset
if TRAIN_DIR.exists():
    train_dataset = HardHatDataset(
        root_dir=str(TRAIN_DIR),
        transform=get_train_transforms()
    )
    
    print(f"\nDataset size: {len(train_dataset)} images")

In [None]:
# Get a sample
if 'train_dataset' in dir() and len(train_dataset) > 0:
    image, label = train_dataset[0]
    
    print("Sample from dataset:")
    print(f"  Image shape: {image.shape}")  # Should be (3, 224, 224)
    print(f"  Label: {label} ({train_dataset.CLASSES[label]})")
    print(f"  Pixel value range: [{image.min():.2f}, {image.max():.2f}]")

In [None]:
# Test DataLoader
if DATA_DIR.exists():
    train_loader, val_loader = create_dataloaders(
        data_dir=str(DATA_DIR),
        batch_size=16
    )
    
    # Get a batch
    images, labels = next(iter(train_loader))
    
    print(f"\nBatch from DataLoader:")
    print(f"  Images shape: {images.shape}")  # Should be (16, 3, 224, 224)
    print(f"  Labels shape: {labels.shape}")  # Should be (16,)
    print(f"  Labels: {labels.tolist()}")

## 6. Visualize a Batch

Let's display a batch of images with their labels.

In [None]:
def show_batch(images, labels, class_names, num_images=8):
    """
    Display a batch of images with their labels.
    
    Args:
        images: Batch of image tensors (B, C, H, W)
        labels: Batch of labels (B,)
        class_names: List of class names
        num_images: Number of images to display
    """
    num_images = min(num_images, len(images))
    
    fig, axes = plt.subplots(2, num_images // 2, figsize=(15, 7))
    axes = axes.flatten()
    
    for i, ax in enumerate(axes[:num_images]):
        # Denormalize
        img = denormalize(images[i])
        img = img.permute(1, 2, 0).numpy()
        img = np.clip(img, 0, 1)
        
        # Get label
        label_idx = labels[i].item()
        label_name = class_names[label_idx]
        
        # Plot
        ax.imshow(img)
        ax.set_title(label_name, color='green' if label_idx == 1 else 'red')
        ax.axis('off')
    
    plt.suptitle("Training Batch Sample", fontsize=14)
    plt.tight_layout()
    plt.show()

if 'train_loader' in dir():
    images, labels = next(iter(train_loader))
    show_batch(images, labels, HardHatDataset.CLASSES)

## Summary

In this notebook we:

1. **Downloaded/located** the hard hat dataset
2. **Explored** sample images from each class
3. **Analyzed** class distribution (checking for imbalance)
4. **Visualized** data augmentation transforms
5. **Tested** our custom Dataset and DataLoader classes

### Next Steps

Now that we understand our data, we can move on to:
1. Model creation (see `src/model.py`)
2. Training (see `src/train.py`)
3. Inference (see `src/inference.py`)

To train the model, run:
```bash
cd src
python train.py --data_dir ../data --epochs 10
```