## Student Information
Gideon Shahar

Noy Rahmani  
311124416

### 1. Problem Definition
**Classes**

In this project we use the Marvel Heroes image dataset from Kaggle.
Each image belongs to one of 8 classes, where each class corresponds to a specific Marvel character:

- **Black Widow**

- **Captain America**

- **Doctor Strange**

- **Hulk**

- **Iron Man**

- **Loki**

- **Spider-Man**

- **Thanos**

The dataset is already organized into folders, one folder per character, with roughly a few hundred images per class. This makes it a natural 8-way image classification problem:
Given an image, predict which hero appears in it.

**Use-case and Motivation**

Superhero images appear everywhere: fan sites, news articles, social media posts, and digital asset libraries.
Manually tagging every image with the correct character is time-consuming and error-prone.

A model that can automatically classify an image as Black Widow, Hulk, Iron Man, etc., could be used to:

- Auto-tag images in a content management system
(e.g., “show me all images of Spider-Man”).

- Improve search and recommendations in a fan app or media platform
(e.g., “show me similar images of Loki”).

- Provide a proof-of-concept for using deep learning in entertainment and media applications.

This project is a small but realistic example of how an image classifier can help organize and search large collections of visual content.

**Expected Challenges**

Even though the dataset is relatively clean, we expect several challenges:

1. Visual similarity between classes

Some heroes share similar colors and costume patterns
(e.g., the red/blue themes of Spider-Man and Captain America, or the dark suits of Black Widow and Iron Man), which can confuse the model.

2. Variation in pose, lighting, and background

Characters appear in different:

- poses

- zoom levels

- lighting conditions

- environments (city, space, explosions, studio backgrounds)

These variations may distract the model or make certain features harder to learn.

3. Artwork vs. renders vs. movie stills

The dataset may contain a mixture of:

- movie stills

- posters

- CGI renders

- fan art styles

The model must learn to focus on character identity, not artistic style.

4. Potential label noise

Some images may:

- contain multiple characters, or

- be misfiled under the wrong folder

This introduces noise into the labels and may reduce accuracy.

5. Class imbalance and overfitting

Some characters may have more images or higher-quality images than others.
With only a few thousand total images, we must be careful to avoid overfitting, especially when using a large pretrained model.

Part of the project will be to investigate these issues using:

- confusion matrix

- top-loss examples

- misclassification analysis

and discuss how these factors affect the final results.

### 2. Dataset Creation and Preparation

#### 2.1 Source

For this project we use the **Marvel Heroes** image dataset from Kaggle.  
The dataset contains images of 8 Marvel characters (one folder per character), with roughly a few hundred images for each class.

source: https://www.kaggle.com/datasets/hchen13/marvel-heroes

Please download the dataset manually from Kaggle as a ZIP file (`marvel-heroes.zip`).

#### 2.2 Folder structure

Please extract the ZIP file, so the directory sturcture will be:

- `data/`
  - `train`
    - `black widow/`
    - `captain america/`
    - `doctor strange/`
    - `hulk/`
    - `iron man/`
    - `loki/`
    - `spider-man/`
    - `thanos/`
  - `valid`
    ...

Each subfolder contains images for that specific character.  
This structure matches the typical fastai convention: **one folder per class**, where the folder name is used as the label.


In [1]:
import sys, fastai
print(sys.executable)
print("fastai version:", fastai.__version__)


c:\Users\Noy\.virtualenvs\shared\Scripts\python.exe
fastai version: 2.8.5


In [None]:
from pathlib import Path

data_path = Path("data")
if not data_path.exists():
    raise FileNotFoundError("The 'data' directory does not exist. Please ensure it is present.")

# output data stats
train_path = data_path / "train"
valid_path = data_path / "valid"
for split_path, split_name in [(train_path, "Training"), (valid_path, "Validation")]:
    if not split_path.exists():
        raise FileNotFoundError(f"The '{split_name.lower()}' directory does not exist in 'data'. Please ensure it is present.")
    print(f"\n{split_name} set:")
    for class_dir in split_path.iterdir():
        if class_dir.is_dir():
            num_images = sum(1 for _ in class_dir.rglob('*') if _.is_file())
            print(f"Class '{class_dir.name}': {num_images} images")


Training data - Class 'black widow': 320 images
Training data - Class 'captain america': 324 images
Training data - Class 'doctor strange': 345 images
Training data - Class 'hulk': 321 images
Training data - Class 'ironman': 318 images
Training data - Class 'loki': 307 images
Training data - Class 'spider-man': 326 images
Training data - Class 'thanos': 323 images
Validation data - Class 'black widow': 55 images
Validation data - Class 'captain america': 57 images
Validation data - Class 'doctor strange': 61 images
Validation data - Class 'hulk': 56 images
Validation data - Class 'ironman': 56 images
Validation data - Class 'loki': 54 images
Validation data - Class 'spider-man': 57 images
Validation data - Class 'thanos': 55 images


### 3. Data Loading in FastAI

In this section we build a FastAI `DataLoaders` object from the extracted
Marvel Heroes image folders.

- Each **subfolder name** (Black Widow, Hulk, etc.) is treated as a **class label**.
- We let FastAI create a **training/validation split** from the images.
- We apply a basic preprocessing pipeline:
  - resize all images to a fixed size (224×224),
  - apply standard image augmentations during training.

We also display a small batch of images with their labels to verify that the
data has been loaded and labeled correctly.

In [None]:
from fastai.vision.all import *

# Base data path (where the class folders are)
data_path = Path("data")

# Create DataLoaders from the folder structure
dls = ImageDataLoaders.from_folder(
    data_path,
    valid_pct=0.2,     # 20% of the data for validation
    seed=42,           # for reproducible split
    item_tfms=Resize(224),      # resize images to 224x224
    batch_tfms=aug_transforms() # standard image augmentations
)

# Show a small batch of images with labels to verify everything
dls.show_batch(max_n=9, figsize=(6, 6))


### 4. Model Training

### 5. Evaluation