# Exercise 2: Transfer Learning

Attribution: Kolhatkar, Varada (2024) [DSCI 572](https://ubc-mds.github.io/DSCI_572_sup-learn-2/README.html) 

**Transfer learning** is like borrowing knowledge from one task to help with another: you take a model that has already learned patterns from a related task (e.g., classifying images in [Imagenet](https://www.image-net.org/)) and adapt it to your task (e.g., detecting specific types of fruits) with less effort and data. 

In this exercise, you will explore transfer learning by leveraging pre-trained image classification models. Specifically, you will:

- Use these models out of the box to classify your own images.
  
- Use them as feature extractors to obtain rich representations of your images, which you can then apply to your own tasks.

**Important!!**

We are going to run this notebook on the cloud using [Kaggle](https://www.kaggle.com). Kaggle offers **30 hours** of free GPU usage per week which should be much more than enough for this lab. 

You should make sure the followings are ready **before** you start the lab.

- Create an Kaggle account [here](https://www.kaggle.com/) if you don't have one yet
  
- Verify your phone number [here](https://www.kaggle.com/settings) to get access to GPUs

## Getting Started with Kaggle Kernels
<hr>

To get started, follow these steps:

1. Go to https://www.kaggle.com/kernels
2. Select `+ New Notebook`
3. Click `File` on the top left side of your Kaggle notebook, select `Import Notebook`
4. Upload this notebook
5. On the right-hand side of your Kaggle notebook, find `Session options` and make sure:
  
  - `INTERNET` is enabled.
  
  - In the `ACCELERATOR` dropdown, choose the options starts with `GPU` when you're ready to use it (you can turn it on/off as you need it).
    
7. **Run** the cell below for preparation the model, labels and functions.
> The code in the following cell contains helper functions that will be used later. You don't need to fully understand this code to answer the questions in this notebook.

In [None]:
# Import 
from PIL import Image
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from torch import nn, optim
from torchvision import datasets, models, transforms, utils
import glob
import json
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import os, sys
import pandas as pd
import random
import torch
import torchvision
%matplotlib inline

plt.rcParams.update({'axes.grid': False})

# Download ImageNet labels
!wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt

# Prepare the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize model globally for reuse
vgg_model = models.get_model('vgg16', weights='VGG16_Weights.DEFAULT').to(device)
vgg_model.eval()

densenet_model = models.get_model('densenet121', weights="DenseNet121_Weights.IMAGENET1K_V1").to(device)
densenet_model.classifier = nn.Identity()  # remove that last "classification" layer
densenet_model.eval()

def classify_image(img: Image.Image, topn: int = 4) -> pd.DataFrame:
    """
    Classify an image using a pre-trained VGG16 model.
    
    Args:
        img: PIL Image to classify
        topn: Number of top predictions to return
        
    Returns:
        DataFrame with top class predictions and their probabilities
    """
    preprocess = transforms.Compose([
                transforms.Resize(299),
                transforms.CenterCrop(299),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                                  std=[0.229, 0.224, 0.225])])

    with open("imagenet_classes.txt", "r") as f:
        classes = [line.strip() for line in f.readlines()]
    
    img_t = preprocess(img)
    batch_t = torch.unsqueeze(img_t, 0).to(device)
    
    with torch.no_grad():
        output = vgg_model(batch_t)
        _, indices = torch.sort(output, descending=True)
        probabilities = torch.nn.functional.softmax(output, dim=1)
    
    d = {'Predicted Class': [classes[idx] for idx in indices[0][:topn]], 
         'Probability Score': [np.round(probabilities[0, idx].item(),3) for idx in indices[0][:topn]]}
    return pd.DataFrame(d, columns=['Predicted Class','Probability Score'])


# Attribution: [Code from PyTorch docs](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html?highlight=transfer%20learning)
IMAGE_SIZE = 200
BATCH_SIZE = 64

def read_data(data_dir: str, subdir: dict) -> tuple:
    """
    Reads and preprocesses image data from directories.
    
    Args:
        data_dir: Base directory containing image data
        subdir: Dictionary with train/valid subdirectories
        
    Returns:
        tuple: (image_datasets, dataloaders) containing the processed datasets
    """
    data_transforms = {
        "train": transforms.Compose(
            [
                transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),     
                transforms.ToTensor(),
                transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),            
            ]
        ),
        "valid": transforms.Compose(
            [
                transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),                        
                transforms.ToTensor(),
                transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),                        
            ]
        ),
    }

    image_datasets = {
        x: datasets.ImageFolder(os.path.join(data_dir, subdir[x]), data_transforms[x])
        for x in ["train", "valid"]
    }
    
    dataloaders = {}
    
    dataloaders["train"] = torch.utils.data.DataLoader(
            image_datasets["train"], batch_size=BATCH_SIZE, shuffle=True
        )
    
    dataloaders["valid"] = torch.utils.data.DataLoader(
            image_datasets["valid"], batch_size=BATCH_SIZE, shuffle=True
        )
    
    return image_datasets, dataloaders

def get_features(model: nn.Module, data_loader: torch.utils.data.DataLoader, seed: int = None, verbose: bool = False) -> tuple:
    """
    Extract features from penultimate layer of model.
    
    Args:
        model: Pre-trained neural network
        data_loader: DataLoader containing images
        seed: Random seed for reproducibility
        
    Returns:
        tuple: (features, labels) as torch tensors
    """
    if seed:
        torch.manual_seed(seed)
    model.to(device)
    with torch.no_grad():  # turn off computational graph stuff
        Z_init = torch.empty((0, 1024)).to(device)  # Initialize empty tensors
        y_init = torch.empty((0)).to(device)
        for X, y in data_loader:
            X, y = X.to(device), y.to(device)
            Z_init = torch.cat((Z_init, model(X)), dim=0)
            y_init = torch.cat((y_init, y))
    if verbose:
        print(f'Sample feature vectors: \n {pd.DataFrame(Z_init.cpu().detach())[:10]} \n')
    return Z_init.cpu().detach(), y_init.cpu().detach()

def show_predictions(pipe, 
                    Z_valid: torch.Tensor,
                    y_valid: torch.Tensor, 
                    dataloader: torch.utils.data.DataLoader,
                    class_names: dict,
                    num_images: int = 20,
                    seed: int = None) -> None:
    """
    Display images with predicted and true labels.
    
    Args:
        pipe: Trained sklearn pipeline
        Z_valid: Validation features
        y_valid: Validation labels  
        dataloader: DataLoader for images
        class_names: Dictionary mapping indices to class names
        num_images: Number of images to display
        seed: Random seed for reproducibility
    """
    if seed:
        torch.manual_seed(seed)
    images_so_far = 0
    fig = plt.figure(figsize=(15, 25))  # Adjust the figure size for better visualization

    # Convert the features and labels to numpy arrays
    Z_valid = Z_valid.numpy()
    y_valid = y_valid.numpy()

    # Make predictions using the trained logistic regression model
    preds = pipe.predict(Z_valid)

    with torch.no_grad():
        for idx, (inputs, labels) in enumerate(dataloader):
            inputs = inputs.cpu()
            for j in range(inputs.size()[0]):
                if images_so_far >= num_images:
                    return
                # print(f"Dataloader Labels: {labels[j]}: {class_names['valid'][labels[j]]}")
                ax = plt.subplot(num_images // 5, 5, images_so_far + 1)  # 5 images per row
                ax.axis('off')
                # A manual hotfix for the wrong directory naming in the dataset
                ax.set_title(f"Predicted Class: {class_names['train'][int(preds[images_so_far])]} \n"
                             f"Actual Label: {class_names['valid'][int(y_valid[images_so_far])]}")
                inp = inputs.data[j].numpy().transpose((1, 2, 0))
                mean = np.array([0.5, 0.5, 0.5])
                std = np.array([0.5, 0.5, 0.5])
                inp = std * inp + mean
                inp = np.clip(inp, 0, 1)
                ax.imshow(inp)
                #imshow(inputs.data[j])
                images_so_far += 1

def show_image_label_prob(image_path: str, true_label: str, num_images: int) -> None:
    """
    Display images with classifications and probabilities.
    
    Args:
        image_path: Glob pattern to match image files
        true_label: Actual label of the images
        num_images: Number of random images to display
    """
    images = glob.glob(image_path)
    selected_images = random.sample(images, num_images)
    plt.figure(figsize=(5, 5));
    for image in selected_images:
        img = Image.open(image)
        img.load()
        plt.imshow(img)
        plt.title(f'Actual Label: {true_label}')
        plt.show()
        df = classify_image(img)    
        print(df.to_string(index=False))
        print("--------------------------------------------------------------\n\n")

def extract_features_and_train_classifier(DATA_DIR: str, SUBDIR: dict, seed: int = None, verbose: bool = True):
    """
    Extract features and train a classifier on image data.
    
    Args:
        DATA_DIR: Base directory containing image data
        SUBDIR: Dictionary with train/valid subdirectories
        seed: Random seed for reproducibility
        verbose: Whether to print training progress
        
    Returns:
        Dictionary containing model, features, and class names
    """
    image_datasets, dataloaders = read_data(DATA_DIR, SUBDIR)
    class_names = {
        "train": image_datasets["train"].classes,
        "valid": image_datasets["valid"].classes
    }
    
    Z_train, y_train = get_features(
        densenet_model, dataloaders["train"], 
        seed=seed, verbose=verbose
    )
    Z_valid, y_valid = get_features(
        densenet_model, dataloaders["valid"],
        seed=seed
    )
    
    pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000))
    
    if verbose:
        print("======== ML Model Training In Progress ========")
    
    pipe.fit(Z_train, y_train)
    
    if verbose:
        print(f"Training score: {pipe.score(Z_train, y_train):.3f}")
        print(f"Validation score: {pipe.score(Z_valid, y_valid):.3f}")
        print("======== ML Model Training Complete ========")

    return {
        "class_names": class_names,
        "dataloaders": dataloaders,
        "Z_train": Z_train,
        "y_train": y_train,
        "Z_valid": Z_valid,
        "y_valid": y_valid,
        "model": pipe,
    }

def show_directory(directory='/kaggle/input'):
    for dirname, _, filenames in os.walk(directory):
        print(f'Directory: {dirname}')

print('Helper functions, libraries, labels are imported successfully!')

Once you've done all your work on Kaggle, you can download the notebook from Kaggle. That way any work you did on Kaggle won't be lost. 

## Getting Started with Kaggle Datasets
<hr>

In this exercise, we'll use some Kaggle datasets for image classification. To get started with the dataset, follow the instructions below.

1. On the right-hand side of your Kaggle notebook, find `Input` and click `+ Add Input`.

2. Choose `Datasets`. In the search bar, type `cat-and-dog`. You will locate this [dataset]((https://www.kaggle.com/datasets/tongpython/cat-and-dog)) with a size of 228MB and click `+` if the dataset is not added yet.

3. In the search bar, type `fruit-classification10-class`. You will locate this [dataset]((https://www.kaggle.com/datasets/karimabdulnabi/fruit-classification10-class)) with a size of 31MB and click `+` if the dataset is not added yet.

## Exercise 1: Using pre-trained models out of the box
<hr>

First, we will use pre-trained Convolutional Neural Network (CNN) models out of the box for image classification. You can find a list of available pre-trained models [here](https://nnabla.readthedocs.io/en/v1.39.0/python/api/models/imagenet.html).

In this exercise, we'll use the `VGG16` model to classify cats and dogs using this [dataset](https://www.kaggle.com/datasets/tongpython/cat-and-dog). 

**Run** the following cell and it will display the image, its label (`Actual Label`), the model's predictions (`Predicted Class`), and the corresponding probabilities (`Probability Score`).

In [None]:
# YOUR INPUT: Number of images shown per category
num_images=4

# Set up data
IMAGE_DIR = {
    "Cat": "/kaggle/input/cat-and-dog/test_set/test_set/cats/*.*", 
    "Dog": "/kaggle/input/cat-and-dog/test_set/test_set/dogs/*.*"
}

# Display image and prediction labels with probability
for true_label, image_path in IMAGE_DIR.items():
    show_image_label_prob(image_path, true_label, num_images=num_images)

## Exercise 1.1 

<div class="alert alert-info">

**Discussion questions**

1. How well does the model distinguish between cats and dogs?
2. Do you notice any specific patterns or characteristics in the Predicted Class versus the Actual Label?
   
</div>

<div class="alert alert-warning">

Type your answer below.
    
</div>

_Type your answer here, replacing this text._

Hopefully, you observed reasonable performance on the cats and dogs dataset. Now, let's test the model on a slightly different dataset: [Fruit Classification Dataset](https://www.kaggle.com/datasets/karimabdulnabi/fruit-classification10-class). 

The dataset includes images from the following 10 classes:
- Apple
- Banana
- Avocado
- Cherry
- Kiwi
- Mango
- Orange
- Pineapple
- Strawberries
- Watermelon

**Run** the following cell and it will display the image, its label (`Actual Label`), the model's predictions (`Predicted Class`), and the corresponding probabilities (`Probability Score`).

In [None]:
# YOUR INPUT: Number of images shown per category
num_images=2

# Set up data
IMAGE_DIR = {
    "Apple": "/kaggle/input/fruit-classification10-class/MY_data/train/Apple/*.jpeg", 
    "Banana": "/kaggle/input/fruit-classification10-class/MY_data/train/Banana/*.jpeg", 
    "Avocado": "/kaggle/input/fruit-classification10-class/MY_data/train/avocado/*.jpeg", 
    "Cherry": "/kaggle/input/fruit-classification10-class/MY_data/train/cherry/*.jpeg", 
    "Kiwi": "/kaggle/input/fruit-classification10-class/MY_data/train/kiwi/*.jpeg", 
    "Mango": "/kaggle/input/fruit-classification10-class/MY_data/train/mango/*.jpeg", 
    "Orange": "/kaggle/input/fruit-classification10-class/MY_data/train/orange/*.jpeg", 
    "Pineapple": "/kaggle/input/fruit-classification10-class/MY_data/train/pinenapple/*.jpeg", 
    "Strawberries": "/kaggle/input/fruit-classification10-class/MY_data/train/strawberries/*.jpeg", 
    "Watermelon": "/kaggle/input/fruit-classification10-class/MY_data/train/watermelon/*.jpeg", 
}

# Display image and prediction labels with probability
for true_label, image_path in IMAGE_DIR.items():
    show_image_label_prob(image_path, true_label, num_images=num_images)


## Exercise 1.2

<div class="alert alert-info">

**Discussion questions**

1. How well does the model distinguish between different types of fruits?
2. Did you notice any differences in the model's performance between the cats and dogs dataset and the fruits dataset? Briefly explain your answer. 

</div>

<div class="alert alert-warning">

Type your answer below.
    
</div>

_Type your answer here, replacing this text._

<br><br><br><br>

## Exercise 2: Using pre-trained models as feature extractors

<br>

Often, we want to train a model on our own datasets and have it predict classes specific to our data, rather than the 1000 classes from ImageNet. To achieve this, we can use pre-trained models as feature extractors. Specifically, we can leverage the rich representations learned by pre-trained models, use these representations as feature vectors, and train a new model on these feature vectors for our specific task.

In this exercise, you will use a pre-trained CNN model, `Densenet`, to extract features from images and train a logistic regression classifier to identify different types of fruits.

To get started, **run** the following cell to prepare the data.

In [None]:
# Set up data
DATA_DIR = '/kaggle/input/fruit-classification10-class/MY_data/'
SUBDIR = {'train': 'train', 'valid': 'test'}

Next, we'll extract feature vectors from the images above using the pre-trained model. 

Each image sample is represented with feature vectors (tabular data shown below) extracted from the `Densenet` model.

We'll then train a **logistic regression model** to classify fruits using the extracted features and the true label of the fruit images.

**Run** the following cell to extract feature and train the model.

In [None]:
# YOUR INPUT: Integer `seed` to randomize the image dataset
seed=315

# Extract features and train ML model
model_output = extract_features_and_train_classifier(DATA_DIR, SUBDIR, seed=seed)

Let's **run** the cell below and have some predictions made by the model to examine together!

In [None]:
# YOUR INPUT: Number of images shown
num_images=25

# Show predictions
show_predictions(model_output["model"], model_output["Z_valid"], model_output["y_valid"], model_output["dataloaders"]["valid"], model_output["class_names"], num_images=num_images, seed=seed)

## Exercise 2.1

<div class="alert alert-info">

**Discussion questions**

1. How well does the model distinguish between different types of fruits?
2. Is the performance with feature extraction better than the performance with the out-of-the-box method? 

</div>

<div class="alert alert-warning">

Type your answer below.
    
</div>

_Type your answer here, replacing this text._

### Your Free Time (Optional)

**Your tasks**:

Choose any image dataset that interests you and train a model using it!

Feel free to discuss your ideas and progress with your teammates and the workshop team.

In [None]:
# YOUR INPUT
DATA_DIR = "{_DATA_DIRECTORY_PATH_}"
SUBDIR = "{_SUB_DIRECTORY_PATH_}"
seed = "{Seed_Integer}"
# Example - dataset `cat-breed-mardhik` https://www.kaggle.com/datasets/solothok/cat-breed
# DATA_DIR = '/kaggle/input/cat-breed/cat-breed/'
# SUBDIR = {'train': 'TRAIN', 'valid': 'TEST'}
# seed = 315

# Extract features and train ML model
model_output = extract_features_and_train_classifier(DATA_DIR, SUBDIR, seed=seed)

In [None]:
# YOUR INPUT: Number of images shown
num_images=20

# Show predictions
show_predictions(model_output["model"], model_output["Z_valid"], model_output["y_valid"], model_output["dataloaders"]["valid"], model_output["class_names"], num_images=num_images, seed=seed)

<!-- END QUESTION -->

<br><br>