<a href="https://colab.research.google.com/github/kamalasaurus/BIOL-GA-1133/blob/kamalasaurus/MachineLearningHW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Assignment: Comparing feature extraction methods and classifiers for skin cancer image           classification

 This assignment guides you through a multi-model comparison for classifying skin cancer images as
 benign or malignant. You will use the Falah/skin-cancer dataset from Hugging Face and experiment
 with different feature extraction techniques and classic machine learning models.

 ## Prerequisites

 Ensure you have the necessary libraries installed:


In [1]:
pip install datasets scikit-learn transformers Pillow numpy scikit-image



## Step 1: Download and prepare the dataset

 Use the datasets library to load the Falah/skin-cancer data from huggingface

In [4]:
from datasets import load_dataset
import numpy as np
from PIL import Image

from google.colab import userdata
HF_TOKEN = userdata.get('huggingface')

# Load the dataset (it only has a single 'train' split on the Hub)
dsdict = load_dataset("Falah/skin-cancer", token=HF_TOKEN, verification_mode="no_checks")
train_ds = dsdict["train"]
test_ds = dsdict["test"]
## there was a load_dataset error that resolved itself
#full = dsdict["train"]

# Make your own train/test split (stratified by the label column)
#splits = full.train_test_split(test_size=0.2, seed=42, stratify_by_column="label")
#train_ds, test_ds = splits["train"], splits["test"]

# Extract images (PIL) & labels (ints 0/1). Resize to a common size first.
images = [img.convert("RGB").resize((128, 128)) for img in train_ds["image"]]
labels = np.array(train_ds["label"])  # <— correct label column

# Convert images to a single numpy array [N, 128, 128, 3]
image_data = np.stack([np.asarray(img) for img in images], axis=0)

print("Dataset loaded and prepared.")
print("image_data:", image_data.shape, "labels:", labels.shape)
print("classes:", train_ds.features["label"].names)

Dataset loaded and prepared.
image_data: (2637, 128, 128, 3) labels: (2637,)
classes: ['benign', 'malignant']


 ## Step 2: Feature extraction using different methods

 Feature extraction is reducing the image data to comparable quantities.  Basically you have to convert the image into a number.

 ## A. Principal Component Analysis (PCA)

 PCA is a dimensionality reduction technique. Here, we'll flatten the image data and then apply
 PCA to reduce the number of features.

In [5]:
from sklearn.decomposition import PCA

# Flatten the image data
n_samples, h, w, c = image_data.shape
image_data_flat = image_data.reshape((n_samples, h * w * c))

# Apply PCA for dimensionality reduction
# We'll use 100 components for this example, which can be tuned
pca = PCA(n_components=100)
pca_features = pca.fit_transform(image_data_flat)

print(f"PCA features shape: {pca_features.shape}")

PCA features shape: (2637, 100)


 ## B. Histograms

 An alternative feature extraction method:  We can use a color histogram as a simple but effective feature representation.

In [6]:
import cv2

def get_histogram_features(image_array):
    # Convert RGB to grayscale for a simpler histogram
    gray_image = cv2.cvtColor(image_array, cv2.COLOR_RGB2GRAY)
    # Compute the histogram
    hist = cv2.calcHist([gray_image], [0], None, [256], [0, 256])
    return hist.flatten()

hist_features = np.array([get_histogram_features(img) for img in image_data])

print(f"Histogram features shape: {hist_features.shape}")

Histogram features shape: (2637, 256)


 ## C. Gray-Level Co-occurrence Matrix (GLCM)

 Yet another feature extraction method is GLCM.  GLCM captures texture information from the images by considering the spatial relationship of
 pixels.

In [7]:
from skimage.feature import graycomatrix, graycoprops
from skimage.color import rgb2gray

def get_glcm_features(image_array):
    # Convert to grayscale and then to integer type
    gray_image = (rgb2gray(image_array) * 255).astype('uint8')
    # Calculate GLCM
    glcm = graycomatrix(gray_image, distances=[5], angles=[0], symmetric=True, normed=True)
    # Extract properties
    contrast = graycoprops(glcm, 'contrast')[0, 0]
    dissimilarity = graycoprops(glcm, 'dissimilarity')[0, 0]
    homogeneity = graycoprops(glcm, 'homogeneity')[0, 0]
    energy = graycoprops(glcm, 'energy')[0, 0]
    correlation = graycoprops(glcm, 'correlation')[0, 0]

    return [contrast, dissimilarity, homogeneity, energy, correlation]

glcm_features = np.array([get_glcm_features(img) for img in image_data])

print(f"GLCM features shape: {glcm_features.shape}")

GLCM features shape: (2637, 5)


 ## D. Vision Transformer (ViT) as a feature extractor

 This method leverages a powerful pre-trained deep learning model to produce highly descriptive
 feature vectors.

 # NOTE: if this repeatedly crashes your session (probably due to lack of vram) just skip running it.

In [None]:
from transformers import ViTFeatureExtractor, ViTForImageClassification
import torch

# Load the pre-trained ViT model and feature extractor
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224-in21k")

def get_vit_features(images):
    # Preprocess images
    inputs = feature_extractor(images=images, return_tensors="pt")
    # Get features from the model's last hidden state
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    last_hidden_state = outputs.hidden_states[-1]
    # The [CLS] token (first token) is used as the aggregated feature vector
    return last_hidden_state[:, 0, :].cpu().numpy()

# Note: This step can be very slow. It is recommended to run on a GPU.
vit_features = get_vit_features(images)

print(f"ViT features shape: {vit_features.shape}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 ## Step 3: Train and compare machine learning models

 ## A. Naive Bayes

 We will train a Gaussian Naive Bayes model on each of the four feature sets.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

def train_and_evaluate_naive_bayes(features, labels, feature_name):
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2,
random_state=42)
    gnb = GaussianNB()
    gnb.fit(X_train, y_train)
    y_pred = gnb.predict(X_test)
    print(f"--- Naive Bayes with {feature_name} ---")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(classification_report(y_test, y_pred))

train_and_evaluate_naive_bayes(pca_features, labels, "PCA Features")
train_and_evaluate_naive_bayes(hist_features, labels, "Histogram Features")
train_and_evaluate_naive_bayes(glcm_features, labels, "GLCM Features")
#train_and_evaluate_naive_bayes(vit_features, labels, "ViT Features")

--- Naive Bayes with PCA Features ---
Accuracy: 0.7102
              precision    recall  f1-score   support

           0       0.69      0.84      0.76       283
           1       0.75      0.56      0.64       245

    accuracy                           0.71       528
   macro avg       0.72      0.70      0.70       528
weighted avg       0.72      0.71      0.70       528

--- Naive Bayes with Histogram Features ---
Accuracy: 0.6174
              precision    recall  f1-score   support

           0       0.59      0.94      0.72       283
           1       0.78      0.24      0.37       245

    accuracy                           0.62       528
   macro avg       0.68      0.59      0.55       528
weighted avg       0.68      0.62      0.56       528

--- Naive Bayes with GLCM Features ---
Accuracy: 0.7500
              precision    recall  f1-score   support

           0       0.72      0.89      0.79       283
           1       0.82      0.59      0.69       245

    accura

 ## B. Logistic Regression

 A Logistic Regression model will be trained on the same four feature sets.

In [10]:
from sklearn.linear_model import LogisticRegression

def train_and_evaluate_logistic_regression(features, labels, feature_name):
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2,
random_state=42)
    # Using 'liblinear' solver for faster convergence on smaller datasets
    lr = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    print(f"--- Logistic Regression with {feature_name} ---")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(classification_report(y_test, y_pred))

train_and_evaluate_logistic_regression(pca_features, labels, "PCA Features")
train_and_evaluate_logistic_regression(hist_features, labels, "Histogram Features")
train_and_evaluate_logistic_regression(glcm_features, labels, "GLCM Features")
#train_and_evaluate_logistic_regression(vit_features, labels, "ViT Features")

--- Logistic Regression with PCA Features ---
Accuracy: 0.8163
              precision    recall  f1-score   support

           0       0.82      0.84      0.83       283
           1       0.81      0.79      0.80       245

    accuracy                           0.82       528
   macro avg       0.82      0.81      0.81       528
weighted avg       0.82      0.82      0.82       528

--- Logistic Regression with Histogram Features ---
Accuracy: 0.6307
              precision    recall  f1-score   support

           0       0.64      0.72      0.68       283
           1       0.62      0.52      0.57       245

    accuracy                           0.63       528
   macro avg       0.63      0.62      0.62       528
weighted avg       0.63      0.63      0.63       528

--- Logistic Regression with GLCM Features ---
Accuracy: 0.7519
              precision    recall  f1-score   support

           0       0.73      0.85      0.79       283
           1       0.79      0.64      0.

## C. Linear Discriminant Analysis (LDA) without feature extraction

 LDA can be used directly as a classification model, and it also performs dimensionality reduction.

In [11]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

def train_and_evaluate_lda(data, labels):
    # Flatten the image data for LDA
    n_samples, h, w, c = data.shape
    data_flat = data.reshape((n_samples, h * w * c))

    X_train, X_test, y_train, y_test = train_test_split(data_flat, labels, test_size=0.2,
random_state=42)

    # Run LDA
    lda = LinearDiscriminantAnalysis()
    lda.fit(X_train, y_train)
    y_pred = lda.predict(X_test)

    print("--- Linear Discriminant Analysis (no feature extraction) ---")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(classification_report(y_test, y_pred))

train_and_evaluate_lda(image_data, labels)

--- Linear Discriminant Analysis (no feature extraction) ---
Accuracy: 0.7708
              precision    recall  f1-score   support

           0       0.76      0.83      0.80       283
           1       0.78      0.70      0.74       245

    accuracy                           0.77       528
   macro avg       0.77      0.77      0.77       528
weighted avg       0.77      0.77      0.77       528



 # Assignment report questions

 1.  Which feature extraction method produced the best results
 for each classifier? Why do you think this is the case?
 2. Compare the performance of Naive Bayes, Logistic Regression, and LDA.
 Which model performed best and under what conditions?
 3. How does LDA perform on the raw pixel data compared to the other models using
 extracted features? Explain why this might be a poor strategy for images.
 4. The ViT model is a complex deep learning model. What are the advantages and
 disadvantages of using it solely as a feature extractor for a simple model like Naive Bayes?
 5. Summarize your findings on the relationship between feature engineering and       model performance for this image classification task.