# HAM10000 Dataset Exploration

This notebook explores the HAM10000 skin lesion dataset for classification. The dataset contains 10,015 dermatoscopic images of pigmented skin lesions across 7 diagnostic categories.

## Dataset Overview

The HAM10000 ('Human Against Machine with 10000 training images') dataset consists of 10015 dermatoscopic images released as a training set for academic machine learning purposes. The dataset includes images of common pigmented skin lesions from different populations, acquired and stored by different modalities.

The 7 diagnostic categories in the dataset are:
1. Actinic Keratoses (akiec) - 327 images
2. Basal Cell Carcinoma (bcc) - 514 images
3. Benign Keratosis-like Lesions (bkl) - 1099 images
4. Dermatofibroma (df) - 115 images
5. Melanoma (mel) - 1113 images
6. Melanocytic Nevi (nv) - 6705 images
7. Vascular Lesions (vasc) - 142 images

In [None]:
# Import required libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2
from pathlib import Path
import sys
from collections import Counter
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Add the project root directory to the Python path
sys.path.append('/content/drive/MyDrive/Explainable-AI-for-Skin-Cancer-Detection')

In [None]:
# Import project modules
from XAI.config import CLASS_NAMES, RAW_DATA_DIR
from XAI.features import extract_color_histogram, extract_shape_features, extract_texture_features

In [None]:
# Set plotting style
plt.style.use('ggplot')
sns.set(style="whitegrid")

In [None]:
# Matplotlib settings for better visualization
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

## Loading the Dataset

In [None]:
# Define paths
DATA_DIR = Path('/content/drive/MyDrive/Explainable-AI-for-Skin-Cancer-Detection/data/raw')
METADATA_FILE = DATA_DIR / 'HAM10000_metadata.csv'
IMAGE_DIR_PART1 = DATA_DIR / 'HAM10000_images_part_1'
IMAGE_DIR_PART2 = DATA_DIR / 'HAM10000_images_part_2'

In [None]:
# Check if the files exist
print(f"Metadata file exists: {METADATA_FILE.exists()}")
print(f"Image directory part 1 exists: {IMAGE_DIR_PART1.exists()}")
print(f"Image directory part 2 exists: {IMAGE_DIR_PART2.exists()}")

In [None]:
# Load metadata
metadata = pd.read_csv(METADATA_FILE)
# Display first few rows
metadata.head()

In [None]:
# Check the shape of the metadata
print(f"Metadata shape: {metadata.shape}")

In [None]:
# Check for missing values
print("\nMissing values:")
print(metadata.isnull().sum())

## Exploratory Data Analysis

In [None]:
# Display dataset information
metadata.info()

In [None]:
# Statistical summary of numerical columns
metadata.describe()

### Class Distribution

In [None]:
# Count and visualize the distribution of diagnostic categories
class_counts = metadata['dx'].value_counts()
print("Class distribution:")
for class_name, count in class_counts.items():
    print(f"{CLASS_NAMES[class_name]}: {count} images ({count/len(metadata)*100:.2f}%)")

In [None]:
# Plot class distribution
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=class_counts.index, y=class_counts.values)
plt.xlabel('Class')
plt.ylabel('Number of Images')
plt.title('Class Distribution in HAM10000 Dataset')

In [None]:
# Add value labels on top of bars
for i, count in enumerate(class_counts.values):
    ax.text(i, count + 50, str(count), ha='center')

In [None]:
# Replace class codes with full names
plt.xticks(range(len(CLASS_NAMES)), [CLASS_NAMES[cls] for cls in class_counts.index], rotation=45, ha='right')
plt.tight_layout()
plt.show()

We can see that the dataset is highly imbalanced, with 'Melanocytic Nevi' (nv) being the dominant class with 6705 images, while 'Dermatofibroma' (df) has only 115 images. This class imbalance will need to be addressed during model training.


### Age Distribution

In [None]:
# Age distribution analysis
plt.figure(figsize=(12, 6))
sns.histplot(metadata['age'].dropna(), bins=20, kde=True)
plt.title('Age Distribution in the Dataset')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
# Age distribution by diagnostic category
plt.figure(figsize=(14, 8))
sns.boxplot(x='dx', y='age', data=metadata)
plt.title('Age Distribution by Diagnostic Category')
plt.xlabel('Diagnostic Category')
plt.ylabel('Age')
plt.xticks(range(len(CLASS_NAMES)), [CLASS_NAMES[cls] for cls in sorted(metadata['dx'].unique())], rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Statistical summary of age by class
age_by_class = metadata.groupby('dx')['age'].describe()
age_by_class

The age distribution shows that skin lesions occur predominantly in middle-aged and older adults. Some classes like 'Actinic Keratoses' tend to occur in older populations, which makes sense as these are related to sun damage over time.


### Sex Distribution

In [None]:
# Sex distribution
sex_counts = metadata['sex'].value_counts()
plt.figure(figsize=(8, 6))
sex_counts.plot(kind='pie', autopct='%1.1f%%')
plt.title('Sex Distribution in the Dataset')
plt.ylabel('')
plt.show()

In [None]:
# Sex distribution by diagnostic category
plt.figure(figsize=(14, 6))
sex_by_dx = pd.crosstab(metadata['dx'], metadata['sex'])
sex_by_dx_norm = sex_by_dx.div(sex_by_dx.sum(axis=1), axis=0)
sex_by_dx_norm.plot(kind='bar', stacked=True)
plt.title('Sex Distribution by Diagnostic Category')
plt.xlabel('Diagnostic Category')
plt.ylabel('Proportion')
plt.xticks(range(len(CLASS_NAMES)), [CLASS_NAMES[cls] for cls in sex_by_dx.index], rotation=45, ha='right')
plt.legend(title='Sex')
plt.tight_layout()
plt.show()

In [None]:
# Count of images by sex and diagnostic category
sex_dx_counts = pd.crosstab(metadata['dx'], metadata['sex'])
sex_dx_counts.columns = ['Female', 'Male', 'Unknown']
sex_dx_counts.index = [CLASS_NAMES[cls] for cls in sex_dx_counts.index]
sex_dx_counts

The dataset has more male patients than female, and there's a significant number of 'unknown' sex entries. Some skin lesion types show sex-based prevalence differences, which could be useful information for the model.


### Localization Distribution

In [None]:
# Localization distribution
loc_counts = metadata['localization'].value_counts()
plt.figure(figsize=(12, 6))
loc_counts.plot(kind='bar')
plt.title('Localization Distribution in the Dataset')
plt.xlabel('Localization')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Localization distribution by diagnostic category
plt.figure(figsize=(14, 8))
loc_by_dx = pd.crosstab(metadata['dx'], metadata['localization'])
loc_by_dx_norm = loc_by_dx.div(loc_by_dx.sum(axis=1), axis=0)
loc_by_dx_norm.plot(kind='bar', stacked=True)
plt.title('Localization Distribution by Diagnostic Category')
plt.xlabel('Diagnostic Category')
plt.ylabel('Proportion')
plt.xticks(range(len(CLASS_NAMES)), [CLASS_NAMES[cls] for cls in loc_by_dx.index], rotation=45, ha='right')
plt.legend(title='Localization', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Top 5 localizations by class
for cls in metadata['dx'].unique():
    class_df = metadata[metadata['dx'] == cls]
    top_locs = class_df['localization'].value_counts().head(5)
    print(f"\nTop 5 localizations for {CLASS_NAMES[cls]}:")
    for loc, count in top_locs.items():
        print(f"{loc}: {count} ({count/len(class_df)*100:.1f}%)")

The localization distribution shows that skin lesions have different prevalence patterns on the body. For instance, melanoma (mel) is more common on the back and trunk areas, which are more exposed to sun damage. Understanding these patterns can help in developing better diagnostic models.


### Diagnosis Confirmation Methods

In [None]:
# Diagnosis confirmation method distribution
dx_type_counts = metadata['dx_type'].value_counts()
plt.figure(figsize=(10, 6))
dx_type_counts.plot(kind='bar')
plt.title('Diagnosis Confirmation Methods in the Dataset')
plt.xlabel('Confirmation Method')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

In [None]:
# Percentage of each confirmation method
print("Diagnosis confirmation methods:")
for method, count in dx_type_counts.items():
    print(f"{method}: {count} ({count/len(metadata)*100:.2f}%)")

In [None]:
# Confirmation method by diagnostic category
plt.figure(figsize=(14, 8))
dx_type_by_dx = pd.crosstab(metadata['dx'], metadata['dx_type'])
dx_type_by_dx_norm = dx_type_by_dx.div(dx_type_by_dx.sum(axis=1), axis=0)
dx_type_by_dx_norm.plot(kind='bar', stacked=True)
plt.title('Confirmation Method by Diagnostic Category')
plt.xlabel('Diagnostic Category')
plt.ylabel('Proportion')
plt.xticks(range(len(CLASS_NAMES)), [CLASS_NAMES[cls] for cls in dx_type_by_dx.index], rotation=45, ha='right')
plt.legend(title='Confirmation Method')
plt.tight_layout()
plt.show()

Most of the diagnoses in the dataset are confirmed by histopathology, which is considered the gold standard. Different lesion types have different confirmation patterns, with melanoma (mel) and basal cell carcinoma (bcc) having high histopathology confirmation rates, likely because they're more critical to diagnose accurately.

### Image Samples

In [None]:
# Display sample images from each class
samples_per_class = 5
fig, axs = plt.subplots(len(CLASS_NAMES), samples_per_class, figsize=(15, 3*len(CLASS_NAMES)))

In [None]:
for i, class_name in enumerate(CLASS_NAMES.keys()):
    # Get samples for this class
    samples = metadata[metadata['dx'] == class_name].sample(min(samples_per_class, sum(metadata['dx'] == class_name)))

    for j, (_, row) in enumerate(samples.iterrows()):
        img_id = row['image_id']

        # Find image file
        img_path = None
        for img_dir in [IMAGE_DIR_PART1, IMAGE_DIR_PART2]:
            temp_path = img_dir / f"{img_id}.jpg"
            if temp_path.exists():
                img_path = temp_path
                break

        if img_path is None:
            print(f"Warning: Image {img_id} not found")
            continue

        # Load and display image
        img = Image.open(img_path)
        axs[i, j].imshow(img)
        axs[i, j].axis('off')

        # Add class label to first image in row
        if j == 0:
            axs[i, j].set_title(f"{CLASS_NAMES[class_name]}", fontsize=12)

plt.tight_layout()
plt.show()

The sample images show the visual differences between the seven skin lesion classes. Some classes like melanoma (mel) and melanocytic nevi (nv) can look quite similar, which makes the classification task challenging.


### Image Properties Analysis

In [None]:
# Get a sample of images from each class for analysis
sample_size = 20
sample_data = []

In [None]:
for class_name in CLASS_NAMES.keys():
    class_samples = metadata[metadata['dx'] == class_name].sample(
        min(sample_size, sum(metadata['dx'] == class_name)), random_state=42
    )

    for _, row in class_samples.iterrows():
        img_id = row['image_id']

        # Find image file
        img_path = None
        for img_dir in [IMAGE_DIR_PART1, IMAGE_DIR_PART2]:
            temp_path = img_dir / f"{img_id}.jpg"
            if temp_path.exists():
                img_path = temp_path
                break

        if img_path is None:
            continue

        # Read image
        img = cv2.imread(str(img_path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        # Get image properties
        height, width, channels = img.shape
        aspect_ratio = width / height
        mean_color = np.mean(img, axis=(0, 1))
        std_color = np.std(img, axis=(0, 1))

        # Extract color histogram
        color_hist = extract_color_histogram(img, bins=8)  # Reduced bins for simplicity

        # Extract shape features
        shape_features = extract_shape_features(img)

        # Extract texture features
        texture_features = extract_texture_features(img)

        # Combine all features
        sample_data.append({
            'image_id': img_id,
            'class': class_name,
            'height': height,
            'width': width,
            'aspect_ratio': aspect_ratio,
            'mean_r': mean_color[0],
            'mean_g': mean_color[1],
            'mean_b': mean_color[2],
            'std_r': std_color[0],
            'std_g': std_color[1],
            'std_b': std_color[2],
            'color_hist': color_hist,
            'shape_features': shape_features,
            'texture_features': texture_features
        })

In [None]:
# Convert to DataFrame
sample_df = pd.DataFrame(sample_data)

In [None]:
# Analyze basic image properties
plt.figure(figsize=(12, 8))

# Image dimensions
plt.subplot(2, 2, 1)
plt.scatter(sample_df['width'], sample_df['height'], c=sample_df['class'].astype('category').cat.codes, alpha=0.7)
plt.title('Image Dimensions')
plt.xlabel('Width (pixels)')
plt.ylabel('Height (pixels)')
plt.colorbar(ticks=range(len(CLASS_NAMES)), label='Class')

In [None]:
# Aspect ratio
plt.subplot(2, 2, 2)
sns.boxplot(x='class', y='aspect_ratio', data=sample_df)
plt.title('Aspect Ratio by Class')
plt.xlabel('Class')
plt.ylabel('Aspect Ratio (width/height)')
plt.xticks(rotation=45, ha='right')

In [None]:
# Mean color
plt.subplot(2, 2, 3)
for i, color in enumerate(['mean_r', 'mean_g', 'mean_b']):
    sns.kdeplot(data=sample_df, x=color, label=color.split('_')[1].upper())
plt.title('Mean Color Distribution')
plt.xlabel('Pixel Value')
plt.ylabel('Density')
plt.legend()

In [None]:
# Color standard deviation
plt.subplot(2, 2, 4)
for i, color in enumerate(['std_r', 'std_g', 'std_b']):
    sns.kdeplot(data=sample_df, x=color, label=color.split('_')[1].upper())
plt.title('Color Standard Deviation Distribution')
plt.xlabel('Pixel Value')
plt.ylabel('Density')
plt.legend()

plt.tight_layout()
plt.show()

These analyses show that the images have fairly consistent dimensions but vary in their color properties. This information can be useful for preprocessing decisions.


### Color Analysis

In [None]:
# Color histograms by class
# Get average color histogram for each class
class_hist_means = {}
bins = 8  # Must match the bins used earlier

In [None]:
for class_name in CLASS_NAMES.keys():
    class_samples = sample_df[sample_df['class'] == class_name]
    if len(class_samples) == 0:
        continue

    # Stack all histograms and compute mean
    hist_stack = np.vstack(class_samples['color_hist'].values)
    class_hist_means[class_name] = np.mean(hist_stack, axis=0)

In [None]:
# Plot average color histograms
plt.figure(figsize=(14, 10))

In [None]:
for i, (class_name, hist) in enumerate(class_hist_means.items()):
    plt.subplot(3, 3, i+1)

    # Reshape to separate RGB channels
    hist_r = hist[:bins]
    hist_g = hist[bins:2*bins]
    hist_b = hist[2*bins:3*bins]

    bin_edges = np.linspace(0, 256, bins+1)[:-1]
    width = 256 / bins

    plt.bar(bin_edges, hist_r, width=width, alpha=0.7, color='r', label='R')
    plt.bar(bin_edges, hist_g, width=width, alpha=0.7, color='g', label='G')
    plt.bar(bin_edges, hist_b, width=width, alpha=0.7, color='b', label='B')

    plt.title(f"{CLASS_NAMES[class_name]}")
    plt.xlabel('Pixel Value')
    plt.ylabel('Normalized Frequency')
    plt.legend()

plt.tight_layout()
plt.show()

The color histograms show class-specific color patterns, which is expected since different types of skin lesions have characteristic colorations.


### Shape Analysis

In [None]:
# Analyze shape features
shape_features = np.vstack(sample_df['shape_features'].values)
shape_df = pd.DataFrame(shape_features, columns=[f'shape_{i}' for i in range(shape_features.shape[1])])
shape_df['class'] = sample_df['class'].values

In [None]:
# Plot first two shape features
plt.figure(figsize=(10, 8))
sns.scatterplot(x='shape_0', y='shape_1', hue='class', data=shape_df)
plt.title('Shape Features by Class (First Two Hu Moments)')
plt.xlabel('Shape Feature 1')
plt.ylabel('Shape Feature 2')
plt.legend(title='Class')
plt.show()

The shape features (Hu moments) show some separation between classes, indicating that shape is an important characteristic for distinguishing different types of skin lesions.


### Texture Analysis

In [None]:
# Analyze texture features
texture_features = np.vstack(sample_df['texture_features'].values)
texture_df = pd.DataFrame(texture_features, columns=[f'texture_{i}' for i in range(texture_features.shape[1])])
texture_df['class'] = sample_df['class'].values

In [None]:
# Plot first two texture features
plt.figure(figsize=(10, 8))
sns.scatterplot(x='texture_0', y='texture_1', hue='class', data=texture_df)
plt.title('Texture Features by Class (First Two Haralick Features)')
plt.xlabel('Texture Feature 1')
plt.ylabel('Texture Feature 2')
plt.legend(title='Class')
plt.show()

The scatter plot of texture features shows some separation between classes, indicating that texture is an important characteristic for distinguishing different types of skin lesions.


### Dimensionality Reduction

In [None]:
# Combine all features for dimensionality reduction
# Standardize the features
from sklearn.preprocessing import StandardScaler

In [None]:
# Combine all features
combined_features = np.hstack([
    StandardScaler().fit_transform(np.vstack(sample_df['color_hist'].values)),
    StandardScaler().fit_transform(np.vstack(sample_df['shape_features'].values)),
    StandardScaler().fit_transform(np.vstack(sample_df['texture_features'].values))
])

In [None]:
# Apply PCA for visualization
pca = PCA(n_components=2)
pca_result = pca.fit_transform(combined_features)

In [None]:
# Create DataFrame for plotting
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])
pca_df['class'] = sample_df['class'].values
pca_df['class_name'] = pca_df['class'].map(CLASS_NAMES)

In [None]:
# Plot PCA results
plt.figure(figsize=(12, 10))
sns.scatterplot(x='PC1', y='PC2', hue='class_name', data=pca_df, palette='tab10', s=100, alpha=0.7)
plt.title('PCA of Combined Features by Class')
plt.xlabel(f'Principal Component 1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'Principal Component 2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.legend(title='Class', loc='best')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# Apply t-SNE for more complex visualization
tsne = TSNE(n_components=2, perplexity=min(30, len(sample_df) - 1), random_state=42)
tsne_result = tsne.fit_transform(combined_features)

In [None]:
# Create DataFrame for plotting
tsne_df = pd.DataFrame(data=tsne_result, columns=['t-SNE1', 't-SNE2'])
tsne_df['class'] = sample_df['class'].values
tsne_df['class_name'] = tsne_df['class'].map(CLASS_NAMES)

In [None]:
# Plot t-SNE results
plt.figure(figsize=(12, 10))
sns.scatterplot(x='t-SNE1', y='t-SNE2', hue='class_name', data=tsne_df, palette='tab10', s=100, alpha=0.7)
plt.title('t-SNE of Combined Features by Class')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.legend(title='Class', loc='best')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Both the PCA and t-SNE visualizations show that the combined features (color, shape, and texture) provide reasonable separation between the classes, although there is still considerable overlap. This suggests that these features are informative but not sufficient for perfect classification, highlighting the need for more sophisticated approaches like deep learning.
