<a href="https://colab.research.google.com/github/nehay24/CSCN8040_Skin_Cancer_Prediction/blob/main/Skin_cancer_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Case Studies in Artificial Intelligence and Machine Learning
### Unit 4: Interim Results

### Group 4
| Student Id     | Name                     |
|----------------|--------------------------|
| 9027375        | Khushbu Lad              |
| 8944328        | Akshata Madhav           |
| 8964515        | Neha Yadav               |
| 8965985        | Pradeepti Kasam          |
| 8914803        | Rohit Totlani            |

Dataset URL:

[Skin Cancer MNIST: HAM10000](https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000)

### Step 1: Install Required Libraries

In [None]:
!pip install pandas numpy matplotlib seaborn scipy scikit-learn

### Step 2: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
import sys
import os

### Step 3: Load the Dataset

**NOTE:**

- Our dataset is large containing images of 2.9GB.
- The file is configured as it can run in local as well as in google collab.
- if you are setting up file for google drive, no configuration is requred.
- if this project needs to setup in local device, download the dataset from link in header and extract zip in "dataset" folder in project directory
- Set the correct file path for the dataset:
- data_path = "/content/HAM10000_metadata.csv"
- Ensure that the path is updated to match the actual location of the file on your system.


In [None]:
# Detect Google Colab
IN_COLAB = False
try:
    import google.colab
    IN_COLAB = True
except ModuleNotFoundError:
    pass

In [None]:
# Detect Google Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    DATA_PATH = "/content/drive/MyDrive/dataset/"
else:
    DATA_PATH = "./"

# Download dataset in kaggle
if IN_COLAB and not os.path.exists(DATA_PATH + "HAM10000_metadata.csv"):
    !pip install kaggle
    from google.colab import files
    files.upload()  # Upload kaggle.json manually (for API authentication)

    !mkdir -p ~/.kaggle
    !cp kaggle.json ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json
    !kaggle datasets download -d kmader/skin-cancer-mnist-ham10000 -p {DATA_PATH} --unzip
    print("Dataset downloaded and extracted.")

# Initialize required path
metadata_path = os.path.join(DATA_PATH, "HAM10000_metadata.csv")
images_path = os.path.join(DATA_PATH, "HAM10000_images_part_1/")

df = pd.read_csv(metadata_path)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#code to check sample data
df.head()

### Step 4: Basic Exploration

In [None]:
print("Number of rows: ", df.shape[0])
print("Number of columns: ", df.shape[1])
print("\n")
print("Column Names:", df.columns)
print("\n")
print("Missing Values:\n", df.isnull().sum())

### Step 5: Data Preprocessing


1. Age column has 57 missing value, updating missing value with median

In [None]:
# Fill missing age values with the median age
df['age'] = df['age'].fillna(df['age'].median())

2. There are 3 categorical variables, Using label encoder to convert these string to numerical columns
    1. Gender
    2. Diagnosis class
    3. Localization


In [None]:
# Encode categorical variables
le_sex = LabelEncoder()
# 0 = Female, 1 = Male, 2= unknown
df['sex'] = le_sex.fit_transform(df['sex'])

le_dx = LabelEncoder()
# Encodes the 7 diagnosis classes
#0-akiec, 1-bcc, 2-bkl, 3-df, 4-mel, 5-nv, 6-vasc
df['dx'] = le_dx.fit_transform(df['dx'])

le_localization = LabelEncoder()
# Encodes the 15 localization classes
# 0-abdomen, 1- acral, 2-back, 3-chest, 4- ear, 5-face, 6-foot, 7-genital, 8-hand, 9-lower extremity, 10-neck, 11-scalp, 12-trunk, 13-unknown, 14-upper extremity
df['localization'] = le_localization.fit_transform(df['localization'])

#check sample data
df.head()

### Step 6: Exploratory Data Analysis (EDA)

In [None]:
# Plot class distribution
plt.figure(figsize=(10, 5))
sns.countplot(x='dx', hue='dx', data=df, palette='coolwarm')
plt.xticks(rotation=45)
plt.title("Distribution of Lesion Types")

# Create custom legend labels
legend_labels = {
    0: 'akiec',
    1: 'bcc',
    2: 'bkl',
    3: 'df',
    4: 'mel',
    5: 'nv',
    6: 'vasc'
}

# Get the current handles and labels from the plot
handles, labels = plt.gca().get_legend_handles_labels()

# Create a new list of legend labels using the custom mapping
new_labels = [legend_labels[int(label)] for label in labels]

# Create the legend with the updated labels
plt.legend(handles, new_labels, title='Lesion Type')

plt.show()


#### Type of lesions

- akiec (Actinic keratoses and intraepithelial carcinoma)
- bcc (Basal cell carcinoma)
- bkl (Benign keratosis-like lesions)
- df (Dermatofibroma)
- mel (Melanoma)
- nv (Melanocytic nevi)
- vasc (Vascular lesions)

The plot above suggests that melanocytic nevi are the most common type of skin lesion in the dataset.

In [None]:
# Age distribution
plt.figure(figsize=(8, 4))
sns.histplot(df['age'], bins=30, kde=True)
plt.title("Age Distribution of Patients")
plt.xlabel("Age")
plt.show()

- The plot represents the age distribution of patients in the dataset.
- It shows that most patients are between 30 and 60 years old, with a peak around the 40s.
- The distribution indicates a higher frequency of older patients compared to younger ones.

In [None]:
# Sex distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='sex', hue='sex', data=df, palette='viridis')
plt.xticks(ticks=[0, 1, 2], labels=['Female', 'Male', 'unknown'])
plt.title("Gender Distribution")
plt.show()

- The plot indicates that Males are most affected with skin cancer as compared to females and unknown

### Step 7: Statistical Tests

**t-test:**

A t-test is a statistical test used to compare the means of two groups to determine if they are significantly different from each other. It is commonly used in hypothesis testing.

**Benign Lesions**
The benign category includes non-cancerous (harmless) lesions:
bkl (Benign keratosis-like lesions)
nv (Melanocytic nevi - common moles)
These lesions are generally not life-threatening and do not spread aggressively.

**Malignant Lesions**
The malignant category includes cancerous (harmful) lesions:
mel (Melanoma - the deadliest form of skin cancer)
bcc (Basal cell carcinoma - a slow-growing skin cancer)
akiec (Actinic keratoses and intraepithelial carcinoma - precancerous/cancerous growths)
These lesions have the potential to invade nearby tissues and spread if not treated.

In [None]:
# Hypothesis: Is there a significant difference in age between benign & malignant lesions?
benign = df[df['dx'].isin([le_dx.transform(['bkl'])[0], le_dx.transform(['nv'])[0]])]['age']
malignant = df[df['dx'].isin([le_dx.transform(['mel'])[0], le_dx.transform(['bcc'])[0], le_dx.transform(['akiec'])[0]])]['age']

In [None]:
# Perform an independent t-test
t_stat, p_value = stats.ttest_ind(benign, malignant, equal_var=False)
print(f"T-test result: t-statistic = {t_stat:.4f}, p-value = {p_value:.4f}")
if p_value < 0.05:
    print("\n Statistically significant difference in age between benign and malignant lesions.")
else:
    print("\n No significant difference in age between benign and malignant lesions.")

### Step 8: Hypothesis Evaluation

**Chi-Square Test :**

The Chi-Square test is a statistical test used to determine if there is a significant association between categorical variables. In this dataset, it can be applied to check if factors like lesion type and patient gender are independent or correlated.

In [None]:
# Chi-Square test for gender distribution across lesion types
contingency_table = pd.crosstab(df['sex'], df['dx'])
chi2_stat, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-Square Test: chi2 = {chi2_stat:.4f}, p-value = {p:.4f}")
if p < 0.05:
    print("\n Statistically significant association between gender and lesion type.")
else:
    print("\n No significant association between gender and lesion type.")

### Step 9: Observations & Summary

**Key Findings**

**Statistical Tests & Hypothesis Evaluation**

t-test was conducted to compare means between different groups.

Chi-Square test for gender distribution across lesion types.

**Potential Challenges**

Class Imbalance:

Some lesion types (e.g., Melanocytic nevi) dominate, while others are underrepresented.

This could bias the model toward predicting the most common class.

**Data Bias:**

Higher representation of certain age groups may lead to a model that performs better on middle-aged individuals but poorly on younger or older patients.

**Feature Engineering Needed:**

Image-based features may need augmentation techniques to balance classes.

1. Load & Preprocess the Data
Load HAM10000_metadata.csv and image files.

Join metadata with image data.

Encode categorical labels (e.g., lesion types).

Resize images (e.g., 64x64 or 128x128) and normalize pixel values (divide by 255).

In [None]:
import os
import cv2
from tqdm import tqdm  # for progress bar

# Define image folders
images_path_1 = os.path.join(DATA_PATH, "HAM10000_images_part_1/")
images_path_2 = os.path.join(DATA_PATH, "HAM10000_images_part_2/")

# Initialize lists
image_data = []
labels = []

# Loop through metadata with progress bar
print("📥 Loading and processing images...")
for idx, row in tqdm(df.iterrows(), total=len(df)):
    filename = row['image_id'] + ".jpg"
    label = row['dx']  # 'dx' contains the lesion type

    # Build possible image paths
    path1 = os.path.join(images_path_1, filename)
    path2 = os.path.join(images_path_2, filename)

    # Try loading the image from both folders
    img = None
    if os.path.exists(path1):
        img = cv2.imread(path1)
    elif os.path.exists(path2):
        img = cv2.imread(path2)

    # If image not found or can't be read, skip
    if img is None:
        print(f"⚠️ Could not load image: {filename}")
        continue

    # Resize image to 64x64 and convert BGR to RGB
    img = cv2.resize(img, (64, 64))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Append image and label
    image_data.append(img)
    labels.append(label)

print(f"\n✅ Total images loaded: {len(image_data)}")



Convert Data to NumPy Arrays

In [None]:
import numpy as np

# Convert to NumPy arrays
X = np.array(image_data)
y = np.array(labels)

print("✅ Data shapes:")
print("Images (X):", X.shape)
print("Labels (y):", y.shape)


Encode Labels (String → Integer → One-hot)

In [None]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# Encode string labels into integers
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# One-hot encode the integer labels
y_onehot = to_categorical(y_encoded)

print("✅ Unique classes:", label_encoder.classes_)
print("Encoded labels shape:", y_onehot.shape)


In [None]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# Use original string labels from 'dx'
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(labels)  # labels should be string list from earlier
y_onehot = to_categorical(y_encoded)

# Get class names as string list
class_names = label_encoder.classes_

print("Classes:", class_names)


Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y_onehot,
    test_size=0.2,
    random_state=42,
    stratify=y_onehot
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Build a CNN Model

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

# Normalize image pixels
X_train = X_train / 255.0
X_test = X_test / 255.0

# Build CNN
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
    MaxPooling2D(2, 2),

    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(2, 2),

    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(y_train.shape[1], activation='softmax')  # 7 output classes
])

# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

# Train
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))



Plot Training & Validation Accuracy and Loss

In [None]:
import matplotlib.pyplot as plt

# Plot accuracy
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy', marker='o')
plt.plot(history.history['val_accuracy'], label='Val Accuracy', marker='x')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

# Plot loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss', marker='o')
plt.plot(history.history['val_loss'], label='Val Loss', marker='x')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()


Evaluate with Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

# Predict test set
y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)
y_true = np.argmax(y_test, axis=1)

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
labels = label_encoder.classes_

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Convert label_encoder.classes_ to string list
class_names = [str(c) for c in label_encoder.classes_]

# Print classification report with zero_division to suppress warnings
print("Classification Report:\n")
print(classification_report(y_true, y_pred, target_names=class_names, zero_division=0))




Save the Trained Model

In [None]:
# Save model
model.save("skin_cancer_classifier_model.h5")
print("Model saved as skin_cancer_classifier_model.h5")


Extend the Metadata with Synthetic Columns

In [None]:
import pandas as pd
import random

# Load your metadata if you haven't already
metadata_path = os.path.join(DATA_PATH, "HAM10000_metadata.csv")
df = pd.read_csv(metadata_path)

# Optional: Reset index if needed
df = df.reset_index(drop=True)

# Define synthetic values
professions = ['farmer', 'teacher', 'construction worker', 'office worker', 'student', 'factory worker', 'driver', 'retired']

sun_exposure_map = {
    'farmer': 'yes',
    'construction worker': 'yes',
    'driver': 'yes',
    'teacher': 'no',
    'office worker': 'no',
    'student': 'no',
    'factory worker': 'no',
    'retired': random.choice(['yes', 'no'])
}

chemical_exposure_map = {
    'factory worker': 'yes',
    'construction worker': 'yes',
    'farmer': 'yes',
    'teacher': 'no',
    'office worker': 'no',
    'student': 'no',
    'driver': 'no',
    'retired': random.choice(['yes', 'no'])
}

# Add synthetic columns
df['profession'] = [random.choice(professions) for _ in range(len(df))]
df['sun_exposed'] = df['profession'].apply(lambda x: sun_exposure_map.get(x, 'no'))
df['chemical_exposed'] = df['profession'].apply(lambda x: chemical_exposure_map.get(x, 'no'))

# Define binary cancer diagnosis
cancer_types = ['mel', 'bcc', 'akiec']
df['cancer_diagnosed'] = df['dx'].apply(lambda x: 'yes' if x in cancer_types else 'no')

print("✅ Metadata extended with synthetic features.")
df.head()


In [None]:
df.count()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Cancer count by profession
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='profession', hue='cancer_diagnosed')
plt.title("Skin Cancer Diagnosis by Profession")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
# Sunlight Exposure vs Cancer Diagnosis
sunlight_ct = pd.crosstab(df['sun_exposed'], df['cancer_diagnosed'], normalize='index')
print("Sunlight Exposure vs Cancer Diagnosis:\n", sunlight_ct)

# Chemical Exposure vs Cancer Diagnosis
chemical_ct = pd.crosstab(df['chemical_exposed'], df['cancer_diagnosed'], normalize='index')
print("\nChemical Exposure vs Cancer Diagnosis:\n", chemical_ct)


In [None]:
# Group by combinations
heatmap_data = df.groupby(['sun_exposed', 'chemical_exposed', 'cancer_diagnosed']).size().unstack().fillna(0)

# Plot heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(heatmap_data, annot=True, fmt='g', cmap='YlOrRd')
plt.title("Cancer Diagnosis by Sun & Chemical Exposure")
plt.ylabel('Sun / Chemical Exposure')
plt.xlabel('Cancer Diagnosed')
plt.show()

In [None]:
# Save the extended DataFrame to a CSV file
csv_path = "/content/extended_skin_cancer_data.csv"
df.to_csv(csv_path, index=False)
print(f"✅ CSV saved to: {csv_path}")


In [None]:
# Create age bins
df['age_group'] = pd.cut(df['age'], bins=[0, 20, 40, 60, 80, 100],
                         labels=['0–20', '21–40', '41–60', '61–80', '81+'])

# Plot
plt.figure(figsize=(8,5))
sns.countplot(data=df, x='age_group', hue='cancer_diagnosed')
plt.title('Cancer Diagnosis by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.tight_layout()
plt.show()


In [None]:
import pandas as pd
import random
import numpy as np

# Reload the metadata (if needed)
metadata_path = os.path.join(DATA_PATH, "HAM10000_metadata.csv")
df = pd.read_csv(metadata_path)
df = df.reset_index(drop=True)

# Step 1: Assign random professions
professions = ['farmer', 'teacher', 'construction worker', 'office worker', 'student', 'factory worker', 'driver', 'retired']
df['profession'] = [random.choice(professions) for _ in range(len(df))]

# Step 2: Define exposure logic
sun_exposure_map = {
    'farmer': 'yes',
    'construction worker': 'yes',
    'driver': 'yes',
    'teacher': 'no',
    'office worker': 'no',
    'student': 'no',
    'factory worker': 'no',
    'retired': random.choice(['yes', 'no'])
}
chemical_exposure_map = {
    'factory worker': 'yes',
    'construction worker': 'yes',
    'farmer': 'yes',
    'teacher': 'no',
    'office worker': 'no',
    'student': 'no',
    'driver': 'no',
    'retired': random.choice(['yes', 'no'])
}

df['sun_exposed'] = df['profession'].apply(lambda x: sun_exposure_map.get(x, 'no'))
df['chemical_exposed'] = df['profession'].apply(lambda x: chemical_exposure_map.get(x, 'no'))

# Step 3: Set cancer probability by profession
# Higher risk professions get higher probabilities
cancer_risk_by_profession = {
    'farmer': 0.5,
    'construction worker': 0.45,
    'factory worker': 0.4,
    'driver': 0.3,
    'teacher': 0.2,
    'retired': 0.2,
    'office worker': 0.1,
    'student': 0.05
}

# Generate cancer diagnosis based on risk probability
def assign_cancer(prof):
    return 'yes' if random.random() < cancer_risk_by_profession[prof] else 'no'

df['cancer_diagnosed'] = df['profession'].apply(assign_cancer)

print("✅ Skewed cancer diagnosis based on profession exposure risk.")
df.head()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
sns.countplot(data=df, x='profession', hue='cancer_diagnosed', order=professions)
plt.title("Cancer Diagnosis by Profession (Skewed Exposure Risk)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
df.to_csv("/content/extended_skin_cancer_data_skewed.csv", index=False)
from google.colab import files
files.download("/content/extended_skin_cancer_data_skewed.csv")



In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

sns.countplot(data=df, x='sun_exposed', hue='cancer_diagnosed', ax=ax[0])
ax[0].set_title('🔆 Sunlight Exposure vs Cancer')

sns.countplot(data=df, x='chemical_exposed', hue='cancer_diagnosed', ax=ax[1])
ax[1].set_title('🧪 Chemical Exposure vs Cancer')

plt.tight_layout()
plt.show()


In [None]:
# Prepare numeric-encoded copy of df
df_encoded = df.copy()

# Encode binary and categorical features
df_encoded['cancer_diagnosed'] = df_encoded['cancer_diagnosed'].map({'yes': 1, 'no': 0})
df_encoded['sun_exposed'] = df_encoded['sun_exposed'].map({'yes': 1, 'no': 0})
df_encoded['chemical_exposed'] = df_encoded['chemical_exposed'].map({'yes': 1, 'no': 0})
df_encoded['gender'] = df_encoded['gender'].map({'male': 1, 'female': 0})
df_encoded['retired'] = df_encoded['profession'].map(lambda x: 1 if x == 'retired' else 0)

plt.figure(figsize=(8, 6))
sns.heatmap(df_encoded[['age', 'gender', 'sun_exposed', 'chemical_exposed', 'cancer_diagnosed']].corr(),
            annot=True, cmap='coolwarm')
plt.title('📊 Correlation Heatmap')
plt.tight_layout()
plt.show()


In [None]:
df['exposure_combo'] = df['sun_exposed'] + ' / ' + df['chemical_exposed']

exposure_ct = pd.crosstab(df['exposure_combo'], df['cancer_diagnosed'])
sns.heatmap(exposure_ct, annot=True, cmap='YlOrRd', fmt='d')

plt.title('🔥 Cancer Diagnosis by Combined Exposure (Sun / Chemical)')
plt.ylabel('Exposure Combo')
plt.xlabel('Cancer Diagnosed')
plt.tight_layout()
plt.show()
