<a href="https://colab.research.google.com/github/mrrnour/Medical-Image-Analysis_public/blob/main/Melanoma_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#SIIM-ISIC Melanoma Classification

Skin cancer is the most common type of cancer, with melanoma being the deadliest despite its rarity. In 2020, over 100,000 new melanoma cases were expected in the U.S., with nearly 7,000 deaths. Early detection is crucial for effective treatment.

Dermatologists currently identify potential melanomas by examining all of a patient’s moles for unusual ones. AI approaches haven’t fully utilized this method. Improved algorithms that consider patient-specific images could enhance diagnostic accuracy and support dermatologists.

A competition aims to develop tools to identify melanoma in skin lesion images, using patient-level contextual information. Early detection and accurate diagnosis through image analysis tools could significantly improve outcomes and save lives.

https://www.kaggle.com/competitions/siim-isic-melanoma-classification/overview

### **Here, I only work image data**

## 1-Ignition

### 1.1- Set up Kernel and Required Dependencies

In [None]:
! pip install -q kaggle
! pip install wtfml==0.0.2
! pip install torch==2.2.0
! pip install pretrainedmodels
! git clone https://github.com/mrrnour/ds_toolbox_public.git

Collecting wtfml==0.0.2
  Downloading wtfml-0.0.2-py3-none-any.whl.metadata (808 bytes)
Downloading wtfml-0.0.2-py3-none-any.whl (8.1 kB)
Installing collected packages: wtfml
Successfully installed wtfml-0.0.2
Collecting torch==2.2.0
  Downloading torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.0)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.0)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.0)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.2.0)

## 1.2 - Loading Libs and Functions

In [None]:
import os
import shutil
import numpy as np
import pandas as pd
from tqdm import tqdm

from sklearn import metrics
from sklearn import model_selection

import torch
import albumentations
import torch.nn as nn
from torch.nn import functional as F

from wtfml.utils import EarlyStopping
from wtfml.engine import Engine
from wtfml.data_loaders.image import ClassificationLoader

import pretrainedmodels

import ds_toolbox_public.dsToolbox.io_funcs_kaggle2colab as io_funcs

  check_for_updates()


In [None]:
###Parameters:
download_folder="/content/drive/My Drive/Colab Notebooks/melanoma_classification_data"
kaggle_json_source=os.path.join(download_folder, '../kaggle.json')
zip_file_name=os.path.join(download_folder, 'siim-isic-melanoma-classification.zip')
exclude_folders=('train/', 'test/', 'tfrecords/')
image_resized_folder = os.path.join(download_folder, 'jpeg512')
colab_folder="/content/data"
colab_image_folder=os.path.join(colab_folder, 'jpeg512')
colab_train_csv=os.path.join(colab_folder, 'train.csv')
colab_test_csv=os.path.join(colab_folder, 'test.csv')
colab_model_path__template=os.path.join(colab_folder, f"model_fold.bin")
colab_submission_csv=os.path.join(colab_folder, 'sample_submission.csv')
colab_performance_csv=os.path.join(colab_folder, 'performance.csv')
files2Copy=["README.md",
            "requirements.txt",
            "train.csv" ,
            "test.csv",
            "sample_submission.csv",
            "se_resnext50_32x4d-a260b3a4.pth"]
n_splits=5

###Functions:
def resize_image(image_org_file, image_resized_folder, resize):
  import os
  from PIL import Image, ImageFile
  from joblib import Parallel, delayed
  ImageFile.LOAD_TRUNCATED_IMAGES = True
  base_name=os.path.basename(image_org_file)
  outpath = os.path.join(image_resized_folder, base_name)
  img = Image.open(image_org_file)
  img = img.resize(
                  (resize[1], resize[0]),
                  resample=Image.BILINEAR
                  )
  img.save(outpath)

def resize_image_parallel(image_org_folder, image_resized_folder, resize=(512,512),n_jobs=32, verbose=10):
  import glob
  from joblib import Parallel, delayed

  from PIL import Image, ImageChops
  images = glob.glob(image_org_folder+'/*.jpg')

  if not os.path.exists(image_resized_folder):
    print(f"folder {image_resized_folder} created")
    os.makedirs(image_resized_folder)

  Parallel(n_jobs=n_jobs, verbose=verbose)(
      delayed(resize_image)(
          f,
          image_resized_folder,
          resize)
      for f in tqdm(images))

def add_kfold(df_path, df_fold_path, n_splits):
  df = pd.read_csv(df_path)
  df["kfold"] = -1
  df = df.sample(frac=1).reset_index(drop=True)
  y = df.target.values
  kf = model_selection.StratifiedKFold(n_splits=n_splits)

  for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
      df.loc[v_, 'kfold'] = f

  df.to_csv(df_fold_path, index=False)
  return df

def create_fileName(model_path__template, fold_no):
    base_name=os.path.basename(model_path__template)
    filename, file_extension = os.path.splitext(base_name)
    dir_name=os.path.dirname(model_path__template)
    model_path=os.path.join(dir_name, f"{filename}{fold_no}{file_extension}")
    return model_path

class SEResnext50_32x4d(nn.Module):
    def __init__(self, pretrained='imagenet'):
        super(SEResnext50_32x4d, self).__init__()

        self.base_model = pretrainedmodels.__dict__[
            "se_resnext50_32x4d"
        ](pretrained=pretrained)

        self.l0 = nn.Linear(2048, 1)

    def forward(self, image, targets):
        batch_size, _, _, _ = image.shape

        x = self.base_model.features(image)
        x = F.adaptive_avg_pool2d(x, 1).reshape(batch_size, -1)

        out = self.l0(x)
        loss = nn.BCEWithLogitsLoss()(out, targets.view(-1, 1).type_as(x))

        return out, loss

def train(fold, model_path__template):
    """
    Trains a classification model using the specified fold of the dataset and saves the trained model.

    Args:
        fold (int): The fold number to be used for training and validation.
        model_path__template (str): The template path where the model will be saved. The fold number will be appended to the filename.

    Returns:
        str: The path where the trained model is saved.

    The function performs the following steps:
    1. Reads the training data and splits it into training and validation sets based on the fold number.
    2. Initializes the model and moves it to the specified device (GPU).
    3. Defines data augmentation techniques for training and validation datasets.
    4. Prepares data loaders for training and validation datasets.
    5. Sets up the optimizer, learning rate scheduler, and early stopping mechanism.
    6. Trains the model for a specified number of epochs, evaluating it on the validation set after each epoch.
    7. Saves the model if it achieves a better validation score and stops early if the validation score does not improve for a specified number of epochs.
    """
    training_data_path =os.path.join(image_resized_folder, 'train')
    df = pd.read_csv(colab_train_csv)
    model_path=create_fileName(model_path__template, fold_no=fold)
    device = "cuda"
    epochs = 50
    train_bs = 32
    valid_bs = 16
    file_extention='jpg'
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid = df[df.kfold == fold].reset_index(drop=True)

    model = SEResnext50_32x4d(pretrained="imagenet")
    model.to(device)

    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)
    train_aug = albumentations.Compose(
        [
            albumentations.Normalize(mean, std, max_pixel_value=255.0, always_apply=True),
            albumentations.ShiftScaleRotate(shift_limit=0.0625, scale_limit=0.1, rotate_limit=15),
            albumentations.Flip(p=0.5)
        ]
    )

    valid_aug = albumentations.Compose(
        [
            albumentations.Normalize(mean, std, max_pixel_value=255.0, always_apply=True)
        ]
    )

    train_images = df_train.image_name.values.tolist()
    train_images = [os.path.join(training_data_path, i + f".{file_extention}") for i in train_images]
    train_targets = df_train.target.values

    valid_images = df_valid.image_name.values.tolist()
    valid_images = [os.path.join(training_data_path, i + f".{file_extention}") for i in valid_images]
    valid_targets = df_valid.target.values

    train_dataset = ClassificationLoader(
        image_paths=train_images,
        targets=train_targets,
        resize=None,
        augmentations=train_aug,
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=train_bs, shuffle=True, num_workers=4
    )

    valid_dataset = ClassificationLoader(
        image_paths=valid_images,
        targets=valid_targets,
        resize=None,
        augmentations=valid_aug,
    )

    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=valid_bs, shuffle=False, num_workers=4
    )

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer,
        patience=3,
        threshold=0.001,
        mode="max"
    )

    es = EarlyStopping(patience=5, mode="max")

    for epoch in range(epochs):
        train_loss = Engine.train(train_loader, model, optimizer, device=device)
        predictions, valid_loss = Engine.evaluate(
            valid_loader, model, device=device
        )
        predictions = np.vstack((predictions)).ravel()
        auc = metrics.roc_auc_score(valid_targets, predictions)
        print(f"Epoch = {epoch}, AUC = {auc}")
        scheduler.step(auc)

        es(auc, model, model_path=model_path)
        if es.early_stop:
            print("Early stopping")
            break
    return model_path

def predict(fold, model_path__template):
    model_path=create_fileName(model_path__template, fold_no=fold)
    test_data_path =os.path.join(image_resized_folder, 'test')
    df = pd.read_csv(colab_test_csv)
    device = "cuda"
    file_extention='jpg'

    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)
    aug = albumentations.Compose(
        [
            albumentations.Normalize(mean, std, max_pixel_value=255.0, always_apply=True)
        ]
    )

    images = df.image_name.values.tolist()
    images = [os.path.join(test_data_path, i + f".{file_extention}") for i in images]
    targets = np.zeros(len(images))

    test_dataset = ClassificationLoader(
        image_paths=images,
        targets=targets,
        resize=None,
        augmentations=aug,
    )

    test_loader = torch.utils.data.DataLoader(
        test_dataset, batch_size=16, shuffle=False, num_workers=4
    )

    model = SEResnext50_32x4d(pretrained=None)
    model.load_state_dict(torch.load(model_path))
    model.to(device)

    predictions = Engine.predict(test_loader, model, device=device)
    predictions = np.vstack((predictions)).ravel()

    return predictions

NameError: name 'nn' is not defined

# 2-Loading and Resizing Image

In [None]:
# ### download kaggle data in Google Drive
# #see https://www.kaggle.com/discussions/general/74235

copy_kaggle_json_to_colab(kaggle_json_source)
download_and_extract_dataset(download_folder, zip_file_name, kaggle_json_source, extract_folders=None, exclude_folders=exclude_folders)

##resizing images:
for subset in ['train', 'test']:
  resize_image_parallel(os.path.join(download_folder, 'jpeg', subset) ,
                        os.path.join(image_resized_folder, subset) ,
                        resize=(512,512),
                        n_jobs=32, verbose=10
                        )

ssl._create_default_https_context = ssl._create_unverified_context
!wget --no-check-certificate -P  "/content/drive/My Drive/Colab Notebooks/melanoma_classification_data/" "http://data.lip6.fr/cadene/pretrainedmodels/se_resnext50_32x4d-a260b3a4.pth"

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
###copying from google Drive to Colab
from google.colab import drive
drive.mount('/content/drive/')

shutil.copytree(image_resized_folder, colab_image_folder)
for file in files2Copy:
  print(f"copying {file} --> {colab_folder}...")
  shutil.copyfile(os.path.join(download_folder, file), os.path.join(colab_folder, file))

Mounted at /content/drive/


# 3-Training

In [None]:
df=add_kfold(colab_train_csv, colab_train_csv, n_splits=n_splits)

# aucS=[]
for fold in range(1, n_splits):
  print(f"Training Fold {fold}...")
  model_path=train(fold, model_path__template=colab_model_path__template)
  # aucS.append(auc)
  # print("auc=",auc)
  print("-"*150)

# pd.Series(aucS, index=range(n_splits)).to_csv(colab_performance_csv, index=False)

# 4-Prediction

The AC is .91, which is a relatively good result compared to the score of 0.9490 from the first solution. A meta data information alongside an image was used for the first solution:
https://www.kaggle.com/competitions/siim-isic-melanoma-classification/discussion/175412

In [None]:
ps=[]
for fold in range(n_splits):
  print(f"Prediction Fold {fold}...")
  p_i=predict(fold, colab_model_path__template)
  ps.append(p_i)
  shutil.copyfile(model_path, os.path.join(download_folder, os.path.basename(model_path)))
  print("p=", p_i)
  print("-"*150)

predictions = np.mean(np.column_stack(ps), axis=1)
sample = pd.read_csv(colab_submission_csv)
sample.loc[:, "target"] = predictions
sample.to_csv(colab_submission_csv, index=False)

%cp -av "/content/data/submission.csv" "/content/drive/My Drive/Colab Notebooks/melanoma_classification_data/"
%cp -av "/content/data/performance.csv" "/content/drive/My Drive/Colab Notebooks/melanoma_classification_data/"
