### This file is used to organize file structure accurding to diagnosis

First, I have downloaded the first dataset from
https://www.kaggle.com/datasets/rajivaiml/isic-skin-cancer-dataset, and unzipped it manually. The root folder was renamed to 'skin_cancer' to make it shorter to manipulate with. This dataset has the very simple structure to work with.
Then I downloaded the second dataset, and put it into the project folder https://www.kaggle.com/datasets/farjanakabirsamanta/skin-cancer-dataset/data.

In [None]:
import os
import pandas as pd
import shutil

In [None]:
inp_dir = 'archive'

import zipfile
with zipfile.ZipFile('archive.zip', 'r') as zip_ref:
    zip_ref.extractall(inp_dir)

In [None]:
label_df=pd.read_csv(os.path.join(inp_dir,'HAM10000_metadata.csv'))
label_df

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear
...,...,...,...,...,...,...,...
10010,HAM_0002867,ISIC_0033084,akiec,histo,40.0,male,abdomen
10011,HAM_0002867,ISIC_0033550,akiec,histo,40.0,male,abdomen
10012,HAM_0002867,ISIC_0033536,akiec,histo,40.0,male,abdomen
10013,HAM_0000239,ISIC_0032854,akiec,histo,80.0,male,face


In [None]:
dict = {'bkl' : 'seborrheic keratosis',
        'nv' : 'nevus',
        'df': 'dermatofibroma',
        'mel' : 'melanoma',
        'vasc': 'vascular lesion',
        'bcc': 'basal cell carcinoma',
        'akiec': 'actinic keratosis'}
label_df = label_df.replace({"dx": dict})

In [None]:
labels = label_df['dx'].unique()
print(labels)
print('Lable count: ', len(labels))

['seborrheic keratosis' 'nevus' 'dermatofibroma' 'melanoma'
 'vascular lesion' 'basal cell carcinoma' 'actinic keratosis']
Lable count:  7


In [None]:
train_size = 0.8

In [None]:
count_df = label_df.groupby('dx').size().reset_index()
count_df
count_df.reset_index()

Unnamed: 0,index,dx,0
0,0,actinic keratosis,327
1,1,basal cell carcinoma,514
2,2,dermatofibroma,115
3,3,melanoma,1113
4,4,nevus,6705
5,5,seborrheic keratosis,1099
6,6,vascular lesion,142


In [None]:
6705/count_df[0].sum()

0.6694957563654518

Here we see that one of two datasets is highly imbalanced. One class gives us 67% of data.
So if the model's output is 'nevus' only, it gives us 67% accuracy, that is not correct.

Let we check the second dataset.

In [None]:
def fast_scandir(dirname):
    subfolders= [f.path for f in os.scandir(dirname) if f.is_dir()]
    for dirname in list(subfolders):
        subfolders.extend(fast_scandir(dirname))
    return subfolders

In [None]:
data_dir = 'skin_cancer'

Two classes were deleted from the first dataset, as I decided to use only 7 classes, presented in both datasets.

In [None]:
shutil.rmtree('skin_cancer/Train/squamous cell carcinoma', ignore_errors=False, onerror=None)
shutil.rmtree('skin_cancer/Test/squamous cell carcinoma', ignore_errors=False, onerror=None)
shutil.rmtree('skin_cancer/Train/pigmented benign keratosis', ignore_errors=False, onerror=None)
shutil.rmtree('skin_cancer/Test/pigmented benign keratosis', ignore_errors=False, onerror=None)

In [None]:
dir=fast_scandir(os.path.join(data_dir, 'Train'))

for d2 in dir:
    label = str(d2).replace( os.path.join(data_dir)+'/Train/','')
    n_d2 = len(os.listdir(d2))
    print(label, ': ', n_d2)

melanoma :  438
nevus :  357
basal cell carcinoma :  376
actinic keratosis :  114
vascular lesion :  139
seborrheic keratosis :  77
dermatofibroma :  95


In [None]:
dir=fast_scandir(os.path.join(data_dir, 'Test'))

for d2 in dir:
    label = str(d2).replace( os.path.join(data_dir)+'/Test/','')
    n_d2 = len(os.listdir(d2))
    print(label, ': ', n_d2)

melanoma :  16
nevus :  16
basal cell carcinoma :  16
actinic keratosis :  16
vascular lesion :  3
seborrheic keratosis :  3
dermatofibroma :  16


The second dataset is more or less balanced.
So I decided to save more data for modeling, and use all files but remove some files with nevus.

I restricted it by number 1200 (which is around the numbers for 2 other dx)

In [None]:
count_dict = {}

In [None]:
for i in range(count_df.shape[0]):
    count_dict [count_df.iloc[i, 0]] = count_df.iloc[i, 1]
#count_dict

In [None]:
count_dict['nevus'] = 1200

In [None]:
img_dir= os.path.join(inp_dir, 'Skin Cancer', 'Skin Cancer')

In [None]:
counter = {}
for label in labels:
    counter [label] = 0

In [None]:
### Moving the image into correct folder

In [None]:
for i in range(0, label_df.shape[0]):

    counter[label_df.loc[i, 'dx']] += 1

    if label_df.loc[i, 'dx'] == 'nevus' and counter[label_df.loc[i, 'dx']] > count_dict['nevus']:
        continue
    img_path_from = os.path.join(img_dir, label_df.loc[i, 'image_id']+'.jpg')

    if counter[label_df.loc[i, 'dx']] > train_size * count_dict[label_df.loc[i, 'dx']]:
        img_path_to = os.path.join(data_dir, 'Test', label_df.loc[i, 'dx'])
    else:
        img_path_to = os.path.join(data_dir, 'Train', label_df.loc[i, 'dx'])

    if os.path.exists(os.path.join(img_path_to, label_df.loc[i, 'image_id']+'.jpg')):
        continue

    shutil.move(img_path_from, img_path_to)

The resulting training and testing datasets

In [None]:
dir=fast_scandir(os.path.join(data_dir, 'Train'))

for d2 in dir:
    label = str(d2).replace( os.path.join(data_dir)+'/Train/','')
    n_d2 = len(os.listdir(d2))
    print(label, ': ', n_d2)

melanoma :  1328
nevus :  1317
basal cell carcinoma :  492
actinic keratosis :  311
vascular lesion :  140
seborrheic keratosis :  956
dermatofibroma :  114


In [None]:
dir=fast_scandir(os.path.join(data_dir, 'Test'))

for d2 in dir:
    label = str(d2).replace( os.path.join(data_dir)+'/Test/','')
    n_d2 = len(os.listdir(d2))
    print(label, ': ', n_d2)

melanoma :  239
nevus :  256
basal cell carcinoma :  116
actinic keratosis :  77
vascular lesion :  30
seborrheic keratosis :  223
dermatofibroma :  38
