<h2 style="color:#22198A">PROJECT INFO</h2>

<h3 style="color:green">About project-03</h3>
<p> The goal of this project is to find out if some characteristics of skin lesions can be reliably measured with a <b>smartphone app.</b> The
characteristics the dermatologist is especially interested in are: asymmetry, border and color.</p>
<p><b>Contact:</b> jtih@itu.dk, joap@itu.dk, luci@itu.dk</p>
<p><b>Created:</b> 06. 04. 2021</p>
<p><b>Last modified:</b> 22. 04. 2021 </p>

<h2 style="color:#22198A">NOTEBOOK SETUP</h2>
<p>Before you start working with the notebook, please make sure to go through this setup to ensure smooth running. (by default, no changes should be needed if you just downloaded the repository)</p>
<h3 style="color:green">Important highlights</h3>
<ul>
<li><b>BASE_DIR:</b> This should lead to the root directory relative to the location of this notebook</li>
<li><b>SCRIPTS IMPORT:</b> All scripts are saved within one file. In the file, there are comments splitting the whole file into sections which gather scripts with similar functionality, e.g. loading data. All functions should contain a docstring, which might be useful for any troubleshooting or just knowing how the given thing was implemented. The way the scripts are imported was implemented according to <a href='# https://stackoverflow.com/questions/34478398/import-local-function-from-a-module-housed-in-another-directory-with-relative-im
'>this</a> SO question. <b>Once you run the below cell, all scripts should be loaded.</b></li>
<li><b>PACKAGES USED WITHIN DIRECTORY: </b> In <b>all_scripts.py</b> you can see in the beginning all the packages used, but it is worth highlight these "not so standard" packages which you should make sure you have installed: <b>pandas, scipy.</b> Alternatively, you can also use provided <b>requirements.txt.</b></li>
</ul>

In [None]:
import os
import sys
BASE_DIR = f"..{os.sep}..{os.sep}..{os.sep}"
USE_DEEPNOTE = True # In case you would open this notebook via Deepnote

# SCRIPTS IMPORT
scripts_path = os.path.abspath(os.path.join(f'{BASE_DIR}scripts'))

if scripts_path not in sys.path:
    # Add the scripts to the path
    sys.path.append(scripts_path)
    
    # Import the needed scripts
    from all_scripts import *
    
    # Remove the added path to avoid possible future conflicts
    sys.path.remove(scripts_path)
else:
    
    # Import the needed scripts
    from all_scripts import *
    
    # Remove the added path to avoid possible future conflicts
    sys.path.remove(scripts_path)

# PLOTS COLOR SETTING - see more here: https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette
PLOT_COLOR_SETTINGS = sns.color_palette("flare", as_cmap=True)

<h2 style="color:#22198A">CONSTANTS</h2>

In [None]:
PATH_DATA = {
    "raw": f"{BASE_DIR}data{os.sep}raw{os.sep}"
}

FILENAMES = {
    "GT_train_ISIC_2017": "ISIC-2017_Training_Part3_GroundTruth.csv",
    "GT_validate_ISIC_2017": "ISIC-2017_Validation_Part3_GroundTruth.csv",
    "GT_test_ISIC_2017": "ISIC-2017_Test_v2_Part3_GroundTruth.csv",
    "meta_info": "ISIC_meta_data.csv"
}

<h2 style="color:#22198A">LOAD DATA</h2>

In [None]:
 all_dfs_raw = {
    "train": pd.read_csv(f"{PATH_DATA['raw']}{FILENAMES['GT_train_ISIC_2017']}"),
    "validate": pd.read_csv(f"{PATH_DATA['raw']}{FILENAMES['GT_validate_ISIC_2017']}"),
    "test": pd.read_csv(f"{PATH_DATA['raw']}{FILENAMES['GT_test_ISIC_2017']}")
}

<h2 style="color:#22198A">Task 0: Explore the data, clean it and extract the features</h2>
<p>Go through the data (csv file, images, segmentations) that you have available
to understand what’s available to you, and write a brief description. Decide if
this data is sufficient, or if cleaning is needed. For example, what do you do with
the images that are malignant (cancer), but not of the class you want to focus
on? Are there images of low quality? Etc. You are allowed to search for and add
other public dataset, to this set of images</p>
<h3 style="color:green">Brief summary</h3>
<ul>
<li><b>Source of data:</b> The data comes from the <a href = 'https://challenge.isic-archive.com/landing/2017'>2017 ISIC challenge.</a></li>
</ul>

<h3 style="color:green">Initial exploration</h3>

<h4 style="color:#ff9900">Shape</h4>

In [None]:
for name_df, df in all_dfs_raw.items():
    print(f"{name_df}: {df.shape}")

train: (2000, 3)
validate: (150, 3)
test: (600, 3)


<h4 style="color:#ff9900">Are there any missing values?</h4>

In [None]:
for name_df, df in all_dfs_raw.items():
    print(f"{name_df}:\n{df.isnull().sum()}\n")

train:
image_id                0
melanoma                0
seborrheic_keratosis    0
dtype: int64

validate:
image_id                0
melanoma                0
seborrheic_keratosis    0
dtype: int64

test:
image_id                0
melanoma                0
seborrheic_keratosis    0
dtype: int64



<h4 style="color:#ff9900">What are the variable names?</h4>

In [None]:
for name_df, df in all_dfs_raw.items():
    print(f"{name_df}:\n{list(df.columns)}\n")

train:
['image_id', 'melanoma', 'seborrheic_keratosis']

validate:
['image_id', 'melanoma', 'seborrheic_keratosis']

test:
['image_id', 'melanoma', 'seborrheic_keratosis']



<h4 style="color:#ff9900">Can the same image contain example of both skin cancers?</h4>
No.

In [None]:
for name_df, df in all_dfs_raw.items():
    count_both = sum((df["melanoma"] + df["seborrheic_keratosis"]) > 2)
    print(f"{name_df}:\n{count_both}\n")

train:
0

validate:
0

test:
0



<h3 style="color:green">Explore the distribution of data</h3>

<h4 style="color:#ff9900">How many melanomas examples are there?</h4>

In [None]:
for name_df, df in all_dfs_raw.items():
    print(f"{name_df}:\n{sum(df['melanoma'])/df.shape[0]*100} %\n")

train:
18.7 %

validate:
20.0 %

test:
19.5 %



<h4 style="color:#ff9900">How many keratosis examples are there?</h4>

In [None]:
for name_df, df in all_dfs_raw.items():
    print(f"{name_df}:\n{sum(df['seborrheic_keratosis'])/df.shape[0]*100} %\n")

train:
12.7 %

validate:
28.000000000000004 %

test:
15.0 %



<h4 style="color:#ff9900">How many healthy examples are there?</h4>

In [None]:
for name_df, df in all_dfs_raw.items():
    healthy = 1 - (sum(df['seborrheic_keratosis'])/df.shape[0] +  sum(df['melanoma'])/df.shape[0])
    print(f"{name_df}:\n{healthy*100} %\n")

train:
68.6 %

validate:
52.0 %

test:
65.5 %



<h3 style="color:green">Merge all the datasets into one</h3>

In [None]:
all_data = all_dfs_raw["train"].append(all_dfs_raw["validate"], ignore_index=True).append(all_dfs_raw["test"], ignore_index=True)

# Make sure index is from 0 to N - 1
all_data.reset_index(drop=True, inplace=True)
all_data.head()

Unnamed: 0,image_id,melanoma,seborrheic_keratosis
0,ISIC_0000000,0.0,0.0
1,ISIC_0000001,0.0,0.0
2,ISIC_0000002,1.0,0.0
3,ISIC_0000003,0.0,0.0
4,ISIC_0000004,1.0,0.0


<h4 style="color:#ff9900">Are all image IDs unique?</h4>

In [None]:
unique_ids_count = len(pd.unique(all_data["image_id"]))
unique_ids_count == all_data.shape[0]

True

<h4 style="color:#ff9900">Is index correspnding to our expecation?</h4>

In [None]:
current_indices = all_data.index.to_list()
expected_indices = [i for i in range(all_data.shape[0])]
current_indices == expected_indices

True

<h3 style="color:green">Add meta data</h3>

In [None]:
all_data = getImageMetaData(all_data)

In [None]:
# Check against missing values
all_data.isnull().sum()

image_id                0
melanoma                0
seborrheic_keratosis    0
db_id                   0
size_x                  0
size_y                  0
dtype: int64

<h3 style="color:green">Filter out too large images</h3>

In [None]:
mask_x = all_data["size_x"] <= 3200
mask_y = all_data["size_y"] <= 2100
all_data = all_data[mask_x & mask_y]

<h3 style="color:green">Sample from all data</h3>

In [None]:
sampled_data = sampleFromAllData(all_data, 20, 0.2)

<h3 style="color:green">Build the model input csv</h3>

In [None]:
buildClassifierInput(sampled_data, chunk_size = 100, temp_img_fold = "imageData/")

Data for image ISIC_0013525 were successfuly downloaded.
Data for image ISIC_0010231 were successfuly downloaded.
Data for image ISIC_0000531 were successfuly downloaded.
Data for image ISIC_0000552 were successfuly downloaded.
Data for image ISIC_0012962 were successfuly downloaded.
Data for image ISIC_0013725 were successfuly downloaded.
Data for image ISIC_0012432 were successfuly downloaded.
Data for image ISIC_0012833 were successfuly downloaded.
Data for image ISIC_0012250 were successfuly downloaded.
Data for image ISIC_0011168 were successfuly downloaded.
Data for image ISIC_0013220 were successfuly downloaded.
  z_score = (value - mean)/sd
Data for image ISIC_0000352 were successfuly downloaded.
Data for image ISIC_0016046 were successfuly downloaded.
Data for image ISIC_0013248 were successfuly downloaded.
Data for image ISIC_0000341 were successfuly downloaded.
Data for image ISIC_0012126 were successfuly downloaded.
Data for image ISIC_0011317 were successfuly downloaded.
D

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2a47fbf6-c653-4328-90db-f0771def66a6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>