# **Data Collection Notebook - Benign and Malignant classification of Skin Lesions**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs

* https://www.kaggle.com/datasets/sallyibrahim/skin-cancer-isic-2019-2020-malignant-or-benign

## Outputs

* Generate Dataset:
    * input/  
      * Benign_Malignant_Dataset/MainData  

* Final Output:
  * input/Benign_Malignant_Dataset/MainData/  
    * test  
      * benign  
      * malignant  
    * train  
      * benign  
      * malignant   
    * validation  
      * benign  
      * malignant   

## Benign and Malignant Skin Lesion Dataset

This is a dataset that combines images from ISIC 2020 and ISIC 2019. The dataset comprises approximately 4500 images split up between the train, validation, and test sets. This was made for training a CNN with a diverse dataset while staying relatively small for quick and easy feedback.

This dataset splits the images into two classes: benign and malignant.

BCN_20000 Dataset: (c) Department of Dermatology, Hospital Clínic de Barcelona

HAM10000 Dataset: (c) by ViDIR Group, Department of Dermatology, Medical University of Vienna; https://doi.org/10.1038/sdata.2018.161

MSK Dataset: (c) Anonymous; https://arxiv.org/abs/1710.05006; https://arxiv.org/abs/1902.03368

International Skin Imaging Collaboration. SIIM-ISIC 2020 Challenge Dataset. International Skin Imaging Collaboration https://doi.org/10.34970/2020-ds01 (2020).

Creative Commons Attribution-Non Commercial 4.0 International License.

The dataset was generated by the International Skin Imaging Collaboration (ISIC) and images are from the following sources: Hospital Clínic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute Australia, The University of Queensland, and the University of Athens Medical School. 


---

# Import packages

In [1]:
%pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas==1.3.4 (from -r ../requirements.txt (line 3))
  Using cached pandas-1.3.4.tar.gz (4.7 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting numpy==1.21.2 (from -r ../requirements.txt (line 4))
  Downloading numpy-1.21.2-cp39-cp39-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting seaborn==0.11.2 (from -r ../requirements.txt (line 5))
  Downloading seaborn-0.11.2-py3-none-any.whl.metadata (2.3 kB)
Collecting matplotlib==3.4.3 (from -r ../requirements.txt (line 6))
  Downloading matplotlib-3.4.3.tar.gz (37.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.9/37.9 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting scikit-learn==0.24.2 (from -r ../requirements.t

# Change working directory

* Because the Jupyter notebooks are in a subfolder, we need to change the directory for the code's execution

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/Users/carolina/Documents/CodeInstitute/benign-malignant-classification/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/Users/carolina/Documents/CodeInstitute/benign-malignant-classification'

## Setup Kaggle

Install Kaggle

In [6]:
%pip install kaggle==1.5.12

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Setup Kaggle details

In [39]:
# Kaggle json file and directory setup
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [54]:
KaggleDatasetPath = "sallyibrahim/skin-cancer-isic-2019-2020-malignant-or-benign"
DestinationFolder = "input/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}



Downloading skin-cancer-isic-2019-2020-malignant-or-benign.zip to input
100%|███████████████████████████████████████▊| 402M/403M [00:15<00:00, 29.8MB/s]
100%|████████████████████████████████████████| 403M/403M [00:15<00:00, 27.9MB/s]


Unzip the dowloaded file and delete the zip file

In [55]:
import zipfile
with zipfile.ZipFile(DestinationFolder + 'skin-cancer-isic-2019-2020-malignant-or-benign.zip' , 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + 'skin-cancer-isic-2019-2020-malignant-or-benign.zip')

This dataset contains 11400 images, classified to: training images, 5200 benign tumors and 4000 malignant tumors; validation images contain 550 benign tumors and 550 malignant tumors, and testing images contain 550 benign tumors and 550 malignant tumors.

---