# **Data Collection Notebook - Benign and Malignant classification of Skin Lesions**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs

* https://www.kaggle.com/datasets/bryanqtnguyen/benign-and-malignant-skin-lesion-dataset/data
* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset:
    * input/  
      * Benign_Malignant_Dataset/MainData  

* Final Output:
  * input/Benign_Malignant_Dataset/MainData/  
    * test  
      * benign  
      * malignant  
    * train  
      * benign  
      * malignant   
    * validation  
      * benign  
      * malignant   

## Benign and Malignant Skin Lesion Dataset

This is a dataset that combines images from ISIC 2020 and ISIC 2019. The dataset comprises approximately 4500 images split up between the train, validation, and test sets. This was made for training a CNN with a diverse dataset while staying relatively small for quick and easy feedback.

This dataset splits the images into two classes: benign and malignant.

BCN_20000 Dataset: (c) Department of Dermatology, Hospital Clínic de Barcelona

HAM10000 Dataset: (c) by ViDIR Group, Department of Dermatology, Medical University of Vienna; https://doi.org/10.1038/sdata.2018.161

MSK Dataset: (c) Anonymous; https://arxiv.org/abs/1710.05006; https://arxiv.org/abs/1902.03368

International Skin Imaging Collaboration. SIIM-ISIC 2020 Challenge Dataset. International Skin Imaging Collaboration https://doi.org/10.34970/2020-ds01 (2020).

Creative Commons Attribution-Non Commercial 4.0 International License.

The dataset was generated by the International Skin Imaging Collaboration (ISIC) and images are from the following sources: Hospital Clínic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute Australia, The University of Queensland, and the University of Athens Medical School. 


---

# Import packages

In [1]:
%pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


# Change working directory

* Because the Jupyter notebooks are in a subfolder, we need to change the directory for the code's execution

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [40]:
import os
current_dir = os.getcwd()
current_dir

'/Users/carolina/Documents/CodeInstitute'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [36]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [37]:
current_dir = os.getcwd()
current_dir

'/Users/carolina/Documents/CodeInstitute'

## Setup Kaggle

Install Kaggle

In [5]:
%pip install kaggle==1.5.12

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Setup Kaggle details

In [6]:
# Kaggle json file and directory setup
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [23]:
KaggleDatasetPath = "bryanqtnguyen/benign-and-malignant-skin-lesion-dataset/data"
DestinationFolder = "input/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading benign-and-malignant-skin-lesion-dataset.zip to input
100%|███████████████████████████████████████▉| 834M/836M [00:30<00:00, 30.4MB/s]
100%|████████████████████████████████████████| 836M/836M [00:30<00:00, 28.6MB/s]


Unzip the dowloaded file and delete the zip file

In [24]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/benign-and-malignant-skin-lesion-dataset.zip' , 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + 'benign-and-malignant-skin-lesion-dataset.zip')

The dataset comes pre-sorted into Test, Train and Validation folders containing only image files.

---

# Push files to Repo

---

* git add .

In [51]:
!git add .

* git commit

In [52]:
!git commit -m "Added Kaggle dataset and extracted it"

[main fdfbb0d] Added Kaggle dataset and extracted it
 4 files changed, 99 insertions(+), 156 deletions(-)
 create mode 100644 .DS_Store
 create mode 100644 input/.DS_Store
 create mode 100644 input/Benign_Malignant_DataSet/.DS_Store


* git push

In [57]:
!git push

Everything up-to-date
