# **Data Collection Notebook - Benign and Malignant classification of Skin Lesions**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs

* https://www.kaggle.com/datasets/bryanqtnguyen/benign-and-malignant-skin-lesion-dataset/data
* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset:
    * input/  
        └── Benign_Malignant_Dataset/ (Image files)  

* Final Output:
  * input/  
    * test  
      * benign  
      * malignant  
    * train  
      * benign  
      * malignant   
    * validation  
      * benign  
      * malignant   

## Benign and Malignant Skin Lesion Dataset

This is a dataset that combines images from ISIC 2020 and ISIC 2019. The dataset comprises approximately 4500 images split up between the train, validation, and test sets. This was made for training a CNN with a diverse dataset while staying relatively small for quick and easy feedback.

This dataset splits the images into two classes: benign and malignant.

BCN_20000 Dataset: (c) Department of Dermatology, Hospital Clínic de Barcelona

HAM10000 Dataset: (c) by ViDIR Group, Department of Dermatology, Medical University of Vienna; https://doi.org/10.1038/sdata.2018.161

MSK Dataset: (c) Anonymous; https://arxiv.org/abs/1710.05006; https://arxiv.org/abs/1902.03368

International Skin Imaging Collaboration. SIIM-ISIC 2020 Challenge Dataset. International Skin Imaging Collaboration https://doi.org/10.34970/2020-ds01 (2020).

Creative Commons Attribution-Non Commercial 4.0 International License.

The dataset was generated by the International Skin Imaging Collaboration (ISIC) and images are from the following sources: Hospital Clínic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute Australia, The University of Queensland, and the University of Athens Medical School. 


---

# Import packages

In [1]:
%pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting streamlit==0.85.0 (from -r ../requirements.txt (line 2))
  Downloading streamlit-0.85.0-py2.py3-none-any.whl.metadata (1.1 kB)
Collecting altair<5 (from -r ../requirements.txt (line 3))
  Downloading altair-4.2.2-py3-none-any.whl.metadata (13 kB)
Collecting astor (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting attrs (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Using cached attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting base58 (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Downloading base58-2.1.1-py3-none-any.whl.metadata (3.1 kB)
Collecting blinker (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Using cached blinker-1.7.0-py3-none-any.whl.metadata (1.9 kB)
Collecting cachetools>=4.0 (from streamlit==0.85.0->-r ../requirements.txt (line 2))
  Using cached cacheto

# Change working directory

* Because the Jupyter notebooks are in a subfolder, we need to change the directory for the code's execution

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/Users/carolina/Documents/CodeInstitute/breast-cancer-prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/Users/carolina/Documents/CodeInstitute/breast-cancer-prediction'

## Setup Kaggle

Install Kaggle

In [5]:
%pip install kaggle==1.5.12

Defaulting to user installation because normal site-packages is not writeable
Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.66.2-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [3

Setup Kaggle details

In [6]:
# Kaggle json file and directory setup
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [7]:
KaggleDatasetPath = "bryanqtnguyen/benign-and-malignant-skin-lesion-dataset/data"
DestinationFolder = "input/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading benign-and-malignant-skin-lesion-dataset.zip to input
100%|███████████████████████████████████████▉| 835M/836M [00:38<00:00, 27.2MB/s]
100%|████████████████████████████████████████| 836M/836M [00:38<00:00, 23.0MB/s]


Unzip the dowloaded file and delete the zip file

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/benign-and-malignant-skin-lesion-dataset.zip' , 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + 'benign-and-malignant-skin-lesion-dataset.zip')

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
