This notebook has code for HAM10000. It will be later sorted into individual articles.

# HAM10000 Source Loading

First, download the data from the [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T) - it will ask you confirm the data license. This process might take a while since it is roughly 3GB to download. 

In this section, we assume the downloaded zip folder name is `HAM10000_data.zip`. This section unzips the files and organizes the data. Additionally, since the images are split into two folders, we merge it into one folder.

The file structure for the data assumed for the project is:

```
HAM10000
└───data
    ├── README.md
    ├── source
    │   ├── HAM10000.zip
    │   ├── HAM10000_images_part_1.zip
    │   ├── HAM10000_images_part_2.zip
    │   ├── HAM10000_metadata
    │   ├── HAM10000_segmentations_lesion_tschandl.zip
    │   ├── ISIC2018_Task3_Test_Images.zip
    │   └── ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.tab
    └── original
        ├── HAM10000_images_part_1
        │   ├── ISIC_XXXXXXX.jpg
        │   └── ...
        ├── HAM10000_images_part_2
        │   ├── ISIC_XXXXXXX.jpg
        │   └── ...
        ├── HAM10000_images_all
        │   ├── ISIC_XXXXXXX.jpg
        │   └── ...
        ├── HAM10000_metadata
        ├── HAM10000_segmentations_lesion_tschandl
        │   ├── ISIC_XXXXXXX_segmentation.png
        │   └── ISIC_0024307_segmentation.png
        └── ISIC2018_Task3_Test_Images
            ├── ISIC_XXXXXXX.jpg
            └── ...
    ...
```
All the data lives in the `data` folder. The `data/source` folder has the raw zip file, `HAM10000.zip`, we downloaded. It will also have the raw content we extract from the `HAM10000.zip` file (there are 6 items). The `data/original` folder will have the relevant files extracted and ready to for use. The original data source split the training images to two folders, called `HAM10000_images_part_1` and `HAM10000_images_part_2`. For convenience, we combine them into one folder called `HAM10000_image_all`. Additionally, there's the `HAM10000_metadata` which has info on the images (such demographics and labels). There's `HAM10000_segmentations_lesion_tschandl` which has segmentations and `ISIC2018_Task3_Test_Images` which are the test images.

Any cached or custom versions of the data will live in the `HAM10000/data` folder inside it's own folder. For example, if I reduce the size of every image, I might store that in `HAM10000/data/reduced_images`.

In [33]:
# libraries
import os
import zipfile
import shutil

In [10]:
# Get the current working directory
cwd = os.getcwd()

# Print the current working directory
print(cwd)

/Users/nabibahmed/Desktop/Local/Brand/HAM10000/articles


In [11]:
# setting the base directory based on the cwd - it should be where the HAM10000 project repo is
base_dir = '/Users/nabibahmed/Desktop/Local/Brand/HAM10000'

In [27]:
def extract_zips(base_dir : str = base_dir) -> None:
    """
    Extract the zip files downloaded from datasource
    """
    # source directory for the zip files
    source_dir = os.path.join(base_dir, "data", "source")
    
    # path to the zip files
    zipfiles = dict(
        HAM10000_images_part_1 = os.path.join(source_dir, "HAM10000_images_part_1.zip"),
        HAM10000_images_part_2 = os.path.join(source_dir, "HAM10000_images_part_2.zip"),
        HAM10000_segmentations_lesion_tschandl = os.path.join(source_dir, "HAM10000_segmentations_lesion_tschandl.zip"),
        ISIC2018_Task3_Test_Images = os.path.join(source_dir, "ISIC2018_Task3_Test_Images.zip"),
    )
    
    # destination folder
    destination_dir = os.path.join(base_dir, "data", "original")
    
    # check if destination folder exists
    if os.path.exists(destination_dir):
        print(f"Destination directory exists: {destination_dir} \nAssumes files have been extracted!")
        return None
    
    # unzips and places in destination
    for name, path in zipfiles.items():
        with zipfile.ZipFile(path) as zf:
            zf.extractall(os.path.join(destination_dir, name))
    
    print(f"Destination directory created: {destination_dir} \nFiles have been extracted!")

extract_zips()

Destination directory exists: /Users/nabibahmed/Desktop/Local/Brand/HAM10000/data/original 
assumes files have been extracted!


In [38]:
def merge_HAM10000_images(base_dir : str = base_dir) -> None:
    """
    Combines the HAM10000 part 1 and part 2 folders into one
    """
    # original directory where the part 1 and part 2 folders exist
    original_dir = os.path.join(base_dir, "data", "original")
    
    # part 1 and part 2 folders
    part_1 = os.path.join(original_dir, "HAM10000_images_part_1")
    part_2 = os.path.join(original_dir, "HAM10000_images_part_2")
    
    # destination folder
    destination_dir = os.path.join(base_dir, "data", "original", "HAM10000_images_all")
    
    # check if destination folder exists
    if os.path.exists(destination_dir):
        print(f"Destination directory exists: {destination_dir} \nAssumes parts have been merged!")
        return None

    # creates the destination folder
    os.mkdir(destination_dir)
    
    # copying over the files
    for p in [part_1, part_2]:
        files_to_copy = os.listdir(p)

        for file in files_to_copy:
            shutil.copy(os.path.join(p, file), os.path.join(destination_dir, file))
    
    # print confirmation
    print(f"Created destination directory: {destination_dir}")
    print(f"The total number of images in the directory is: {len(os.listdir(destination_dir))}")

merge_HAM10000_images()   

Created destination directory: /Users/nabibahmed/Desktop/Local/Brand/HAM10000/data/original/HAM10000_images_all
The total number of images in the directory is: 10015
