# Uploading to Hugging Face datasets

This is the notebook used to upload the HAM10000 dataset to hugging face (HF).

Original dataset from kaggle was extracted to data folder.

More information on formats and how to's to upload to HF datasets can be found [here](https://huggingface.co/docs/datasets/image_dataset#loading-script).

### 1. Import Dependencies

In [4]:
import os
import shutil
import pandas as pd
from sklearn.model_selection import train_test_split
import json
import jsonlines
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


### 2. Read in files and join with Metadata

Download the HAM10000_metadata.csv file to categorize the datasets by their diagnosis type 

In [5]:
# moving the directory to the root folder
os.chdir('..')

In [6]:
path = 'data'
fullpath = os.path.join(os.getcwd(), path)

# walking through the directory to get the path names
datapath = []
for root, _, files in os.walk(fullpath):
    for file in files:
        datapath.append(os.path.relpath(os.path.join(root, file)))

orig_df = pd.DataFrame(pd.Series(datapath))
orig_df = orig_df.rename(columns={0: 'file_name'})
orig_df['image_id'] = orig_df["file_name"].apply(lambda x: os.path.splitext(os.path.basename(x))[0])


# add mapping the image names to metadata diagnosis
meta_df = pd.read_csv(os.path.join(os.getcwd(), 'notebooks/HAM10000_metadata.csv'))
lesion_type_dict = {
    'nv': 'melanocytic_Nevi',
    'mel': 'melanoma',
    'bkl': 'benign_keratosis-like_lesions',
    'bcc': 'basal_cell_carcinoma',
    'akiec': 'actinic_keratoses',
    'vasc': 'vascular_lesions',
    'df': 'dermatofibroma'
}
meta_df['dx'] = meta_df.dx.map(lesion_type_dict)

df = orig_df.merge(meta_df, how='inner', left_on='image_id', right_on='image_id')
print(df.shape)

Original dataframe shape: (20030, 2)
Meta dataframe shape: (10015, 7)


### 3. Split files by directory

Using SKlearn's train test split to split dataset into train test and validation sets

In [7]:
X = df
y = df['dx']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, stratify=y)
X_valid, X_test, y_valid, y_test = train_test_split(X_valid, y_valid, train_size=(2/3), stratify=y_valid)

### 4. Move files to correct folders

Creating the necessary metadata.jsonl file. 

Please note that this is not a standard json file as each line is a json dictionary. Use jsonlines to convert and dump the necessary files into metadata.jsonl 

In [8]:
datasets = {'train': X_train, 'valid': X_valid, 'test': X_test}

# creating the new destination path in the dataset
for k, v in datasets.items():
    v['base_path'] = v['file_name'].apply(lambda x: os.path.split(x)[1])
    v['move_path'] = 'data' + os.path.sep + k + os.path.sep + v['dx'] + os.path.sep + v['base_path']
    v = v.drop(columns=['base_path'])

In [9]:
# creating the necessary folders for train,test,split
for k in datasets.keys():
    parentfolderpath = os.path.join(os.getcwd(), 'data', k)
    if os.path.isdir(parentfolderpath)==False:
        os.mkdir(parentfolderpath)

# creating the necessary subfolder for each cancer type
for col in df['dx'].unique():
    for k in datasets.keys():
        folderpath = os.path.join(os.getcwd(), 'data', k, col)
        if os.path.isdir(folderpath)==False:
                os.mkdir(folderpath)

# moving the files to their correct destination
for k, v in datasets.items():
    for i, row in v.iterrows():
        filename = os.path.join(os.getcwd(), row['file_name'])
        movepath = os.path.join(os.getcwd(), row['move_path'])
        shutil.move(filename, movepath)

# Create Jsonl files

In [20]:
# creating the jsonlines files
for k, v in datasets.items():
    # editing the dataset to get only the folder and filename in "file_name" column
    v['filepath'] = v['move_path'].copy()
    v['foldername'] = v['filepath'].apply(lambda x: x.split(os.path.sep)[-2])
    v['filename'] = v['move_path'].apply(lambda x: os.path.basename(x))
    v['file_name'] = v['foldername'] + os.path.sep + v['filename']
    v = v.drop(columns=['filepath', 'move_path', 'foldername', 'base_path', 'filename'])

    # creating the jsonlines file
    res = v.to_json(orient='records')
    jsonls = json.loads(res)
    jsonobj = json.dumps(jsonls)
    with jsonlines.open(os.path.join(os.getcwd(), 'data', k, 'metadata.jsonl'), 'w') as writer:
        writer.write_all(jsonls)

# Upload to Hugging Face Hub

In [21]:
# upload dataset to hugging face

dataset = load_dataset(os.path.join(os.getcwd(), "data"))
dataset.push_to_hub("marmal88/skin_cancer")

Resolving data files: 100%|██████████| 9578/9578 [00:00<00:00, 16333.04it/s]
Resolving data files: 100%|██████████| 1286/1286 [00:00<00:00, 22729.62it/s]
Resolving data files: 100%|██████████| 2493/2493 [00:00<00:00, 11224.38it/s]
Using custom data configuration data-87648cf40e2c2d6c


Downloading and preparing dataset imagefolder/data to /home/oem/.cache/huggingface/datasets/imagefolder/data-87648cf40e2c2d6c/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f...


Downloading data files: 100%|██████████| 9581/9581 [00:00<00:00, 70904.13it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
Downloading data files: 100%|██████████| 1289/1289 [00:00<00:00, 74699.59it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
Downloading data files: 100%|██████████| 2496/2496 [00:00<00:00, 75562.68it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
                                                                      

Dataset imagefolder downloaded and prepared to /home/oem/.cache/huggingface/datasets/imagefolder/data-87648cf40e2c2d6c/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 16.98it/s]
Pushing split train to the Hub.
100%|██████████| 2/2 [00:02<00:00,  1.28s/ba]
100%|██████████| 2/2 [00:02<00:00,  1.00s/ba]0%|██        | 1/5 [03:08<12:35, 188.87s/it]
100%|██████████| 2/2 [00:02<00:00,  1.25s/ba]0%|████      | 2/5 [05:24<07:52, 157.39s/it]
100%|██████████| 2/2 [00:03<00:00,  1.62s/ba]0%|██████    | 3/5 [07:35<04:50, 145.25s/it]
100%|██████████| 2/2 [00:03<00:00,  1.84s/ba]0%|████████  | 4/5 [09:48<02:20, 140.45s/it]
Pushing dataset shards to the dataset hub: 100%|██████████| 5/5 [12:01<00:00, 144.38s/it]
Pushing split test to the Hub.
100%|██████████| 2/2 [00:01<00:00,  1.61ba/s]
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [02:08<00:00, 128.22s/it]
Pushing split validation to the Hub.
100%|██████████| 2/2 [00:01<00:00,  1.77ba/s]
100%|██████████| 2/2 [00:01<00:00,  1.16ba/s]0%|█████     | 1/2 [01:29<01:29, 89.22s/it]
Pushing dataset shards to the dataset hub: 100%|██████████| 2/2 [04:12<00:00, 126.23s/i