## Goal

In this notebook, we take the DermaMNIST-C partitions (that were created by moving all the 'leaking' `lesion ID`s from validation and testing partitions to the training partition) and further improve them by fixing the leakage arising out of undetected duplicates in HAM10000.

Therefore, we load the DermaMNIST-C metadata, and for each new duplicate pair detected in HAM10000, we check if the two images in the pair belong to different partitions. We observed that of the 18 pairs, 5 pairs had 1 image in training and 1 in testing, 2 pairs had 1 image in training and 1 in validation, and the remaining 11 pairs had both images in training.

Therefore, to fix this, for the 7 'leaking' pairs, we move both images of the pair to the training partition.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from pathlib import Path
from typing import Union

## Read metadata files:
- HAM10000 original metadata file.
- DermaMNIST-C metadata file.
- List of confirmed duplicate pairs in HAM10000.

In [2]:
ham10k_metadata_file: Union[str, Path] = Path("HAM10000_metadata.csv")
dermamnist_c_metadata_file: Union[str, Path] = Path("CSV_files/combined_metadata_corrected.csv")
ham10k_duplicates_file: Union[str, Path] = Path(
    "../HAM10000_DuplicateConfirmation/AnalysesOutputFiles/newly_discovered_duplicates.csv"
)

In [3]:
ham10k_metadata: pd.DataFrame = pd.read_csv(ham10k_metadata_file, header="infer")
dermamnist_c_metadata: pd.DataFrame = pd.read_csv(dermamnist_c_metadata_file, header="infer")
ham10k_duplicates: pd.DataFrame = pd.read_csv(ham10k_duplicates_file, header="infer")

### Assert whether the dataframes have the same number of rows.

In [4]:
assert ham10k_metadata.shape[0] == dermamnist_c_metadata.shape[0]

### Create a deep copy of the DermaMNIST-C metadata.

In [5]:
dermamnist_c_metadata_new = dermamnist_c_metadata.copy(deep=True)

### Fix the leakage due to HAM10000 duplicates, and save the new metadata to file.

In [6]:
for _, row in ham10k_duplicates.iterrows():
    img_id1, img_id2 = (
        row.from_img.split(".")[0], row.to_img.split(".")[0]
    )
    
    split1 = dermamnist_c_metadata_new[
        dermamnist_c_metadata_new["image_id"] == img_id1
    ].split.values[0]
    split2 = dermamnist_c_metadata_new[
        dermamnist_c_metadata_new["image_id"] == img_id2
    ].split.values[0]
    
    if split1 != split2:
        dermamnist_c_metadata_new.loc[
            dermamnist_c_metadata_new["image_id"] == img_id1, "split"
        ] = "train"
        dermamnist_c_metadata_new.loc[
            dermamnist_c_metadata_new["image_id"] == img_id2, "split"
        ] = "train"
        print("Duplicate pair moved to training.")

Duplicate pair moved to training.
Duplicate pair moved to training.
Duplicate pair moved to training.
Duplicate pair moved to training.
Duplicate pair moved to training.
Duplicate pair moved to training.
Duplicate pair moved to training.


In [7]:
dermamnist_c_metadata_new.to_csv(
    Path("CSV_files/combined_metadata_corrected-HAM10000_corrected.csv"), 
    index=False
)