# Analyzing Class Imbalance in Medical Imaging Datasets

This notebook computes the Imbalance Factor (IF) across several medical imaging datasets. Class imbalance is a critical challenge in medical machine learning, where imbalanced data can lead to biased or inaccurate model predictions.

In this notebook, we will:
- Compute the imbalance factor for several datasets, including CheXCOVID, HAM10000, OL3I, and PAPILA. The Imbalance Factor measures the ratio of the most frequent to the least frequent class.

In [1]:
import json
import os
from os.path import dirname as up
import pandas as pd
import matplotlib.pyplot as plts

In [2]:
root_dir = up(os.getcwd())

## CheXCOVID
Here, we read data from multiple CSV files representing different client datasets, combining them into one DataFrame. This combined dataset represents the CheXCOVID dataset in the paper. We will use all the samples to compute the imbalance factor (IF), giving us insights into how class distribution varies across different samples or groups.

In [3]:
path = root_dir + "/outputs_final_results/fl_COVID_4T_10C_100R_5I_imbalanced_rehearsal_group_probability/data/clients"

# Initialize an empty DataFrame to store the combined data
combined_df = pd.DataFrame()

# Iterate through the files and append them to the combined DataFrame
for i in range(10):
    file_name = f"train_client{i}_COVID.csv"
    file_path = os.path.join(path, file_name)
    
    # Read the file and append to the DataFrame
    df = pd.read_csv(file_path)
    combined_df = combined_df.append(df, ignore_index=True)

combined_df.head()

Unnamed: 0,Path,Sex,AP/PA,Target,Age_multi,Age_binary,PATIENT,PRIMARY_RACE
0,images/2eadbbb367a0366d8c34350d083a83_jumbo.jpeg,M,AP,2,2,0,359,Unknown
1,CheXpert-v1.0/train/patient32487/study1/view2_...,M,,1,1,0,patient32487,White
2,CheXpert-v1.0/train/patient11658/study2/view1_...,M,AP,0,3,1,patient11658,"White, non-Hispanic"
3,CheXpert-v1.0/train/patient57885/study1/view1_...,M,AP,1,2,0,patient57885,Asian
4,CheXpert-v1.0/train/patient09318/study2/view1_...,M,AP,1,1,0,patient09318,White


### Calculating the Imbalance Factor (IF)
In this step, we calculate the imbalance factor, which is the ratio of the number of instances in the most common class to the number of instances in the least common class. This is a common metric that helps identify the degree of label imbalance in a dataset.

In [5]:
major_label_count = combined_df['Target'].value_counts()[0]
print(major_label_count)

minor_label_count = combined_df['Target'].value_counts()[2]
print(minor_label_count)

# Compute the imbalance factor (IF)
IF = major_label_count / minor_label_count
print("IF:", IF)

3037
282
IF: 10.76950354609929


## HAM10000
Here, we will repeat the IF computations for the HAM10000 dataset. First, we load the dataset's metadata and then calculate the imbalance factor, illustrating the prevalence of class imbalance in this dataset.

In [17]:
# Dataset Configuration Re-loading
root_dir = up(os.getcwd())
with open(root_dir + "/configs/datasets.json", 'r') as f:
    config = json.load(f)

In [47]:
dataset_name = "HAM10000"
dataset_config = config[dataset_name]
train_meta = pd.read_csv(dataset_config['train_meta_path']) 
val_meta = pd.read_csv(dataset_config['val_meta_path']) 
test_meta = pd.read_csv(dataset_config['test_meta_path']) 

In [48]:
train_meta['Target'].value_counts().values

array([5360,  879,  858,  414,  261,  103,   92])

In [49]:
major_label_count = train_meta['Target'].value_counts().values[0]
print(major_label_count)

minor_label_count = train_meta['Target'].value_counts().values[-1]
print(minor_label_count)

IF = major_label_count / minor_label_count
print("IF:", IF)

5360
92
IF: 58.26086956521739


## OL3I
Here, we will repeat the IF computations for the OL3I dataset.

In [10]:
dataset_name = "OL3I"
dataset_config = config[dataset_name]
train_meta = pd.read_csv(dataset_config['train_meta_path']) 
val_meta = pd.read_csv(dataset_config['val_meta_path']) 
test_meta = pd.read_csv(dataset_config['test_meta_path']) 

In [11]:
train_meta['Target'].value_counts()

0    4961
1     224
Name: Target, dtype: int64

In [12]:
major_label_count = train_meta['Target'].value_counts()[0]
print(major_label_count)

minor_label_count = train_meta['Target'].value_counts()[1]
print(minor_label_count)

IF = major_label_count / minor_label_count
print("IF:", IF)

4961
224
IF: 22.147321428571427


## PAPILA
Here, we will repeat the IF computations for the PAPILA dataset.

In [13]:
dataset_name = "PAPILA"
dataset_config = config[dataset_name]
train_meta = pd.read_csv(dataset_config['train_meta_path']) 
val_meta = pd.read_csv(dataset_config['val_meta_path']) 
test_meta = pd.read_csv(dataset_config['test_meta_path']) 

In [14]:
train_meta['Target'].value_counts()

0    228
1     68
2     44
Name: Target, dtype: int64

In [15]:
major_label_count = train_meta['Target'].value_counts()[0]
print(major_label_count)

minor_label_count = train_meta['Target'].value_counts()[2]
print(minor_label_count)

IF = major_label_count / minor_label_count
print("IF:", IF)

228
44
IF: 5.181818181818182
