# Label Distribution Analysis

This notebook analyzes the distribution of labels in the dataset (`data/trials.csv`) and explains the implications of the distribution for model training.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Load the data
data = pd.read_csv('data/trials.csv')

# Display first few rows
print(data.head())

In [None]:
# Analyze label distribution
label_counts = data['label'].value_counts()
print("Label Distribution:\n", label_counts)

# Plot the label distribution
label_counts.plot(kind='bar', figsize=(8, 6), color='skyblue')
plt.title('Label Distribution')
plt.xlabel('Labels')
plt.ylabel('Count')
plt.show()

## Analysis and Explanation

The label distribution in the dataset is roughly balanced, with the following counts:

- **Dementia:** approximately 368 samples
- **ALS:** approximately 368 samples
- **Obsessive Compulsive Disorder:** approximately 358 samples
- **Scoliosis:** approximately 335 samples
- **Parkinson’s Disease:** approximately 330 samples

### Implications for Model Training

- **Balanced Classes:** The relatively even number of examples per label implies that the training process is less likely to be biased toward any one class.
- **Reliable Metrics:** Evaluation metrics such as accuracy, F1-score, and PR-AUC will more reliably reflect model performance because class imbalance issues are minimized.
- **Generalization:** With similar representation, the model is more likely to learn discriminative features across classes, leading to better generalization on unseen data.

Overall, the balanced label distribution is a positive indicator for the multi-class classification task.