# ADHD Clinical Trials: EDA and Predictive Modeling

This notebook provides an interactive exploration of ADHD clinical trial data and demonstrates the full machine learning pipeline for predicting trial success.

## Objectives

1. Fetch ADHD trial data from ClinicalTrials.gov
2. Perform exploratory data analysis (EDA)
3. Engineer features and create labels
4. Train and evaluate multiple ML models
5. Interpret results and identify key predictors

## Setup and Imports

In [1]:
import sys
import os

# Add parent directory to path
sys.path.append(os.path.dirname(os.getcwd()))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src import fetch_data, prepare_data, train_models, utils

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Setup complete!")

ModuleNotFoundError: No module named 'pandas'

## Step 1: Fetch Data from ClinicalTrials.gov

We'll fetch ADHD Phase 2 and Phase 3 interventional trials using the ClinicalTrials.gov API.

In [None]:
# Fetch trials from API
trials = fetch_data.fetch_adhd_trials(max_results=2000, page_size=100)

print(f"\nFetched {len(trials)} trials")

In [None]:
# Save raw data
df_raw = fetch_data.save_data(trials)

print(f"\nRaw data shape: {df_raw.shape}")
df_raw.head()

## Step 2: Exploratory Data Analysis

Let's explore the raw data to understand trial characteristics and distributions.

In [None]:
# Data summary
summary = utils.get_data_summary(df_raw)
print("\nData Summary:")
summary

In [None]:
# Overall status distribution
print("\nTrial Status Distribution:")
print(df_raw['OverallStatus'].value_counts())

plt.figure(figsize=(12, 6))
status_counts = df_raw['OverallStatus'].value_counts()
plt.barh(range(len(status_counts)), status_counts.values, color='steelblue', alpha=0.8)
plt.yticks(range(len(status_counts)), status_counts.index)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Status', fontsize=12)
plt.title('Distribution of Trial Statuses', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Phase distribution
print("\nPhase Distribution:")
print(df_raw['Phase'].value_counts())

plt.figure(figsize=(10, 6))
phase_counts = df_raw['Phase'].value_counts()
plt.bar(range(len(phase_counts)), phase_counts.values, color='coral', alpha=0.8, edgecolor='black')
plt.xticks(range(len(phase_counts)), phase_counts.index, rotation=45, ha='right')
plt.ylabel('Count', fontsize=12)
plt.xlabel('Phase', fontsize=12)
plt.title('Distribution of Trial Phases', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Sponsor class distribution
print("\nSponsor Class Distribution:")
print(df_raw['LeadSponsorClass'].value_counts())

plt.figure(figsize=(10, 6))
sponsor_counts = df_raw['LeadSponsorClass'].value_counts()
plt.bar(range(len(sponsor_counts)), sponsor_counts.values, color='mediumseagreen', alpha=0.8, edgecolor='black')
plt.xticks(range(len(sponsor_counts)), sponsor_counts.index, rotation=45, ha='right')
plt.ylabel('Count', fontsize=12)
plt.xlabel('Sponsor Class', fontsize=12)
plt.title('Distribution of Sponsor Classes', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Enrollment distribution
enrollment = pd.to_numeric(df_raw['EnrollmentCount'], errors='coerce').dropna()

print(f"\nEnrollment Statistics:")
print(f"Mean: {enrollment.mean():.1f}")
print(f"Median: {enrollment.median():.1f}")
print(f"Min: {enrollment.min():.0f}")
print(f"Max: {enrollment.max():.0f}")

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(enrollment, bins=30, color='skyblue', alpha=0.8, edgecolor='black')
plt.xlabel('Enrollment Count', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Enrollment (Linear Scale)', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.hist(np.log1p(enrollment), bins=30, color='lightcoral', alpha=0.8, edgecolor='black')
plt.xlabel('Log(Enrollment + 1)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Enrollment (Log Scale)', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Step 3: Data Preparation and Labeling

Create binary labels and engineer features for modeling.

In [None]:
# Create labels
df_labeled = prepare_data.create_binary_labels(df_raw)

print(f"\nLabeled data shape: {df_labeled.shape}")

In [None]:
# Visualize class distribution
utils.plot_class_distribution(df_labeled, save_path='../data/processed/class_distribution.png')
plt.show()

In [None]:
# Engineer features
df_features = prepare_data.engineer_features(df_labeled)

print(f"\nData with features shape: {df_features.shape}")

In [None]:
# Select final features
df_final, feature_cols = prepare_data.select_modeling_features(df_features)

print(f"\nFinal dataset shape: {df_final.shape}")
print(f"Number of features: {len(feature_cols)}")
print(f"\nFeature list:")
for i, feat in enumerate(feature_cols, 1):
    print(f"{i:2d}. {feat}")

In [None]:
# Save processed data
prepare_data.save_processed_data(df_final)

print("\nProcessed data saved!")

## Step 4: Feature Analysis

Analyze relationships between features and the target variable.

In [None]:
# Success rate by sponsor class
sponsor_summary = utils.create_summary_table(df_final, 'OverallStatus', 'Label')
print("\nSuccess Rate by Status:")
print(sponsor_summary)

In [None]:
# Plot feature distributions by outcome
numeric_features = [col for col in feature_cols if df_final[col].dtype in [np.float64, np.int64]]
utils.plot_feature_distributions(
    df_final, 
    numeric_features, 
    save_path='../data/processed/feature_distributions.png'
)
plt.show()

In [None]:
# Correlation matrix
utils.plot_correlation_matrix(
    df_final, 
    feature_cols, 
    save_path='../data/processed/correlation_matrix.png'
)
plt.show()

## Step 5: Model Training and Evaluation

Train multiple models and compare their performance.

In [None]:
# Prepare train/test split
X_train, X_test, y_train, y_test, train_idx, test_idx = train_models.prepare_train_test_split(
    df_final, feature_cols, test_size=0.2, random_state=42
)

In [None]:
# Scale features
X_train_scaled, X_test_scaled, scaler = train_models.scale_features(X_train, X_test)

In [None]:
# Train models
models = train_models.train_models(X_train_scaled, y_train, random_state=42)

In [None]:
# Evaluate models
results_df = train_models.evaluate_all_models(
    models, X_train_scaled, y_train, X_test_scaled, y_test
)

print("\nModel Performance Comparison:")
results_df

In [None]:
# Calculate baseline
baseline = utils.calculate_baseline_metrics(y_test)
print("\nBaseline (Majority Class) Performance:")
for metric, value in baseline.items():
    print(f"  {metric}: {value:.3f}")

## Step 6: Model Visualization and Interpretation

In [None]:
# Plot ROC curves
train_models.plot_roc_curves(
    models, X_test_scaled, y_test, 
    save_path='../data/processed/roc_curves.png'
)
plt.show()

In [None]:
# Feature importance for Random Forest
train_models.plot_feature_importance(
    models['Random Forest'], 
    feature_cols, 
    'Random Forest',
    top_n=20,
    save_path='../data/processed/feature_importance.png'
)
plt.show()

In [None]:
# Get top features
top_features = utils.get_top_features_by_importance(
    models['Random Forest'], 
    feature_cols, 
    top_n=15
)
print("\nTop 15 Most Important Features (Random Forest):")
print(top_features)

In [None]:
# Confusion matrix for best model
best_model_name = results_df['test_auc'].idxmax()
best_model = models[best_model_name]

print(f"\nBest model by AUC: {best_model_name}")

train_models.plot_confusion_matrix(
    best_model, X_test_scaled, y_test, best_model_name,
    save_path='../data/processed/confusion_matrix.png'
)
plt.show()

In [None]:
# Classification report
y_pred = best_model.predict(X_test_scaled)
utils.print_classification_summary(y_test, y_pred, best_model_name)

## Step 7: Error Analysis

Examine cases where the model made incorrect predictions.

In [None]:
# Create predictions dataframe
test_df = df_final.iloc[test_idx].copy()
test_df['Predicted'] = y_pred
test_df['Correct'] = test_df['Label'] == test_df['Predicted']
test_df['Probability_Success'] = best_model.predict_proba(X_test_scaled)[:, 1]

# False positives (predicted success, actually failed)
false_positives = test_df[(test_df['Label'] == 0) & (test_df['Predicted'] == 1)]
print(f"\nFalse Positives: {len(false_positives)}")
if len(false_positives) > 0:
    print("\nSample False Positives:")
    print(false_positives[['NCTId', 'BriefTitle', 'OverallStatus', 'Probability_Success']].head())

# False negatives (predicted failure, actually succeeded)
false_negatives = test_df[(test_df['Label'] == 1) & (test_df['Predicted'] == 0)]
print(f"\nFalse Negatives: {len(false_negatives)}")
if len(false_negatives) > 0:
    print("\nSample False Negatives:")
    print(false_negatives[['NCTId', 'BriefTitle', 'OverallStatus', 'Probability_Success']].head())

## Summary and Conclusions

This notebook demonstrated:

1. **Data Collection**: Fetched ADHD Phase 2/3 trials from ClinicalTrials.gov
2. **Data Preparation**: Created binary labels and engineered 34 features
3. **Model Training**: Trained 3 models (Logistic Regression, Random Forest, Gradient Boosting)
4. **Evaluation**: Compared models using accuracy, precision, recall, F1, and AUC
5. **Interpretation**: Identified key predictors of trial success

### Key Findings

- Model performance suggests trial design characteristics are moderately predictive of success
- Important features typically include enrollment size, randomization, blinding, and sponsor type
- Class imbalance (most trials succeed) affects prediction of failures

### Next Steps

- Explore additional feature engineering (text analysis, temporal features)
- Tune hyperparameters for improved performance
- Validate on trials from other therapeutic areas
- Develop interpretable insights for trial design recommendations