# ADMET Prediction Models - Complete Workflow

This notebook demonstrates a complete end-to-end workflow for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) property prediction using machine learning.

## Overview

ADMET properties are critical pharmacokinetic and safety parameters evaluated during drug discovery. This workflow includes:

1. **Data Loading and Preprocessing**: Loading molecular data and calculating features
2. **Data Splitting**: Stratified and scaffold-based splitting strategies
3. **Making Predictions**: Using pre-trained models for all 11 ADMET properties
4. **Results Visualization**: Analyzing and visualizing prediction results
5. **Custom Model Training**: Training your own ADMET models (optional)

## Available ADMET Models

| Model Name | Property | Classification Criteria |
|------------|----------|-------------------------|
| BBB | Blood-Brain Barrier | logBB ≥ -1: permeable |
| Papp | Caco-2 Permeability | Papp ≥ 8×10⁻⁶ cm/s: permeable |
| P_gp_subs | P-glycoprotein Substrate | ER ≥ 2: substrate |
| CYP1A2 | CYP1A2 Inhibition | IC50 < 10 µM: inhibitor |
| CYP2C9 | CYP2C9 Inhibition | IC50 < 10 µM: inhibitor |
| CYP2C19 | CYP2C19 Inhibition | IC50 < 10 µM: inhibitor |
| CYP2D6 | CYP2D6 Inhibition | IC50 < 10 µM: inhibitor |
| CYP3A4 | CYP3A4 Inhibition | IC50 < 10 µM: inhibitor |
| HCLint | Human Hepatic Clearance | t½ > 30 min: stable |
| RCLint | Rat Hepatic Clearance | t½ > 30 min: stable |
| hERG_inh | hERG Inhibition | IC50 < 10 µM: inhibitor |

## 1. Setup and Imports

First, let's import all necessary libraries and modules.

In [None]:
# Standard library imports
import os
import pickle
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# RDKit for molecular handling
from rdkit import Chem
from rdkit.Chem import Draw

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Custom modules
from utils import data_preprocessing, load_features
from predict import load_model, predict

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')

print("✓ All libraries imported successfully")
print(f"✓ Working directory: {os.getcwd()}")

## 2. Data Loading and Preprocessing

Let's load a sample dataset and preprocess it. We'll use the included `smiles.csv` file as an example.

In [None]:
# Load the sample data
data_file = 'smiles.csv'
print(f"Loading data from: {data_file}")

df = pd.read_csv(data_file)
print(f"\nLoaded {len(df)} molecules")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Check class distribution
print("Class distribution:")
class_counts = df['bioclass'].value_counts()
print(class_counts)
print(f"\nClass 0: {class_counts[0]} ({class_counts[0]/len(df)*100:.1f}%)")
print(f"Class 1: {class_counts[1]} ({class_counts[1]/len(df)*100:.1f}%)")

# Visualize class distribution
plt.figure(figsize=(8, 5))
class_counts.plot(kind='bar', color=['#FF6B6B', '#4ECDC4'])
plt.title('Class Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Class', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Visualize some example molecules
print("Sample molecules from the dataset:\n")

# Take 4 random molecules
sample_smiles = df.sample(4, random_state=42)['SMILES'].tolist()
mols = [Chem.MolFromSmiles(smi) for smi in sample_smiles]

# Display molecules
img = Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(300, 300), 
                            legends=[f"Molecule {i+1}" for i in range(len(mols))])
display(img)

## 3. Making Predictions with Pre-trained Models

Now let's use all 11 pre-trained ADMET models to make predictions on our molecules.

In [None]:
# Define all available ADMET models
admet_models = [
    'BBB',       # Blood-Brain Barrier
    'Papp',      # Caco-2 Permeability
    'P_gp_subs', # P-glycoprotein Substrate
    'CYP1A2',    # CYP1A2 Inhibition
    'CYP2C9',    # CYP2C9 Inhibition
    'CYP2C19',   # CYP2C19 Inhibition
    'CYP2D6',    # CYP2D6 Inhibition
    'CYP3A4',    # CYP3A4 Inhibition
    'HCLint',    # Human Hepatic Clearance
    'RCLint',    # Rat Hepatic Clearance
    'hERG_inh'   # hERG Inhibition
]

print(f"Available ADMET models: {len(admet_models)}")
for i, model in enumerate(admet_models, 1):
    print(f"{i:2d}. {model}")

In [None]:
# Load features for preprocessing
features = load_features()
print(f"Loaded {len(features)} molecular features for model input")

# Preprocess the data once (this is used by all models)
print("\nPreprocessing data...")
scaled_data = data_preprocessing(df.copy())
print(f"✓ Preprocessing complete. Final dataset: {len(scaled_data)} molecules")

In [None]:
# Make predictions with all models
print("Making predictions with all ADMET models...\n")
print("=" * 70)

# Store all predictions
all_predictions = pd.DataFrame({'SMILES': scaled_data['SMILES']})

# Dictionary to store probability predictions
all_probabilities = {}

for model_name in admet_models:
    try:
        print(f"\n{model_name}:")
        print("-" * 70)
        
        # Load model
        model = load_model(model_name)
        
        # Make predictions
        predictions = model.predict(scaled_data[features].values)
        probabilities = model.predict_proba(scaled_data[features].values)
        max_probs = np.max(probabilities, axis=1)
        
        # Store predictions
        all_predictions[model_name] = predictions
        all_predictions[f"{model_name}_prob"] = max_probs
        all_probabilities[model_name] = probabilities
        
        # Calculate statistics
        positive_count = np.sum(predictions == 1)
        negative_count = np.sum(predictions == 0)
        positive_pct = (positive_count / len(predictions)) * 100
        
        print(f"  ✓ Predictions complete")
        print(f"  Positive (1): {positive_count:3d} ({positive_pct:5.1f}%)")
        print(f"  Negative (0): {negative_count:3d} ({100-positive_pct:5.1f}%)")
        print(f"  Avg confidence: {np.mean(max_probs):.3f}")
        
    except Exception as e:
        print(f"  ✗ Error: {str(e)}")
        continue

print("\n" + "=" * 70)
print("✓ All predictions complete!")

In [None]:
# Display prediction results
print("\nPrediction Results Summary:")
print("=" * 70)

# Show predictions for first 10 molecules (only binary predictions)
display_cols = ['SMILES'] + [m for m in admet_models if m in all_predictions.columns]
print("\nFirst 10 molecules (Binary predictions: 1=positive, 0=negative):")
display(all_predictions[display_cols].head(10))

# Save complete predictions to CSV
output_file = 'all_admet_predictions.csv'
all_predictions.to_csv(output_file, index=False)
print(f"\n✓ Complete predictions saved to: {output_file}")

## 4. Results Visualization

Let's visualize the prediction results across all ADMET properties.

In [None]:
# Visualize prediction distribution across all models
fig, axes = plt.subplots(4, 3, figsize=(15, 12))
axes = axes.flatten()

for idx, model_name in enumerate(admet_models):
    if model_name in all_predictions.columns:
        ax = axes[idx]
        
        # Count predictions
        counts = all_predictions[model_name].value_counts()
        
        # Create bar plot
        bars = ax.bar(['Negative (0)', 'Positive (1)'], 
                      [counts.get(0, 0), counts.get(1, 0)],
                      color=['#FF6B6B', '#4ECDC4'])
        
        ax.set_title(f'{model_name}', fontweight='bold', fontsize=12)
        ax.set_ylabel('Count', fontsize=10)
        ax.tick_params(axis='x', rotation=45)
        
        # Add percentage labels on bars
        total = len(all_predictions)
        for bar in bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{height/total*100:.1f}%',
                   ha='center', va='bottom', fontsize=9)

# Remove extra subplot
fig.delaxes(axes[-1])

plt.suptitle('ADMET Prediction Distribution Across All Models', 
             fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# Create a heatmap showing prediction patterns
prediction_matrix = all_predictions[[m for m in admet_models if m in all_predictions.columns]].T

plt.figure(figsize=(16, 6))
sns.heatmap(prediction_matrix.iloc[:, :50],  # Show first 50 molecules
            cmap=['#FF6B6B', '#4ECDC4'], 
            cbar_kws={'label': 'Prediction (0=Negative, 1=Positive)'},
            linewidths=0.5,
            linecolor='white',
            vmin=0, vmax=1)

plt.title('ADMET Prediction Heatmap (First 50 Molecules)', 
          fontsize=14, fontweight='bold', pad=20)
plt.xlabel('Molecule Index', fontsize=12)
plt.ylabel('ADMET Property', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Analyze prediction confidence across models
prob_cols = [f"{m}_prob" for m in admet_models if f"{m}_prob" in all_predictions.columns]

plt.figure(figsize=(14, 6))

# Box plot of prediction probabilities
prob_data = [all_predictions[col].values for col in prob_cols]
model_labels = [col.replace('_prob', '') for col in prob_cols]

bp = plt.boxplot(prob_data, labels=model_labels, patch_artist=True)

# Color the boxes
for patch in bp['boxes']:
    patch.set_facecolor('#4ECDC4')
    patch.set_alpha(0.7)

plt.title('Prediction Confidence Distribution Across ADMET Models', 
          fontsize=14, fontweight='bold')
plt.ylabel('Prediction Probability', fontsize=12)
plt.xlabel('ADMET Model', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Summary statistics table
summary_stats = []

for model_name in admet_models:
    if model_name in all_predictions.columns:
        preds = all_predictions[model_name]
        probs = all_predictions[f"{model_name}_prob"]
        
        summary_stats.append({
            'Model': model_name,
            'Positive (%)': f"{np.sum(preds == 1) / len(preds) * 100:.1f}%",
            'Negative (%)': f"{np.sum(preds == 0) / len(preds) * 100:.1f}%",
            'Mean Confidence': f"{np.mean(probs):.3f}",
            'Median Confidence': f"{np.median(probs):.3f}",
            'Min Confidence': f"{np.min(probs):.3f}",
            'Max Confidence': f"{np.max(probs):.3f}"
        })

summary_df = pd.DataFrame(summary_stats)
print("\nSummary Statistics for All ADMET Models:")
print("=" * 100)
display(summary_df)

## 5. Detailed Analysis for Individual Molecules

Let's examine the ADMET profile for specific molecules.

In [None]:
# Analyze a specific molecule
molecule_idx = 0  # Change this to examine different molecules

print(f"\nADMET Profile for Molecule #{molecule_idx}")
print("=" * 70)

# Get SMILES and draw molecule
smiles = all_predictions.iloc[molecule_idx]['SMILES']
print(f"SMILES: {smiles}\n")

mol = Chem.MolFromSmiles(smiles)
display(Draw.MolToImage(mol, size=(400, 400)))

# Show predictions
print("\nADMET Predictions:")
print("-" * 70)

for model_name in admet_models:
    if model_name in all_predictions.columns:
        pred = all_predictions.iloc[molecule_idx][model_name]
        prob = all_predictions.iloc[molecule_idx][f"{model_name}_prob"]
        
        result = "✓ Positive" if pred == 1 else "✗ Negative"
        print(f"{model_name:12s}: {result:12s} (confidence: {prob:.3f})")

In [None]:
# Create radar chart for ADMET profile of a molecule
from math import pi

# Get predictions for the molecule
molecule_predictions = []
categories = []

for model_name in admet_models:
    if model_name in all_predictions.columns:
        pred = all_predictions.iloc[molecule_idx][model_name]
        molecule_predictions.append(pred)
        categories.append(model_name)

# Number of variables
N = len(categories)

# Compute angle for each axis
angles = [n / float(N) * 2 * pi for n in range(N)]
molecule_predictions += molecule_predictions[:1]
angles += angles[:1]

# Initialize the plot
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))

# Draw the plot
ax.plot(angles, molecule_predictions, 'o-', linewidth=2, color='#4ECDC4', label='Predictions')
ax.fill(angles, molecule_predictions, alpha=0.25, color='#4ECDC4')

# Fix axis to go in the right order
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, size=10)

# Set y-axis limits
ax.set_ylim(0, 1)
ax.set_yticks([0, 0.5, 1])
ax.set_yticklabels(['0', '0.5', '1'])

# Add grid
ax.grid(True)

plt.title(f'ADMET Profile - Molecule #{molecule_idx}\n{smiles[:50]}...', 
          size=12, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

## 6. Data Splitting for Model Training (Optional)

If you want to train custom models, you'll need to split your data. Here's how to do it using different strategies.

In [None]:
# Example: Stratified split
from sklearn.model_selection import train_test_split

print("Data Splitting Examples")
print("=" * 70)

# Stratified split (maintains class distribution)
train_data, test_data = train_test_split(
    scaled_data,
    test_size=0.2,
    random_state=42,
    stratify=scaled_data['bioclass']
)

print("\nStratified Split:")
print(f"  Training set: {len(train_data)} molecules")
print(f"  Test set:     {len(test_data)} molecules")
print(f"\n  Training set class distribution:")
print(f"    Class 0: {np.sum(train_data['bioclass'] == 0)} ({np.sum(train_data['bioclass'] == 0)/len(train_data)*100:.1f}%)")
print(f"    Class 1: {np.sum(train_data['bioclass'] == 1)} ({np.sum(train_data['bioclass'] == 1)/len(train_data)*100:.1f}%)")
print(f"\n  Test set class distribution:")
print(f"    Class 0: {np.sum(test_data['bioclass'] == 0)} ({np.sum(test_data['bioclass'] == 0)/len(test_data)*100:.1f}%)")
print(f"    Class 1: {np.sum(test_data['bioclass'] == 1)} ({np.sum(test_data['bioclass'] == 1)/len(test_data)*100:.1f}%)")

In [None]:
# Example: Scaffold-based split (more realistic for drug discovery)
from dgllife.utils import ScaffoldSplitter

# Prepare data for scaffold splitter
scaffold_df = scaled_data.copy()
scaffold_df = scaffold_df.rename(columns={'SMILES': 'smiles'})

# Split by scaffolds
train_set, val_set, test_set = ScaffoldSplitter.train_val_test_split(
    scaffold_df,
    frac_train=0.8,
    frac_val=0.0,
    frac_test=0.2
)

scaffold_train = scaffold_df.iloc[train_set.indices]
scaffold_test = scaffold_df.iloc[test_set.indices]

print("\nScaffold-Based Split:")
print(f"  Training set: {len(scaffold_train)} molecules")
print(f"  Test set:     {len(scaffold_test)} molecules")
print(f"\n  Training set class distribution:")
print(f"    Class 0: {np.sum(scaffold_train['bioclass'] == 0)} ({np.sum(scaffold_train['bioclass'] == 0)/len(scaffold_train)*100:.1f}%)")
print(f"    Class 1: {np.sum(scaffold_train['bioclass'] == 1)} ({np.sum(scaffold_train['bioclass'] == 1)/len(scaffold_train)*100:.1f}%)")
print(f"\n  Test set class distribution:")
print(f"    Class 0: {np.sum(scaffold_test['bioclass'] == 0)} ({np.sum(scaffold_test['bioclass'] == 0)/len(scaffold_test)*100:.1f}%)")
print(f"    Class 1: {np.sum(scaffold_test['bioclass'] == 1)} ({np.sum(scaffold_test['bioclass'] == 1)/len(scaffold_test)*100:.1f}%)")

print("\n✓ Scaffold split prevents data leakage by grouping similar molecules together")

## 7. Model Training (Optional - Advanced)

To train a custom model, you would use the `model.py` script. Here's the workflow:

### Step 1: Prepare your data
- CSV file with 'SMILES' and 'bioclass' columns

### Step 2: Split the data
```python
# Save split data
train_data.to_csv('train_mydata.csv', index=False)
test_data.to_csv('test_mydata.csv', index=False)
```

### Step 3: Train the model (run in terminal)
```bash
python model.py --file_name mydata.csv --model_name my_custom_model --max_eval 200 --time_out 120 --training
```

### Model Training Parameters:
- `--max_eval`: Number of hyperparameter optimization trials (higher = better but slower)
- `--time_out`: Maximum seconds per trial
- The training uses Hyperopt with TPE (Tree-structured Parzen Estimator) for optimization
- Automatically tries different classifiers and preprocessing methods

### Performance Metrics:
After training, you'll see:
- **Sensitivity (Recall)**: True positive rate
- **Specificity**: True negative rate
- **Accuracy**: Overall accuracy
- **MCC**: Matthews Correlation Coefficient (good for imbalanced data)
- **AUC-ROC**: Area under ROC curve

In [None]:
# Example: Save training and test sets for custom model training
# Uncomment the lines below to save your split data

# train_data.to_csv('train_custom_model.csv', index=False)
# test_data.to_csv('test_custom_model.csv', index=False)
# print("✓ Training and test sets saved for custom model training")
# print("\nTo train a model, run in terminal:")
# print("python model.py --file_name custom_model.csv --model_name my_model --max_eval 200 --training")

print("Custom model training is available via the command line.")
print("See the markdown cell above for instructions.")

## 8. Exporting Results

Let's export our predictions in various formats for further analysis.

In [None]:
# Export detailed results
print("Exporting results...\n")

# 1. Complete predictions with all probabilities
all_predictions.to_csv('all_admet_predictions_detailed.csv', index=False)
print("✓ Detailed predictions saved to: all_admet_predictions_detailed.csv")

# 2. Summary statistics
summary_df.to_csv('admet_summary_statistics.csv', index=False)
print("✓ Summary statistics saved to: admet_summary_statistics.csv")

# 3. Binary predictions only (for easier viewing)
binary_cols = ['SMILES'] + [m for m in admet_models if m in all_predictions.columns]
all_predictions[binary_cols].to_csv('all_admet_predictions_binary.csv', index=False)
print("✓ Binary predictions saved to: all_admet_predictions_binary.csv")

print("\n" + "=" * 70)
print("Workflow complete! All results have been saved.")
print("=" * 70)

## Summary

This notebook demonstrated a complete ADMET prediction workflow:

### ✓ Completed Steps:
1. **Data Loading**: Loaded and explored molecular data
2. **Preprocessing**: Calculated molecular descriptors and fingerprints
3. **Predictions**: Used all 11 pre-trained ADMET models
4. **Visualization**: Created comprehensive visualizations of results
5. **Analysis**: Examined individual molecule ADMET profiles
6. **Data Splitting**: Demonstrated stratified and scaffold-based splitting
7. **Export**: Saved results in multiple formats

### Next Steps:
- Use the predictions to prioritize compounds for synthesis/testing
- Train custom models on your own data using the splitting methods shown
- Integrate predictions into your drug discovery pipeline
- Perform cross-validation to assess model robustness

### Key Files Generated:
- `all_admet_predictions.csv` - Complete predictions with probabilities
- `all_admet_predictions_detailed.csv` - Detailed predictions
- `all_admet_predictions_binary.csv` - Binary predictions only
- `admet_summary_statistics.csv` - Summary statistics

### For More Information:
- See `README.md` for detailed documentation
- Check `predict.py`, `model.py`, and `utils.py` for implementation details
- Refer to the classification criteria table at the top of this notebook

---

**Note**: This workflow uses pre-trained models optimized on specific datasets. For production use, validate predictions with experimental data and consider retraining models on relevant datasets for your specific application.