# ðŸ§¬ Drug-Target Binding Affinity Prediction

## Overview
This notebook demonstrates how to predict drug-target binding affinity using machine learning.

**What you'll learn:**
1. Generate synthetic drug and protein features
2. Create drug-target interaction dataset
3. Train a Random Forest model
4. Evaluate and visualize results

## Step 1: Install Dependencies

In [None]:
# Install required packages
%pip install numpy pandas matplotlib seaborn scikit-learn

## Step 2: Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette('husl')

print('Libraries imported successfully!')

## Step 3: Generate Synthetic Dataset

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Dataset parameters
n_drugs = 500
n_targets = 100
n_interactions = 10000

print(f'Creating dataset with {n_drugs} drugs, {n_targets} targets, {n_interactions} interactions')

In [None]:
# Generate Drug Features
drug_df = pd.DataFrame({
    'Drug_ID': [f'DRUG_{i:04d}' for i in range(n_drugs)],
    'Molecular_Weight': np.random.normal(350, 100, n_drugs),
    'LogP': np.random.normal(2.5, 1.5, n_drugs),
    'NumAtoms': np.random.randint(20, 80, n_drugs),
    'NumRings': np.random.randint(1, 6, n_drugs),
    'Polar_Surface_Area': np.random.normal(70, 30, n_drugs),
    'Num_HB_Donors': np.random.randint(0, 5, n_drugs),
    'Num_HB_Acceptors': np.random.randint(2, 10, n_drugs)
})

print(f'Generated {len(drug_df)} drugs')
drug_df.head()

In [None]:
# Generate Target Features
target_df = pd.DataFrame({
    'Target_ID': [f'TARGET_{i:03d}' for i in range(n_targets)],
    'Protein_Length': np.random.randint(200, 800, n_targets),
    'Hydrophobicity': np.random.normal(0, 1, n_targets),
    'Instability_Index': np.random.normal(35, 15, n_targets),
    'Isoelectric_Point': np.random.normal(7, 2, n_targets),
    'Net_Charge': np.random.randint(-10, 10, n_targets),
    'Num_Domains': np.random.randint(1, 5, n_targets),
    'Active_Site_Volume': np.random.normal(1500, 400, n_targets)
})

print(f'Generated {len(target_df)} targets')
target_df.head()

In [None]:
# Generate Drug-Target Interactions with Binding Affinity
drug_indices = np.random.randint(0, n_drugs, n_interactions)
target_indices = np.random.randint(0, n_targets, n_interactions)

kiba_scores = []
for drug_idx, target_idx in zip(drug_indices, target_indices):
    drug = drug_df.iloc[drug_idx]
    target = target_df.iloc[target_idx]
    
    # Calculate binding affinity
    size_factor = 1 / (1 + abs(drug['Molecular_Weight'] - target['Active_Site_Volume']/4))
    hydrophobic_match = 1 / (1 + abs(drug['LogP'] - target['Hydrophobicity']))
    hbond_potential = (drug['Num_HB_Donors'] + drug['Num_HB_Acceptors']) / 10
    
    kiba_score = max(0, size_factor + hydrophobic_match + hbond_potential + np.random.normal(0, 0.5))
    kiba_scores.append(kiba_score)

interaction_df = pd.DataFrame({
    'Drug_ID': drug_df.iloc[drug_indices]['Drug_ID'].values,
    'Target_ID': target_df.iloc[target_indices]['Target_ID'].values,
    'Drug_Idx': drug_indices,
    'Target_Idx': target_indices,
    'KIBA_Score': kiba_scores
})

print(f'âœ“ Generated {len(interaction_df)} interactions')
print(f'KIBA Score range: {min(kiba_scores):.3f} - {max(kiba_scores):.3f}')
interaction_df.head()

## Step 4: Prepare Features for ML

In [None]:
# Combine drug and target features
drug_feature_cols = [col for col in drug_df.columns if col != 'Drug_ID']
target_feature_cols = [col for col in target_df.columns if col != 'Target_ID']

X_drug = drug_df[drug_feature_cols].iloc[interaction_df['Drug_Idx'].values].values
X_target = target_df[target_feature_cols].iloc[interaction_df['Target_Idx'].values].values

X = np.hstack([X_drug, X_target])
y = interaction_df['KIBA_Score'].values

feature_names = drug_feature_cols + target_feature_cols

print(f'Feature matrix shape: {X.shape}')
print(f'Features: {feature_names}')

## Step 5: Train Model

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'Training samples: {len(X_train)}')
print(f'Testing samples: {len(X_test)}')

In [None]:
# Train Random Forest
model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)

model.fit(X_train_scaled, y_train)
print('Model training completed!')

## Step 6: Evaluate Model

In [None]:
# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Model Performance:')
print(f'  MSE:  {mse:.4f}')
print(f'  RMSE: {rmse:.4f}')
print(f'  MAE:  {mae:.4f}')
print(f'  RÂ²:   {r2:.4f}')

In [None]:
# Feature Importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

print('Top 10 Feature Importance:')
importance_df.head(10)

## Step 7: Visualize Results

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Actual vs Predicted
axes[0, 0].scatter(y_test, y_pred, alpha=0.5, s=10)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual KIBA Score')
axes[0, 0].set_ylabel('Predicted KIBA Score')
axes[0, 0].set_title(f'Actual vs Predicted (RÂ² = {r2:.3f})')
axes[0, 0].grid(True, alpha=0.3)

# 2. Feature Importance
top_features = importance_df.head(10)
axes[0, 1].barh(top_features['Feature'], top_features['Importance'])
axes[0, 1].set_xlabel('Importance')
axes[0, 1].set_title('Top 10 Feature Importance')
axes[0, 1].invert_yaxis()

# 3. Residual Distribution
residuals = y_test - y_pred
axes[1, 0].hist(residuals, bins=50, alpha=0.7, color='steelblue')
axes[1, 0].axvline(x=0, color='red', linestyle='--', lw=2)
axes[1, 0].set_xlabel('Residual')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Residual Distribution')

# 4. KIBA Score Distribution
axes[1, 1].hist(interaction_df['KIBA_Score'], bins=50, alpha=0.7, color='lightcoral')
axes[1, 1].set_xlabel('KIBA Score')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('KIBA Score Distribution')

plt.tight_layout()
plt.show()

## Step 8: Make Predictions for New Data

In [None]:
# Example: Predict binding affinity for a new drug-target pair
new_drug = [350, 2.5, 45, 3, 70, 2, 5]  # 7 drug features
new_target = [450, 0.1, 35, 7, 2, 3, 1500]  # 7 target features

combined = np.array(new_drug + new_target).reshape(1, -1)
combined_scaled = scaler.transform(combined)
prediction = model.predict(combined_scaled)[0]

print(f'Drug features: {new_drug}')
print(f'Target features: {new_target}')
print(f'Predicted KIBA Score: {prediction:.3f}')

## ðŸŽ‰ Summary

In this notebook, we:
1. Created a synthetic drug-target binding dataset
2. Trained a Random Forest model for affinity prediction
3. Achieved reasonable performance on test data
4. Identified important features for binding

**Next Steps:**
- Use real KIBA dataset from TDC
- Try deep learning models (GNN, Transformer)
- Add more molecular descriptors