# üíä Drug Discovery and Molecular ML: Hands-on Practice

## Table of Contents
1. [Molecular Representations with SMILES](#practice-1-molecular-representations-with-smiles)
2. [Computing Molecular Descriptors](#practice-2-computing-molecular-descriptors)
3. [Molecular Fingerprints](#practice-3-molecular-fingerprints)
4. [Building a Simple QSAR Model](#practice-4-building-a-simple-qsar-model)
5. [Virtual Screening Simulation](#practice-5-virtual-screening-simulation)
6. [ADMET Property Prediction](#practice-6-admet-property-prediction)

### üéØ Learning Objectives
- Understand molecular representations (SMILES notation)
- Compute molecular descriptors and fingerprints
- Build predictive models for drug properties
- Apply ML to drug discovery problems

## Installing and Importing Essential Libraries

In [None]:
# Install RDKit if not already installed
# !pip install rdkit-pypi

# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem, Draw
from rdkit.Chem import Lipinski
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')

print("‚úÖ All libraries loaded successfully!")
print(f"RDKit version: {Chem.rdBase.rdkitVersion}")

---
## Practice 1: Molecular Representations with SMILES

### üéØ Learning Objectives
- Understand SMILES (Simplified Molecular Input Line Entry System) notation
- Convert SMILES strings to molecular objects
- Visualize molecular structures

### üìñ Key Concepts
**SMILES Notation:** A string-based representation of chemical structures
- `C`: Carbon atom
- `O`: Oxygen atom
- `N`: Nitrogen atom
- `c`: Aromatic carbon
- `()`: Branches
- `=`: Double bond

Examples:
- Ethanol: `CCO`
- Benzene: `c1ccccc1`
- Aspirin: `CC(=O)Oc1ccccc1C(=O)O`

In [None]:
# 1.1 Create molecules from SMILES
def smiles_to_molecule():
    """Convert SMILES strings to molecular objects and visualize"""
    
    # Example drug molecules
    molecules = {
        'Aspirin': 'CC(=O)Oc1ccccc1C(=O)O',
        'Caffeine': 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
        'Ibuprofen': 'CC(C)Cc1ccc(cc1)C(C)C(=O)O',
        'Penicillin': 'CC1(C)SC2C(NC(=O)Cc3ccccc3)C(=O)N2C1C(=O)O'
    }
    
    print("Converting SMILES to Molecules")
    print("="*60)
    
    mol_objects = {}
    for name, smiles in molecules.items():
        mol = Chem.MolFromSmiles(smiles)
        if mol is not None:
            mol_objects[name] = mol
            print(f"\n{name}:")
            print(f"  SMILES: {smiles}")
            print(f"  Formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
            print(f"  Molecular Weight: {Descriptors.MolWt(mol):.2f} g/mol")
        else:
            print(f"\n‚ùå Failed to parse SMILES for {name}")
    
    return mol_objects

molecules = smiles_to_molecule()

In [None]:
# 1.2 Visualize molecular structures
def visualize_molecules(mol_dict):
    """Draw molecular structures"""
    
    # Create a grid of molecular structures
    mol_list = list(mol_dict.values())
    legends = list(mol_dict.keys())
    
    # Draw molecules
    img = Draw.MolsToGridImage(mol_list, molsPerRow=2, subImgSize=(300, 300),
                                legends=legends, returnPNG=False)
    
    return img

# Display molecules
visualize_molecules(molecules)

---
## Practice 2: Computing Molecular Descriptors

### üéØ Learning Objectives
- Calculate physicochemical properties
- Understand Lipinski's Rule of Five
- Assess drug-likeness

### üìñ Key Concepts
**Lipinski's Rule of Five** (Drug-likeness criteria):
1. Molecular weight ‚â§ 500 Da
2. LogP ‚â§ 5
3. H-bond donors ‚â§ 5
4. H-bond acceptors ‚â§ 10

In [None]:
# 2.1 Calculate molecular descriptors
def calculate_descriptors(mol_dict):
    """Calculate key molecular descriptors"""
    
    descriptors = []
    
    for name, mol in mol_dict.items():
        desc = {
            'Name': name,
            'MW': Descriptors.MolWt(mol),
            'LogP': Descriptors.MolLogP(mol),
            'HBD': Descriptors.NumHDonors(mol),
            'HBA': Descriptors.NumHAcceptors(mol),
            'TPSA': Descriptors.TPSA(mol),
            'RotBonds': Descriptors.NumRotatableBonds(mol),
            'Rings': Descriptors.RingCount(mol)
        }
        descriptors.append(desc)
    
    df = pd.DataFrame(descriptors)
    
    print("\nMolecular Descriptors")
    print("="*80)
    print(df.to_string(index=False))
    
    return df

descriptor_df = calculate_descriptors(molecules)

In [None]:
# 2.2 Check Lipinski's Rule of Five
def check_lipinski(mol_dict):
    """Check drug-likeness using Lipinski's Rule of Five"""
    
    results = []
    
    print("\nLipinski's Rule of Five Assessment")
    print("="*60)
    
    for name, mol in mol_dict.items():
        mw = Descriptors.MolWt(mol)
        logp = Descriptors.MolLogP(mol)
        hbd = Descriptors.NumHDonors(mol)
        hba = Descriptors.NumHAcceptors(mol)
        
        # Check each rule
        mw_pass = mw <= 500
        logp_pass = logp <= 5
        hbd_pass = hbd <= 5
        hba_pass = hba <= 10
        
        violations = sum([not mw_pass, not logp_pass, not hbd_pass, not hba_pass])
        drug_like = violations <= 1  # Allow 1 violation
        
        print(f"\n{name}:")
        print(f"  MW ‚â§ 500: {mw:.1f} {'‚úì' if mw_pass else '‚úó'}")
        print(f"  LogP ‚â§ 5: {logp:.2f} {'‚úì' if logp_pass else '‚úó'}")
        print(f"  HBD ‚â§ 5: {hbd} {'‚úì' if hbd_pass else '‚úó'}")
        print(f"  HBA ‚â§ 10: {hba} {'‚úì' if hba_pass else '‚úó'}")
        print(f"  ‚Üí Drug-like: {'Yes ‚úì' if drug_like else 'No ‚úó'} ({violations} violations)")
        
        results.append({
            'Name': name,
            'Violations': violations,
            'Drug-like': drug_like
        })
    
    return pd.DataFrame(results)

lipinski_results = check_lipinski(molecules)

---
## Practice 3: Molecular Fingerprints

### üéØ Learning Objectives
- Generate molecular fingerprints
- Calculate molecular similarity
- Understand how fingerprints represent chemical structures

### üìñ Key Concepts
**Molecular Fingerprints:** Binary vectors representing structural features
- Morgan Fingerprints (Circular fingerprints)
- Tanimoto Similarity: Measures similarity between fingerprints

In [None]:
# 3.1 Generate molecular fingerprints
def generate_fingerprints(mol_dict):
    """Generate Morgan fingerprints for molecules"""
    from rdkit.Chem import DataStructs
    
    fingerprints = {}
    
    print("Generating Molecular Fingerprints")
    print("="*60)
    
    for name, mol in mol_dict.items():
        # Generate Morgan fingerprint (radius=2, 2048 bits)
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
        fingerprints[name] = fp
        
        # Count number of set bits
        n_bits = fp.GetNumOnBits()
        print(f"{name}: {n_bits}/2048 bits set ({n_bits/2048*100:.1f}%)")
    
    return fingerprints

fingerprints = generate_fingerprints(molecules)

In [None]:
# 3.2 Calculate molecular similarity
def calculate_similarity(fp_dict):
    """Calculate Tanimoto similarity between molecules"""
    from rdkit.Chem import DataStructs
    
    names = list(fp_dict.keys())
    n = len(names)
    
    # Create similarity matrix
    similarity_matrix = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            similarity = DataStructs.TanimotoSimilarity(fp_dict[names[i]], fp_dict[names[j]])
            similarity_matrix[i, j] = similarity
    
    # Create DataFrame
    sim_df = pd.DataFrame(similarity_matrix, index=names, columns=names)
    
    print("\nTanimoto Similarity Matrix")
    print("="*60)
    print(sim_df.round(3))
    
    # Visualize as heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(sim_df, annot=True, fmt='.2f', cmap='YlOrRd', 
                square=True, cbar_kws={'label': 'Tanimoto Similarity'})
    plt.title('Molecular Similarity Heatmap', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    return sim_df

similarity_matrix = calculate_similarity(fingerprints)

---
## Practice 4: Building a Simple QSAR Model

### üéØ Learning Objectives
- Build a predictive model for molecular properties
- Train a classifier to predict drug activity
- Evaluate model performance

### üìñ Key Concepts
**QSAR (Quantitative Structure-Activity Relationship):**
Mathematical relationship between chemical structure and biological activity

In [None]:
# 4.1 Create synthetic dataset
def create_synthetic_dataset(n_samples=200):
    """Create synthetic molecular dataset for classification"""
    
    np.random.seed(42)
    
    # Generate random descriptors
    data = {
        'MW': np.random.uniform(200, 600, n_samples),
        'LogP': np.random.uniform(-2, 6, n_samples),
        'HBD': np.random.randint(0, 10, n_samples),
        'HBA': np.random.randint(0, 15, n_samples),
        'TPSA': np.random.uniform(0, 200, n_samples),
        'RotBonds': np.random.randint(0, 15, n_samples)
    }
    
    df = pd.DataFrame(data)
    
    # Create synthetic activity label
    # Active if: MW < 500, LogP < 5, and TPSA < 140
    df['Active'] = (
        (df['MW'] < 500) & 
        (df['LogP'] < 5) & 
        (df['TPSA'] < 140)
    ).astype(int)
    
    print(f"Created dataset with {n_samples} molecules")
    print(f"Active compounds: {df['Active'].sum()} ({df['Active'].mean()*100:.1f}%)")
    print(f"Inactive compounds: {(1-df['Active']).sum()} ({(1-df['Active']).mean()*100:.1f}%)")
    
    return df

dataset = create_synthetic_dataset()

In [None]:
# 4.2 Train QSAR classification model
def train_qsar_model(df):
    """Train Random Forest classifier for activity prediction"""
    
    # Prepare features and target
    feature_cols = ['MW', 'LogP', 'HBD', 'HBA', 'TPSA', 'RotBonds']
    X = df[feature_cols]
    y = df['Active']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    
    print("\nQSAR Model Performance")
    print("="*60)
    print(f"Accuracy: {accuracy:.3f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Inactive', 'Active']))
    
    # Feature importance
    importances = pd.DataFrame({
        'Feature': feature_cols,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("\nFeature Importance:")
    print(importances.to_string(index=False))
    
    # Visualize
    plt.figure(figsize=(8, 5))
    plt.barh(importances['Feature'], importances['Importance'], color='steelblue')
    plt.xlabel('Importance', fontsize=12)
    plt.title('Feature Importance in QSAR Model', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    return model, X_test, y_test, y_pred

qsar_model, X_test, y_test, y_pred = train_qsar_model(dataset)

---
## Practice 5: Virtual Screening Simulation

### üéØ Learning Objectives
- Simulate a virtual screening workflow
- Filter compounds using multiple criteria
- Rank and select lead compounds

### üìñ Key Concepts
**Virtual Screening:** Computational method to filter large compound libraries

In [None]:
# 5.1 Virtual screening pipeline
def virtual_screening_pipeline(df, model):
    """Simulate virtual screening process"""
    
    print("Virtual Screening Pipeline")
    print("="*60)
    
    initial_count = len(df)
    print(f"\nStep 1: Initial library size: {initial_count} compounds")
    
    # Step 2: Lipinski filter
    lipinski_filter = (
        (df['MW'] <= 500) &
        (df['LogP'] <= 5) &
        (df['HBD'] <= 5) &
        (df['HBA'] <= 10)
    )
    df_filtered = df[lipinski_filter].copy()
    print(f"Step 2: After Lipinski filter: {len(df_filtered)} compounds ({len(df_filtered)/initial_count*100:.1f}%)")
    
    # Step 3: TPSA filter
    df_filtered = df_filtered[df_filtered['TPSA'] <= 140]
    print(f"Step 3: After TPSA filter: {len(df_filtered)} compounds ({len(df_filtered)/initial_count*100:.1f}%)")
    
    # Step 4: ML prediction
    feature_cols = ['MW', 'LogP', 'HBD', 'HBA', 'TPSA', 'RotBonds']
    X_screen = df_filtered[feature_cols]
    
    # Get prediction probabilities
    proba = model.predict_proba(X_screen)[:, 1]  # Probability of being active
    df_filtered['Predicted_Activity'] = proba
    
    # Select top candidates (probability > 0.7)
    df_hits = df_filtered[df_filtered['Predicted_Activity'] > 0.7].copy()
    df_hits = df_hits.sort_values('Predicted_Activity', ascending=False)
    
    print(f"Step 4: After ML prediction: {len(df_hits)} hit compounds ({len(df_hits)/initial_count*100:.1f}%)")
    
    print("\n" + "="*60)
    print(f"‚úÖ Screening complete: {len(df_hits)} hits identified")
    print(f"   Enrichment: {len(df_hits)/initial_count*100:.2f}% of original library")
    
    # Display top 5 hits
    print("\nTop 5 Hit Compounds:")
    print(df_hits[['MW', 'LogP', 'TPSA', 'Predicted_Activity']].head().to_string())
    
    return df_hits

hit_compounds = virtual_screening_pipeline(dataset, qsar_model)

---
## Practice 6: ADMET Property Prediction

### üéØ Learning Objectives
- Predict ADMET properties
- Assess drug-likeness comprehensively
- Visualize property distributions

### üìñ Key Concepts
**ADMET:**
- **A**bsorption: Oral bioavailability
- **D**istribution: Tissue distribution
- **M**etabolism: Drug metabolism
- **E**xcretion: Elimination pathways
- **T**oxicity: Safety profile

In [None]:
# 6.1 Predict ADMET properties
def predict_admet_properties(df):
    """Predict basic ADMET properties"""
    
    df_admet = df.copy()
    
    # Simple rule-based ADMET predictions
    
    # Absorption (good if TPSA < 140 and MW < 500)
    df_admet['Good_Absorption'] = (
        (df_admet['TPSA'] < 140) & 
        (df_admet['MW'] < 500)
    ).astype(int)
    
    # BBB permeability (TPSA < 90 and MW < 400)
    df_admet['BBB_Permeable'] = (
        (df_admet['TPSA'] < 90) & 
        (df_admet['MW'] < 400)
    ).astype(int)
    
    # CYP450 substrate likelihood (LogP between 0 and 5)
    df_admet['CYP_Substrate'] = (
        (df_admet['LogP'] >= 0) & 
        (df_admet['LogP'] <= 5)
    ).astype(int)
    
    # Low toxicity (fewer rotatable bonds and HBA)
    df_admet['Low_Toxicity'] = (
        (df_admet['RotBonds'] < 10) & 
        (df_admet['HBA'] < 10)
    ).astype(int)
    
    # Overall drug-likeness score
    df_admet['ADMET_Score'] = (
        df_admet['Good_Absorption'] + 
        df_admet['BBB_Permeable'] + 
        df_admet['CYP_Substrate'] + 
        df_admet['Low_Toxicity']
    )
    
    print("ADMET Property Prediction Summary")
    print("="*60)
    print(f"Compounds with good absorption: {df_admet['Good_Absorption'].sum()} "
          f"({df_admet['Good_Absorption'].mean()*100:.1f}%)")
    print(f"BBB permeable compounds: {df_admet['BBB_Permeable'].sum()} "
          f"({df_admet['BBB_Permeable'].mean()*100:.1f}%)")
    print(f"Likely CYP substrates: {df_admet['CYP_Substrate'].sum()} "
          f"({df_admet['CYP_Substrate'].mean()*100:.1f}%)")
    print(f"Low toxicity compounds: {df_admet['Low_Toxicity'].sum()} "
          f"({df_admet['Low_Toxicity'].mean()*100:.1f}%)")
    
    # Visualize ADMET score distribution
    plt.figure(figsize=(10, 5))
    
    plt.subplot(1, 2, 1)
    df_admet['ADMET_Score'].value_counts().sort_index().plot(kind='bar', color='steelblue')
    plt.xlabel('ADMET Score', fontsize=11)
    plt.ylabel('Count', fontsize=11)
    plt.title('ADMET Score Distribution', fontsize=12, fontweight='bold')
    plt.xticks(rotation=0)
    
    plt.subplot(1, 2, 2)
    admet_props = ['Good_Absorption', 'BBB_Permeable', 'CYP_Substrate', 'Low_Toxicity']
    prop_means = [df_admet[prop].mean() * 100 for prop in admet_props]
    prop_labels = ['Absorption', 'BBB', 'CYP', 'Low Tox']
    plt.barh(prop_labels, prop_means, color='coral')
    plt.xlabel('Percentage (%)', fontsize=11)
    plt.title('ADMET Properties', fontsize=12, fontweight='bold')
    plt.xlim(0, 100)
    
    plt.tight_layout()
    plt.show()
    
    return df_admet

admet_results = predict_admet_properties(hit_compounds)

In [None]:
# 6.2 Select final drug candidates
def select_drug_candidates(df_admet, min_score=3):
    """Select final drug candidates based on ADMET score"""
    
    candidates = df_admet[df_admet['ADMET_Score'] >= min_score].copy()
    candidates = candidates.sort_values(['ADMET_Score', 'Predicted_Activity'], ascending=False)
    
    print(f"\nFinal Drug Candidates (ADMET Score ‚â• {min_score})")
    print("="*70)
    print(f"Total candidates identified: {len(candidates)}")
    
    if len(candidates) > 0:
        print("\nTop 10 Candidates:")
        display_cols = ['MW', 'LogP', 'TPSA', 'Predicted_Activity', 'ADMET_Score']
        print(candidates[display_cols].head(10).to_string())
        
        # Summary statistics
        print("\nCandidate Statistics:")
        print(candidates[display_cols].describe().round(2))
    else:
        print("‚ö†Ô∏è No candidates meet the criteria. Consider relaxing filters.")
    
    return candidates

final_candidates = select_drug_candidates(admet_results, min_score=3)

---
## üéØ Practice Complete!

### Summary of What We Learned:

1. **Molecular Representations**: SMILES notation and molecular visualization
2. **Molecular Descriptors**: Physicochemical properties and Lipinski's Rule
3. **Molecular Fingerprints**: Binary representations and similarity calculations
4. **QSAR Modeling**: Building predictive models for bioactivity
5. **Virtual Screening**: Filtering large compound libraries computationally
6. **ADMET Prediction**: Assessing drug-like properties

### Key Insights:
- Cheminformatics connects chemistry and machine learning
- Multiple filters reduce compound libraries to manageable sizes
- ML models can predict drug properties with good accuracy
- ADMET properties are crucial for drug development success

### Next Steps:
- Learn Graph Neural Networks for molecular representation
- Explore generative models for de novo drug design
- Study protein-ligand docking simulations
- Understand drug-target interaction prediction

### üìö Further Reading:
- RDKit Documentation: https://www.rdkit.org/docs/
- DeepChem: https://deepchem.io/
- ChEMBL Database: https://www.ebi.ac.uk/chembl/