# BBBP Dataset Analysis and Machine Learning Pipeline

## Lab Overview

In this lab, you will build a complete machine learning pipeline for predicting **Blood-Brain Barrier Penetration (BBBP)** using molecular fingerprints. The blood-brain barrier is a selective membrane that separates circulating blood from the brain extracellular fluid. Predicting whether a drug can cross this barrier is crucial in pharmaceutical research.

### Learning Objectives
By completing this lab, you will learn how to:
1. Download and preprocess molecular datasets
2. Parse SMILES strings using RDKit and generate molecular identifiers (InChI, InChIKey)
3. Handle data quality issues (parsing failures, duplicates, stereochemistry)
4. Generate Morgan molecular fingerprints with different parameters
5. Perform structure-based train/test splitting using Butina clustering
6. Train and evaluate XGBoost classifiers
7. Analyze model performance on known molecules

### Dataset
The **BBBP dataset** contains ~2000 molecules labeled with their blood-brain barrier penetration status:
- `p_np = 1`: The molecule **passes** the blood-brain barrier
- `p_np = 0`: The molecule **does not pass** the blood-brain barrier

### Tools Used
- **RDKit**: Chemistry library for molecular parsing and fingerprint generation
- **XGBoost**: Gradient boosting classifier
- **Pandas, NumPy, Matplotlib**: Data manipulation and visualization

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from rdkit import Chem
from rdkit.Chem import MolToInchi, MolToInchiKey, Draw, rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.ML.Cluster import Butina
import xgboost as xgb
from sklearn.metrics import accuracy_score
from IPython.display import display, Image

print("All libraries imported successfully!")

---
## Step 1: Download BBBP Dataset

The BBBP dataset is hosted on DeepChem's S3 storage. The function below downloads it directly as a Pandas DataFrame.

**Dataset columns:**
- `num`: Molecule number/identifier
- `name`: Molecule name
- `p_np`: Target label (1 = penetrates BBB, 0 = does not penetrate)
- `smiles`: SMILES string representation of the molecule

In [None]:
def download_bbbp_dataset():
    """
    Downloads the BBBP (Blood-Brain Barrier Penetration) dataset from DeepChem
    and returns it as a Pandas DataFrame.
    
    Returns:
        pd.DataFrame: The BBBP dataset with columns: num, name, p_np, smiles
    """
    url = "https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/BBBP.csv"
    df = pd.read_csv(url)
    return df

# Download the dataset
print("Downloading BBBP dataset...")
df = download_bbbp_dataset()
print(f"Dataset downloaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

---
## Step 2: Parse Molecules with RDKit and Add Identifiers

### Task Description
Not all SMILES strings can be parsed by RDKit. In this step, you need to:

1. **Iterate through each row** of the DataFrame
2. **Parse SMILES** using `Chem.MolFromSmiles(smiles)`
3. For successfully parsed molecules, add:
   - `mol`: RDKit molecule object
   - `inchi`: InChI string using `MolToInchi(mol)`
   - `inchikey`: InChIKey using `MolToInchiKey(mol)`
   - `id`: First part of InChIKey (before the first `-`), which identifies the molecule structure
4. **Separate** successfully parsed molecules from failed ones

### Expected Output
- A DataFrame with successfully parsed molecules (with new columns: mol, inchi, inchikey, id)
- A DataFrame with molecules that failed to parse

### Hints
- `Chem.MolFromSmiles()` returns `None` if parsing fails
- Use try/except to handle any exceptions during parsing
- InChIKey format: `XXXXXXXXXXXXXX-YYYYYYYYYY-Z`, where XXXXXXXXXXXXXX is the connectivity layer

In [None]:
def parse_molecules_with_rdkit(df):
    """
    Parses SMILES strings using RDKit and adds mol, inchi, inchikey, and id columns.
    Separates successfully parsed molecules from those that cannot be parsed.
    
    Args:
        df (pd.DataFrame): DataFrame with a 'smiles' column
        
    Returns:
        tuple: (parsed_df, failed_df) where:
            - parsed_df: DataFrame with successfully parsed molecules including 
                        mol, inchi, inchikey, and id columns
            - failed_df: DataFrame with molecules that could not be parsed
    """
    parsed_rows = []
    failed_rows = []
    
    # TODO: Implement the parsing logic
    # 1. Iterate through each row of df using df.iterrows()
    # 2. Get the SMILES string from row['smiles']
    # 3. Try to parse it with Chem.MolFromSmiles(smiles)
    # 4. If mol is None or an exception occurs, add the row to failed_rows
    # 5. If successful:
    #    - Create a dictionary from the row with row.to_dict()
    #    - Add 'mol' key with the parsed molecule
    #    - Add 'inchi' key using MolToInchi(mol)
    #    - Add 'inchikey' key using MolToInchiKey(mol)
    #    - Add 'id' key with first part of inchikey (use inchikey.split('-')[0])
    #    - Append the dictionary to parsed_rows
    
    pass  # Remove this line when implementing
    
    # Create dataframes
    parsed_df = pd.DataFrame(parsed_rows)
    failed_df = pd.DataFrame(failed_rows)
    
    return parsed_df, failed_df

# Parse molecules
print("Parsing molecules with RDKit...")
parsed_df, failed_df = parse_molecules_with_rdkit(df)

print(f"\nSuccessfully parsed: {len(parsed_df)} molecules")
print(f"Failed to parse: {len(failed_df)} molecules")
print(f"\nParsed dataframe columns: {parsed_df.columns.tolist()}")
print(f"\nFirst few parsed molecules:")
# Display without the mol column (which is an object)
display_cols = [col for col in parsed_df.columns if col != 'mol']
parsed_df[display_cols].head()

---
## Step 3: List Molecules That Failed to Parse

### Task Description
Create a function that nicely displays all molecules that failed RDKit parsing.

For each failed molecule, print:
- Number (num)
- Name
- SMILES string
- p_np value

This helps understand what types of SMILES strings cause parsing issues.

In [None]:
def list_failed_molecules(failed_df):
    """
    Lists all molecules that failed to parse with RDKit.
    
    Args:
        failed_df (pd.DataFrame): DataFrame with molecules that failed to parse
    """
    if len(failed_df) == 0:
        print("No failed molecules to list.")
        return
    
    print(f"\nAll {len(failed_df)} molecules that failed to parse:")
    print("=" * 80)
    
    # TODO: Iterate through failed_df and print each molecule's info
    # Use enumerate(failed_df.iterrows(), 1) to get an index starting from 1
    # For each row, print:
    #   - idx. Number: {num}
    #   - Name: {name}
    #   - SMILES: {smiles}
    #   - p_np: {p_np}
    
    pass  # Remove this line when implementing
    
    print("\n" + "=" * 80)

# List failed molecules
if len(failed_df) > 0:
    list_failed_molecules(failed_df)
else:
    print("No molecules failed to parse!")

---
## Step 4: Deduplication by ID and Show Molecules with Multiple Labels

### Task Description
The same molecule (same InChIKey connectivity) may appear multiple times with different labels. This is a data quality issue.

1. **Group molecules by 'id'** (first part of InChIKey)
2. For each group, collect all unique `p_np` labels as a set
3. **Consistent molecules**: Groups where all entries have the same label (set has 1 element)
   - Keep only the first occurrence
4. **Inconsistent molecules**: Groups with conflicting labels (set has >1 elements)
   - Draw these molecules using `Draw.MolsToGridImage()`
   - These should be excluded from further analysis

### Expected Output
- A deduplicated DataFrame with only consistent molecules
- A grid image showing molecules with inconsistent labels

### Hints
- Use `parsed_df.groupby('id')` to group by molecule ID
- `set(group['p_np'].values)` gives unique labels for a group
- `Draw.MolsToGridImage(mols, molsPerRow=4, legends=legends, subImgSize=(300, 300), returnPNG=False)`

In [None]:
def deduplicate_by_id(parsed_df, molsPerRow=4):
    """
    Deduplicates molecules by 'id' column. For each unique 'id', collects 'p_np' labels as a set.
    Returns dataframe with single-element label sets (consistent labels) and draws molecules
    with inconsistent labels (more than one element in the label set).
    
    Args:
        parsed_df (pd.DataFrame): DataFrame with parsed molecules including 'id' and 'p_np' columns
        molsPerRow (int): Number of molecules per row in the grid (default: 4)
        
    Returns:
        pd.DataFrame: Deduplicated dataframe with consistent labels (single-element label sets)
    """
    if len(parsed_df) == 0:
        print("No molecules to deduplicate.")
        return pd.DataFrame()
    
    # Group by 'id' and collect labels as sets
    grouped = []
    
    consistent_rows = []
    inconsistent_mols = []
    inconsistent_legends = []
    
    # TODO: Implement deduplication logic
    # 1. Iterate through grouped: for id_val, group in grouped:
    # 2. Get set of labels: label_set = set(group['p_np'].values)
    # 3. If len(label_set) == 1: 
    #    - Add first row to consistent_rows: consistent_rows.append(group.iloc[0].to_dict())
    # 4. If len(label_set) > 1 (inconsistent):
    #    - For each row in group, add mol to inconsistent_mols
    #    - Create legend with name, num, and p_np value
    
    pass  # Remove this line when implementing
    
    # Create dataframe with consistent labels
    deduplicated_df = pd.DataFrame(consistent_rows)
    
    print(f"\nDeduplication results:")
    print(f"  Total unique IDs: {len(grouped)}")
    print(f"  Consistent labels (single-element sets): {len(consistent_rows)}")
    print(f"  Inconsistent labels (multiple elements): {len(inconsistent_mols)} molecules")
    
    # Draw molecules with inconsistent labels
    if len(inconsistent_mols) > 0:
        print(f"\nDrawing {len(inconsistent_mols)} molecules with inconsistent labels...")
        img = Draw.MolsToGridImage(
            inconsistent_mols,
            molsPerRow=molsPerRow,
            legends=inconsistent_legends,
            subImgSize=(300, 300),
            returnPNG=False
        )
        display(img)
    else:
        print("\nNo molecules with inconsistent labels to display.")
    
    return deduplicated_df

# Deduplicate molecules
print("Deduplicating molecules by 'id'...")
deduplicated_df = deduplicate_by_id(parsed_df)

print(f"\nDeduplicated dataframe shape: {deduplicated_df.shape}")
print(f"Deduplicated dataframe columns: {deduplicated_df.columns.tolist()}")
print(f"\nFirst few deduplicated molecules:")
display_cols = [col for col in deduplicated_df.columns if col != 'mol']
deduplicated_df[display_cols].head()

---
## Step 5: Analyze Unassigned Stereocenters

### Task Description
Stereocenters (chiral centers) in molecules can be either:
- **Assigned**: R or S configuration specified
- **Unassigned**: Configuration marked as '?' (unknown)

Unassigned stereocenters indicate incomplete stereochemistry information.

1. For each molecule, find chiral centers using `Chem.FindMolChiralCenters(mol, includeUnassigned=True)`
2. Count centers where chirality is `'?'`
3. Add this count as a new column `unassigned_stereocenters`
4. List all molecules with at least one unassigned stereocenter
5. Create a histogram of unassigned stereocenter counts

### Hints
- `FindMolChiralCenters` returns list of tuples: `[(atom_idx, chirality), ...]`
- Chirality values: 'R', 'S', or '?' (unassigned)

In [None]:
def analyze_unassigned_stereocenters(df):
    """
    Calculates the number of unassigned stereocenters for each molecule.
    Lists all molecules with at least one unassigned stereocenter and creates a histogram.
    
    Args:
        df (pd.DataFrame): DataFrame with parsed molecules including 'mol' column
    """
    if len(df) == 0:
        print("No molecules to analyze.")
        return
    
    unassigned_counts = []
    molecules_with_unassigned = []
    
    # TODO: Implement stereocenter analysis
    # 1. Iterate through df.iterrows()
    # 2. Get the mol from row['mol']
    # 3. Find chiral centers: chiral_centers = Chem.FindMolChiralCenters(mol, includeUnassigned=True)
    # 4. Count unassigned: sum(1 for _, chirality in chiral_centers if chirality == '?')
    # 5. Append count to unassigned_counts
    # 6. If count > 0, add molecule info to molecules_with_unassigned list
    
    pass  # Remove this line when implementing
    
    # Add unassigned_stereocenters column to dataframe
    df['unassigned_stereocenters'] = unassigned_counts
    
    # Print statistics
    total_with_unassigned = sum(1 for count in unassigned_counts if count > 0)
    print(f"\nStereocenter analysis:")
    print(f"  Total molecules analyzed: {len(df)}")
    print(f"  Molecules with unassigned stereocenters: {total_with_unassigned}")
    print(f"  Total unassigned stereocenters: {sum(unassigned_counts)}")
    if total_with_unassigned > 0:
        print(f"  Average unassigned stereocenters (for molecules with >0): {sum(unassigned_counts) / total_with_unassigned:.2f}")
    
    # TODO: List molecules with unassigned stereocenters (similar to list_failed_molecules)
    
    # TODO: Create histogram using plt.hist()
    # plt.figure(figsize=(10, 6))
    # plt.hist(unassigned_counts, bins=range(max(unassigned_counts) + 2), edgecolor='black', alpha=0.7)
    # plt.xlabel('Number of Unassigned Stereocenters')
    # plt.ylabel('Number of Molecules')
    # plt.title('Histogram of Unassigned Stereocenters')
    # plt.show()

# Analyze unassigned stereocenters
print("Analyzing unassigned stereocenters...")
analyze_unassigned_stereocenters(deduplicated_df)

---
## Step 6: Generate Morgan Fingerprints

### Task Description
**Morgan fingerprints** (also called circular fingerprints or ECFP) encode molecular structure as binary vectors. Key parameters:

- **Radius**: How far from each atom to consider (larger = more context)
- **Length**: Size of the fingerprint vector (more bits = less collision)

Generate 9 fingerprint variants using all combinations of:
- Radius: 1, 2, 3
- Length: 512, 1024, 2048

Column naming convention: `morgan_r{radius}_l{length}` (e.g., `morgan_r2_l1024`)

### Hints
- Create generator: `fpgen = rdFingerprintGenerator.GetMorganGenerator(radius, length)`
- Generate fingerprint: `fp = fpgen.GetFingerprint(mol)`
- Convert to list: `list(fp)`

In [None]:
def assign_morgan_fingerprints(df):
    """
    Assigns Morgan fingerprints with different radius and length combinations.
    Creates 9 versions: radius 1, 2, 3 and length 512, 1024, 2048.
    
    Args:
        df (pd.DataFrame): DataFrame with parsed molecules including 'mol' column
        
    Returns:
        pd.DataFrame: New dataframe with original columns plus 9 Morgan fingerprint columns
    """
    if len(df) == 0:
        print("No molecules to process.")
        return pd.DataFrame()
    
    if 'mol' not in df.columns:
        print("Error: 'mol' column not found in dataframe.")
        return df
    
    # Create a copy of the dataframe
    new_df = df.copy()
    
    # Define radius and length combinations
    radii = [1, 2, 3]
    lengths = [512, 1024, 2048]
    
    print(f"\nGenerating Morgan fingerprints for {len(df)} molecules...")
    print(f"  Radius options: {radii}")
    print(f"  Length options: {lengths}")
    print(f"  Total combinations: {len(radii) * len(lengths)}")
    
    # TODO: Generate fingerprints for each combination
    # 1. Loop through radii and lengths (nested loops)
    # 2. Create column name: column_name = f'morgan_r{radius}_l{length}'
    # 3. Create generator: fpgen = rdFingerprintGenerator.GetMorganGenerator(radius, length)
    # 4. For each molecule in df:
    #    - Generate fingerprint: fp = fpgen.GetFingerprint(mol)
    #    - Convert to list: list(fp)
    #    - Handle None molecules gracefully
    # 5. Add fingerprint column to new_df
    # 6. Print progress: print(f"  Generated: {column_name}")
    
    pass  # Remove this line when implementing
    
    print(f"\nMorgan fingerprint assignment complete!")
    print(f"  New dataframe shape: {new_df.shape}")
    print(f"  New columns added: {len(radii) * len(lengths)}")
    
    return new_df

# Assign fingerprints
print("Assigning Morgan fingerprints...")
fingerprint_df = assign_morgan_fingerprints(deduplicated_df)

print(f"\nFingerprint dataframe shape: {fingerprint_df.shape}")
print(f"Fingerprint dataframe columns: {fingerprint_df.columns.tolist()}")
print(f"\nFirst few rows (showing non-fingerprint columns):")
# Display without the mol and fingerprint columns
display_cols = [col for col in fingerprint_df.columns 
               if col != 'mol' and not col.startswith('morgan_')]
fingerprint_df[display_cols].head()

---
## Step 7: Butina Split (Train/Test)

### Why Butina Clustering?
Random train/test splits can leak information when similar molecules appear in both sets. **Butina clustering** groups similar molecules together, then assigns entire clusters to either train or test, ensuring a more realistic evaluation.

The function below is provided for you. It:
1. Calculates Tanimoto distances between all molecule pairs
2. Clusters molecules using the Butina algorithm
3. Assigns clusters to train (~80%) or test (~20%) sets

In [None]:
def butina_split(df, fingerprint_col='morgan_r2_l1024', train_ratio=0.8, cutoff=0.7):
    """
    Creates a Butina split of the dataset into train (~80%) and test (~20%) sets.
    Uses Butina clustering to ensure similar molecules are grouped together.
    
    Args:
        df (pd.DataFrame): DataFrame with molecules and fingerprint columns
        fingerprint_col (str): Name of the fingerprint column to use (default: 'morgan_r2_l1024')
        train_ratio (float): Desired ratio for training set (default: 0.8)
        cutoff (float): Distance cutoff for Butina clustering (default: 0.7)
        
    Returns:
        tuple: (train_df, test_df) - Training and test dataframes
    """
    if len(df) == 0:
        print("No molecules to split.")
        return pd.DataFrame(), pd.DataFrame()
    
    if fingerprint_col not in df.columns:
        print(f"Error: Fingerprint column '{fingerprint_col}' not found.")
        print(f"Available fingerprint columns: {[col for col in df.columns if col.startswith('morgan_')]}")
        return pd.DataFrame(), pd.DataFrame()
    
    print(f"\nPerforming Butina split...")
    print(f"  Using fingerprint: {fingerprint_col}")
    print(f"  Target train ratio: {train_ratio:.1%}")
    print(f"  Distance cutoff: {cutoff}")
    
    # Extract fingerprints and convert to RDKit format
    fingerprints = []
    valid_indices = []
    
    for idx, row in df.iterrows():
        fp_list = row[fingerprint_col]
        if fp_list is not None and len(fp_list) > 0:
            # Convert list of 0s and 1s back to RDKit BitVect
            bit_string = ''.join(['1' if bit else '0' for bit in fp_list])
            fp = DataStructs.CreateFromBitString(bit_string)
            fingerprints.append(fp)
            valid_indices.append(idx)
    
    if len(fingerprints) == 0:
        print("Error: No valid fingerprints found.")
        return pd.DataFrame(), pd.DataFrame()
    
    print(f"  Valid molecules: {len(fingerprints)}")
    
    # Calculate distance matrix (Tanimoto distance = 1 - Tanimoto similarity)
    nfps = len(fingerprints)
    dists = []
    for i in range(1, nfps):
        sims = DataStructs.BulkTanimotoSimilarity(fingerprints[i], fingerprints[:i])
        dists.extend([1 - x for x in sims])
    
    # Perform Butina clustering
    clusters = Butina.ClusterData(dists, nfps, cutoff, isDistData=True)
    print(f"  Number of clusters: {len(clusters)}")
    
    # Sort clusters by size (largest first)
    clusters_sorted = sorted(clusters, key=len, reverse=True)
    
    # Assign clusters to train or test to achieve target ratio
    train_indices = []
    test_indices = []
    total_assigned = 0
    
    for cluster in clusters_sorted:
        cluster_indices = [valid_indices[i] for i in cluster]
        current_ratio = len(train_indices) / max(total_assigned + len(cluster), 1)
        
        # Assign to train if we're below target ratio, otherwise to test
        if current_ratio < train_ratio:
            train_indices.extend(cluster_indices)
        else:
            test_indices.extend(cluster_indices)
        
        total_assigned += len(cluster)
    
    # Create train and test dataframes
    train_df = df.loc[train_indices].copy()
    test_df = df.loc[test_indices].copy()
    
    # Reset indices
    train_df = train_df.reset_index(drop=True)
    test_df = test_df.reset_index(drop=True)
    
    actual_train_ratio = len(train_df) / len(df)
    print(f"\nButina split complete!")
    print(f"  Train set: {len(train_df)} molecules ({actual_train_ratio:.1%})")
    print(f"  Test set: {len(test_df)} molecules ({1-actual_train_ratio:.1%})")
    print(f"  Total: {len(train_df) + len(test_df)} molecules")
    
    return train_df, test_df

# Create Butina split
print("Creating Butina split...")
train_df, test_df = butina_split(fingerprint_df)

print(f"\nTrain dataframe shape: {train_df.shape}")
print(f"Test dataframe shape: {test_df.shape}")
print(f"\nTrain set columns: {train_df.columns.tolist()}")
print(f"\nFirst few train molecules:")
display_cols = [col for col in train_df.columns 
               if col != 'mol' and not col.startswith('morgan_')]
train_df[display_cols].head()

---
## Step 8: Train XGBoost Models (9 Cases) and Show Accuracies

### Task Description
Train **XGBoost classifiers** for each of the 9 fingerprint combinations and compare their performance.

1. For each Morgan fingerprint column:
   - Extract fingerprints as numpy arrays (X_train, X_test)
   - Extract labels (y_train, y_test) from 'p_np' column
   - Train XGBoost classifier
   - Calculate test accuracy
2. Track the **best model** (highest accuracy)
3. Create a **bar plot** comparing accuracies

### XGBoost Parameters
```python
model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
)
```

### Hints
- Convert fingerprint list to numpy: `np.array(fp_list, dtype=np.float32)`
- Stack arrays: `X_train = np.array(list_of_arrays)`
- Use `accuracy_score(y_true, y_pred)` from sklearn

In [None]:
def train_xgboost_models(train_df, test_df):
    """
    Trains XGBoost models using all 9 Morgan fingerprint combinations.
    Calculates accuracy on test set for each model and keeps the best model.
    
    Args:
        train_df (pd.DataFrame): Training dataframe with fingerprint columns
        test_df (pd.DataFrame): Test dataframe with fingerprint columns
        
    Returns:
        tuple: (results_dict, best_model_info) where:
            - results_dict: Dictionary with fingerprint names as keys and accuracies as values
            - best_model_info: Dictionary with 'model', 'fingerprint_col', 'accuracy', 
                             'X_train', 'X_test', 'y_train', 'y_test' for the best model
    """
    if len(train_df) == 0 or len(test_df) == 0:
        print("Error: Train or test dataframes are empty.")
        return {}, {}
    
    # Get all Morgan fingerprint columns
    fingerprint_cols = [col for col in train_df.columns if col.startswith('morgan_')]
    
    if len(fingerprint_cols) == 0:
        print("Error: No Morgan fingerprint columns found.")
        return {}, {}
    
    print(f"\nTraining XGBoost models for {len(fingerprint_cols)} fingerprint combinations...")
    
    # Extract target variable
    y_train = train_df['p_np'].values
    y_test = test_df['p_np'].values
    
    results = {}
    best_model = None
    best_accuracy = -1
    best_fp_col = None
    best_X_train = None
    best_X_test = None
    best_y_train = None
    best_y_test = None
    
    # TODO: Train a model for each fingerprint column
    # For each fp_col in sorted(fingerprint_cols):
    #   1. Extract fingerprints as numpy arrays
    #      - X_train_list = [np.array(row[fp_col], dtype=np.float32) for _, row in train_df.iterrows() if row[fp_col] is not None]
    #      - X_train = np.array(X_train_list)
    #   2. Create XGBoost model with parameters shown above
    #   3. Train: model.fit(X_train, y_train_valid)
    #   4. Predict: y_pred = model.predict(X_test)
    #   5. Calculate accuracy: accuracy = accuracy_score(y_test_valid, y_pred)
    #   6. Store in results dict: results[fp_col] = accuracy
    #   7. Track best model if accuracy > best_accuracy
    
    pass  # Remove this line when implementing
    
    # Print results
    print(f"\n" + "="*50)
    print("XGBoost Training Results:")
    print("="*50)
    for fp_col, acc in sorted(results.items(), key=lambda x: x[1], reverse=True):
        print(f"  {fp_col}: {acc:.4f}")
    
    # TODO: Create bar plot comparing accuracies
    # Use plt.bar() with fingerprint names on x-axis and accuracies on y-axis
    
    # Prepare best model info
    best_model_info = {}
    if best_model is not None:
        best_model_info = {
            'model': best_model,
            'fingerprint_col': best_fp_col,
            'accuracy': best_accuracy,
            'X_train': best_X_train,
            'X_test': best_X_test,
            'y_train': best_y_train,
            'y_test': best_y_test
        }
        print(f"\n" + "="*50)
        print(f"Best Model Summary:")
        print(f"  Fingerprint: {best_fp_col}")
        print(f"  Test Accuracy: {best_accuracy:.4f}")
        print("="*50)
    
    return results, best_model_info

# Train XGBoost models
print("Training XGBoost models...")
xgboost_results, best_model_info = train_xgboost_models(train_df, test_df)

print(f"\nFinal results dictionary:")
print(xgboost_results)

if best_model_info:
    print(f"\nBest model available for further evaluation:")
    print(f"  Fingerprint: {best_model_info.get('fingerprint_col', 'N/A')}")
    print(f"  Accuracy: {best_model_info.get('accuracy', 0):.4f}")

---
## Step 9: Plot Test Predictions (TP, TN, FP, FN)

### Task Description
Visualize model predictions by categorizing test molecules into:

- **True Positives (TP)**: Correctly predicted as 1 (passes BBB)
- **True Negatives (TN)**: Correctly predicted as 0 (doesn't pass)
- **False Positives (FP)**: Predicted 1 but actual 0
- **False Negatives (FN)**: Predicted 0 but actual 1

Draw molecule grids for each category.

### Hints
- Use `model.predict(X_test)` to get predictions
- Compare predictions with actual labels to categorize
- Use `Draw.MolsToGridImage()` to visualize molecules

In [None]:
def plot_test_predictions(best_model_info, test_df, molsPerRow=4):
    """
    Plots test set molecules grouped by prediction correctness:
    - Correctly predicted as 1 (true positives)
    - Correctly predicted as 0 (true negatives)
    - Incorrectly predicted (false positives and false negatives)
    
    Args:
        best_model_info (dict): Dictionary containing the best model and related data
        test_df (pd.DataFrame): Test dataframe with molecules
        molsPerRow (int): Number of molecules per row in grid (default: 4)
    """
    if not best_model_info or 'model' not in best_model_info:
        print("Error: Best model information not available.")
        return
    
    model = best_model_info['model']
    fingerprint_col = best_model_info['fingerprint_col']
    y_test = best_model_info['y_test']
    X_test = best_model_info['X_test']
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # TODO: Categorize predictions into TP, TN, FP, FN
    # 1. Get test indices (molecules with valid fingerprints)
    # 2. For each (prediction, actual) pair:
    #    - If pred == 1 and actual == 1: True Positive
    #    - If pred == 0 and actual == 0: True Negative
    #    - If pred == 1 and actual == 0: False Positive
    #    - If pred == 0 and actual == 1: False Negative
    # 3. Collect molecules and legends for each category
    
    true_positives = []
    true_negatives = []
    false_positives = []
    false_negatives = []
    
    tp_legends = []
    tn_legends = []
    fp_legends = []
    fn_legends = []
    
    # TODO: Fill in the categorization logic
    
    pass  # Remove this line when implementing
    
    print(f"\n" + "="*50)
    print("Test Set Prediction Analysis:")
    print("="*50)
    print(f"  True Positives (correctly predicted as 1): {len(true_positives)}")
    print(f"  True Negatives (correctly predicted as 0): {len(true_negatives)}")
    print(f"  False Positives (predicted 1, actual 0): {len(false_positives)}")
    print(f"  False Negatives (predicted 0, actual 1): {len(false_negatives)}")
    print(f"  Total test molecules: {len(y_test)}")
    
    # TODO: Draw molecule grids for each category using Draw.MolsToGridImage()

# Plot test predictions
if best_model_info:
    print("Plotting test set predictions...")
    plot_test_predictions(best_model_info, test_df)

---
## Step 10: Test Model on Known Molecules

### Task Description
Validate the model on molecules with **known BBB permeability**.

**Molecules that should pass the BBB (expected p_np=1):**
- Water, Glucose, Oxygen, CO2, Ethanol, Nicotine, Morphine, Tryptophan, Vitamin C

**Molecules that should NOT pass (expected p_np=0):**
- Sodium ion, Potassium ion, Doxorubicin, Glucose-6-phosphate, ADP

1. Parse each molecule's SMILES
2. Generate fingerprints using the same parameters as the best model
3. Make predictions
4. Identify **failed predictions** (wrong category)
5. Display molecules with incorrect predictions

### Hints
- Extract radius and length from fingerprint column name: `morgan_r{radius}_l{length}`
- Use `model.predict_proba(X)` to get probability scores

In [None]:
def test_on_known_molecules(best_model_info, passing_barrier=None, non_passing=None):
    """
    Tests the model on lists of known molecules that pass or don't pass the blood-brain barrier.
    
    Args:
        best_model_info (dict): Dictionary containing the best model and related data
        passing_barrier (list): List of dictionaries with 'name' and 'smiles' keys for molecules
                               that pass the barrier. If None, uses default list.
        non_passing (list): List of dictionaries with 'name' and 'smiles' keys for molecules
                           that don't pass the barrier. If None, uses default list.
    """
    if not best_model_info or 'model' not in best_model_info:
        print("Error: Best model information not available.")
        return
    
    # Default passing_barrier molecules list
    if passing_barrier is None:
        passing_barrier = [
            {'name': 'Water', 'smiles': 'O'},
            {'name': 'Glucose', 'smiles': 'C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O'},
            {'name': 'Oxygen', 'smiles': 'O=O'},
            {'name': 'Carbon dioxide', 'smiles': 'O=C=O'},
            {'name': 'Ethanol', 'smiles': 'CCO'},
            {'name': 'Nicotine', 'smiles': 'CN1CCC[C@H]1c2cccnc2'},
            {'name': 'Morphine', 'smiles': 'CN1CC[C@]23C4=C5C=CC(=C4O)C[C@@H]2[C@@H]1C[C@@H]5[C@H](C3)O'},
            {'name': 'Tryptophan', 'smiles': 'c1ccc2c(c1)c(c[nH]2)CC(C(=O)O)N'},
            {'name': 'Vitamin C', 'smiles': 'C([C@@H]1[C@H](C(=C(C1=O)O)O)O)O'}
        ]
    
    # Default non_passing molecules list
    if non_passing is None:
        non_passing = [
            {'name': 'Sodium ion', 'smiles': '[Na+]'},
            {'name': 'Potassium ion', 'smiles': '[K+]'},
            {'name': 'Doxorubicin', 'smiles': 'CC1C(C(=O)C2=C(C1=O)C(=O)c3cc(O)ccc3O2)OC(=O)C4CCN(CCN4C)C'},
            {'name': 'Glucose-6-phosphate', 'smiles': 'C([C@@H]1[C@H]([C@@H]([C@H](C(O1)OP(=O)(O)O)O)O)O)O'},
            {'name': 'ADP', 'smiles': 'NC1=NC=NC2=C1N=CN2[C@@H]3O[C@H](COP(=O)(O)OP(=O)(O)O)[C@@H](O)[C@H]3O'}
        ]
    
    model = best_model_info['model']
    fingerprint_col = best_model_info['fingerprint_col']
    
    # Extract radius and length from fingerprint column name
    # Format: morgan_r{radius}_l{length}
    parts = fingerprint_col.replace('morgan_', '').split('_')
    radius = int(parts[0].replace('r', ''))
    length = int(parts[1].replace('l', ''))
    
    print(f"\n" + "="*50)
    print("Testing model on known molecules...")
    print(f"  Using fingerprint: {fingerprint_col}")
    print(f"  Radius: {radius}, Length: {length}")
    
    # TODO: Process molecules from both lists
    # 1. Create Morgan generator: fpgen = rdFingerprintGenerator.GetMorganGenerator(radius, length)
    # 2. For each molecule in passing_barrier:
    #    - Parse SMILES: mol = Chem.MolFromSmiles(smiles)
    #    - Generate fingerprint: fp = fpgen.GetFingerprint(mol)
    #    - Convert to numpy array: np.array(list(fp), dtype=np.float32)
    #    - Track the category ('passing_barrier')
    # 3. Do the same for non_passing molecules (category: 'non_passing')
    # 4. Stack fingerprints and make predictions
    # 5. Identify failed predictions:
    #    - passing_barrier predicted as 0 (should be 1)
    #    - non_passing predicted as 1 (should be 0)
    # 6. Display statistics and draw failed molecules
    
    pass  # Remove this line when implementing

# Test on known molecules
if best_model_info:
    print("Testing model on known molecules...")
    test_results = test_on_known_molecules(best_model_info)

---
## Summary

Congratulations! You've completed the BBBP analysis pipeline. Key takeaways:

1. **Data Quality Matters**: ~0.5% of SMILES failed parsing, some molecules had inconsistent labels
2. **Stereochemistry**: Many molecules have unassigned stereocenters
3. **Fingerprint Choice**: Different radius/length combinations may give different results
4. **Butina Split**: Structure-based splitting provides more realistic evaluation than random splits
5. **Model Limitations**: Even good models make mistakes on seemingly simple molecules (like ions)

### Questions to Consider
- Why might the model fail on simple ions?
- How could you improve the model's performance?
- What other molecular representations could be used instead of Morgan fingerprints?