**Note**:

This Notebook is inspired by the following post:

https://practicalcheminformatics.blogspot.com/2019/11/visualizing-chemical-space.html

Created with Google Gemini v2.5 Pro

# Chemoinformatics: Chemical Space

In this notebook, we will explore the concept of chemical space.

We'll represent a set of molecules numerically using Morgan fingerprints and then use Principal Component Analysis (PCA) to reduce the high-dimensional data into a 2D plot.

This will allow us to visually identify which molecules are structurally related.

## Preparation

First, let's make sure we have the necessary libraries and set up our environment.

We'll also need the `%matplotlib` inline command to ensure our plots display correctly in Jupyter.

In [None]:
pip install rdkit networkx matplotlib

In [None]:
# Make sure plots appear in the notebook
%matplotlib inline

# Import necessary libraries
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem, Draw
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

## Define Our Molecule Dataset

Instead of a large dataset, we'll use a hand-picked set of 9 molecules.

This set includes a series of simple alcohols, some aromatic compounds, and two more complex drug molecules.

This should create some clear separation in the low-dimensional representation.

In [None]:
# A curated dictionary of molecules {name: SMILES}
molecule_dict = {
    # Alcohols
    "Methanol": "CO",
    "Ethanol": "CCO",
    "Propanol": "CCCO",

    # Aromatics
    "Benzene": "c1ccccc1",
    "Toluene": "Cc1ccccc1",
    "Phenol": "c1ccc(O)cc1",

    # Common Drugs
    "Aspirin": "CC(=O)OC1=CC=CC=C1C(=O)O",
    "Caffeine": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",
    "Paracetamol": "CC(=O)NC1=CC=C(O)C=C1"
}

# Create RDKit molecule objects and store them
mols = [Chem.MolFromSmiles(smiles) for smiles in molecule_dict.values()]
mol_names = list(molecule_dict.keys())

# Let's see our molecules
Draw.MolsToGridImage(mols, legends=mol_names, molsPerRow=3)

## Generate Molecular Fingerprints

Now, we need to convert these chemical structures into a format a machine learning algorithm can understand: numbers. We will use Morgan fingerprints (a type of circular fingerprint) for this. Each molecule will be represented as a long vector (a list of 0s and 1s) of a fixed length (e.g., 2048 bits).

In [None]:
# Generate Morgan fingerprints for each molecule
# Radius 2 is standard (ECFP4), nBits is the length of the vector
fp_list = []
for mol in mols:
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
    fp_list.append(fp)

# Convert the RDKit explicit bit vectors into a NumPy array
# This is our high-dimensional data
np_fps = np.array(fp_list)

print(f"Shape of our fingerprint matrix: {np_fps.shape}")
print("This means we have 9 molecules, each described by 2048 features (bits).")

## Reduce Dimensions with PCA

Our data has 2048 dimensions, which is impossible to visualize directly. We'll use Principal Component Analysis (PCA) to find the two most important "principal components" that capture the most variance in the data. This effectively projects our 2048D data down to 2D.

In [None]:
# Initialize PCA to find the top 2 principal components
pca = PCA(n_components=2)

# Fit and transform the fingerprint data
crds = pca.fit_transform(np_fps)

print(f"Shape of our new coordinate matrix: {crds.shape}")
print("The data has been successfully projected from 2048D to 2D.")

## Inspecting the Explained Variance

Before we plot our 2D chemical space, let's check how much of the original information (variance) is captured by our two principal components. This tells us how faithfully our 2D map represents the full 2048-dimensional reality.

In [None]:
# The pca object stores the explained variance ratio for each component
explained_variance = pca.explained_variance_ratio_

print(f"Variance explained by PC1: {explained_variance[0]:.2%}")
print(f"Variance explained by PC2: {explained_variance[1]:.2%}")
print(f"Total variance explained by the first two components: {explained_variance.sum():.2%}")

# Create a Scree Plot to visualize the explained variance
plt.figure(figsize=(8, 5))
# Bar plot for individual variance
plt.bar(range(len(explained_variance)), explained_variance, alpha=0.7, align='center',
        label='Individual explained variance')
# Line plot for cumulative variance
plt.plot(range(len(explained_variance)), np.cumsum(explained_variance), 'r-o',
         label='Cumulative explained variance')

plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.xticks(range(len(explained_variance)), [f"PC{i+1}" for i in range(len(explained_variance))])
plt.title('Scree Plot')
plt.legend(loc='best')
plt.grid(True)
plt.show()

### Interpretation

The explained variance analysis tells you the quality of your projection.

For example, if the first two components explain 45% of the variance, it means that your 2D plot, while useful for visualization, has compressed away over half of the structural information.

The scree plot is a standard way to visualize this and helps you decide how many components you might need to retain to capture a sufficient amount of information (e.g., you might decide you need 5 components to capture 80% of the variance).

## Visualize the Chemical Space

Finally, let's create a scatter plot of our 2D data.

Each point will represent a molecule, and its position will be determined by the two principal components.

We'll label each point with the molecule's name to see how they are related.

In [None]:
# Create the scatter plot
plt.figure(figsize=(12, 8))
x = crds[:, 0] # First principal component
y = crds[:, 1] # Second principal component

plt.scatter(x, y)

# Add labels to each point
for i, name in enumerate(mol_names):
    plt.text(x[i] + 0.05, y[i] + 0.05, name, fontsize=12)

# Add titles and labels
plt.title("Chemical Space of 9 Molecules (PCA)", fontsize=16)
plt.xlabel("Principal Component 1", fontsize=12)
plt.ylabel("Principal Component 2", fontsize=12)
plt.grid(True)
plt.show()

### Task 1: How Trustworthy is Our 2D Map?

Look at the Plot which shows the explained variance, and then look at our final 2D Chemical Space scatter plot. We have projected 2048-dimensional data onto a simple 2D map.

**Your Task:**
Critically evaluate this 2D representation.

Based on the total explained variance you calculated, what percentage of the original structural information is missing from our 2D map?

Does this 2D representation agree with your chemical expectations? Considering the information loss, why can we still be confident that the clear separation between the alcohols and the aromatics along the PC2 axis is a meaningful chemical distinction?

The plot only shows PC1 and PC2. What do you hypothesize is the most likely structural difference being represented by PC3 (the largest source of variance not shown on our plot)?

### Task 2: Explain the Layout of our 2D Chemical Space Plot

For each Question below, justify the molecule's position by comparing its chemical structure to its neighbors.

**Q1**: The plot shows Caffeine is the most isolated molecule, pushed far out along the PC1 axis. Identify at least two unique structural features of Caffeine that are absent in all other molecules and explain why they make it the primary outlier.

**Q2**: Locate Benzene, Toluene, and Phenol. They all share a core benzene ring, yet they are spread out. Explain their specific order on the plot. Why is Toluene much closer to Phenol than Benzene is?

**Q3**: The alcohols (Methanol, Ethanol, Propanol) are clustered very tightly. What single, systematically changing feature is responsible for the small separation that does exist between them?

### Interpretation

A close inspection of the PCA plot reveals a clear hierarchy in how the structural variance is captured. Instead of a single, continuous gradient, the principal components have separated the molecules based on their most distinct features.

**Principal Component 1 (The X-axis)**: The "Caffeine Axis"
The first principal component, which explains the largest amount of variance, is overwhelmingly dominated by a single molecule: Caffeine. It is pushed far out to one side, while all other eight molecules are clustered together on the other.

This happens because Caffeine is the most structurally unique molecule in our dataset. Its nitrogen-rich, fused heterocyclic purine ring system is fundamentally different from the simple aliphatic alcohols and the single-ring aromatic compounds. PC1 has therefore identified the presence of this specific scaffold as the single biggest source of variation in the entire dataset.

**Principal Component 2 (The Y-axis)**: The "Aromatic vs. Aliphatic Axis"
With Caffeine's profound uniqueness captured by PC1, the second principal component is free to explain the next largest source of variance among the remaining eight molecules. This axis clearly separates the molecules into two fundamental chemical families:

On one side of the Y-axis, you'll find the alcohols (Methanol, Ethanol, Propanol).

On the other side, you'll find all the molecules that contain a benzene ring (Benzene, Toluene, Phenol, Aspirin, and Paracetamol).

**In conclusion**, this is a perfect illustration of how PCA deconstructs chemical space. It first isolated the biggest outlier (Caffeine) along PC1 and then used PC2 to organize the remaining, more closely related molecules based on their most significant shared feature: the presence or absence of an aromatic ring.