Created with Google Gemini v2.5 Pro

# Chemoinformatics: Working with SMILES

## Preparation

To run this notebook, you'll need to have rdkit installed.

In [None]:
pip install rdkit networkx matplotlib

In [None]:
# Import necessary libraries
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.rdMolDescriptors import CalcMolFormula
from IPython.display import display

IPythonConsole.drawOptions.addAtomIndices = True # Optional: helps in visualizing atom numbers



## 1. Introduction: What is SMILES?

In chemoinformatics, we need a way to represent complex molecular structures as simple text. While file formats like .mol exist, they can be very long.

SMILES (Simplified Molecular-Input Line-Entry System) is a popular solution that encodes a molecule's 2D graph structure into a single line of text using ASCII characters. This makes it incredibly efficient for storing and searching through large chemical databases.

## 2. From Molecule to SMILES and Back Again

The most basic tasks are converting a known molecule into a SMILES string and, conversely, creating a molecule object from a SMILES string.

**Molecule to SMILES**

Let's create an RDKit molecule object for Aspirin and then generate its SMILES string.



In [None]:
# Aspirin has the chemical formula C9H8O4
# We can define it by its canonical SMILES string
aspirin_smiles = 'CC(=O)OC1=CC=CC=C1C(=O)O'

# Create a molecule object from the SMILES string
aspirin_mol = Chem.MolFromSmiles(aspirin_smiles)

In [None]:
# Now, let's generate the SMILES string *from* the molecule object
generated_smiles = Chem.MolToSmiles(aspirin_mol)

In [None]:
print(f"Original SMILES:    {aspirin_smiles}")
print(f"RDKit-generated SMILES: {generated_smiles}")

In [None]:
# Visualize the molecule to confirm
aspirin_mol

**SMILES to Molecule (Visualization)**

Now let's take a few SMILES strings and visualize the molecules they represent.

In [None]:
# A list of molecules represented by their SMILES strings
smiles_list = [
    'CCO',         # Ethanol
    'c1ccccc1',    # Benzene
    'C1CCCCC1',    # Cyclohexane
    'N[C@@H](C)C(=O)O' # L-Alanine (notice the stereochemistry symbols)
]

# Convert the SMILES strings to molecule objects
mol_list = [Chem.MolFromSmiles(s) for s in smiles_list]

# Draw the molecules in a grid
Draw.MolsToGridImage(mol_list, molsPerRow=4, legends=['Ethanol', 'Benzene', 'Cyclohexane', 'L-Alanine'])

Notice that aromatic atoms are represented by lowercase letters (like c for benzene)

## 3. Understanding SMILES Rules: A Hands-On Guide

SMILES has a set of rules for representing different structural features. Let's explore the most important ones.

**Bonds**

Single bonds are often implied or can be written with a `-`

Double bonds use `=`

Triple bonds use `#`



### Exercise 1: Bonds

Create molecular graphs for etahne, ethen and ethyne using SMILES.

In [None]:
# Ethane (single bond), ethene (double bond) and ethyne (triple bond)
ethane_mol = Chem.MolFromSmiles( ) # your code here
ethene_mol = Chem.MolFromSmiles( ) # your code here
ethyne_mol = Chem.MolFromSmiles( ) # your code here

Draw.MolsToGridImage([ethane_mol, ethene_mol, ethyne_mol], legends=['Ethene (C=C)', 'Ethyne (C#C)'])

#### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the functions</summary>

```
# Ethane (single bond), ethene (double bond) and ethyne (triple bond)
ethane_mol = Chem.MolFromSmiles('C-C') # your code here
ethene_mol = Chem.MolFromSmiles('C=C') # your code here
ethyne_mol = Chem.MolFromSmiles('C#C') # your code here

Draw.MolsToGridImage([ethane_mol, ethene_mol, ethyne_mol], legends=['Ethane (C-C)', 'Ethene (C=C)', 'Ethyne (C#C)'])
```

</details>

**Branches**

Branches from the main chain are enclosed in parentheses ().



In [None]:
# Isobutane: A propane chain with a methyl branch on the central carbon
isobutane_smiles = 'CC(C)C'
isobutane_mol = Chem.MolFromSmiles(isobutane_smiles)

display(isobutane_mol)

print(f"SMILES for Isobutane: {isobutane_smiles}")

*Reading the SMILES:*

C (atom 0) is connected to C (atom 1), which has a branch (C) (atom 2) and is also connected to another C (atom 3).

**Rings**

Rings are handled by "breaking" one bond and labeling the two atoms involved with a matching number.

In [None]:
# Cyclohexane: A 6-carbon ring
cyclohexane_smiles = 'C1CCCCC1'
cyclohexane_mol = Chem.MolFromSmiles(cyclohexane_smiles)

display(cyclohexane_mol)

print(f"SMILES for Cyclohexane: {cyclohexane_smiles}")

*Reading the SMILES:*

The 1 after the first C indicates it's connected to another atom also labeled 1. The parser follows the chain (CCCCC) until it finds the closing 1, forming the ring.

## 4. The Problem of Uniqueness: Canonical SMILES

A single molecule can be represented by many different valid SMILES strings, depending on which atom you start from and which path you take .

In [None]:
# Let's represent ethanol in two different ways
ethanol_mol_1 = Chem.MolFromSmiles('CCO')
ethanol_mol_2 = Chem.MolFromSmiles('OCC')

Draw.MolsToGridImage([ethanol_mol_1, ethanol_mol_2], legends=['SMILES CCO', 'SMILES OCC'])

In [None]:
# Generate SMILES from both, but this time ask for the *canonical* version
smiles_1 = Chem.MolToSmiles(ethanol_mol_1, canonical=False)
smiles_2 = Chem.MolToSmiles(ethanol_mol_2, canonical=False)

In [None]:
print(f"SMILES from 'CCO': {smiles_1}")
print(f"SMILES from 'OCC': {smiles_2}")

In [None]:
print(f"Are they the same?\n {smiles_1 == smiles_2}")

In [None]:
# Generate SMILES from both, but this time ask for the *canonical* version
smiles_1 = Chem.MolToSmiles(ethanol_mol_1, canonical=True)
smiles_2 = Chem.MolToSmiles(ethanol_mol_2, canonical=True)

In [None]:
print(f"SMILES from 'CCO': {smiles_1}")
print(f"SMILES from 'OCC': {smiles_2}")

In [None]:
print(f"Are they the same?\n {smiles_1 == smiles_2}")

This is a critical concept.

To make database lookups and comparisons reliable, we use a **canonical representation**.

Algorithms like the **Morgan algorithm** or **CANGEN** assign a unique, unambiguous order to the atoms based on their properties and connectivity. This ensures that every molecule has **exactly one "correct" SMILES string**.


**Note:** When you use Chem.MolToSmiles() in RDKit, it generates the canonical SMILES by default.

# Exercises: Creating SMILES

### Exercise 2

Write the SMILES string for Toluene (a benzene ring with a methyl group attached).

#### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the functions</summary>

```
# Write the SMILES string for Toluene
toluene_smiles = 'c1ccccc1C' # Or Cc1ccccc1
```

</details>

### Exercise 3
Generate the molecule from your SMILES string and visualize it to check your answer of Exercise 2.

#### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the functions</summary>

```
# Generate and visualize the molecule
toluene_mol = Chem.MolFromSmiles(toluene_smiles)

display(toluene_mol)

print(f"SMILES for Toluene: {toluene_smiles}")
```

</details>

### Exercise 4

Write at least two different valid (but non-canonical) SMILES strings for acetic acid (CC(=O)O).

#### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the functions</summary>

```
# Non-canonical SMILES strings for acetic acid
non_canonical_1 = 'OCC(=O)'
non_canonical_2 = 'C(C)(O)=O'

# Let's verify they all produce the same canonical SMILES
mol1 = Chem.MolFromSmiles('CC(=O)O')
mol2 = Chem.MolFromSmiles(non_canonical_1)
mol3 = Chem.MolFromSmiles(non_canonical_2)

print(f"Canonical SMILES of 'CC(=O)O': {Chem.MolToSmiles(mol1)}")
print(f"Canonical SMILES of '{non_canonical_1}': {Chem.MolToSmiles(mol2)}")
print(f"Canonical SMILES of '{non_canonical_2}': {Chem.MolToSmiles(mol3)}")

print("\nVisual confirmation of the structures:")
display(Draw.MolsToGridImage([mol1, mol2, mol3], legends=["Canonical", non_canonical_1, non_canonical_2]))
```

</details>

# Advanced Exercise: Isomers and Canonicalization

Molecules that share the same chemical formula but have different structures are called isomers.

A tool like *Surge* is a "chemical graph generator" designed to create all possible isomers for a given formula.

For example, running

> surge -S C9H8O4 -oC9H8O4.smi

would generate thousands of SMILES strings, each representing a unique isomer.

This creates a challenge:

**how do we know if two SMILES strings represent the same molecule or different isomers?**

This is where canonical SMILES becomes essential.

### Exercise 5: Aspirin and its Isomers (C₉H₈O₄)

Aspirin has the formula C9H8O4. But so do many other molecules. Let's look at a few.

**Your Task:**

Create RDKit molecule objects from the three SMILES strings below.

For each molecule, calculate its molecular formula to confirm they are all isomers.

Generate the canonical SMILES for each one. Are they different?

Visualize all three in a grid to see their different structures.

In [None]:
# SMILES for Aspirin and two of its isomers
aspirin_and_isomers_smiles = {
    "Aspirin": "CC(=O)OC1=CC=CC=C1C(=O)O",
    "3-Acetoxybenzoic acid": "CC(=O)Oc1cccc(C(=O)O)c1",
    "Phenylmalonic acid": "O=C(O)C(C(=O)O)c1ccccc1" # Note: Formula is C9H8O4
}

# 1. Create a list of molecule objects
mols = [Chem.MolFromSmiles(s) for s in aspirin_and_isomers_smiles.values()]
names = list(aspirin_and_isomers_smiles.keys())

# 2. & 3. Loop through, calculate formula, get canonical SMILES, and print
print("--- Isomers of C9H8O4 ---")
for i, mol in enumerate(mols):
    formula = CalcMolFormula(# --- YOUR CODE GOES HERE ---)
    canonical_smiles = # --- YOUR CODE GOES HERE ---
    print(f"Name: {names[i]}")
    print(f"  Formula: {formula}")
    print(f"  Canonical SMILES: {canonical_smiles}\n")

# 4. Visualize the isomers
print("Visual Comparison:")
# --- YOUR CODE GOES HERE ---
display(Draw.MolsToGridImage(mols, legends=names))

#### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the functions</summary>

```
# SMILES for Aspirin and two of its isomers
aspirin_and_isomers_smiles = {
    "Aspirin": "CC(=O)OC1=CC=CC=C1C(=O)O",
    "3-Acetoxybenzoic acid": "CC(=O)Oc1cccc(C(=O)O)c1",
    "Phenylmalonic acid": "O=C(O)C(C(=O)O)c1ccccc1" # Note: Formula is C9H8O4
}

# 1. Create a list of molecule objects
mols = [Chem.MolFromSmiles(s) for s in aspirin_and_isomers_smiles.values()]
names = list(aspirin_and_isomers_smiles.keys())

# 2. & 3. Loop through, calculate formula, get canonical SMILES, and print
print("--- Isomers of C9H8O4 ---")
for i, mol in enumerate(mols):
    formula = CalcMolFormula(mol)
    canonical_smiles = Chem.MolToSmiles(mol)
    print(f"Name: {names[i]}")
    print(f"  Formula: {formula}")
    print(f"  Canonical SMILES: {canonical_smiles}\n")

# 4. Visualize the isomers
print("Visual Comparison:")
display(Draw.MolsToGridImage(mols, legends=names))
```

</details>

**Analysis of the Output:**

The code confirms that all three molecules have the formula C9H8O4.

However, it prints three different canonical SMILES strings.

This proves they are distinct molecules (isomers).

The visualization clearly shows their different chemical structures, confirming this conclusion.

### Exercise 5: Identifying Isomers of C₈H₁₁NO

Now, let's do the reverse. You are given three SMILES strings.

Your goal is to determine which of them are actual isomers of the formula C8H11NO.

**Your Task:**

For each SMILES string, create a molecule and calculate its molecular formula.

Identify which of the molecules match the target formula C8H11NO.

Create a new list containing only the true isomers and visualize them.

In [None]:
# A list of potential candidates
candidate_smiles = {
    "Candidate A": "c1ccccc1C(O)CN",
    "Candidate B": "NC1=CC=C(C(C)O)C=C1",
    "Candidate C": "c1ccc(OC)cc1CCN" # 3-Methoxyphenethylamine
}

In [None]:
# A list of potential candidates
candidate_smiles = {
    "Candidate A": "c1ccccc1C(O)CN",
    "Candidate B": "NC1=CC=C(C(C)O)C=C1",
    "Candidate C": "c1ccc(OC)cc1CCN" # 3-Methoxyphenethylamine
}
target_formula = "C8H11NO"

true_isomers = []
isomer_names = []

print(f"--- Checking for isomers of {target_formula} ---")
for name, smi in candidate_smiles.items():
    mol = # --- YOUR CODE GOES HERE ---
    formula = # --- YOUR CODE GOES HERE ---

    # 2. Check if the formula matches
    if formula == # --- YOUR CODE GOES HERE --- :
        print(f"✅ {name} ({formula}) is an isomer.")
        true_isomers.append(# --- YOUR CODE GOES HERE --- )
        isomer_names.append(name)
    else:
        print(f"❌ {name} ({formula}) is NOT an isomer.")

# 3. Visualize only the true isomers
print("\n--- Visualization of True Isomers ---")
if true_isomers:
    display(# --- YOUR CODE GOES HERE ---)
else:
    print("No true isomers were found in the list.")

#### 🔍 Solution (Hidden - Expand to see the answer)

<details>
<summary>Click here to see the solution after you've tried exploring the functions</summary>

```
# A list of potential candidates
candidate_smiles = {
    "Candidate A": "c1ccccc1C(O)CN",
    "Candidate B": "NC1=CC=C(C(C)O)C=C1",
    "Candidate C": "c1ccc(OC)cc1CCN" # 3-Methoxyphenethylamine
}
target_formula = "C8H11NO"

true_isomers = []
isomer_names = []

print(f"--- Checking for isomers of {target_formula} ---")
for name, smi in candidate_smiles.items():
    mol = Chem.MolFromSmiles(smi)
    formula = CalcMolFormula(mol)
    
    # 2. Check if the formula matches
    if formula == target_formula:
        print(f"✅ {name} ({formula}) is an isomer.")
        true_isomers.append(mol)
        isomer_names.append(name)
    else:
        print(f"❌ {name} ({formula}) is NOT an isomer.")

# 3. Visualize only the true isomers
print("\n--- Visualization of True Isomers ---")
if true_isomers:
    display(Draw.MolsToGridImage(true_isomers, legends=isomer_names))
else:
    print("No true isomers were found in the list.")
```

</details>