## Generating a dataset

Purpose: We want to assess the correlation of structure (e.g. lipophilicity) of some molecules with their inhibitory activity for Acetylcholinesterase (AChE). First of all:

Import dependencies and load the two datasets for activities (ChEMBL) and molecular descriptors (PubChem).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load both datasets
act = pd.read_csv("ache_activities.csv")
desc = pd.read_csv("ache_pubchem_descriptors.csv")

# Keep only relevant columns from ChEMBL
act = act[["canonical_smiles", "pchembl_value", "standard_relation", "standard_type"]]

# Aggregate (some SMILES might appear multiple times → average or median)
act = act.groupby("canonical_smiles", as_index=False)["pchembl_value"].median()



Inspect the two datasets.

In [2]:
act.head()

Unnamed: 0,canonical_smiles,pchembl_value
0,Br.CC[N+](CC)(CCCCCn1c(=O)cc(C)n(CCCCCn2c(C)cc...,9.4
1,Br.CC[N+](CC)(CCCCCn1c(=O)cc(C)n(CCCCn2c(C)cc(...,9.22
2,Br.CC[N+](CC)(CCCCCn1c(=O)cc(C)n(CCCn2c(C)cc(=...,8.15
3,Br.COc1ccc2c3c1O[C@H]1C[C@@H](O)C=C[C@@]31CCN(...,5.5
4,Brc1ccc(-[n+]2ccc3ccccc3c2)cc1.[Br-],5.62


Combine the two datasets.

In [3]:
# Merge on canonical_smiles, how="inner" drops all rows which are only contained in one df
merged = pd.merge(act, desc, on="canonical_smiles", how="inner") 

print(f"Merged dataset: {len(merged)} entries")
merged.head()

Merged dataset: 200 entries


Unnamed: 0,canonical_smiles,pchembl_value,CID,MolecularWeight,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount
0,C#CCCOC[n+]1ccc(/C=N/O)cc1.[Cl-],4.01,135909812,240.68,,45.7,1,4.0,5
1,C(=C/C1CCN(Cc2ccccc2)CC1)\c1noc2ccccc12,6.68,9901561,318.4,4.5,29.3,0,3.0,4
2,C/C=C1\[C@H]2C=C(C)C[C@]1(NC1OCC3=C4CC(C)(C)C[...,6.96,118715261,492.6,1.2,90.8,4,5.0,2
3,C/C=C1\[C@H]2C=C(C)C[C@]1(N[C@H]1OC(=O)C3=C4CC...,5.66,118715260,506.6,1.5,108.0,4,6.0,2
4,C=CC(=O)N1C/C(=C\c2ccc(C)cc2)C(=O)/C(=C/c2ccc(...,4.74,24796898,357.4,4.4,37.4,0,2.0,3


Now, inspect the combined dataframe. Look for abnomalies, i.e. duplicates and missing data.

In [4]:
merged.describe()

Unnamed: 0,pchembl_value,CID,MolecularWeight,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount,RotatableBondCount
count,200.0,200.0,198.0,178.0,198.0,200.0,199.0,200.0
mean,6.1976,49815260.0,402.229242,4.301685,64.24798,1.18,4.567839,6.275
std,1.334509,47928390.0,115.19469,1.831441,32.207804,1.059819,1.867903,5.340598
min,4.01,1338.0,110.13,0.1,0.0,0.0,1.0,0.0
25%,5.0125,10572730.0,337.175,2.8,40.6,0.0,3.0,3.0
50%,6.1,44333800.0,392.95,4.3,59.35,1.0,4.0,5.0
75%,7.1325,76308880.0,457.9,5.3,85.45,2.0,6.0,9.0
max,9.57,164625600.0,801.0,10.2,183.0,5.0,11.0,31.0


SyntaxError: invalid syntax (2248177593.py, line 1)

Decide what to do with the bad data points and proceed. Make sure to document properly and preserve the original dataframe.

Run a quick EDA on the cleaned dataset including some plots. What can you conclude?

Export the cleaned dataset as "ache_qsar_data.csv".