D∆∞·ªõi ƒë√¢y l√† h∆∞·ªõng d·∫´n chi ti·∫øt song ng·ªØ (English-Vietnamese) cho ch·ªß ƒë·ªÅ:

**üî¨ Topic\_CheMBL\_35\_2: Predicting biological activity from molecular descriptors**
**üîç Ch·ªß ƒë·ªÅ: D·ª± ƒëo√°n ho·∫°t t√≠nh sinh h·ªçc t·ª´ c√°c ƒë·∫∑c tr∆∞ng ph√¢n t·ª≠**

---

## üß™ 1. Analytical Framework | M√¥ h√¨nh ph√¢n t√≠ch

**English**
We aim to predict the biological activity (e.g., IC‚ÇÖ‚ÇÄ) of molecules using molecular descriptors like Molecular Weight (MolWt), Topological Polar Surface Area (TPSA), LogP, etc., derived via RDKit.
This QSAR modeling task involves:

* Querying ChEMBL for assay data (standard\_value for IC50)
* Filtering and cleaning
* Generating molecular descriptors
* Training machine learning models (e.g., Random Forest)

**Ti·∫øng Vi·ªát**
M·ª•c ti√™u l√† d·ª± ƒëo√°n ho·∫°t t√≠nh sinh h·ªçc (v√≠ d·ª• IC‚ÇÖ‚ÇÄ) c·ªßa c√°c ph√¢n t·ª≠ d·ª±a v√†o ƒë·∫∑c tr∆∞ng ph√¢n t·ª≠ nh∆∞ MolWt, TPSA, LogP‚Ä¶ ƒë∆∞·ª£c t·∫°o b·ªüi RDKit.
B√†i to√°n QSAR g·ªìm:

* Truy v·∫•n d·ªØ li·ªáu th√≠ nghi·ªám t·ª´ ChEMBL (standard\_value cho IC50)
* L√†m s·∫°ch d·ªØ li·ªáu
* T√≠nh to√°n ƒë·∫∑c tr∆∞ng h√≥a h·ªçc
* Hu·∫•n luy·ªán m√¥ h√¨nh h·ªçc m√°y (v√≠ d·ª• Random Forest)

---

## üì¶ 2. Folder Structure | C·∫•u tr√∫c th∆∞ m·ª•c AIMLOps

```plaintext
project_root/
‚îÇ
‚îú‚îÄ‚îÄ data/                ‚Üê Exported CSV from SQL (max 100 rows)
‚îÇ   ‚îî‚îÄ‚îÄ Topic_CheMBL_35_2_data.csv
‚îÇ
‚îú‚îÄ‚îÄ notebook/
‚îÇ   ‚îî‚îÄ‚îÄ Topic_CheMBL_35_2_1_query_and_descriptors.ipynb
‚îÇ   ‚îî‚îÄ‚îÄ Topic_CheMBL_35_2_2_model_training.ipynb
‚îÇ
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îî‚îÄ‚îÄ descriptors.py   ‚Üê Functions to calculate descriptors
‚îÇ
‚îî‚îÄ‚îÄ scripts/
    ‚îî‚îÄ‚îÄ run_model.py     ‚Üê Optional CLI for batch runs
```

---

## üßæ 3. Example SQL Query (limit 100 rows) | V√≠ d·ª• truy v·∫•n SQL (gi·ªõi h·∫°n 100 d√≤ng)

```sql
-- File: get_ic50_cleaned.sql
SELECT md.chembl_id,
       cs.canonical_smiles,
       act.standard_value::float AS ic50_nM
FROM activities act
JOIN compound_structures cs ON act.molregno = cs.molregno
JOIN molecule_dictionary md ON md.molregno = cs.molregno
WHERE act.standard_type = 'IC50'
  AND act.standard_value ~ '^[0-9\.]+$'
  AND act.standard_units = 'nM'
  AND act.standard_value::float < 100000
LIMIT 100;
```

> üí° L∆∞u √Ω: D√πng `act.standard_value::float` thay v√¨ `~` v·ªõi ki·ªÉu numeric ƒë·ªÉ tr√°nh l·ªói `operator does not exist`.

---

## üß¨ 4. Python Script for Descriptor Calculation | T√≠nh descriptor b·∫±ng Python

```python
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors

df = pd.read_csv('../data/Topic_CheMBL_35_2_data.csv')
df['mol'] = df['canonical_smiles'].apply(Chem.MolFromSmiles)
df['MolWt'] = df['mol'].apply(Descriptors.MolWt)
df['TPSA'] = df['mol'].apply(Descriptors.TPSA)
df['LogP'] = df['mol'].apply(Descriptors.MolLogP)
df.drop(columns='mol', inplace=True)
df.head()
```

---

## ü§ñ 5. Train Random Forest Model (compatible with old scikit-learn)

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

X = df[['MolWt', 'TPSA', 'LogP']]
y = df['ic50_nM']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("R2 score:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred)**0.5)  # Compatible format
```

---

## üí° 6. Additional SQL & Python Examples | 5 v√≠ d·ª• b·ªï sung

### üîπ SQL 1: Select compounds with pChEMBL value

```sql
SELECT md.chembl_id, cs.canonical_smiles, act.pchembl_value
FROM activities act
JOIN molecule_dictionary md ON md.molregno = act.molregno
JOIN compound_structures cs ON cs.molregno = md.molregno
WHERE act.pchembl_value IS NOT NULL
LIMIT 100;
```

### üîπ SQL 2: Fetch compounds active on targets

```sql
SELECT md.chembl_id, cs.canonical_smiles, tgt.pref_name
FROM activities act
JOIN compound_structures cs ON act.molregno = cs.molregno
JOIN molecule_dictionary md ON md.molregno = cs.molregno
JOIN target_dictionary tgt ON act.target_id = tgt.tid
WHERE act.standard_type = 'IC50' AND act.standard_value ~ '^[0-9\.]+$'
LIMIT 100;
```

### üîπ Python 1: Add Num of H-Donors and Acceptors

```python
from rdkit.Chem import Lipinski
df['NumHDonors'] = df['mol'].apply(Lipinski.NumHDonors)
df['NumHAcceptors'] = df['mol'].apply(Lipinski.NumHAcceptors)
```

### üîπ Python 2: Export processed dataset

```python
df.to_csv('../data/processed_descriptors.csv', index=False)
```

### üîπ Python 3: Save model

```python
import joblib
joblib.dump(model, '../models/qsar_ic50_rf.pkl')
```

---

N·∫øu b·∫°n mu·ªën m√¨nh ƒë√≥ng g√≥i l·∫°i th√†nh 2 notebook chu·∫©n AIMLOps (`Topic_CheMBL_35_2_1_query_and_descriptors.ipynb`, `Topic_CheMBL_35_2_2_model_training.ipynb`) ho·∫∑c t·∫°o ZIP d·ª± √°n m·∫´u, vui l√≤ng y√™u c·∫ßu ti·∫øp nh√©.


In [1]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors

df = pd.read_csv('../data/Topic_CheMBL_35_2_data.csv')
df['mol'] = df['canonical_smiles'].apply(Chem.MolFromSmiles)
df['MolWt'] = df['mol'].apply(Descriptors.MolWt)
df['TPSA'] = df['mol'].apply(Descriptors.TPSA)
df['LogP'] = df['mol'].apply(Descriptors.MolLogP)
df.drop(columns='mol', inplace=True)
df.head()

Unnamed: 0,chembl_id,canonical_smiles,ic50_nm,MolWt,TPSA,LogP
0,CHEMBL324340,Cc1ccc2oc(-c3cccc(N4C(=O)c5ccc(C(=O)O)cc5C4=O)...,2500.0,398.374,100.71,4.30202
1,CHEMBL324340,Cc1ccc2oc(-c3cccc(N4C(=O)c5ccc(C(=O)O)cc5C4=O)...,50000.0,398.374,100.71,4.30202
2,CHEMBL109600,COc1ccccc1-c1ccc2oc(-c3ccc(OC)c(N4C(=O)c5ccc(C...,9000.0,520.497,119.17,5.6778
3,CHEMBL357278,Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4ccc(Cl)c(C(...,4000.0,543.011,77.93,4.27292
4,CHEMBL357119,Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)NCCc4ccccc4)CC...,17000.0,468.623,77.93,2.32092


In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

X = df[['MolWt', 'TPSA', 'LogP']]
y = df['ic50_nm']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("R2 score:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred)**0.5)  # Compatible format


R2 score: -0.6935868754184276
RMSE: 15040.909614321108
