<a href="https://colab.research.google.com/github/isaacperomero/Bioinformatics/blob/main/Parte_1_Acetylcholinesterase_Datos_Bioactividad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Trabajo de Investigación - Descubrimiento Computacional de Fármacos [Parte 1] Descarga de los Datos de Bioactividad**

En los cuadernos de Jupyter del presente trabajo de investigación se encuentra el código de un modelo de aprendizaje automático (ML) que utiliza los datos de bioactividad de la base de datos ChEMBL, inspirado en el trabajo Chanin Nantasenamat.

---

## **Base de datos ChEMBL**

La [*Base de datos ChEMBL*](https://www.ebi.ac.uk/chembl/) contiene información curada de bioactividad de más de 2 millones de compuestos. Reúne datos químicos, bioactivos y genómicos para ayudar a traducir la información genómica en nuevos fármacos eficaces. Se compila a partir de más de 89.000 documentos, 1,6 millones de ensayos y los datos abarcan más de 15.000 dianas, 2.000 células y 48.000 indicaciones.
[Información del 10 de Mayo, 2024; versión 34 de ChEMBL].

## **Instalación de librerías**

Instalación del paquete de servicios web ChEMBL para recuperar datos de bioactividad de la base de datos ChEMBL.

In [37]:
! pip list

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
aiohttp                          3.9.5
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.6.0
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array_record                     0.5.1
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.1.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.15.0
backcall                         0.2.0
beautifulsoup4                   4.12.3


In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m51.2/55.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-23.2.3-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize

## **Importación de librerías**

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Búsqueda de la proteína diana**

### **Búsqueda diana de la *acetilcolinesterasa***

In [3]:
target = new_client.target
target_query = target.search('acetylcholinesterase')
targets = pd.DataFrame.from_dict(target_query)
targets.head(10)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P22303', 'xref_name': None, 'xre...",Homo sapiens,Acetylcholinesterase,28.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cholinesterases; ACHE & BCHE,28.0,False,CHEMBL2095233,"[{'accession': 'P06276', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Drosophila melanogaster,Acetylcholinesterase,18.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
3,[],Bemisia tabaci,AChE2,16.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038
4,[],Leptinotarsa decemlineata,Acetylcholinesterase,16.0,False,CHEMBL2366490,"[{'accession': 'Q27677', 'component_descriptio...",SINGLE PROTEIN,7539
5,"[{'xref_id': 'P04058', 'xref_name': None, 'xre...",Torpedo californica,Acetylcholinesterase,15.0,False,CHEMBL4780,"[{'accession': 'P04058', 'component_descriptio...",SINGLE PROTEIN,7787
6,"[{'xref_id': 'P21836', 'xref_name': None, 'xre...",Mus musculus,Acetylcholinesterase,15.0,False,CHEMBL3198,"[{'accession': 'P21836', 'component_descriptio...",SINGLE PROTEIN,10090
7,"[{'xref_id': 'P37136', 'xref_name': None, 'xre...",Rattus norvegicus,Acetylcholinesterase,15.0,False,CHEMBL3199,"[{'accession': 'P37136', 'component_descriptio...",SINGLE PROTEIN,10116
8,"[{'xref_id': 'O42275', 'xref_name': None, 'xre...",Electrophorus electricus,Acetylcholinesterase,15.0,False,CHEMBL4078,"[{'accession': 'O42275', 'component_descriptio...",SINGLE PROTEIN,8005
9,"[{'xref_id': 'P23795', 'xref_name': None, 'xre...",Bos taurus,Acetylcholinesterase,15.0,False,CHEMBL4768,"[{'accession': 'P23795', 'component_descriptio...",SINGLE PROTEIN,9913


### **Seleccionar y recuperar los datos de bioactividad de la *acetilcolinesterasa humana* (Primera entrada)**

Se asigna la primera entrada (que corresponde a la proteína objetivo, la acetilcolinesterasa humana) a la variable ***selected_target***

In [7]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL220'

Se recuperan sólo los datos de bioactividad para la acetilcolinesterasa humana (CHEMBL1220) que se presentan como valores pChEMBL.

In [8]:
activity = new_client.activity
res = activity.filter(target_chembl_id = selected_target).filter(standard_type = "IC50")

In [9]:
df = pd.DataFrame.from_dict(res)

In [11]:
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0


In [12]:
df.columns

Index(['action_type', 'activity_comment', 'activity_id', 'activity_properties',
       'assay_chembl_id', 'assay_description', 'assay_type',
       'assay_variant_accession', 'assay_variant_mutation', 'bao_endpoint',
       'bao_format', 'bao_label', 'canonical_smiles', 'data_validity_comment',
       'data_validity_description', 'document_chembl_id', 'document_journal',
       'document_year', 'ligand_efficiency', 'molecule_chembl_id',
       'molecule_pref_name', 'parent_molecule_chembl_id', 'pchembl_value',
       'potential_duplicate', 'qudt_units', 'record_id', 'relation', 'src_id',
       'standard_flag', 'standard_relation', 'standard_text_value',
       'standard_type', 'standard_units', 'standard_upper_value',
       'standard_value', 'target_chembl_id', 'target_organism',
       'target_pref_name', 'target_tax_id', 'text_value', 'toid', 'type',
       'units', 'uo_units', 'upper_value', 'value'],
      dtype='object')

In [13]:
df.standard_type.unique()

array(['IC50'], dtype=object)

In [14]:
# Entre menor sea el standard value mayor es la potencia del fármaco
# Menor cantidad de fármaco para producir una inhibición del 50%
df[['standard_type', 'standard_value', 'standard_units']]

Unnamed: 0,standard_type,standard_value,standard_units
0,IC50,750.0,nM
1,IC50,100.0,nM
2,IC50,50000.0,nM
3,IC50,300.0,nM
4,IC50,800.0,nM
...,...,...,...
9086,IC50,160.0,nM
9087,IC50,7943.28,nM
9088,IC50,100000.0,nM
9089,IC50,63095.73,nM


Se guardan los datos de bioactividad resultantes en un archivo CSV **bioactivity_data_raw.csv**.

In [32]:
df.to_csv('bioactivity_data_raw.csv', index = False)

## **Tratamiento de datos faltantes y duplicados**
Si en algún compuesto hay valores faltantes en las columnas **standard_value** y **canonical_smiles** se eliminan.

In [38]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df3 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9086,"{'action_type': 'INHIBITOR', 'description': 'N...",,25111481,[],CHEMBL5265203,Inhibition of AChE (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.16
9087,,,25402914,[],CHEMBL5303778,Cross screening panel,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,pIC50,,UO_0000065,,5.1
9088,,,25402962,[],CHEMBL5303826,Cross screening panel,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,pIC50,,UO_0000065,,4.0
9089,,,25403899,[],CHEMBL5303876,Cross screening panel,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,pIC50,,UO_0000065,,4.2


Si en algún compuesto hay valores duplicados en la columna de canonical_smiles se eliminan.

In [40]:
len(df2.canonical_smiles.unique())

6370

In [18]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9086,"{'action_type': 'INHIBITOR', 'description': 'N...",,25111481,[],CHEMBL5265203,Inhibition of AChE (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.16
9087,,,25402914,[],CHEMBL5303778,Cross screening panel,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,pIC50,,UO_0000065,,5.1
9088,,,25402962,[],CHEMBL5303826,Cross screening panel,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,pIC50,,UO_0000065,,4.0
9089,,,25403899,[],CHEMBL5303876,Cross screening panel,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,pIC50,,UO_0000065,,4.2


## **Preprocesamiento de los datos de bioactividad**

### **Combinar las columnas (*molecule_chembl_id*, *canonical_smiles*, *standard_value*) en un DataFrame**

In [21]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0
...,...,...,...
9086,CHEMBL2238282,O=C(/C=C/c1ccc(N2CCCCC2)cc1)c1sccc1Cl,160.0
9087,CHEMBL4636881,CC(=O)Nc1c(F)cc(C(=O)N[C@H]2CC[C@H](O)CC2)cc1O...,7943.28
9088,CHEMBL4635134,CNC(=O)c1cc(C(=O)NC2CC2)cn(Cc2ccccc2)c1=O,100000.0
9089,CHEMBL4639128,COCc1nc2cnc3cc(-c4c(C)noc4C)c(OC[C@H]4CCNC4)cc...,63095.73


Guardar el dataframe en un archivo CSV

In [33]:
df3.to_csv('bioactivity_data_preprocessed.csv', index=False)

### **Etiquetado de compuestos como activos, inactivos o intermedios**
Los datos de bioactividad están en la unidad **IC50**. Los compuestos con valores inferiores a 1.000 nM se considerarán **activos**, mientras que los superiores a 10.000 nM se considerarán **inactivos**. Los valores comprendidos entre 1.000 y 10.000 nM se considerarán **intermedios**.

In [24]:
df4 = pd.read_csv('bioactivity_data_preprocessed.csv')

In [25]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

### **Combinar las columnas (*molecule_chembl_id*, *canonical_smiles*, *standard_value*) y *bioactivity_class* en un DataFrame**

In [28]:
bioactivity_class = pd.Series(bioactivity_threshold, name = 'bioactivity_class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.00,active
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.00,active
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.00,inactive
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.00,active
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.00,active
...,...,...,...,...
6364,CHEMBL2238282,O=C(/C=C/c1ccc(N2CCCCC2)cc1)c1sccc1Cl,160.00,active
6365,CHEMBL4636881,CC(=O)Nc1c(F)cc(C(=O)N[C@H]2CC[C@H](O)CC2)cc1O...,7943.28,intermediate
6366,CHEMBL4635134,CNC(=O)c1cc(C(=O)NC2CC2)cn(Cc2ccccc2)c1=O,100000.00,inactive
6367,CHEMBL4639128,COCc1nc2cnc3cc(-c4c(C)noc4C)c(OC[C@H]4CCNC4)cc...,63095.73,inactive


Guardar el dataframe en un archivo CSV

In [34]:
df5.to_csv('bioactivity_data_curated.csv', index=False)

In [30]:
! zip acetylcholinesterase.zip *.csv

  adding: acetylcholinesterase_02_bioactivity_data_preprocessed.csv (deflated 80%)
  adding: bioactivity_data_curated.csv (deflated 82%)
  adding: bioactivity_data_preprocessed.csv (deflated 80%)
  adding: bioactivity_data_raw.csv (deflated 91%)


In [35]:
! ls -l

total 7244
-rw-r--r-- 1 root root  468001 May 10 15:21 acetylcholinesterase_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  737987 May 10 15:31 acetylcholinesterase.zip
-rw-r--r-- 1 root root  526353 May 10 15:31 bioactivity_data_curated.csv
-rw-r--r-- 1 root root  468001 May 10 15:31 bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 5200159 May 10 15:31 bioactivity_data_raw.csv
drwxr-xr-x 1 root root    4096 May  8 21:21 sample_data


---