<h1>Solution for Data Collection</h1>

Task 1

How many CYP protein structures are available in ChEMBL?

1. To solve this problem, it is necessary to extract all targets that <b>contain</b> the word <b>Cytochrome</b> in their names
2. The task does not specify which type of organism we are interested. So let's calculate how many CYP structures are available for each type of organism

In [1]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [2]:
# gether protein chembl id of interested target from ChEMBL
def download_targetid_chembl(taget_name):
    target = new_client.target
    
    # instead of looking for exact protein name in ChEMBL we are looking for proteins which contain  the word
    cyp = target.filter(pref_name__icontains=taget_name).only(
    ['target_chembl_id', 'target_organism', 'target_pref_name'])
    
    return cyp

In [3]:
res = download_targetid_chembl('Cytochrome')

In [4]:
# Converting data into pandas dataframe format
cyp = pd.DataFrame(data=res)

In [5]:
# Let's take a look at the data
cyp.head()

Unnamed: 0,organism,pref_name,target_chembl_id
0,Homo sapiens,Cytochrome P450 11A1,CHEMBL2033
1,Candida albicans (strain SC5314 / ATCC MYA-287...,Cytochrome P450 51,CHEMBL1780
2,Homo sapiens,Cytochrome P450 19A1,CHEMBL1978
3,Homo sapiens,Cytochrome P450 11B1,CHEMBL1908
4,Homo sapiens,Cytochrome P450 7A1,CHEMBL1851
...,...,...,...
122,Rattus norvegicus,NADH-cytochrome b5 reductase 3,CHEMBL4523194
123,Homo sapiens,Cytochrome P450 4F8,CHEMBL4523270
124,Homo sapiens,Cytochrome P450 4Z1,CHEMBL4523375
125,Homo sapiens,Cytochrome P450,CHEMBL4523986


In [9]:
print(len(cyp), 'CYP protein structures in ChEMBL')

127 CYP protein structures in ChEMBL


In [7]:
# Let's estimate how many different types of organisms are presented
cyp['organism'].describe()

count              127
unique              22
top       Homo sapiens
freq                53
Name: organism, dtype: object

The results tell us that there are 127 lines in total, of which 22 are unique. The most common string in the column is <i>Homo sapiens</i>, it occurs 53 times

In [13]:
# Print all CYP structures which belongs to Homo sapiens
cyp[cyp['organism'] == 'Homo sapiens'].sort_values('pref_name')

Unnamed: 0,organism,pref_name,target_chembl_id
112,Homo sapiens,Apoptotic protease-activating factor 1/Caspase...,CHEMBL3885517
125,Homo sapiens,Cytochrome P450,CHEMBL4523986
0,Homo sapiens,Cytochrome P450 11A1,CHEMBL2033
3,Homo sapiens,Cytochrome P450 11B1,CHEMBL1908
23,Homo sapiens,Cytochrome P450 11B2,CHEMBL2722
35,Homo sapiens,Cytochrome P450 17A1,CHEMBL3522
2,Homo sapiens,Cytochrome P450 19A1,CHEMBL1978
108,Homo sapiens,Cytochrome P450 1A,CHEMBL3544905
17,Homo sapiens,Cytochrome P450 1A1,CHEMBL2231
30,Homo sapiens,Cytochrome P450 1A2,CHEMBL3356
