Adding Missing Values

Identify missing values in your dataset.
Decide on a strategy for handling them: imputation, deletion, or prediction.
For numerical data, imputation could mean using the mean, median, or mode.
For categorical data, imputation could involve using the most frequent category or a placeholder value like 'Unknown'.
Implement the chosen strategy to fill in missing values.
Standardizing/Normalizing Numerical Values

Standardization: Transform your data to have a mean of 0 and a standard deviation of 1.
Calculate the mean and standard deviation for each numerical feature.
Subtract the mean from each feature.
Divide the result by the standard deviation.
Normalization: Scale your data to a fixed range, usually 0 to 1.
Find the minimum and maximum values for each feature.
Subtract the minimum value from each feature.
Divide the result by the range (max - min).
Choose between standardization and normalization based on your model and data distribution.
Tokenizing Words

If your data includes textual information, convert the text into tokens (words or characters).
Decide on the granularity of tokens (words, characters, or n-grams).
Use a tokenizer to split the text into tokens.
Optionally, convert tokens into numerical values (e.g., through embeddings or one-hot encoding).
Removing Outliers

Identify outliers in your data.
Use statistical methods (e.g., Z-score, IQR) to detect outliers.
Decide whether to remove outliers or cap them.
Remove or cap outliers based on the chosen strategy.
Removing Duplicates

Check for duplicate entries in your dataset.
Use a function to identify and remove duplicates.
Ensure that the removal of duplicates does not affect the integrity of your data.

In [58]:
from methods.widedta import _WideDTADataHandler
from modules.encoders import WideCNN
from tdc.multi_pred import DTI
from pathlib import Path


In [70]:
ds = "bindingdb_ic50"

ds = ds.lower()
df = DTI(ds, path=Path("..", "data", ds), print_stats=True).get_data()
ori_size = df.shape[0]

ori_size

Found local copy...
Loading...
--- Dataset Statistics ---
548633 unique drugs.
5077 unique targets.
990630 drug-target pairs.
--------------------------
Done!


990630

In [71]:
df = df.drop_duplicates(subset=['Drug_ID', 'Target_ID'])

ori_size = df.shape[0]

ori_size

854120

In [72]:
df = df.dropna()

print("NaN dropped:", df.shape[0] - ori_size)

NaN dropped: -89334


In [73]:
df.groupby('Drug_ID')['Target_ID'].nunique()

drug_target_counts = df.groupby('Drug_ID')['Target_ID'].nunique().reset_index(name='Target_Count')

print("Min interactions: ", drug_target_counts["Target_Count"].min())
print("Max interactions: ", drug_target_counts["Target_Count"].max())


Min interactions:  1
Max interactions:  305


In [74]:
drug_target_counts[drug_target_counts['Target_Count'] < 10]

Unnamed: 0,Drug_ID,Target_Count
0,7.0,1
1,19.0,1
2,45.0,1
3,51.0,2
4,72.0,4
...,...,...
486177,145866835.0,1
486178,145866836.0,1
486179,145866837.0,1
486180,145866838.0,1


In [75]:
drug_ids_to_remove = drug_target_counts[drug_target_counts['Target_Count'] < 10]['Drug_ID']

df = df[~df['Drug_ID'].isin(drug_ids_to_remove)]

df.shape[0]

23356

In [76]:
df

Unnamed: 0,Drug_ID,Drug,Target_ID,Target,Y
713,44259.0,CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n1c3ccccc3...,P68403,MADPAAGPPPSEGEESTVRFARKGALRQKNVHEVKNHKFTARFFKQ...,9.0
813,3815.0,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,25000.0
831,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P15127,MGSGRGCETTAVPLLMAVAVAGGTAGHLYPGEVCPGMDIRNNLTRL...,100000.0
833,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,10.0
839,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P27791,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWEDPSQNTAQLDHFDR...,2000.0
...,...,...,...,...,...
989666,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P29376,MGCWGQLLVWFGAAGAILCSSPGSQETFLRSSPLPLASPSPRDPKV...,2.0
989667,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P16591,MGFGSDLKNSHEAVLKLQDWELRLLETVKKFMALRIKSDKEYASTL...,2.0
990259,91895618.0,O=Nc1c(C2C(=O)Nc3cc(Br)ccc32)[nH]c2ccccc12,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,2100.0
990262,5329009.0,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,7600.0


In [77]:
print("Original size:", ori_size)
print("Final size:    ", df.shape[0])
print("Difference:  ", df.shape[0] - ori_size)

Original size: 854120
Final size:     23356
Difference:   -830764


In [78]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# 2. Standardizing Numerical Values
scaler = StandardScaler()
df[['Y']] = scaler.fit_transform(df[['Y']])

df

Unnamed: 0,Drug_ID,Drug,Target_ID,Target,Y
713,44259.0,CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n1c3ccccc3...,P68403,MADPAAGPPPSEGEESTVRFARKGALRQKNVHEVKNHKFTARFFKQ...,-0.157165
813,3815.0,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,-0.064316
831,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P15127,MGSGRGCETTAVPLLMAVAVAGGTAGHLYPGEVCPGMDIRNNLTRL...,0.214329
833,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,-0.157161
839,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P27791,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWEDPSQNTAQLDHFDR...,-0.149768
...,...,...,...,...,...
989666,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P29376,MGCWGQLLVWFGAAGAILCSSPGSQETFLRSSPLPLASPSPRDPKV...,-0.157191
989667,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P16591,MGFGSDLKNSHEAVLKLQDWELRLLETVKKFMALRIKSDKEYASTL...,-0.157191
990259,91895618.0,O=Nc1c(C2C(=O)Nc3cc(Br)ccc32)[nH]c2ccccc12,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,-0.149396
990262,5329009.0,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,-0.128962


In [79]:
scaler = MinMaxScaler()

df[['Y']] = scaler.fit_transform(df[['Y']])

df

Unnamed: 0,Drug_ID,Drug,Target_ID,Target,Y
713,44259.0,CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n1c3ccccc3...,P68403,MADPAAGPPPSEGEESTVRFARKGALRQKNVHEVKNHKFTARFFKQ...,9.000000e-07
813,3815.0,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,2.500000e-03
831,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P15127,MGSGRGCETTAVPLLMAVAVAGGTAGHLYPGEVCPGMDIRNNLTRL...,1.000000e-02
833,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,1.000000e-06
839,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P27791,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWEDPSQNTAQLDHFDR...,2.000000e-04
...,...,...,...,...,...
989666,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P29376,MGCWGQLLVWFGAAGAILCSSPGSQETFLRSSPLPLASPSPRDPKV...,2.000000e-07
989667,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P16591,MGFGSDLKNSHEAVLKLQDWELRLLETVKKFMALRIKSDKEYASTL...,2.000000e-07
990259,91895618.0,O=Nc1c(C2C(=O)Nc3cc(Br)ccc32)[nH]c2ccccc12,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,2.100000e-04
990262,5329009.0,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,7.600000e-04


In [80]:
# 4. Removing Outliers (example using Z-score, assuming 'numerical_feature' is your target feature)
from scipy import stats
import numpy as np

df = df[(np.abs(stats.zscore(df['Y'])) < 3)]

df

Unnamed: 0,Drug_ID,Drug,Target_ID,Target,Y
713,44259.0,CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n1c3ccccc3...,P68403,MADPAAGPPPSEGEESTVRFARKGALRQKNVHEVKNHKFTARFFKQ...,9.000000e-07
813,3815.0,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,2.500000e-03
831,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P15127,MGSGRGCETTAVPLLMAVAVAGGTAGHLYPGEVCPGMDIRNNLTRL...,1.000000e-02
833,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P00517,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFER...,1.000000e-06
839,2396.0,CN(C)CCCn1cc(C2=C(c3c[nH]c4ccccc34)C(=O)NC2=O)...,P27791,MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWEDPSQNTAQLDHFDR...,2.000000e-04
...,...,...,...,...,...
989666,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P29376,MGCWGQLLVWFGAAGAILCSSPGSQETFLRSSPLPLASPSPRDPKV...,2.000000e-07
989667,72710568.0,Cc1nc([C@](C)(O)CO)sc1-c1cnc(N)c(O[C@H](C)c2cc...,P16591,MGFGSDLKNSHEAVLKLQDWELRLLETVKKFMALRIKSDKEYASTL...,2.000000e-07
990259,91895618.0,O=Nc1c(C2C(=O)Nc3cc(Br)ccc32)[nH]c2ccccc12,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,2.100000e-04
990262,5329009.0,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC,P49759,MRHSKRTYCPDWDDKDWDYGKWRSSSSHKRRKRSHSSAQENKRCKY...,7.600000e-04
