Adding Missing Values

Identify missing values in your dataset.
Decide on a strategy for handling them: imputation, deletion, or prediction.
For numerical data, imputation could mean using the mean, median, or mode.
For categorical data, imputation could involve using the most frequent category or a placeholder value like 'Unknown'.
Implement the chosen strategy to fill in missing values.
Standardizing/Normalizing Numerical Values

Standardization: Transform your data to have a mean of 0 and a standard deviation of 1.
Calculate the mean and standard deviation for each numerical feature.
Subtract the mean from each feature.
Divide the result by the standard deviation.
Normalization: Scale your data to a fixed range, usually 0 to 1.
Find the minimum and maximum values for each feature.
Subtract the minimum value from each feature.
Divide the result by the range (max - min).
Choose between standardization and normalization based on your model and data distribution.
Tokenizing Words

If your data includes textual information, convert the text into tokens (words or characters).
Decide on the granularity of tokens (words, characters, or n-grams).
Use a tokenizer to split the text into tokens.
Optionally, convert tokens into numerical values (e.g., through embeddings or one-hot encoding).
Removing Outliers

Identify outliers in your data.
Use statistical methods (e.g., Z-score, IQR) to detect outliers.
Decide whether to remove outliers or cap them.
Remove or cap outliers based on the chosen strategy.
Removing Duplicates

Check for duplicate entries in your dataset.
Use a function to identify and remove duplicates.
Ensure that the removal of duplicates does not affect the integrity of your data.

In [2]:
from methods.widedta import _WideDTADataHandler
from modules.encoders import WideCNN
from tdc.multi_pred import DTI
from pathlib import Path
from sklearn.model_selection import KFold

In [3]:
ds = "bindingdb_kd"

ds = ds.lower()
df = DTI(ds, path=Path("..", "data", ds), print_stats=True).get_data()
ori_size = df.shape[0]

ori_size

Found local copy...
Loading...
--- Dataset Statistics ---
10661 unique drugs.
1413 unique targets.
52274 drug-target pairs.
--------------------------
Done!


52274

In [6]:
df = DTI(ds, path=Path("..", "data", ds), print_stats=True)

df.harmonize_affinities(mode="mean")

df = df.get_data()

df[df["Drug_ID"].isna()]

Found local copy...
Loading...
--- Dataset Statistics ---
10661 unique drugs.
1413 unique targets.
52274 drug-target pairs.
--------------------------
Done!
The original data has been updated!


Unnamed: 0,Drug_ID,Drug,Target_ID,Target,Y


In [11]:
# Group by 'Target_ID' and count the number of unique 'Target' sequences for each 'Target_ID'
unique_targets_per_id = df.groupby('Target_ID')['Target'].nunique()

# Filter to find 'Target_ID's associated with more than one unique 'Target' sequence
non_unique_target_ids = unique_targets_per_id[unique_targets_per_id > 1]

# Displaying 'Target_ID's with more than one unique 'Target' sequence
non_unique_target_ids

Series([], Name: Target, dtype: int64)

In [18]:
# Group by 'Target_ID' and count the number of unique 'Target' sequences for each 'Target_ID'
unique_drug_per_id = df.groupby('Drug')['Target'].nunique()

# # Filter to find 'Target_ID's associated with more than one unique 'Target' sequence
non_unique_drug_ids = unique_drug_per_id[unique_drug_per_id > 1]

# Displaying 'Target_ID's with more than one unique 'Target' sequence
non_unique_drug_ids


Drug
C#CCCCOCCCc1cnc[nH]1                                                3
C#CCOCCCc1cnc[nH]1                                                  2
C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CC[C@@H]4[C@H]3CC[C@@]21C      2
C#Cc1ccc(Cn2ccnc2CCc2c(Cl)c(O)cc(O)c2C(=O)OC)cc1                    2
C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1                        363
                                                                 ... 
c1ncc(CCCOCCC2CCCCC2)[nH]1                                          3
c1ncc(CCCOCCCC2CCCC2)[nH]1                                          3
c1ncc(CCCOCCCC2CCCCC2)[nH]1                                         2
c1ncc(COCC2CCCCC2)[nH]1                                             2
c1ncc(COCCCC2CCCCC2)[nH]1                                           2
Name: Target, Length: 2259, dtype: int64

In [21]:
# Correcting the syntax and filtering for the specific drug
filtered_df = df[df["Drug"] == "C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1"]

# Finding duplicates based on 'Drug' and 'Target'
duplicates = filtered_df[filtered_df.duplicated(subset=['Drug', 'Target'], keep=False)]

# Displaying the duplicates
duplicates

Unnamed: 0,Drug_ID,Drug,Target_ID,Target,Y


In [3]:
kfold = KFold(n_splits=5, shuffle=True)

In [14]:
for train_idx, val_idx in kfold.split(df):
    print(train_idx, val_idx)

[    0     1     2 ... 25769 25770 25771] [   10    13    16 ... 25763 25766 25767]
[    2     4     6 ... 25769 25770 25771] [    0     1     3 ... 25741 25758 25759]
[    0     1     2 ... 25769 25770 25771] [   26    28    62 ... 25754 25756 25768]
[    0     1     2 ... 25767 25768 25771] [    6    17    32 ... 25765 25769 25770]
[    0     1     3 ... 25768 25769 25770] [    2     4     8 ... 25757 25760 25771]


In [83]:
df = df.drop_duplicates(subset=['Drug_ID', 'Target_ID'])

ori_size = df.shape[0]

ori_size

25772

In [84]:
df = df.dropna()

print("NaN dropped:", df.shape[0] - ori_size)

NaN dropped: 0


In [85]:
df.groupby('Drug_ID')['Target_ID'].nunique()

drug_target_counts = df.groupby('Drug_ID')['Target_ID'].nunique().reset_index(name='Target_Count')

print("Min interactions: ", drug_target_counts["Target_Count"].min())
print("Max interactions: ", drug_target_counts["Target_Count"].max())


Min interactions:  379
Max interactions:  379


In [86]:
drug_target_counts[drug_target_counts['Target_Count'] < 10]

Unnamed: 0,Drug_ID,Target_Count


In [87]:
drug_ids_to_remove = drug_target_counts[drug_target_counts['Target_Count'] < 10]['Drug_ID']

df = df[~df['Drug_ID'].isin(drug_ids_to_remove)]

df.shape[0]

25772

In [88]:
df

Unnamed: 0,Drug_ID,Drug,Target_ID,Target,Y
0,11314340,Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12,AAK1,MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQV...,43.0
1,11314340,Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12,ABL1p,PFWKILNPLLERGTYYYFMGQQPGKVLGDQRRPSLPALHFIKGAGK...,10000.0
2,11314340,Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12,ABL2,MVLGTVLLPPNSYGRDQDTSLCCLCTEASESALPDLTDHFASCVED...,10000.0
3,11314340,Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12,ACVR1,MVDGVMILPVLIMIALPSPSMEDEKPKVNPKLYMCVCEGLSCGNED...,10000.0
4,11314340,Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12,ACVR1B,MAESAGASSFFPLVVLLLAGSGGSGPRGVQALLCACTSCLQANYTC...,10000.0
...,...,...,...,...,...
25767,151194,Clc1ccc(Nc2nnc(Cc3ccncc3)c3ccccc23)cc1,YES,MGCIKSKENKSPAIKYRPENTPEPVSTSVSHYGAEPTTVSPCPSSS...,10000.0
25768,151194,Clc1ccc(Nc2nnc(Cc3ccncc3)c3ccccc23)cc1,YSK1,MAHLRGFANQHSRVDPEELFTKLDRIGKGSFGEVYKGIDNHTKEVV...,10000.0
25769,151194,Clc1ccc(Nc2nnc(Cc3ccncc3)c3ccccc23)cc1,YSK4,MSSMPKPERHAESLLDICHDTNSSPTDLMTVTKNQNIILQSISRSE...,1900.0
25770,151194,Clc1ccc(Nc2nnc(Cc3ccncc3)c3ccccc23)cc1,ZAK,MSSLGASFVQIKFDDLQFFENCGGGSFGSVYRAKWISQDKEVAVKK...,4400.0


In [89]:
print("Original size:", ori_size)
print("Final size:    ", df.shape[0])
print("Difference:  ", df.shape[0] - ori_size)

Original size: 25772
Final size:     25772
Difference:   0
