# Filtering the data
***

This notebook goes through the process of filtering the original data for training

**Contents**

1. [Binding affinity table](#Binding-affinity-table)
1. [Removing entries](#Removing-entries)
    1. [Binding data](#Binding-data)
    1. [Ligands](#Ligands)

## Binding affinity table

The index file from the PDBbind dataset contains all the protein-ligand complex ID (`PDB_ID`), the ligand name in the complex (`ligand_name`), and its binding affinity to the protein in the complex (in `-logKd/Ki` and `Kd/Ki`).

In [1]:
import pandas as pd

# reading the index file into a Pandas DataFrame 
df = pd.read_csv("D:\\binding_data\\INDEX_general_PL_data_2020.csv")

In [2]:
df.head()

Unnamed: 0,PDB_ID,resolution,release_year,-logKd/Ki,Kd/Ki,reference,ligand_name
0,3zzf,2.2,2012,0.4,Ki=400mM,3zzf.pdf,(NLG)
1,3gww,2.46,2009,0.45,IC50=355mM,3gwu.pdf,(SFX)
2,1w8l,1.8,2004,0.49,Ki=320mM,1w8l.pdf,(1P3)
3,3fqa,2.35,2009,0.49,IC50=320mM,3fq7.pdf,(GAB&PMP)
4,1zsb,2.0,1996,0.6,Kd=250mM,1zsb.pdf,(AZM)


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19443 entries, 0 to 19442
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PDB_ID        19443 non-null  object 
 1   resolution    19443 non-null  object 
 2   release_year  19443 non-null  int64  
 3   -logKd/Ki     19443 non-null  float64
 4   Kd/Ki         19443 non-null  object 
 5   reference     19443 non-null  object 
 6   ligand_name   19443 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 1.0+ MB


There are a total of 19443 entries

They will need to be filtered down, but first, some of the columns can be removed, since they're not needed: `resolution`, `release_year` and `reference`

In [4]:
df = df.drop(['resolution', 'release_year', 'reference'], axis=1)

## Removing entries

The next step is to remove entries that have issues or irregularities regarding the binding data or the ligands. 

### Binding data

Regargind the binding affinity, the entries that mathced the following criteria were removed:
* the binding affinity is reported as an IC50 value (in the `Kd/Ki` column), since it is considered to be of lower accuracy.
* the binding affinity is reported with a `~` sign, which suggests that it is an approximation

There are 132 approximate entries and 7152 with affinity reported as IC50:

In [5]:
approx_c = 0
ic50_c = 0
for i in df.index:
    if '~' in df['Kd/Ki'][i]:
        approx_c+=1
    elif 'IC50' in df['Kd/Ki'][i]:
        ic50_c+=1
print('Approximate values: {}\nIC50 values: {}'.format(approx_c, ic50_c))

Approximate values: 132
IC50 values: 7152


Removing the approximate and IC50 values:

In [6]:
for i in df.index:
    if ('IC50' in df['Kd/Ki'][i]) or ('~' in df['Kd/Ki'][i]):
        df.drop(i, inplace=True)

In [7]:
# resetting the DataFrame index
df.reset_index(drop=True, inplace=True)

In [8]:
df.shape

(12159, 4)

### Ligands

We will check every ligand that does not conform to the general pattern of naming which contains numbers and uppercase letters, because they will cause issues when trying to download some of them from the Protein Data Bank (see file Molecular_descriptors.ipynb for details)

In [9]:
# list that will hold the ligands that do not conform to the general naming scheme
nonconformists = []

# accepted characters in the ligand ID
accepted = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'

# list with the unique ligand names found in the DataFrame
unique_ligands = [i.strip('()') for i in list(df['ligand_name'].unique())]

# check if ligands conform, otherwise insert them in the 'nonconformists' list
for lig in unique_ligands:
    for char in lig:
        if char not in accepted:
            nonconformists.append(lig)
            break

In [10]:
len(nonconformists)

105

There are 105 ligands with unusual IDs

The problematic symbols are going to be identified next:

In [11]:
# identifying the uniuqe symbols that come up in the data
symbols = []
for lig in nonconformists:
    for char in lig:
        if char not in accepted and char not in symbols:
            symbols.append(char)

The following are the symbols that were identified. There are some lowercase letters too.

In [12]:
symbols

['-', 'm', 'e', 'r', '_', '/', 'c', '&', 'h', 'q', '.', '+', 'o', 'x', 's']

The occurrences of some of the symbols will be counted below

In [13]:
symbols_subset = ['&', '-', 'mer', '_', '/', '.', '+', 'c']
counts = {}
for ligand in nonconformists:
    for s in symbols_subset:
        if s in ligand:
            counts.setdefault(s, 0)
            counts[s] += 1

In [14]:
counts

{'-': 72, 'mer': 20, '_': 9, '/': 16, 'c': 1, '&': 5, '.': 2, '+': 1}

In [15]:
under = [ligand for ligand in nonconformists if '_' in ligand]

In [16]:
under

['_Y3',
 '__U',
 '__N',
 '_VI',
 'FMN_hq',
 '_MC',
 '9LQ_18-mer',
 'FMN_ox',
 'FMN_sq',
 '_CG',
 '_VX',
 '_T3']

In the original work, `FMN_`*x* ligands were renamed simply `FMN`, which is a mistake, but it likely had a small impact considering that the total number of ligands that contained `FMN` was small (19)

Entries that have `+`, `&`, `/`, `_`, `.` or `-` in the ligand name can be removed, since they contain either multiple ligands bound to the protein, cannot be found on the Protein Data Bank, or are *n*-mers (like 3-mer, a short polymer made from 3 amino acids), which are not in the scope of this project.

In [17]:
to_drop = ['+', '&', '/', '-', '_', '.']

for i in df.index:
    for d in to_drop:
        if (d in df['ligand_name'][i]) and (('q' not in df['ligand_name'][i]) or ('x' not in df['ligand_name'][i])):
            df.drop(i, inplace=True)
            break

In [18]:
# resetting the DataFrame index
df.reset_index(drop=True, inplace=True)

In [19]:
df.shape

(9762, 4)

The rest of the 155 entries were removed because of problems with the ligands in Mordred and/or they couldn't be found on the Protein Data Bank

After their removal, the final list of protein-ligand complexes had 9607 entries, and the list of PDB IDs was used for further data sourcing and processing.