# Changes in 2016.03 Release: the FilterCatalog

This is the second of a series of posts highlighting changes (mostly new features) in the 2016.03 (Q1 2016) release of the RDKit.

This one focuses on the `FilterCatalog`: a class introduced in the 2015.09 release that has seen some improvements for this release and that remains "underdocumented".

In [1]:
from rdkit import Chem
from rdkit.Chem.FilterCatalog import FilterCatalogParams,FilterCatalog
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG=True
Chem.WrapLogs()
from rdkit import rdBase
print(rdBase.rdkitVersion)
import time
print(time.asctime())

2016.03.1.b1
Fri Apr  8 10:17:39 2016


At a very high level, the `FilterCatalog` allows collections of queries to be used to filter sets of compounds.

The RDKit comes with a number of query sets pre-configured:

In [14]:
[x for x in dir(FilterCatalogParams.FilterCatalogs) if x[0]>='A' and x[0]<='Z']

['ALL', 'BRENK', 'NIH', 'PAINS', 'PAINS_A', 'PAINS_B', 'PAINS_C', 'ZINC']

In [9]:
params = FilterCatalogParams()
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
params.AddCatalog(FilterCatalogParams.FilterCatalogs.BRENK)
params.AddCatalog(FilterCatalogParams.FilterCatalogs.NIH)

filters = FilterCatalog(params)

In [10]:
import gzip
inf = gzip.open("../data/malariahts_trainingset.txt.gz") # HTS data from the 2014 TDT challenge
keep = []
nReject=0
inf.readline() # ignore the header line
for i,line in enumerate(inf):
    splitL = line.strip().split()
    smi = splitL[-1]
    m = Chem.MolFromSmiles(smi)
    if m is None:
        continue
    if filters.HasMatch(m):
        matches = filters.GetMatches(m)
        nReject += 1
        #print(smi.decode('UTF-8'),list(x.GetDescription() for x in matches))
        #if len(reject)>100:
        #    break
    else:
        keep.append(m)
        if len(keep)>=1000:
            break
    if not (i+1)%1000:
        print("   Processed",i+1,"rejected",nReject)
print("Found:",len(keep),"after scanning",i+1)

   Processed 1000 rejected 112
Found: 1000 after scanning 1126


In [12]:
len(matches)

3