# 2-Filtering
This tutorial demonstrates how to filter PDB to create subsets of structures. For details see [filters](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/mmtfPyspark/filters) and [demos](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/demos/filters).

### Import pyspark and mmtfPyspark

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsGroup, ContainsLProteinChain, PolymerComposition, Resolution 
from mmtfPyspark.structureViewer import view_group_interaction

### Configure Spark

In [2]:
spark = SparkSession.builder.appName("2-Filtering").getOrCreate()

### Read PDB structures

In [3]:
path = "../resources/mmtf_reduced_sample"
pdb = mmtfReader.read_sequence_file(path).cache()
pdb.count()

9756

## Filter by Quality Metrics
Structures can be filtered by [Resolution](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/resolution) and [R-free](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/r-value-and-r-free). Each filter takes a minimum and maximum values. The example below returns structures with a resolution in the inclusive range [0.0, 1.5]

In [4]:
pdb = pdb.filter(Resolution(0.0, 1.5))
pdb.count()

2941

## Filter by Polymer Chain Types
A number of filters are available to filter by the type of the polymer chain.

### Create a subset of structures that contain at least one L-protein chain

In [5]:
pdb = pdb.filter(ContainsLProteinChain())
pdb.count()

2912

### Create a subset of structure that exclusively contain L-protein chains (e.g., exclude protein-nucleic acid complexes)

In [6]:
pdb = pdb.filter(ContainsLProteinChain(exclusive=True))
pdb.count()

2788

### Keep protein structures that exclusively contain chains made out of the 20 standard amino acids

In [7]:
pdb = pdb.filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20, exclusive=True))
pdb.count()

2171

## Find the subset of structures that contains ATP

In [8]:
pdb = pdb.filter(ContainsGroup("ATP"))

## Visualize the hits

In [9]:
view_group_interaction(pdb.keys().collect(),"ATP");

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=11), Output()),…

## Filter with a lambda expression
Rather than using a pre-made filter, we can create simple filters using lambda expressions. The expression needs to evaluate to a boolean type.

The variable t in the lambda expression below represents a tuple and t[1] is the second element in the tuple representing the mmtfStructure. 

Here, we filter by the number of atoms in an entry. You will learn more about extracting structural information from an mmtfStructure in future tutorials.

In [10]:
pdb = pdb.filter(lambda t: t[1].num_atoms < 500)
pdb.count()

7

Or, we can filter by the key, represented by the first element in a tuple: t[0].

**Keys are case sensitive. Always use upper case PDB IDs in mmtf-pyspark!**

In [11]:
pdb = pdb.filter(lambda t: t[0] in ["4AFF", "4CBU"])
pdb.count()

1

In [12]:
spark.stop()