# Create Representative Set Demo

This Demo shows how to create a representative set by reading in Hadoop sequence files, filter by BlastClusters, flatMap to polymerChains, and filter again AMINO_ACIDS polymer composition.

![RCSB PDB](https://cdn.rcsb.org/rcsb-pdb/v2/common/images/Logo_wwpdb.png)

## Imports

In [2]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.io import mmtfReader, mmtfWriter
from mmtfPyspark.mappers import structureToPolymerChains
from mmtfPyspark.filters import polymerComposition
from mmtfPyspark.webFilters import Pisces

## Configure Spark

In [3]:
conf = SparkConf().setMaster("local[*]") \
                  .setAppName("CreateRepresentativeSetDemo")
sc = SparkContext(conf = conf)

## Read in Haddop Sequence Files

In [4]:
path = "../../../resources/mmtf_full_sample/"

pdb = mmtfReader.read_sequence_file(path, sc)

## Filter by representative protein chains at 40% sequence identity

In [5]:
sequenceIdentity = 40
resolution = 2.0

pdb = pdb.filter(Pisces(sequenceIdentity, resolution)) \
         .flatMap(structureToPolymerChains()) \
         .filter(Pisces(sequenceIdentity, resolution)) \
         .filter(polymerComposition(polymerComposition.AMINO_ACIDS_20))

## Show top 10 structures

In [6]:
pdb.top(10)

[('1FYE.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42e7929e80>),
 ('1FXL.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42f730d470>),
 ('1FVI.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42f73e6ef0>),
 ('1FV1.F', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42f76ccd30>),
 ('1FTR.D', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42f76a3c88>),
 ('1FT5.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f43085c7cf8>),
 ('1FSG.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42e3aeecf8>),
 ('1FS1.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42e39230b8>),
 ('1FR3.L', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42e38ed898>),
 ('1FPZ.C', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f42e385df98>)]

## Save representative set

In [7]:
write_path = f'./pdb_representatives_{sequenceIdentity}'

mmtfWriter.write_sequence_file(write_path, sc, pdb)

## Terminate Spark

In [8]:
sc.stop()