# 6-Output
Structures can be saved as single files in MMTF format, or multiple structures can be saved in an MMTF Hadoop Sequence file.

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader, mmtfWriter
from mmtfPyspark.webfilters import Pisces
from mmtfPyspark.mappers import StructureToPolymerChains

#### Configure Spark

In [2]:
spark = SparkSession.builder.appName("6-Output").getOrCreate()

#### Read PDB structures

In [3]:
path = "../resources/mmtf_reduced_sample"

pdb = mmtfReader.read_sequence_file(path)

## Use Pisces filter to create a non-redundant set of protein chains
Many analyses of the PDB use the [PISCES CulledPDB](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/webfilters/pisces.py) sets maintained by the R. Dubrack group.
A CulledPDB set is selected by specifying sequenceIdentity and resolution cutoff values from the following
list:
* sequenceIdentity = [20, 25, 30, 40, 50, 60, 70, 80, 90]
* resolution = [1.6, 1.8, 2.0, 2.2, 2.5, 3.0]


In the example below, we create a high-resolution, non-redundant set of protein chains with a 20% sequence identity threshold and 1.6 A resolution threshold.

In [4]:
nr_chains = pdb \
    .filter(Pisces(sequenceIdentity=20, resolution = 1.6)) \
    .flatMap(StructureToPolymerChains()) \
    .filter(Pisces(sequenceIdentity=20, resolution = 1.6))

In [5]:
nr_chains.count()

2004

## Save chains in an MMTF Hadoop Sequence file
If we need to use a set of structures multiple times, or want to create a snapshot in time (e.g., for reproducible analysis), we can save structures to an MMTF Hadoop Sequence File.

Here, we save the Pices subset we've just created. **Note, if the output file already exists, you must delete it first**.

In [6]:
# Writing is temporaryly disabled due to ongoing work on encoder and decoder libraries
# mmtfWriter.write_sequence_file(path + "_pices20_1.6", nr_chains)

## ... and read structures back in 
Check if it contains the correct number of chains.

In [7]:
# nr_chains = mmtfReader.read_sequence_file(path + "_pices20_1.6")
# nr_chains.count()

In [8]:
spark.stop()