# 1-MMTF-Datastructure
This tutorial shows how to access data from the MMTF datastructure.

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.utils import traverseStructureHierarchy
from mmtfPyspark import structureViewer

#### Configure Spark

In [2]:
spark = SparkSession.builder.appName("1-MMTF-Datastructure").getOrCreate()

### Download an example structure
Here we download an HIV protease structure with a bound ligand (Nelfinavir).

In [3]:
pdb = mmtfReader.download_full_mmtf_files(["1OHR"])

Structures are represented as keyword-value pairs (tuples):
* key: structure identifier (e.g., PDB ID)
* value: MmtfStructure (structure data)

In this case, we only have one structure, so we can use the first() method to extract the data.

In [4]:
pdb_id = pdb.keys().first()
structure = pdb.values().first()

In [5]:
structureViewer.view_structure(pdb_id);

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …

### Access metadata
traverseStructureHierachy provides methods to explore MMTF structures.
[See code how to access these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L50-L62)

In [6]:
traverseStructureHierarchy.print_metadata(structure)

*** METADATA ***
StructureId           : 1OHR
Title                 : VIRACEPT (R) (NELFINAVIR MESYLATE, AG1343): A POTENT ORALLY BIOAVAILABLE INHIBITOR OF HIV-1 PROTEASE
Deposition date       : 1997-09-27
Release date          : 1998-12-09
Experimental method(s): [X-RAY DIFFRACTION]
Resolution            : 2.0999999046325684
Rfree                 : None
Rwork                 : 0.20000000298023224



### Structural data
[See code how to accesss these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L87-L96)

In [7]:
traverseStructureHierarchy.print_structure_data(structure)

*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 5
Number of groups : 250
Number of atoms : 1952
Number of bonds : 1926



### Entity data
Entities are the unique molecular components in a structure.

This structure has one unique polymer (ASPARTYLPROTEASE), one non-polymer ligand, and water.

In [8]:
traverseStructureHierarchy.print_entity_info(structure)

*** ENTITY DATA ***
entity type            : 0 polymer
entity description     : 0 ASPARTYLPROTEASE
entity sequence        : 0 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
entity type            : 1 non-polymer
entity description     : 1 2-[2-HYDROXY-3-(3-HYDROXY-2-METHYL-BENZOYLAMINO)-4-PHENYL SULFANYL-BUTYL]-DECAHYDRO-ISOQUINOLINE-3-CARBOXYLIC ACID TERT-BUTYLAMIDE
entity sequence        : 1 
entity type            : 2 water
entity description     : 2 water
entity sequence        : 2 




### Chain information
[See code how to accesss these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L98-L112)

Note, the in [PDB file for this structure](https://files.rcsb.org/view/4hhb.pdb) you find chains A and B. These "PDB" chains are referred to by chainName in MMTF. Almost all operations in MMTF use the chainNames. 

However, in the MMTF data structures, chains are split into polymer, non-polymer, and water chains. For this structure, there are 5 chains: 2 protein chains (99 groups each), 1 non-polymer chain (1 ligand group), and two water chains (29, 22 water groups). These 5 chains are refered to by their chainIds (A,B,C,D,E).

In [9]:
traverseStructureHierarchy.print_chain_info(structure)

*** CHAIN DATA ***
Number of chains: 5
model: 1
chainName: A, chainId: A, groups: 99
chainName: B, chainId: B, groups: 99
chainName: A, chainId: C, groups: 1
chainName: A, chainId: D, groups: 29
chainName: B, chainId: E, groups: 22



### Chain, entity, group, and atom information
[See code how to accesss these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L157-L225)

In the data listed below, seq. index is the zero-based index of a specific group (residue) into the one-letter polymer sequence.

DSSP secStruct. is the DSSP secondary structure annotation recalculated by BioJava's implementation of the DSSP method.

* 5: PI_HELIX
* S: BEND
* H: ALPHA_HELIX
* E: EXTENDED
* G: THREE_TEN_HELIX
* B: BRIDGE
* T: TURN
* C: COIL

In [10]:
traverseStructureHierarchy.print_chain_group_info(structure)

*** CHAIN AND GROUP DATA ***
model: 1
chainName: A, chainId: A, groups: 99
   groupName      : PRO
   oneLetterCode  : P
   seq. index     : 0
   numAtoms       : 9
   numBonds       : 7
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 1
   insertionCode  : 
   DSSP secStruct.: C

   groupName      : GLN
   oneLetterCode  : Q
   seq. index     : 1
   numAtoms       : 12
   numBonds       : 11
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 2
   insertionCode  : 
   DSSP secStruct.: E

   groupName      : ILE
   oneLetterCode  : I
   seq. index     : 2
   numAtoms       : 9
   numBonds       : 8
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 3
   insertionCode  : 
   DSSP secStruct.: E

   groupName      : THR
   oneLetterCode  : T
   seq. index     : 3
   numAtoms       : 9
   numBonds       : 8
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 4
   insertionCode  : 
   DSSP secStruct.: C

   groupName      : LEU
   oneLetterCode  : L
   seq. index

In [11]:
traverseStructureHierarchy.print_chain_entity_group_atom_info(structure)

*** CHAIN ENTITY GROUP ATOM DATA ***
model: 1
chainName: A, chainId: A, groups: 99
entity type          : polymer
entity description   : ASPARTYLPROTEASE
entity sequence      : PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
   groupName      : PRO
   oneLetterCode  : P
   seq. index     : 0
   numAtoms       : 9
   numBonds       : 7
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 1
   insertionCode  : 
   DSSP secStruct.: C
   Atoms          : 
      1	N		-3.477	7.714	33.891	1.0	26.32	N
      2	CA		-2.582	6.722	34.505	1.0	24.3	C
      3	C		-1.168	6.908	34.016	1.0	22.52	C
      4	O		-0.984	7.654	33.063	1.0	22.27	O
      5	CB		-3.083	5.331	34.122	1.0	26.46	C
      6	CG		-3.631	5.623	32.74	1.0	26.17	C
      7	CD		-4.339	6.972	32.959	1.0	26.04	C
      8	H2		-4.023	8.297	34.55	1.0	0.0	H
      9	H3		-2.859	8.366	33.35	1.0	0.0	H
   groupName      : GLN
   oneLetterCode  : Q
   seq. index     : 1
   numAtoms       : 12
   numBonds

### Crystallographic data

In [12]:
traverseStructureHierarchy.print_crystallographic_data(structure)

*** CRYSTALLOGRAPHIC DATA ***
Space group           : P 21 21 21
Unit cell dimensions  : [52.04, 59.38, 61.67, 90.00, 90.00, 90.00]



### Biologial assembly data
In this case, the asymmetric unit (content of MMTF structure) corresponds to the biological assembly. The transformation matrix in this csae is the Unit matrix.

In [13]:
traverseStructureHierarchy.print_bioassembly_data(structure)

*** BIOASSEMBLY DATA ***
Number bioassemblies: 1
bioassembly: 1
  Number transformations: 1
    transformation: 0
    chains:         (0, 1, 2, 3, 4)
    rotTransMatrix: (1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0)


In [14]:
spark.stop()