### 3-ColumnarStructureIndexing
ColumnarStructure provides indices to locate the starts and ends of groups and chains in the atom-based arrays.

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.utils import traverseStructureHierarchy, ColumnarStructure
from mmtfPyspark import structureViewer
import numpy as np
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt

#### Configure Spark

In [2]:
spark = SparkSession.builder.appName("3-ColumnarStructureIndexing").getOrCreate()

### Download an example structure
Here we download an HIV protease structure with a bound ligand (Nelfinavir).

In [3]:
pdb = mmtfReader.download_full_mmtf_files(["1OHR"])

Structures are represented as keyword-value pairs (tuples):
* key: structure identifier (e.g., PDB ID)
* value: MmtfStructure (structure data)

In this case, we only have one structure, so we can use the first() method to extract the data.

In [4]:
structure = pdb.values().first()

## Create a columnar structure from an MMTF structure
Here we convert an MMTF structure to a columnar structure. By specifying the firstModel flag, we
only retrieve data for the first model (this structure has only one model, anyways).

In [5]:
arrays = ColumnarStructure(structure, firstModelOnly=True)

### Get atom coordinates as numpy arrays

In [6]:
x = arrays.get_x_coords()
y = arrays.get_y_coords()
z = arrays.get_z_coords()

### Get entity types
Entity types can be used to distinguish polymer from non-polymer groups and select specific components, e.g., all protein groups. The following entity types are available:
* **Polymer groups**
 * PRO: protein
 * DNA: DNA
 * RNA: RNA
 * PSR: saccharide
* **Non-polymer groups**
 * LGO: ligand organic
 * LGI: ligand inorganic
 * SAC: saccaride
 * WAT: water

In [7]:
entity_types = arrays.get_entity_types()
entity_types

array(['PRO', 'PRO', 'PRO', ..., 'WAT', 'WAT', 'WAT'], dtype=object)

### Get group names, group numbers, and chain name arrays

In [8]:
group_names = arrays.get_group_names()
group_names

array(['PRO', 'PRO', 'PRO', ..., 'HOH', 'HOH', 'HOH'], dtype=object)

Note, group numbers are string. They may contain an insertion code, e.g., `'101A'`

In [9]:
group_numbers = arrays.get_group_numbers()
group_numbers

array(['1', '1', '1', ..., '514', '514', '514'], dtype=object)

In [10]:
chain_names = arrays.get_chain_names()
chain_names

array(['A', 'A', 'A', ..., 'B', 'B', 'B'], dtype=object)

## Indexing the columnar datastructure

Indices are available to find the starts and ends of chains and groups. Indices are zero-based atom indices.
* start: index to first atom
* end: index to last atom + **1**

### Get start and end indices for chains

In [11]:
chain_indices = arrays.get_chain_to_atom_indices()

In [12]:
for i in range(arrays.get_num_chains()):   
    start = chain_indices[i]
    end = chain_indices[i+1]
    
    if entity_types[start] == "PRO":
        print("Protein chain  : " + chain_names[start] + ": " + str(start) + " - " + str(end))
        
    elif entity_types[start] == "LGO":
        print("Organic ligand : " + group_names[start] + "-" + chain_names[start] + ": " + str(start) + " - " + str(end))
        
    elif entity_types[start] == "WAT":
        print("Water          : " + group_names[start] + "-" + chain_names[start] + ": " + str(start) + " - " + str(end))

Protein chain  : A: 0 - 865
Protein chain  : B: 865 - 1755
Organic ligand : 1UN-A: 1755 - 1799
Water          : HOH-A: 1799 - 1886
Water          : HOH-B: 1886 - 1952


### Get start and end indices for groups (residues)

In [13]:
group_indices = arrays.get_group_to_atom_indices()

In [14]:
for i in range(arrays.get_num_groups()):   
    start = group_indices[i]
    end = group_indices[i+1]
    
    print(group_names[start] + "-" + chain_names[start] + ":" + group_numbers[start] + " " + str(start) + " - " + str(end))

PRO-A:1 0 - 9
GLN-A:2 9 - 21
ILE-A:3 21 - 30
THR-A:4 30 - 39
LEU-A:5 39 - 48
TRP-A:6 48 - 64
GLN-A:7 64 - 71
ARG-A:8 71 - 88
PRO-A:9 88 - 95
LEU-A:10 95 - 104
VAL-A:11 104 - 112
THR-A:12 112 - 121
ILE-A:13 121 - 130
LYS-A:14 130 - 138
ILE-A:15 138 - 147
GLY-A:16 147 - 152
GLY-A:17 152 - 157
GLN-A:18 157 - 169
LEU-A:19 169 - 178
LYS-A:20 178 - 191
GLU-A:21 191 - 201
ALA-A:22 201 - 207
LEU-A:23 207 - 216
LEU-A:24 216 - 225
ASP-A:25 225 - 234
THR-A:26 234 - 243
GLY-A:27 243 - 248
ALA-A:28 248 - 254
ASP-A:29 254 - 263
ASP-A:30 263 - 272
THR-A:31 272 - 281
VAL-A:32 281 - 289
LEU-A:33 289 - 298
GLU-A:34 298 - 305
GLU-A:35 305 - 312
MET-A:36 312 - 321
SER-A:37 321 - 329
LEU-A:38 329 - 338
PRO-A:39 338 - 345
GLY-A:40 345 - 350
ARG-A:41 350 - 356
TRP-A:42 356 - 372
LYS-A:43 372 - 378
PRO-A:44 378 - 385
LYS-A:45 385 - 391
MET-A:46 391 - 400
ILE-A:47 400 - 409
GLY-A:48 409 - 414
GLY-A:49 414 - 419
ILE-A:50 419 - 428
GLY-A:51 428 - 433
GLY-A:52 433 - 438
PHE-A:53 438 - 450
ILE-A:54 450 - 459
LYS-A

In [15]:
spark.stop()