Author: Fabian


# 3. Reading and writing files

### Goals

- reading the atom coordinates from a PDB file
- output value tables to a text file
- learning to do calculations with *numpy arrays*
- first plotting with *matplitlib*

### Introduction

#### ATOM data in PDB files

The PDB file format contains information about atom species and positions. This information is found in lines starting with ``ATOM``. Each of these lines has the same amount of characters and has the following format:

| Characters | Type                                  | Data type        |       
|------------|---------------------------------------|------------------|
| 1 - 6      | Record name «ATOM»                    | String           |
| 7-11       | Serial atom number                    | Integer          |
| 13-16      | Atom name                             | String           |
| 17         | Alternate location                    | String           |
| 18-20      | Residue name                          | String           |
| 22         | Chain identifier                      | String           |
| 23-26      | Residue sequence nb.                  | Integer          |
| 27         | Code for residues insertion           | String           |
| 31-38      | X coordinate in Å                     | Float            |
| 39-46      | Y coordinate in Å                     | Float            |
| 47-54      | Z coordinate in Å                     | Float            |
| 55-60      | Occupancy                             | Float            |
| 61-66      | Temperature factor                    | Float            |
| 77-78      | Element symbol                        | String           |
| 79-80      | Charge                                | String           |

Further reading: http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html

**Note:** With PDB files, splitting the lines in their individual columns using the function ``line.split()`` is dangerous: use character positions instead!

### *TASK 1*

In this folder you will find the PDB file for Proaerolysin (``1PRE.pdb``). Open the file, and print all the lines containing a cysteine (residue name "CYS"). 

**Note:** Remember that in *Python* one starts counting from 0, not 1!

In [None]:
with open('1PRE.pdb','r') as file_handle:
    data=file_handle.readlines()
    lines=list(data)

for i in range(len(lines)):
    #if lines[i].startswith("ATOM") and lines[i][17:20] == "CYS":
    if lines[i][17:20] == "CYS":
            print(lines[i])
    

### *TASK 2*

Parse ``1PRE.pdb`` and write the $x$, $y$ and $z$ coordinates of all the cysteines to the output file ``atoms_cys.txt``. Use [*f-strings*](https://www.blog.pythonlibrary.org/2018/03/13/python-3-an-intro-to-f-strings/) to generate one string for each line ending with a newline character ``\n``.

**Advanced:** Make sure that columns are aligned.

In [None]:

with open('1PRE.pdb','r') as file_handle:
    data=file_handle.readlines()
    lines=list(data)

with open('atoms_cys.txt','w') as coordinates:
    coordinates.writelines("Coordinates of Cys residues\n X \t\t Y \t\t Z\n")
    for i in range(len(lines)):
        #if lines[i].startswith("ATOM") and lines[i][17:20] == "CYS":
            if lines[i][17:20] == "CYS":
                    xcoord=float(lines[i][31:38])
                    ycoord=float(lines[i][39:47])
                    zcoord=float(lines[i][47:55])
                    output=f"{xcoord} {ycoord} {zcoord}\n"
                    coordinates.writelines(output)
            
    

### *TASK 3*

Define a function ``read_pdb(filename)`` that expects as input parameter a PDB filename and that returns four *numpy arrays* containing
- the $x$ coordinates
- the $y$ coordinates
- the $z$ coordinates
- the atomic numbers (proton numbers)

**Advanced:** Save the function in another file called ``pdb_``*yourname*``.py``, import and call it.

In [None]:
# Import of numpy
import numpy

In [None]:
proton_number={'C': 6, 'N': 7, 'O' : 8, 'S' :16}
def read_pdb(filename='1PRE.pdb'):
    X=[]
    Y=[]
    Z=[]
    A=[]
    with open('1PRE.pdb','r') as file_handle:
        lines=file_handle.readlines()
        for line in lines:
            if line[17:20] == "CYS":
                x=float(line[31:38])
                y=float(line[39:47])
                z=float(line[47:55])
                prot=line[77:78]
                a=proton_number[prot]
                X.append(x)
                Y.append(y)
                Z.append(z)
                A.append(a)
    X = numpy.array(X)
    Y = numpy.array(Y)
    Z = numpy.array(Z)
    A = numpy.array(A)
    return (X, Y, Z, A)

                
read_pdb()

In [None]:
from pdb_fabs import read_pdb
read_pdb()

### *TASK 4*

Use the *matplotlib* function [*matplotlib.pyplot.scatter* ](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html) to visualise the $x$ and $y$ coordinates of Proaerolysin.
**Advanced:** Plot each element in a different color.

In [None]:
# Magic import of the matplotlib package for activating plotting in the notebook
import matplotlib
%matplotlib inline

In [None]:
import matplotlib.pyplot
import numpy
from pdb_fabs import read_pdb
x=read_pdb()[0]
y=read_pdb()[1]

matplotlib.pyplot.scatter(x,y, c=range(len(y)))

### *EXTRA TASK*

The radius of gyration of a molecule of $N$ atoms is defined as
    
$$R_g = \sqrt{\frac{\sum_i^N m_i r_i^2}{\sum_i^N m_i}}$$

where $m_i$ is mass of atom $i$ and $r_i$ is the distance of atom $i$ from the molecule's center of mass

$$\vec{r_0} = \frac{\sum_i^N m_i \vec{r_i}}{\sum_i^N m_i}$$

Estimate the size of Proaerolysin by calculating the radius of gyration on the basis of only its carbon atoms.

**Advanced:** Consider atoms of all elements for this calculation.

In [2]:
    import numpy
    X=[]
    Y=[]
    Z=[]
    A=[]
    mass=12
    with open('1PRE.pdb','r') as file_handle:
        lines=file_handle.readlines()
        for line in lines:
            if line[77:78] == "C":
                x=float(line[31:38])
                y=float(line[39:47])
                z=float(line[47:55])
                
                X.append(x)
                Y.append(y)
                Z.append(z)
                
    X = numpy.array(X)
    Y = numpy.array(Y)
    Z = numpy.array(Z)


In [13]:
rcenterx=sum(X)*12*len(X)/(12*len(X))
rcenterx

69980.76700000012

In [16]:
rcenterx=sum(X)
rcentery=sum(Y)
rcenterz=sum(Z)
print(rcenterx)
print(rcentery)
print(rcenterz)

69980.76700000012
238530.87300000017
144237.9699999999


In [17]:
X

array([  5.061,   5.843,   4.405, ..., -12.928, -11.762, -12.493])