Author: ****

# 3. Reading and writing files

### Goals

- reading the atom coordinates from a PDB file
- output value tables to a text file
- learning to do calculations with *numpy arrays*
- first plotting with *matplitlib*

### Introduction

#### ATOM data in PDB files

The PDB file format contains information about atom species and positions. This information is found in lines starting with ``ATOM``. Each of these lines has the same amount of characters and has the following format:

| Characters | Type                                  | Data type        |       
|------------|---------------------------------------|------------------|
| 1 - 6      | Record name «ATOM»                    | String           |
| 7-11       | Serial atom number                    | Integer          |
| 13-16      | Atom name                             | String           |
| 17         | Alternate location                    | String           |
| 18-20      | Residue name                          | String           |
| 22         | Chain identifier                      | String           |
| 23-26      | Residue sequence nb.                  | Integer          |
| 27         | Code for residues insertion           | String           |
| 31-38      | X coordinate in Å                     | Float            |
| 39-46      | Y coordinate in Å                     | Float            |
| 47-54      | Z coordinate in Å                     | Float            |
| 55-60      | Occupancy                             | Float            |
| 61-66      | Temperature factor                    | Float            |
| 77-78      | Element symbol                        | String           |
| 79-80      | Charge                                | String           |

Further reading: http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html

**Note:** With PDB files, splitting the lines in their individual columns using the function ``line.split()`` is dangerous: use character positions instead!

### *TASK 1*

In this folder you will find the PDB file for Proaerolysin (``1PRE.pdb``). Open the file, and print all the lines containing a cysteine (residue name "CYS"). 

**Note:** Remember that in *Python* one starts counting from 0, not 1!

In [22]:
pdb = open("1PRE.pdb", "r") 
lines = list(pdb.readlines())
print(pdb)
pdb.close()
for i, line in enumerate(lines):
    if line[17:20] == "CYS":
        print(i)

<_io.TextIOWrapper name='1PRE.pdb' mode='r' encoding='cp1252'>
745
746
747
748
749
750
1170
1171
1172
1173
1174
1175
1820
1821
1822
1823
1824
1825
1861
1862
1863
1864
1865
1866
4269
4270
4271
4272
4273
4274
4694
4695
4696
4697
4698
4699
5344
5345
5346
5347
5348
5349
5385
5386
5387
5388
5389
5390


### *TASK 2*

Parse ``1PRE.pdb`` and write the $x$, $y$ and $z$ coordinates of all the cysteines to the output file ``atoms_cys.txt``. Use [*f-strings*](https://www.blog.pythonlibrary.org/2018/03/13/python-3-an-intro-to-f-strings/) to generate one string for each line ending with a newline character ``\n``.

**Advanced:** Make sure that columns are aligned.

In [106]:
import numpy as np
pdb = open("1PRE.pdb", "r") 
lines = list(pdb.readlines())
pdb.close()
cys_container = []
for i, line in enumerate(lines):
    #print(i, line)
    if line[17:20] == "CYS":
        x = float(line[30:38])
        y = float(line[38:46])
        z = float(line[46:54])
        cys_container.append([x, y, x])
cys_container = np.array(cys_container)
print(cys_container)

with open('outfile.txt', 'w') as f:
    for x, y, z in  cys_container:
        f.write(f"{x} {y} {z}\n")

[[ 13.459  82.83   13.459]
 [ 12.542  81.712  12.542]
 [ 11.546  81.89   11.546]
 [ 11.397  82.988  11.397]
 [ 11.74   81.685  11.74 ]
 [ 12.815  81.107  12.815]
 [ 13.304  76.65   13.304]
 [ 12.218  77.532  12.218]
 [ 11.06   76.742  11.06 ]
 [ 11.022  75.512  11.022]
 [ 11.76   78.281  11.76 ]
 [ 13.153  79.155  13.153]
 [-34.683  48.263 -34.683]
 [-33.997  49.53  -33.997]
 [-34.829  50.797 -34.829]
 [-36.037  50.849 -36.037]
 [-32.829  49.678 -32.829]
 [-33.275  49.564 -33.275]
 [-33.025  52.464 -33.025]
 [-31.971  52.007 -31.971]
 [-31.642  50.547 -31.642]
 [-30.709  50.075 -30.709]
 [-32.142  52.251 -32.142]
 [-33.612  51.54  -33.612]
 [ 17.737  73.52   17.737]
 [ 18.669  73.085  18.669]
 [ 19.747  74.16   19.747]
 [ 19.92   74.938  19.92 ]
 [ 19.421  71.86   19.421]
 [ 18.646  70.235  18.646]
 [ 18.284  67.714  18.284]
 [ 19.436  67.773  19.436]
 [ 20.647  66.978  20.647]
 [ 20.941  66.772  20.941]
 [ 19.759  69.237  19.759]
 [ 18.303  70.128  18.303]
 [ 65.931  40.911  65.931]
 

In [101]:
with open('outfile.txt', 'r') as f:
    for l in f.readlines():
        print(l)

13.459 82.83 23.23

12.542 81.712 23.122

11.546 81.89 21.995

11.397 82.988 21.465

11.74 81.685 24.443

12.815 81.107 25.8

13.304 76.65 26.884

12.218 77.532 27.304

11.06 76.742 27.879

11.022 75.512 27.695

11.76 78.281 26.056

13.153 79.155 25.299

-34.683 48.263 43.646

-33.997 49.53 43.913

-34.829 50.797 43.898

-36.037 50.849 43.733

-32.829 49.678 42.916

-33.275 49.564 41.147

-33.025 52.464 37.669

-31.971 52.007 38.53

-31.642 50.547 38.331

-30.709 50.075 38.978

-32.142 52.251 40.022

-33.612 51.54 40.736

17.737 73.52 9.02

18.669 73.085 10.077

19.747 74.16 10.288

19.92 74.938 9.349

19.421 71.86 9.535

18.646 70.235 9.589

18.284 67.714 13.303

19.436 67.773 12.391

20.647 66.978 12.853

20.941 66.772 14.033

19.759 69.237 12.197

18.303 70.128 11.605

65.931 40.911 32.854

65.302 41.248 31.584

66.332 41.648 30.537

67.553 41.729 30.74

64.291 42.383 31.857

65.036 43.979 32.268

64.657 47.952 30.508

63.676 46.891 30.597

63.314 46.637 32.044

62.193 46.172 32.274

### *TASK 3*

Define a function ``read_pdb(filename)`` that expects as input parameter a PDB filename and that returns four *numpy arrays* containing
- the $x$ coordinates
- the $y$ coordinates
- the $z$ coordinates
- the atomic numbers (proton numbers)

**Advanced:** Save the function in another file called ``pdb_``*yourname*``.py``, import and call it.

In [117]:
# Import of numpy
import numpy


def read_pdb(pdb):
    container = []
    with open("1PRE.pdb", 'r') as f:
        lines = list(f.readlines())
        for i, line in enumerate(lines):
            if line.startswith("ATOM"):
                n = float(line[6:11])
                x = float(line[30:38])
                y = float(line[38:46])
                z = float(line[46:54])
                container.append([n, x, y, z])
    container = np.array(container)
    return(container)

output = read_pdb("1PRE.pdb")    
print(output)


[[ 1.0000e+00  5.9550e+00  7.7192e+01  4.1900e+01]
 [ 2.0000e+00  5.0610e+00  7.6430e+01  4.0975e+01]
 [ 3.0000e+00  5.8430e+00  7.5299e+01  4.0293e+01]
 ...
 [ 7.0550e+03 -1.1762e+01  7.6732e+01  4.3104e+01]
 [ 7.0560e+03 -1.1630e+01  7.6829e+01  4.1859e+01]
 [ 7.0570e+03 -1.2493e+01  7.8663e+01  4.4602e+01]]


In [23]:
# TYPE YOUR SOLUTION HERE

### *TASK 4*

Use the *matplotlib* function [*matplotlib.pyplot.scatter* ](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html) to visualise the $x$ and $y$ coordinates of Proaerolysin.

**Advanced:** Plot each element in a different color.

In [None]:
# Magic import of the matplotlib package for activating plotting in the notebook
import matplotlib
%matplotlib inline

In [24]:
# TYPE YOUR SOLUTION HERE

### *EXTRA TASK*

The radius of gyration of a molecule of $N$ atoms is defined as
    
$$R_g = \sqrt{\frac{\sum_i^N m_i r_i^2}{\sum_i^N m_i}}$$

where $m_i$ is mass of atom $i$ and $r_i$ is the distance of atom $i$ from the molecule's center of mass

$$\vec{r_0} = \frac{\sum_i^N m_i \vec{r_i}}{\sum_i^N m_i}$$

Estimate the size of Proaerolysin by calculating the radius of gyration on the basis of only its carbon atoms.

**Advanced:** Consider atoms of all elements for this calculation.

In [25]:
# TYPE YOUR SOLUTION HERE