## Working with file paths - the os.path module

For this section, we will be working with the file ~ethanol.out~ 4EYR.pdb in the ~outfiles~ data directory.

To see this, go to a new cell and type ls. ls stands for ‘list’, and will list all of the contents of the current directory. This command is not a Python command, but will work in the Jupyter notebook. To see everything in the data directory, type

In [1]:
ls

01-Introduction_biochem_jupyter_book.md
02-file_parsing_biochem_jupyter_book.md
File_parsing_activities.ipynb
Remdesivir | C27H35N6O8P - PubChem.html
Remdesivir | C27H35N6O8P - PubChem.pdf
[1m[36mdata[m[m/
[1m[36mfig[m[m/
remdesivir_pubchem.json


In [2]:
ls data

[1m[36mPDB_files[m[m/      remdesivir.pdb  remdesivir.sdf  remdesivir.xyz


In [3]:
pwd

'/Users/pac8612/biochem_jupyter_book/Workshop_plagiarized_to_biochem'

In [4]:
import os
protein_file = os.path.join('data', 'PDB_files','4eyr.pdb')
print(protein_file)

data/PDB_files/4eyr.pdb


In [5]:
outfile = open(protein_file,"r")
data = outfile.readlines()

In [6]:
outfile.close()

In [7]:
with open(protein_file,"r") as outfile2:
    data2 = outfile2.readlines()

In [8]:
print(len(data))

2232


In [9]:
print(len(data2))

2232


In [10]:
for line in data:
    if 'HETNAM' in line:
        HETNAM_line = line
        print(HETNAM_line)

HETNAM     RIT RITONAVIR                                                        



In [11]:
HETNAM_line.split()

['HETNAM', 'RIT', 'RITONAVIR']

In [12]:
words = HETNAM_line.split()
print(words)

['HETNAM', 'RIT', 'RITONAVIR']


In [13]:
print(words[2])

RITONAVIR


In [14]:
print(words[1])

RIT


In [15]:
print(words[-1])

RITONAVIR


In [16]:
abbrev = words[1]
print(abbrev)

RIT


In [17]:
for line in data:
    if 'RESOLUTION.' in line:
        RESOLUTION_line = line
        print(RESOLUTION_line)

REMARK   2 RESOLUTION.    1.80 ANGSTROMS.                                       



In [18]:
words = RESOLUTION_line.split()
print(words)

['REMARK', '2', 'RESOLUTION.', '1.80', 'ANGSTROMS.']


In [31]:
for line in data:
    if 'PROTEIN ATOMS' in line:
        PROTEIN_ATOMS_line = line
        words = PROTEIN_ATOMS_line.split()
        print(words)

['REMARK', '3', 'PROTEIN', 'ATOMS', ':', '1514']


In [41]:
for line in data:
    if 'PROTEIN ATOMS' in line:
        PROTEIN_ATOM_line = line
        words = PROTEIN_ATOM_line.split(':')
        print(words)

['REMARK   3   PROTEIN ATOMS            ', ' 1514                                    \n']


In [42]:
atoms = words[1]
print(atoms)

 1514                                    



In [43]:
atoms / 2

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [44]:
atoms = float(atoms)
atoms/2

757.0

In [21]:
type(words[-1])

str

In [22]:
words[-1] = float(words[-1])
print(words[-1] * 2)

3028.0


Exercise: Extract data for 4eyr.pdb to tell you the resolution, number of protein atoms, and number of heterogen atoms. Your output should look like this:
RESOLUTION : 1.80 ANGSTROMS
PROTEIN ATOMS : 1514
HETEROGEN ATOMS : 50

In [23]:
for line in data:
    if 'RESOLUTION.' in line:
        RESOLUTION_line = line
        words = RESOLUTION_line.split()
        words[2] = words[2].rstrip('.')  # to remove the . from the end of RESOLUTION.
        words[-1] = words[-1].rstrip('.')
        print(words[2], ':', words[3], words[-1])
    if 'PROTEIN ATOMS' in line:
        PROTEIN_ATOMS_line = line
        words = PROTEIN_ATOMS_line.split()
        print(words[2], words[3], ':', words[5] )
    if 'HETEROGEN ATOMS' in line:
        HETEROGEN_ATOMS_line = line
        words = HETEROGEN_ATOMS_line.split()
        print(words[2], words[3], ':', words[5])

RESOLUTION : 1.80 ANGSTROMS
PROTEIN ATOMS : 1514
HETEROGEN ATOMS : 50


In [57]:
# to get the primary sequence for 4EYR

for linenum, line in enumerate(data):
    if 'SEQRES' in line:
        print(linenum, ':', line, sep = '')

310:SEQRES   1 A   99  PRO GLN ILE THR LEU TRP GLN ARG PRO ILE VAL THR ILE          

311:SEQRES   2 A   99  LYS ILE GLY GLY GLN LEU LYS GLU ALA LEU LEU ASN THR          

312:SEQRES   3 A   99  GLY ALA ASP ASP THR VAL LEU GLU GLU VAL ASN LEU PRO          

313:SEQRES   4 A   99  GLY ARG TRP LYS PRO LYS LEU ILE GLY GLY ILE GLY GLY          

314:SEQRES   5 A   99  PHE VAL LYS VAL ARG GLN TYR ASP GLN VAL PRO ILE GLU          

315:SEQRES   6 A   99  ILE CYS GLY HIS LYS VAL ILE GLY THR VAL LEU VAL GLY          

316:SEQRES   7 A   99  PRO THR PRO THR ASN VAL ILE GLY ARG ASN LEU MET THR          

317:SEQRES   8 A   99  GLN ILE GLY CYS THR LEU ASN PHE                              

318:SEQRES   1 B   99  PRO GLN ILE THR LEU TRP GLN ARG PRO ILE VAL THR ILE          

319:SEQRES   2 B   99  LYS ILE GLY GLY GLN LEU LYS GLU ALA LEU LEU ASN THR          

320:SEQRES   3 B   99  GLY ALA ASP ASP THR VAL LEU GLU GLU VAL ASN LEU PRO          

321:SEQRES   4 B   99  GLY ARG TRP LYS PRO LYS LEU ILE

In [25]:
# Does this approach work with other PDB files? Let's try myoglobin (PDB entry 1vxh)
# First we must set the path to the file
protein_file = os.path.join('data', 'PDB_files','1vxh.pdb')
with open(protein_file,"r") as outfile3:
    data3 = outfile3.readlines()
    print("There are", len(data3), "lines in 1vxh.pdb")


There are 1928 lines in 1vxh.pdb


In [26]:
for line in data3:
    if 'HETNAM' in line:
        HETNAM_line = line
        print(HETNAM_line)

HETNAM     SO4 SULFATE ION                                                      

HETNAM     HEM PROTOPORPHYRIN IX CONTAINING FE                                  



In [27]:
words = HETNAM_line.split()
print(words)

['HETNAM', 'HEM', 'PROTOPORPHYRIN', 'IX', 'CONTAINING', 'FE']


In [28]:
print(words[1], ':', words[2], words[3], words[4], words[5])

HEM : PROTOPORPHYRIN IX CONTAINING FE


In [29]:
for line in data3:
    if 'RESOLUTION.' in line:
        RESOLUTION_line = line
        words = RESOLUTION_line.split()
        words[2] = words[2].rstrip('.')  # to remove the . from the end of RESOLUTION.
        words[-1] = words[-1].rstrip('.')
        print(words[2], ':', words[3], words[-1])
    if 'PROTEIN ATOMS' in line:
        PROTEIN_ATOMS_line = line
        words = PROTEIN_ATOMS_line.split()
        print(words[2], words[3], ':', words[5] )
    if 'HETEROGEN ATOMS' in line:
        HETEROGEN_ATOMS_line = line
        words = HETEROGEN_ATOMS_line.split()
        print(words[2], words[3], ':', words[5])

RESOLUTION : 1.70 ANGSTROMS
PROTEIN ATOMS : 1217
HETEROGEN ATOMS : 54


In [48]:
protein_file = os.path.join('data', 'PDB_files','7tim.pdb')
print(protein_file)

data/PDB_files/7tim.pdb


In [59]:
with open(protein_file,"r") as outfile2:
    data2 = outfile2.readlines()
for line in data2:
    if 'RESOLUTION.' in line:
        RESOLUTION_line = line
        words = RESOLUTION_line.split()
        words[2] = words[2].rstrip('.')  # to remove the . from the end of RESOLUTION.
        words[-1] = words[-1].rstrip('.')
        print(words[2], ':', words[3], words[-1])
    if 'PROTEIN ATOMS' in line:
        PROTEIN_ATOMS_line = line
        words = PROTEIN_ATOMS_line.split()
        print(words[2], words[3], ':', words[5] )
    if 'HETEROGEN ATOMS' in line:
        HETEROGEN_ATOMS_line = line
        words = HETEROGEN_ATOMS_line.split()
        print(words[2], words[3], ':', words[5])
    if 'HETNAM' in line:
        HETNAM_line = line
        words = HETNAM_line.split()
        print('HETNAM Abbreviation:', words[1], 'Full name:', words[2])

RESOLUTION : 1.90 ANGSTROMS
PROTEIN ATOMS : 3766
HETEROGEN ATOMS : 20
HETNAM Abbreviation: PGH Full name: PHOSPHOGLYCOLOHYDROXAMIC


In [62]:
for linenum, line in enumerate(data2):
    if 'SEQRES   1 A' in line:
        print(linenum, ':', line, sep = '')

369:SEQRES   1 A  247  ALA ARG THR PHE PHE VAL GLY GLY ASN PHE LYS LEU ASN          



In [63]:
print(data2[370])
print(data2[371])
print(data2[372])
print(data2[373])
print(data2[374])

SEQRES   2 A  247  GLY SER LYS GLN SER ILE LYS GLU ILE VAL GLU ARG LEU          

SEQRES   3 A  247  ASN THR ALA SER ILE PRO GLU ASN VAL GLU VAL VAL ILE          

SEQRES   4 A  247  CYS PRO PRO ALA THR TYR LEU ASP TYR SER VAL SER LEU          

SEQRES   5 A  247  VAL LYS LYS PRO GLN VAL THR VAL GLY ALA GLN ASN ALA          

SEQRES   6 A  247  TYR LEU LYS ALA SER GLY ALA PHE THR GLY GLU ASN SER          



# Extracting the Atom and Heteroatom Coordinates
PDB files contain a lot of information and sometimes you only want the atom and heteroatom coordinates. One reason for doing this is when a program cannot read a particular PDB file for some unknown reason. You could spend hours trying to find the problem or you could just ask Python to pull out the coordinates of the structure and load them in your program. 