# Python in 5 Easy Pieces

Guillermo Avendano-Franco (gufranco@mail.wvu.edu)

This is a set of lectures about Python for Scientific Computing. These lectures are presented for those who attended the Virtual School of Computational Science and Engineering [VSCSE](http://www.vscse.org/) at [West Virginia University](http://www.wvu.edu).
From June 16 to 20 of 2014 and From June 30 to July 2 of 2014

The other lectures are accesible using the IPython Notebook Viewer and ther are located on my GitHub account

[Python in 5 Easy Pieces](http://nbviewer.ipython.org/github/guilleaf/Lectures/tree/master/Python_in_5_Easy_Pieces)

# An exercise about Molecules (Part 1/2)

This piece is inspired by chemistry, however the purpose behind that is to present a series of disconected elements of Python language related to the so called "Python Standard Library" a rich set of routines to perform a variety of tasks.
This is one of the key element that makes Python so attractive for scientific computing.

In this piece you will learn:

  1. How to use a few modules in the Python Standard Library, in particular: urllib, HTMLParser and json 
  2. Use an external library called OpenBabel to extract chemical information from files
  3. Use Mayavi to create a primitive but functional molecular visualizer
  4. Store chemical data in json fileformat, a popular file format for storing data
  5. We will explore the use also of NetCDF a powerful file format for storing large amounts of numerical data

For this piece we are not creating a program, we are better creating a script to perform a set of small tasks with molecules.

## Reading data from Internet

Imagine that you know about a website that host a large database of molecules and you would like to have access to that database. However the access is restricted to a web interface, you need to capture a large collection of molecules but you do not want to start clicking many times to get every single molecule that you need.

That is our first task, we will use one of the Python Standard Modules to access a webpage, the **urllib** module and the **HTMLParser** module to read the data that you need to extract.

The [**urllib**](https://docs.python.org/2.7/library/urllib.html) is a module to deal with URLs. The uniform resource locator, abbreviated as URL (also known as web address, particularly when used with HTTP) is a way to obtain an specific resource from internet. Lets import that module:

In [1]:
# Python 2
#import urllib
import urllib.request

The urllib module provides a function to open a [URL](http://en.wikipedia.org/wiki/Uniform_resource_locator) as a read-only file. We can use the methods, _read()_, _readline()_ and _readlines()_
to actually read the webpage as an entire string [ using _read()_ ] or as a list of lines [ using _readlines()_ ]

Our target is a website that contains a collection of common molecules http://www.reciprocalnet.org/edumodules/commonmolecules/list.html

In [2]:
url="http://www.reciprocalnet.org/edumodules/commonmolecules/list.html"
#rf=urllib.urlopen(url)
rf=urllib.request.urlopen(url)

Now we can read using the file-like object *rf*

In [3]:
webpage=rf.read().decode('latin1')

The webpage is a large file in HTML format. We will see some lines more or less in the middle of the document

In [4]:
for i in range(303,307):
    # Python 2
    #print(str(i).zfill(4)+5*' '+ webpage.split('\n')[i])
    print(str(i).zfill(4)+5*' '+ webpage.split('\n')[i])

0303     <span class="listFont"><a href="/recipnet/showsample.jsp?sampleId=27344121" target="_blank" onclick="openNewWindow(this.href); return false" onkeypress="openNewWindow(this.href); return false;">Carbon dioxide</a></span>&nbsp; &nbsp;<span class="smallListFont">Carbon dioxide, CO2, is one of the gases in our atmosphere, which is uniformly distributed over the earth's surface.</span><br/>
0304     <span class="listFont"><a href="/recipnet/showsample.jsp?sampleId=27344306" target="_blank" onclick="openNewWindow(this.href); return false" onkeypress="openNewWindow(this.href); return false;">Carbon suboxide</a></span>&nbsp; &nbsp;<span class="smallListFont">Carbon suboxide is a foul-smelling lachrymatory gas.</span><br/>
0305     <span class="listFont"><a href="/recipnet/showsample.jsp?sampleId=27344297" target="_blank" onclick="openNewWindow(this.href); return false" onkeypress="openNewWindow(this.href); return false;">Carbon tetrachloride</a></span>&nbsp; &nbsp;<span class="smallLi

We notice that each molecule has an associated *sampleId*, this number is used by the JavaScript *showsample.jsp* to return the webpage of that particular molecule.

We can try to extract those numbers and their associated molecule names reading the webpage as a text file and doing text-parsing acrobatics to read the number after _sampleId_ and the name before `</a>`.

Python in their standard library includes a [HTMLParser](https://docs.python.org/2.7/library/htmlparser.html) that makes easier to parse the  HTML document

In [5]:
from html.parser import HTMLParser
#from HTMLParser import HTMLParser

The **HTMLParser** module provides a class of the same name *HTMLParser*, we need to subclass *HTMLParser* and override its methods to implement the desired behavior. In particular we will override the methods *handle_starttag* and *handle_data* to collect the sampleId and the name of the molecule. We take advantage of the first tag in a webpage `<HTML>`two to create a couple of variables to store our data 

In [6]:
class MyHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        if tag=='html':
            self.molecules={}
            self.molecule_id=None
        if tag=='a' and ('target', '_blank') in attrs:
            for attr in attrs:
                if 'href'== attr[0] and 'sampleId' in attr[1]:
                    self.molecule_id=attr[1].split('=')[1]
                    print(" -> Molecule ID :", self.molecule_id)
        
    def handle_data(self, data):
        if hasattr(self, 'molecule_id') and self.molecule_id is not None:
            print(" -> Parsing name:",data)
            self.molecules[str(self.molecule_id)]={'name': data}
            self.molecule_id=None  

parser = MyHTMLParser()

We have a new class MyHTMLParser that is derivated from HTMLParser and we create an instance of the class that we called *parser*
We can use the *feed* method defined in the parent class to process the webpage that we store as a long string

In [8]:
parser.feed(webpage)

Start tag: html
Start tag: head
Start tag: title
Start tag: meta
Start tag: link
Start tag: link
Start tag: link
Start tag: script
Start tag: link
Start tag: body
Start tag: table
Start tag: tr
Start tag: td
Start tag: a
Start tag: span
Start tag: td
Start tag: div
Start tag: a
Start tag: a
Start tag: a
Start tag: tr
Start tag: td
Start tag: tr
Start tag: td
Start tag: br
Start tag: span
Start tag: a
Start tag: span
Start tag: a
Start tag: tr
Start tag: td
Start tag: tr
Start tag: td
Start tag: table
Start tag: tr
Start tag: td
Start tag: table
Start tag: tr
Start tag: td
Start tag: table
Start tag: tr
Start tag: td
Start tag: img
Start tag: td
Start tag: a
Start tag: tr
Start tag: td
Start tag: img
Start tag: td
Start tag: a
Start tag: tr
Start tag: td
Start tag: img
Start tag: td
Start tag: a
Start tag: tr
Start tag: td
Start tag: img
Start tag: td
Start tag: a
Start tag: tr
Start tag: td
Start tag: td
Start tag: td
Start tag: table
Start tag: tr
Start tag: td
Start tag: img
Start ta

Start tag: a
 -> Molecule ID : 27344871
 -> Parsing name: Chalcopyrite
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344051
 -> Parsing name: Chlorate
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344345
 -> Parsing name: Trans-Chlordane
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344663
 -> Parsing name: Chlordene
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344582
 -> Parsing name: Chlorine monoxide
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344639
 -> Parsing name: Chlorocresol
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344730
 -> Parsing name: Chloro-difluoro-methane
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344734
 -> Parsing name: Chloromethane
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344384
 -> Parsing

Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344764
 -> Parsing name: Galena
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344569
 -> Parsing name: Gallic acid
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344569
 -> Parsing name: Gallic acid monohydrate
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344000
 -> Parsing name: Garnet
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344165
 -> Parsing name: Gaspeite
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344811
 -> Parsing name: Germane
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344328
 -> Parsing name: D-Glucitol
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344728
 -> Parsing name: Glucocorticoid
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344247
 -> P

Start tag: a
Start tag: div
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344327
 -> Parsing name: Octanitrocubane
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344033
 -> Parsing name: Octanoic acid
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344279
 -> Parsing name: Oestrin
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344279
 -> Parsing name: Oestrone
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344383
 -> Parsing name: Oleic acid
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344382
 -> Parsing name: Oxalate
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344627
 -> Parsing name: Oxalic acid
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27343946
 -> Parsing name: Oxirane
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecul

Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344501
 -> Parsing name: Threonine
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344111
 -> Parsing name: Thymidine
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344815
 -> Parsing name: Thymine
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344806
 -> Parsing name: Thymol
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344795
 -> Parsing name: Tin oxide
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344795
 -> Parsing name: Tinstone
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27344519
 -> Parsing name: Titanium oxide
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID : 27343934
 -> Parsing name: Titanium tetrachloride
Start tag: span
Start tag: br
Start tag: span
Start tag: a
 -> Molecule ID :

Notice that everytime we found a new tag such as `<span class... >` and we extract the information that we need collecting the Id and the name of the molecule. All the data collected is stored in the variable *molecules* 

In [9]:
molecules=parser.molecules
len(molecules)

580

We capture IDs for 580 molecules, lets see how the python dictionary looks like

In [10]:
molecules

{'27343041': {'name': 'Oxychlor'},
 '27343452': {'name': 'Hexachlorocyclopentadiene'},
 '27343872': {'name': 'Sulfur oxide tetrafluoride'},
 '27343875': {'name': 'Inesite'},
 '27343876': {'name': 'L-Alanine'},
 '27343879': {'name': 'Hexachlorocyclohexane'},
 '27343880': {'name': 'Glycine'},
 '27343882': {'name': 'Pyrrhotite'},
 '27343883': {'name': 'Vitamin E'},
 '27343884': {'name': 'Pseudopterosin F dihydrate'},
 '27343885': {'name': 'Prometone'},
 '27343886': {'name': 'Estradiol'},
 '27343888': {'name': 'Cholic acid acrylonitrile clathrate'},
 '27343889': {'name': 'Quartz'},
 '27343893': {'name': 'Scalaradial'},
 '27343895': {'name': 'Cumene hydroperoxide'},
 '27343896': {'name': 'N-('},
 '27343898': {'name': 'Dimethylpyrazine'},
 '27343899': {'name': 'Perylene'},
 '27343903': {'name': 'Niter'},
 '27343907': {'name': 'Nuprin'},
 '27343908': {'name': 'Sinhalite'},
 '27343911': {'name': 'Biphosphate ion'},
 '27343915': {'name': 'Biacetyl'},
 '27343917': {'name': 'Halite'},
 '27343918'

Before doing the procedure for several molecules lets explore one particular case. In Python asking for the list of keys does not produce any sorted list of keys. We just need one to see how we can get the chemical file for it 

In [11]:
ID=list(molecules.keys())[0]
ID

'27344305'

Lets define the URL for that specific ID 

In [12]:
url='http://www.reciprocalnet.org/recipnet/showsampledetailed.jsp?sampleId='+ID
url

'http://www.reciprocalnet.org/recipnet/showsampledetailed.jsp?sampleId=27344305'

We will use **urllib** module to read the webpage for that specific molecule. Also we will observe wehre are located the tags that allow us to download the chemical data in PDB file format

In [13]:
#rf=urllib.urlopen(url)
rf=urllib.request.urlopen(url)
# decode is needed by Python 3
webpage=rf.read().decode('latin1')
for line in webpage.split('\n'):
    if 'pdb' in line:
        print(line)

          <a href="http://www.reciprocalnet.org/recipnet/data/common/50902/50902.pdb?ticket=1808737806">50902.pdb</a>


There is a link to download the PDB file associated to that molecule. Again, we can do the parsing reading line by line, searching for the pattern that contains 'pdb' or using [Regular Expressions](http://en.wikipedia.org/wiki/Regular_expression) but we will use here the simple HTMLParser. In this case the parser will also download the PDB file and return us its contents in the variable *pdb*

In [14]:
class MyHTMLParser2(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        if tag=='a':
            for attr in attrs:
                if 'href'==attr[0] and 'pdb' in attr[1]:
                    print("Attributes :", attrs)
                    #self.pdb = urllib.urlopen(attr[1]).read()
                    self.pdb = urllib.request.urlopen(attr[1]).read()
                    

parser2 = MyHTMLParser2()

In [15]:
parser2.feed(webpage)

Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50902/50902.pdb?ticket=1808737806')]


In the last line of the parser above, we are actually downloading the PDB file and storing the contents in the variable parser2.pdb

In [16]:
print(parser2.pdb.decode('latin1'))

COMPNDMSC_CIF_
REMARK THIS FILE GENERATED BY PBDCRT
ATOM      1  H           1        .000    .000    .000
ATOM      2  H           1        .664    .000  -7.591
ATOM      3  H           1       1.328    .000 -15.182
ATOM      4  H           1       -.771  -7.915    .874
ATOM      5  H           1       -.107  -7.915  -6.717
ATOM      6  H           1        .557  -7.915 -14.308
ATOM      7  H           1      -1.542 -15.829   1.747
ATOM      8  H           1       -.878 -15.829  -5.844
ATOM      9  H           1       -.214 -15.829 -13.435
ATOM     10  H           1      -7.900    .000    .000
ATOM     11  H           1      -7.236    .000  -7.591
ATOM     12  H           1      -6.572    .000 -15.182
ATOM     13  H           1      -8.671  -7.915    .874
ATOM     14  H           1      -8.007  -7.915  -6.717
ATOM     15  H           1      -7.343  -7.915 -14.308
ATOM     16  H           1      -9.442 -15.829   1.747
ATOM     17  H           1      -8.778 -15.829  -5

This PDB file is relatively simple, but this file format was indeed created to describe large proteins. The [Protein Data Bank](http://en.wikipedia.org/wiki/Protein_Data_Bank) is the stadard for  structural data of large biological molecules, such as proteins and nucleic acids. However there is a large number of file-formats in chemoinformatics so we really on [OpenBabel](http://openbabel.org/wiki/Main_Page) to read and process the PDB and extract the information that we really want to use such as the number of atoms, their species and their coordinates. 

Open Babel is package written in C++ but they have library bindings written for Python. [Library bindings](http://en.wikipedia.org/wiki/Language_binding) are some sort of glues between languages that allow us to make bridge between two programming languages so that a library written for one language can be used in another.

If you do not have OpenBabel installed into your machine the next instructions will failed, but we will contour that later on

In [17]:
import openbabel

obConversion = openbabel.OBConversion()
obConversion.SetInAndOutFormats("pdb", "xyz")

True

Those commands are specific for OpenBabel, you do not need to understand very deeply those commands. For the purpose of this lecture the important point to retain is that Python serves as a glue between several packages.

In [18]:
mol = openbabel.OBMol()
obConversion.ReadString(mol,parser2.pdb.decode('latin1'))

True

Now mol is a OBMol object and we can use the properties of that object to extrat the data that we need, the number of atoms and their coordinates

In [19]:
mol.NumAtoms()

267

We store all the positions as a list of lists

In [20]:
positions=[]
i=0
for obatom in openbabel.OBMolAtomIter(mol):
    positions.append([obatom.GetX(), obatom.GetY(), obatom.GetZ()])
    i+=1
positions

[[0.0, 0.0, 0.0],
 [0.664, 0.0, -7.591],
 [1.328, 0.0, -15.182],
 [-0.771, -7.915, 0.874],
 [-0.107, -7.915, -6.717],
 [0.557, -7.915, -14.308],
 [-1.542, -15.829, 1.747],
 [-0.878, -15.829, -5.844],
 [-0.214, -15.829, -13.435],
 [-7.9, 0.0, 0.0],
 [-7.236, 0.0, -7.591],
 [-6.572, 0.0, -15.182],
 [-8.671, -7.915, 0.874],
 [-8.007, -7.915, -6.717],
 [-7.343, -7.915, -14.308],
 [-9.442, -15.829, 1.747],
 [-8.778, -15.829, -5.844],
 [-8.114, -15.829, -13.435],
 [-15.8, 0.0, 0.0],
 [-15.136, 0.0, -7.591],
 [-14.472, 0.0, -15.182],
 [-16.571, -7.915, 0.874],
 [-15.907, -7.915, -6.717],
 [-15.243, -7.915, -14.308],
 [-17.342, -15.829, 1.747],
 [-16.678, -15.829, -5.844],
 [-7.093, -7.707, 0.61],
 [-5.611, -4.139, -3.205],
 [-3.719, -5.634, -5.899],
 [-7.694, -5.194, -5.434],
 [-6.581, -1.287, -5.035],
 [-7.974, -7.025, -5.716],
 [-7.8, -6.287, -0.764],
 [-5.444, -6.158, -6.053],
 [-4.881, -0.654, -1.259],
 [-5.322, -1.09, -6.532],
 [-7.075, -5.362, -3.71],
 [-6.127, -3.057, -4.753],
 [-6.456

In [21]:
AtomNumbers=[ obatom.GetAtomicNum() for obatom in openbabel.OBMolAtomIter(mol) ] 
AtomNumbers

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 1,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 26,
 26,
 52,
 52,
 52,
 8,
 8,
 8,
 8,
 8,
 8

## Storing data in Json file format

JSON (JavaScript Object Notation) is a lightweight data interchange format based on a subset of JavaScript syntax. There are several advantages of Json compared with XML and a  Document databases such as MongoDB use JSON documents in order to store records, just as tables and rows store records in a relational database. Here is an example of a JSON document:

We will store the structural data, coordinates, species and number of atoms and store it together with the name of the molecule and the ID that we get before. The process of translating data structures or object state into a format that can be stored is called Serialization, and that is exactly what we are doing here 

In [22]:
ID

'27344305'

In [23]:
molecules[ID]

{'name': 'Hydrated iron tellurite'}

In [24]:
entry={'ID': int(ID), 'name': molecules[ID]['name']}
entry

{'ID': 27344305, 'name': 'Hydrated iron tellurite'}

Now we add the structural data

In [25]:
entry['pos']=positions
entry['natom']=mol.NumAtoms()
entry['species']=AtomNumbers
entry

{'ID': 27344305,
 'name': 'Hydrated iron tellurite',
 'natom': 267,
 'pos': [[0.0, 0.0, 0.0],
  [0.664, 0.0, -7.591],
  [1.328, 0.0, -15.182],
  [-0.771, -7.915, 0.874],
  [-0.107, -7.915, -6.717],
  [0.557, -7.915, -14.308],
  [-1.542, -15.829, 1.747],
  [-0.878, -15.829, -5.844],
  [-0.214, -15.829, -13.435],
  [-7.9, 0.0, 0.0],
  [-7.236, 0.0, -7.591],
  [-6.572, 0.0, -15.182],
  [-8.671, -7.915, 0.874],
  [-8.007, -7.915, -6.717],
  [-7.343, -7.915, -14.308],
  [-9.442, -15.829, 1.747],
  [-8.778, -15.829, -5.844],
  [-8.114, -15.829, -13.435],
  [-15.8, 0.0, 0.0],
  [-15.136, 0.0, -7.591],
  [-14.472, 0.0, -15.182],
  [-16.571, -7.915, 0.874],
  [-15.907, -7.915, -6.717],
  [-15.243, -7.915, -14.308],
  [-17.342, -15.829, 1.747],
  [-16.678, -15.829, -5.844],
  [-7.093, -7.707, 0.61],
  [-5.611, -4.139, -3.205],
  [-3.719, -5.634, -5.899],
  [-7.694, -5.194, -5.434],
  [-6.581, -1.287, -5.035],
  [-7.974, -7.025, -5.716],
  [-7.8, -6.287, -0.764],
  [-5.444, -6.158, -6.053],
  [-4

Now we can store this data in a Json format using the Python [json](https://docs.python.org/2.7/library/json.html) module in the Standard Library

In [26]:
import json

In [27]:
all_entries=[entry]
all_entries

[{'ID': 27344305,
  'name': 'Hydrated iron tellurite',
  'natom': 267,
  'pos': [[0.0, 0.0, 0.0],
   [0.664, 0.0, -7.591],
   [1.328, 0.0, -15.182],
   [-0.771, -7.915, 0.874],
   [-0.107, -7.915, -6.717],
   [0.557, -7.915, -14.308],
   [-1.542, -15.829, 1.747],
   [-0.878, -15.829, -5.844],
   [-0.214, -15.829, -13.435],
   [-7.9, 0.0, 0.0],
   [-7.236, 0.0, -7.591],
   [-6.572, 0.0, -15.182],
   [-8.671, -7.915, 0.874],
   [-8.007, -7.915, -6.717],
   [-7.343, -7.915, -14.308],
   [-9.442, -15.829, 1.747],
   [-8.778, -15.829, -5.844],
   [-8.114, -15.829, -13.435],
   [-15.8, 0.0, 0.0],
   [-15.136, 0.0, -7.591],
   [-14.472, 0.0, -15.182],
   [-16.571, -7.915, 0.874],
   [-15.907, -7.915, -6.717],
   [-15.243, -7.915, -14.308],
   [-17.342, -15.829, 1.747],
   [-16.678, -15.829, -5.844],
   [-7.093, -7.707, 0.61],
   [-5.611, -4.139, -3.205],
   [-3.719, -5.634, -5.899],
   [-7.694, -5.194, -5.434],
   [-6.581, -1.287, -5.035],
   [-7.974, -7.025, -5.716],
   [-7.8, -6.287, -0.764

In [28]:
wf=open('molecules.json','w')
json.dump(all_entries, wf, sort_keys=True, indent=None, separators=(',', ': '))
wf.close()

Now we can see the contents of the file using the magic `%cat`. 
Remember, this is not a Python command, this is part of the functionalities of this IPython interface

In [29]:
!cat molecules.json

[{"ID": 27344305,"name": "Hydrated iron tellurite","natom": 267,"pos": [[0.0,0.0,0.0],[0.664,0.0,-7.591],[1.328,0.0,-15.182],[-0.771,-7.915,0.874],[-0.107,-7.915,-6.717],[0.557,-7.915,-14.308],[-1.542,-15.829,1.747],[-0.878,-15.829,-5.844],[-0.214,-15.829,-13.435],[-7.9,0.0,0.0],[-7.236,0.0,-7.591],[-6.572,0.0,-15.182],[-8.671,-7.915,0.874],[-8.007,-7.915,-6.717],[-7.343,-7.915,-14.308],[-9.442,-15.829,1.747],[-8.778,-15.829,-5.844],[-8.114,-15.829,-13.435],[-15.8,0.0,0.0],[-15.136,0.0,-7.591],[-14.472,0.0,-15.182],[-16.571,-7.915,0.874],[-15.907,-7.915,-6.717],[-15.243,-7.915,-14.308],[-17.342,-15.829,1.747],[-16.678,-15.829,-5.844],[-7.093,-7.707,0.61],[-5.611,-4.139,-3.205],[-3.719,-5.634,-5.899],[-7.694,-5.194,-5.434],[-6.581,-1.287,-5.035],[-7.974,-7.025,-5.716],[-7.8,-6.287,-0.764],[-5.444,-6.158,-6.053],[-4.881,-0.654,-1.259],[-5.322,-1.09,-6.532],[-7.075,-5.362,-3.71],[-6.127,-3.057,-4.753],[-6.456,-2.947,-1.874],[-4.054,-5.0,-4.182],[-5.192,-5.422,-1.565],[-16.014,-15.829,-13.

## Collecting the entire database

Now imagine that you need the entire database. This is something that you will not execute. I will just give you the code how I did.
For the next lecture we will start from the data collected using this code.

In [31]:
import time
all_entries=[]
index=0
for molId in list(molecules.keys()[:100]):
    print(index,molId)
    url='http://www.reciprocalnet.org/recipnet/showsampledetailed.jsp?sampleId='+molId
    rf=urllib.request.urlopen(url)
    webpage=rf.read().decode('latin1')
    parser2.reset()
    try:
        parser2.feed(webpage)
        if hasattr(parser2,'pdb'):
            pdb_string = parser2.pdb.decode('latin1')
            ID = int(molId)
            name = molecules[molId]['name']
            mol = openbabel.OBMol()
            obConversion.ReadString(mol,pdb_string)
            natom=mol.NumAtoms()
            positions = []
            i=0
            for obatom in openbabel.OBMolAtomIter(mol):
                positions.append([obatom.GetX(), obatom.GetY(), obatom.GetZ()])
                i += 1
            AtomNumbers = [ obatom.GetAtomicNum() for obatom in openbabel.OBMolAtomIter(mol)]
            entry={'ID': ID, 'name': name, 'pos': positions, 'natom': natom, 'species': AtomNumbers}
            all_entries.append(entry)
            time.sleep(1)
    except UnicodeDecodeError:
        print("Error parsing this URL", url)
    index+=1

0 27344305
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50902/50902.pdb?ticket=2103888890')]
1 27344636
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50153/50153.pdb?ticket=389727522')]
2 27344085
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50108/50108.pdb?ticket=1758299763')]
3 27344820
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50855/50855.pdb?ticket=1746378401')]
4 27344257
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50967/50967.pdb?ticket=1417348034')]
5 27344285
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50685/50685.pdb?ticket=95465847')]
6 27344166
7 27344029
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50684/50684.pdb?ticket=849375880')]
8 27344463
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50724/50724.pdb?ticket=2073239893')]
9 27344503
Attrib

61 27343876
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50687/LALNIN12.pdb?ticket=842303148')]
62 27344073
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50954/50954.pdb?ticket=631814420')]
63 27344806
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/52035/52035.pdb?ticket=1481711783')]
64 27344296
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50806/50806.pdb?ticket=1437002171')]
65 27343977
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50638/50638.pdb?ticket=1921527379')]
66 27344056
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50768/50768.pdb?ticket=1415992399')]
67 27343987
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50729/50729.pdb?ticket=1614291939')]
68 27344689
69 27344364
Attributes : [('href', 'http://www.reciprocalnet.org/recipnet/data/common/50650/50650.pdb?ticket=1853107176')]
70 

HTTPError: HTTP Error 500: Internal Server Error

In [32]:
len(all_entries)

115

In [34]:
all_entries

[{'ID': 27344305,
  'name': 'Hydrated iron tellurite',
  'natom': 267,
  'pos': [[0.0, 0.0, 0.0],
   [0.664, 0.0, -7.591],
   [1.328, 0.0, -15.182],
   [-0.771, -7.915, 0.874],
   [-0.107, -7.915, -6.717],
   [0.557, -7.915, -14.308],
   [-1.542, -15.829, 1.747],
   [-0.878, -15.829, -5.844],
   [-0.214, -15.829, -13.435],
   [-7.9, 0.0, 0.0],
   [-7.236, 0.0, -7.591],
   [-6.572, 0.0, -15.182],
   [-8.671, -7.915, 0.874],
   [-8.007, -7.915, -6.717],
   [-7.343, -7.915, -14.308],
   [-9.442, -15.829, 1.747],
   [-8.778, -15.829, -5.844],
   [-8.114, -15.829, -13.435],
   [-15.8, 0.0, 0.0],
   [-15.136, 0.0, -7.591],
   [-14.472, 0.0, -15.182],
   [-16.571, -7.915, 0.874],
   [-15.907, -7.915, -6.717],
   [-15.243, -7.915, -14.308],
   [-17.342, -15.829, 1.747],
   [-16.678, -15.829, -5.844],
   [-7.093, -7.707, 0.61],
   [-5.611, -4.139, -3.205],
   [-3.719, -5.634, -5.899],
   [-7.694, -5.194, -5.434],
   [-6.581, -1.287, -5.035],
   [-7.974, -7.025, -5.716],
   [-7.8, -6.287, -0.764

In [38]:
import bz2
wf=bz2.BZ2File('molecules.json.bz2','w')
wf.write(json.dumps(all_entries, sort_keys=True, indent=None, separators=(',', ': ')).encode('utf8'))
wf.close()

In [40]:
rf=bz2.BZ2File('molecules.json.bz2')
data_str=rf.read().decode('utf8')
print(data_str)

[{"ID": 27344305,"name": "Hydrated iron tellurite","natom": 267,"pos": [[0.0,0.0,0.0],[0.664,0.0,-7.591],[1.328,0.0,-15.182],[-0.771,-7.915,0.874],[-0.107,-7.915,-6.717],[0.557,-7.915,-14.308],[-1.542,-15.829,1.747],[-0.878,-15.829,-5.844],[-0.214,-15.829,-13.435],[-7.9,0.0,0.0],[-7.236,0.0,-7.591],[-6.572,0.0,-15.182],[-8.671,-7.915,0.874],[-8.007,-7.915,-6.717],[-7.343,-7.915,-14.308],[-9.442,-15.829,1.747],[-8.778,-15.829,-5.844],[-8.114,-15.829,-13.435],[-15.8,0.0,0.0],[-15.136,0.0,-7.591],[-14.472,0.0,-15.182],[-16.571,-7.915,0.874],[-15.907,-7.915,-6.717],[-15.243,-7.915,-14.308],[-17.342,-15.829,1.747],[-16.678,-15.829,-5.844],[-7.093,-7.707,0.61],[-5.611,-4.139,-3.205],[-3.719,-5.634,-5.899],[-7.694,-5.194,-5.434],[-6.581,-1.287,-5.035],[-7.974,-7.025,-5.716],[-7.8,-6.287,-0.764],[-5.444,-6.158,-6.053],[-4.881,-0.654,-1.259],[-5.322,-1.09,-6.532],[-7.075,-5.362,-3.71],[-6.127,-3.057,-4.753],[-6.456,-2.947,-1.874],[-4.054,-5.0,-4.182],[-5.192,-5.422,-1.565],[-16.014,-15.829,-13.

For the next tutorial, we will download this file an we will extract some basic information about the molecules in th json file

In [41]:
mols=json.loads(data_str)

In [42]:
mols

[{'ID': 27344305,
  'name': 'Hydrated iron tellurite',
  'natom': 267,
  'pos': [[0.0, 0.0, 0.0],
   [0.664, 0.0, -7.591],
   [1.328, 0.0, -15.182],
   [-0.771, -7.915, 0.874],
   [-0.107, -7.915, -6.717],
   [0.557, -7.915, -14.308],
   [-1.542, -15.829, 1.747],
   [-0.878, -15.829, -5.844],
   [-0.214, -15.829, -13.435],
   [-7.9, 0.0, 0.0],
   [-7.236, 0.0, -7.591],
   [-6.572, 0.0, -15.182],
   [-8.671, -7.915, 0.874],
   [-8.007, -7.915, -6.717],
   [-7.343, -7.915, -14.308],
   [-9.442, -15.829, 1.747],
   [-8.778, -15.829, -5.844],
   [-8.114, -15.829, -13.435],
   [-15.8, 0.0, 0.0],
   [-15.136, 0.0, -7.591],
   [-14.472, 0.0, -15.182],
   [-16.571, -7.915, 0.874],
   [-15.907, -7.915, -6.717],
   [-15.243, -7.915, -14.308],
   [-17.342, -15.829, 1.747],
   [-16.678, -15.829, -5.844],
   [-7.093, -7.707, 0.61],
   [-5.611, -4.139, -3.205],
   [-3.719, -5.634, -5.899],
   [-7.694, -5.194, -5.434],
   [-6.581, -1.287, -5.035],
   [-7.974, -7.025, -5.716],
   [-7.8, -6.287, -0.764