Skip to content

Commit

Permalink
updated README, added Mongo stuff.
Browse files Browse the repository at this point in the history
  • Loading branch information
itdaniher committed Jun 4, 2012
1 parent e3871f3 commit 3803494
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 9 deletions.
8 changes: 8 additions & 0 deletions JSON_Mongo_Builder.py
@@ -0,0 +1,8 @@
#!/usr/bin/python
import json, gzip, pymongo

chebi = json.loads(gzip.open("ChEBI_complete.json.gz").read())

connection = pymongo.Connection("mongodb://localhost")
db = connection.chemicals
[db.chebi.insert(obj) for obj in chebi]
15 changes: 6 additions & 9 deletions README.mkd
Expand Up @@ -4,19 +4,16 @@

"Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds."

PyChEBI is a bit of Python glue to massage the *terrible* SDF file format into a Pythonic datastructure.
PyChEBI is a Python script to convert the quasi-obsolete SDF file format into a sane (Pythonic) datastructure.

For lulz, it also saves a gzipped, JSON serialized version of the database to the local machine.
The script ChEBI_JSON_Builder.py will get the gzip'ed SDF file from the EBI FTP site, parse it into a list of Python dictionaries, and serialize that list as a compressed JSON file.

-----

SDF is a somewhat terrible format - it dates back to the 1970s and is very domain-specific to physical and computational chemistry.
The script JSON_Mongo_Builder.py will open the compressed JSON file and save it to a [MongoDB](http://www.mongodb.org/) collection. It assumes that the Mongo daemon is running on the localhost, but can be easily adapted to work with a third party service like [MongoLab](https://mongolab.com/).

The ramifications of this is that there's a ton of useful information tied up in this archaic format, and, unless you're willing to break out [OpenBabel](http://openbabel.org) and figure out the right configuration settings, you don't get to play with the shiny info.
To add perspective, I've included two demo scripts, 'demoJSON.py' and 'demoMongo.py' which execute similar queries on the ChEBI data - the unoptimized Python search takes a few dozen times as long as the MongoDB request.

-----

PyChEBI uses only Python 2.7 builtin modules. To generate a serialized JSON database, simply execute "PyChEBI.py" and wait. And then wait awhile longer.

Alternatively, you can make use of the included file. It'll probably save you some mental anguish.
SDF is a somewhat terrible format - it's a pseudo-heirarchical key-value mapping with objects separated by a the "$$$$" string. Originally designed to distribute [Molfile](http://en.wikipedia.org/wiki/Molfile) connection table information, EBI made use of associated data functionality to distribute a large amount of incredibly useful molecular metadata in addition to the standard table.

The only parser I could find for the SDF format was part of the overcomplicated [OpenBabel](http://openbabel.org) project. I wanted to play with the information contained in the ChEBI database, but didn't want to deal with an absurdly complex program to get at it. An hour or four and a bit of Python later and I had a beautiful, albiet large, 22k element list of dictionarys.

This comment has been minimized.

Copy link
@egonw

egonw Jun 9, 2012

Contributor

There are several other tools with SD file parsers, such as the Chemistry Development Kit... via Cinfony this can be used in Python...

This comment has been minimized.

Copy link
@itdaniher

itdaniher Jun 9, 2012

Author Owner

Cinfony looks awesome, thanks for the protip!

I've been talking to to @kylelutz about his ChemKit project - it looks like with a few tweaks, it'll let me import connection table info from the Molfiles I've stored as strings in PyChEBI and spit them out as sane ChemicalJSON objects.

9 changes: 9 additions & 0 deletions demos/demoJSON.py
@@ -0,0 +1,9 @@
import json, gzip, pymongo

chebi = json.loads(gzip.open("ChEBI_complete.json.gz").read())
chebi = chebi[0:-1]

for obj in chebi:
if 'Synonyms' in obj.keys():
if 'FOOF' in obj['Synonyms']:
obj['Molfile']
5 changes: 5 additions & 0 deletions demos/demoMongo.py
@@ -0,0 +1,5 @@
from re import compile as rec
import pymongo
db = pymongo.Connection("localhost")
chebi = db.chemicals.chebi
chebi.find_one({'Synonyms':rec('FOOF')})['Molfile']

0 comments on commit 3803494

Please sign in to comment.