updated README, added Mongo stuff.

itdaniher · Jun 4, 2012 · 3803494 · egonw · Jun 9, 2012 · itdaniher
1 parent e3871f3
commit 3803494
Show file tree

Hide file tree

Showing 4 changed files with 28 additions and 9 deletions.
diff --git a/JSON_Mongo_Builder.py b/JSON_Mongo_Builder.py
@@ -0,0 +1,8 @@
+#!/usr/bin/python
+import json, gzip, pymongo
+
+chebi = json.loads(gzip.open("ChEBI_complete.json.gz").read())
+
+connection = pymongo.Connection("mongodb://localhost")
+db = connection.chemicals
+[db.chebi.insert(obj) for obj in chebi]
diff --git a/README.mkd b/README.mkd
@@ -4,19 +4,16 @@
 
 "Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds." 
 
-PyChEBI is a bit of Python glue to massage the *terrible* SDF file format into a Pythonic datastructure. 
+PyChEBI is a Python script to convert the quasi-obsolete SDF file format into a sane (Pythonic) datastructure. 
 
-For lulz, it also saves a gzipped, JSON serialized version of the database to the local machine.
+The script ChEBI_JSON_Builder.py will get the gzip'ed SDF file from the EBI FTP site, parse it into a list of Python dictionaries, and serialize that list as a compressed JSON file.
 
------
-
-SDF is a somewhat terrible format - it dates back to the 1970s and is very domain-specific to physical and computational chemistry. 
+The script JSON_Mongo_Builder.py will open the compressed JSON file and save it to a [MongoDB](http://www.mongodb.org/) collection. It assumes that the Mongo daemon is running on the localhost, but can be easily adapted to work with a third party service like [MongoLab](https://mongolab.com/).
 
-The ramifications of this is that there's a ton of useful information tied up in this archaic format, and, unless you're willing to break out [OpenBabel](http://openbabel.org) and figure out the right configuration settings, you don't get to play with the shiny info.
+To add perspective, I've included two demo scripts, 'demoJSON.py' and 'demoMongo.py' which execute similar queries on the ChEBI data - the unoptimized Python search takes a few dozen times as long as the MongoDB request.
 
 -----
 
-PyChEBI uses only Python 2.7 builtin modules. To generate a serialized JSON database, simply execute "PyChEBI.py" and wait. And then wait awhile longer.
-
-Alternatively, you can make use of the included file. It'll probably save you some mental anguish.
+SDF is a somewhat terrible format - it's a pseudo-heirarchical key-value mapping with objects separated by a the "$$$$" string. Originally designed to distribute [Molfile](http://en.wikipedia.org/wiki/Molfile) connection table information, EBI made use of associated data functionality to distribute a large amount of incredibly useful molecular metadata in addition to the standard table. 
 
+The only parser I could find for the SDF format was part of the overcomplicated [OpenBabel](http://openbabel.org) project. I wanted to play with the information contained in the ChEBI database, but didn't want to deal with an absurdly complex program to get at it. An hour or four and a bit of Python later and I had a beautiful, albiet large, 22k element list of dictionarys. 
diff --git a/demos/demoJSON.py b/demos/demoJSON.py
@@ -0,0 +1,9 @@
+import json, gzip, pymongo
+
+chebi = json.loads(gzip.open("ChEBI_complete.json.gz").read())
+chebi = chebi[0:-1]
+
+for obj in chebi:
+	if 'Synonyms' in obj.keys():
+		if 'FOOF' in obj['Synonyms']:
+			obj['Molfile']
diff --git a/demos/demoMongo.py b/demos/demoMongo.py
@@ -0,0 +1,5 @@
+from re import compile as rec
+import pymongo
+db = pymongo.Connection("localhost")
+chebi = db.chemicals.chebi
+chebi.find_one({'Synonyms':rec('FOOF')})['Molfile']