"Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds."
PyChEBI is a Python script to convert the quasi-obsolete SDF file format into a sane (Pythonic) datastructure.
The script ChEBI_JSON_Builder.py will get the gzip'ed SDF file from the EBI FTP site, parse it into a list of Python dictionaries, and serialize that list as a compressed JSON file.
The script JSON_Mongo_Builder.py will open the compressed JSON file and save it to a MongoDB collection. It assumes that the Mongo daemon is running on the localhost, but can be easily adapted to work with a third party service like MongoLab.
To add perspective, I've included two demo scripts, 'demoJSON.py' and 'demoMongo.py' which execute similar queries on the ChEBI data - the unoptimized Python search takes a few dozen times as long as the MongoDB request.
SDF is a somewhat terrible format - it's a pseudo-heirarchical key-value mapping with objects separated by a the "$$$$" string. Originally designed to distribute Molfile connection table information, EBI made use of associated data functionality to distribute a large amount of incredibly useful molecular metadata in addition to the standard table.
The only parser I could find for the SDF format was part of the overcomplicated OpenBabel project. I wanted to play with the information contained in the ChEBI database, but didn't want to deal with an absurdly complex program to get at it. An hour or four and a bit of Python later and I had a beautiful, albiet large, 22k element list of dictionaries.