Skip to content


Switch branches/tags
This branch is 1 commit ahead, 384 commits behind maximilianh:master.

Latest commit


Git stats


Failed to load latest commit information.
These are the tools that I use for the UCSC Genocoding project, see

Most start with the prefix "pub", the category and then the concrete
publisher. The tool categories are:

- pubGetX = download files from publisher X (medline, pmc, elsevier)
- pubConvX = convert downloaded files to a pub format (tab-separated table
             with fields defined in lib/
- pubLoadX = load pub format data into a database system (mysql or sqlite)

More general tools are:

- pubPrepX = prepare directory structures. These are used to download
        taxon names, import gene models from websites like NCBI or
- pubRunAnnot = run an annotator from the scripts directory on text data in
             pub format
- pubRunMapReduce = run a map/reduce style job from "scripts" onto fulltext.
- pubCrawl = crawl papers from various publishers, needs a directory set up
             with pubPrepCrawlDir and the journalList directory
- pubLoad = load pub format files into mysql db
- pubMap = complex multi stage pipeline to find and map markers found in text 
           (sequences, snps, bands, genes, etc) to genomic locations 
           and create/load bed files into the ucsc browser

If you plan to use any of these, make sure to go over lib/ first.
Most commands need some settings in the config file adapted to your particular
server / cluster system. E.g. pubCrawl needs your email address, pubConvX 
need the cluster system and various input/output directories.

Maximilian Haeussler,

BUGS to fix:

fixme: illegal DOI landing page

URL constructor:
for DOI  doi:10.1038/4371102a

URL construction for supplemental files:

no access page:
- in wget, it triggers a 401 error

cat /cluster/home/max/projects/pubs/crawlDir/rupress/ | head
-n13658 | tail -n2 > problem.txt


various tools to download, convert and process scientific articles






No releases published


No packages published


  • Python 99.8%
  • Other 0.2%