This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Failed to load latest commit information.
Latest commit message
These are the tools that I use for the UCSC Genocoding project, see http://text.soe.ucsc.edu Most start with the prefix "pub", the category and then the concrete publisher. The tool categories are: - pubGetX = download files from publisher X (medline, pmc, elsevier) - pubConvX = convert downloaded files to a pub format (tab-separated table with fields defined in lib/pubStore.py) - pubLoadX = load pub format data into a database system (mysql or sqlite) More general tools are: - pubPrepX = prepare directory structures. These are used to download taxon names, import gene models from websites like NCBI or UCSC. - pubRunAnnot = run an annotator from the scripts directory on text data in pub format - pubRunMapReduce = run a map/reduce style job from "scripts" onto fulltext. - pubCrawl = crawl papers from various publishers, needs a directory set up with pubPrepCrawlDir and the journalList directory - pubLoad = load pub format files into mysql db - pubMap = complex multi stage pipeline to find and map markers found in text (sequences, snps, bands, genes, etc) to genomic locations and create/load bed files into the ucsc browser If you plan to use any of these, make sure to go over lib/pubConf.py first. Most commands need some settings in the config file adapted to your particular server / cluster system. E.g. pubCrawl needs your email address, pubConvX need the cluster system and various input/output directories. Maximilian Haeussler, firstname.lastname@example.org BUGS to fix: fixme: illegal DOI landing page http://www.nature.com/doifinder/10.1046/j.1523-1747.1998.00092.x URL constructor: http://www.nature.com/nature/journal/v437/n7062/full/4371102a.html for DOI doi:10.1038/4371102a URL construction for supplemental files: http://www.nature.com/bjc/journal/v103/n10/suppinfo/6605908s1.html no access page: http://www.nature.com/nrclinonc/journal/v7/n11/full/nrclinonc.2010.119.html - in wget, it triggers a 401 error cat /cluster/home/max/projects/pubs/crawlDir/rupress/articleMeta.tab | head -n13658 | tail -n2 > problem.txt