forked from maximilianh/pubMunch
/
README.txt
53 lines (39 loc) · 2.09 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
These are the tools that I use for the UCSC Genocoding project, see
http://text.soe.ucsc.edu
Most start with the prefix "pub", the category and then the concrete
publisher. The tool categories are:
- pubGetX = download files from publisher X (medline, pmc, elsevier)
- pubConvX = convert downloaded files to a pub format (tab-separated table
with fields defined in lib/pubStore.py)
- pubLoadX = load pub format data into a database system (mysql or sqlite)
More general tools are:
- pubPrepX = prepare directory structures. These are used to download
taxon names, import gene models from websites like NCBI or
UCSC.
- pubRunAnnot = run an annotator from the scripts directory on text data in
pub format
- pubRunMapReduce = run a map/reduce style job from "scripts" onto fulltext.
- pubCrawl = crawl papers from various publishers, needs a directory set up
with pubPrepCrawlDir and the journalList directory
- pubLoad = load pub format files into mysql db
- pubMap = complex multi stage pipeline to find and map markers found in text
(sequences, snps, bands, genes, etc) to genomic locations
and create/load bed files into the ucsc browser
If you plan to use any of these, make sure to go over lib/pubConf.py first.
Most commands need some settings in the config file adapted to your particular
server / cluster system. E.g. pubCrawl needs your email address, pubConvX
need the cluster system and various input/output directories.
Maximilian Haeussler, maximilianh@gmail.com
BUGS to fix:
fixme: illegal DOI landing page
http://www.nature.com/doifinder/10.1046/j.1523-1747.1998.00092.x
URL constructor:
http://www.nature.com/nature/journal/v437/n7062/full/4371102a.html
for DOI doi:10.1038/4371102a
URL construction for supplemental files:
http://www.nature.com/bjc/journal/v103/n10/suppinfo/6605908s1.html
no access page:
http://www.nature.com/nrclinonc/journal/v7/n11/full/nrclinonc.2010.119.html
- in wget, it triggers a 401 error
cat /cluster/home/max/projects/pubs/crawlDir/rupress/articleMeta.tab | head
-n13658 | tail -n2 > problem.txt