Mining publications for gene names
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
features
lib
scripts
spec
test/data/input
.document
.gitignore
.rspec
.travis.yml
Gemfile
LICENSE.txt
README.md
Rakefile
VERSION

README.md

bio-exominer

Build Status

Exominer helps build a list of genes from publications.

Such a gene list may be used for identifying candidate genes connected to a specific disease, but also may be used to compile a targeted exome design for sequencing.

A quick example

gene textmatch description context resource doi
AKP8L HAP95 A kinase (PRKA) anchor protein 8-like A cancer-associated RING finger protein, RNF43, is a ubiquitin ligase that interacts with a nuclear protein, HAP95 Whole-exome sequencing of neoplastic cysts of the pancreas reveals recurrent mutations in components of ubiquitin-dependent pathways doi:10.1073/pnas.1118046108

Here, the second column shows the fuzzy text match, the first column the official HUGO name, the third column a description of the gene, the fourth column the textual context in the publication, the fifth column the title of the publication and the sixth column the DOI.

A complete result for a search for pancreatic cancer genes that were not listed in an exome design can be seen here. In this table the second entry for AM is a false positive; quickly seen by checking the context in the fourth column (AM refers to author initials). This output is generated by a SPARQL query and a lot of flexibility in combining resources and generating output is possible. Note that this is just one example.

The inputs for Exominer consists of a list of Pubmed IDs with text files (PDF, HTML, Word, Excel have to be exported to plain text first). Exominer harvests gene names from these documents using a default symbol list with aliases. Ideally, all texts would only contain HUGO symbols, i.e. the over 30K standardized gene names by the HUGO Gene Nomenclature Committee (HGNC). In reality, scientific authors take liberties and the search for names is 'fuzzy'. Therefore the search for Exominer also mines for the 12 odd million symbols and aliases that are known through NCBI.

All matches are written with their sources, symbol frequencies, publication year, and user provided keywords and impact scores and written out.

Exominer also exports to RDF, so that the gene symbols can be stored into a triple-store graph database and link out to Bio2rdf resources. The latter allows, for example, harvesting of pathways.

Every RDF export contains full information on the origin of symbols. Over time designs can be compared against each other and a historical record is maintained. It is a good idea to store the textual versions of the files too.

The initial symbol list with aliases can be fetched/generated from external sources, such as NCBI, Biomart and/or Bio2rdf. Some examples are listed in this README and related scripts are in ./scripts. For a more specific treatment of design and input/output of exominer, see ./doc/design.md.

Questions to ask from the RDF

  • What genes are mentioned in a paper?
  • What papers refer to certain genes?
  • What genes are mentioned most in papers?
  • What genes are mentioned only in one paper?
  • What genes are mentioned since 2011?
  • What genes are linked to a certain disease subtype?
  • What genes are linked to some author or lab?
  • What genes exist in a design?
  • What are the genes in a design that are non-HUGO named
  • What are the genes in a paper that are non-HUGO named
  • How do designs differ?
  • What genes are not in a design mentioned since 2010?

When linking out to TCGA and bio2rdf we can get mutation information and gene sizes

  • Give mutations of genes and their sizes of those listed in a paper
  • Give mutations of genes and their sizes of those listed in a design

The TCGA (maf) data was provided by Will's Ruby publisci RDF module. We can ask patient related questions

  • How many patients are in the TCGA database?
  • How many patients are in the TCGA per tumor type?

And mutation related questions

  • Rank patients on number of mutations
  • How many genes show at least one mutation per patient
  • What genes in what patients show more than X mutations (normalized for gene length)
  • Rank genes on number of mutations (normalized for gene length)
  • List mutated genes per patient
  • List patient per mutated gene
  • List all mutations that have exactly the same start position and matching variant type (SNP, INS, DEL)

These questions are answered through SPARQL queries below.

Note: this software is under active development!

Installation

gem install bio-exominer

Quick start

List all genes in a paper. Visit the paper with your browser and save it as HTML or text to 'paper.txt'

Command line interface (CLI)

Adding NCBI symbols and aliases

NCBI provides a current list of all NCBI used symbols in one large file at

  wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
  gzip -d gene_info.gz

Fetch this file and unpack. Note: unpacked this is a 1.4Gb file; do not check this file into a github repository! Create the symbol/alias list for exominer with

  ncbi_exominer_symbols gene_info > ncbi_symbols.tab

That makes for some 14 million symbols + aliases(!).

The ncbi_symbols.tab file contains entries, synonyms and descriptsions, such as

repA1   pLeuDn_01       putative replication-associated protein
repA2   pLeuDn_03       putative replication-associated protein
leuA    pLeuDn_04       2-isopropylmalate synthase
leuB    pLeuDn_05       3-isopropylmalate dehydrogenase

You can remove the original gene_info file again after generating the ncbi_symbols file.

Next to the ncbi_symbols.tab file a frequency file is generated named ncbi_exominer_symbols.freq, which contains the frequency of every character used in symbol names:

p: 1255137
L: 1907635
e: 1334974
u: 465711
D: 2110781
n: 533637
_: 11942258

and a list of all characters

"#%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{}

In this list some gene symbols and gene names include dashes and dots and other characters. Some gene names even contain spaces - we skip these for further processing.

Later, the millions of NCBI symbols and aliases do not all write to a triple-store. Only those symbols get stored that are mined from the documents.

Adding HUGO symbols and aliases

To make sure all recent HUGO symbols are added, download the HUGO symbols file from EBI and parse that

  wget ftp://ftp.ebi.ac.uk/pub/databases/genenames/reference_genome_set.txt.gz
  gzip -d reference_genome_set.txt.gz
  hugo_exominer_symbols reference_genome_set.txt > hugo_symbols.tab

The hugo_symbols.tab is included with the gem (in test/data/input/hugo_symbols) and will always be loaded if you use the --hugo switch without specifying a symbol file. It contains entries, synonyms and discriptions, such as

ERAP2 L-RAP|LRAP  endoplasmic reticulum aminopeptidase 2
ERAS  HRAS2|HRASP ES cell expressed Ras
ERBB2 NEU|HER-2|CD340|HER2|NGL  v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2
ERBB2IP ERBIN|LAP2  erbb2 interacting protein

Making a text file of your document

Save HTML/Word/Excel/PDF files in a textual format. Command line tools, such as lynx, antiword and pdftotext exist for this purpose. An example of a textual version of an online Nature paper can be made with

  lynx --dump http://www.nature.com/nature/journal/v490/n7418/full/nature11412.html >> tcga_bc.txt

Warning: do not check this file into any public repository! Nature publishing group will not be amused.

Using Exominer to mine a text file for symbols

Pass the symbol file on the command line and pipe in the textual file, e.g.

  exominer -s ncbi_symbols.tab --hugo hugo_symbols.tab < tcga_bc.txt 

This results in a list of symbols and aliases found in the paper, with their tally. For example

35      FOXA1   forkhead box A1
36      cas     CRISPR associated Cas2 family protein
36      AKT1    v-akt murine thymoma viral oncogene homolog 1
37      BRCA2   hypothetical protein
37      BRAF    v-raf murine sarcoma viral oncogene homolog B1
37      BRCA1   breast cancer 1, early onset
38      A       replication gene A protein
38      AFF2    Ady2-Fun34 like Family, similar to S. cerevisiae FUN34 (YNR002C) and ADY2 (YCR010C); similar to Yarrowia glyoxalate pathway regulator, possible transmembrane acetate facilitator/sensor
39      PDGFRA  platelet-derived growth factor receptor, alpha polypeptide
39      RAD51C  Rad51 DNA recombinase 3
39      MAP3K1  mitogen-activated protein kinase kinase kinase 1, E3 ubiquitin protein ligase
41      AKT3    v-akt murine thymoma viral oncogene homolog 3 (protein kinase B, gamma)
43      ATM     hypothetical protein
90      can     carbonic anhydrase 2 Can

Out of a total of 12,774,630 symbols and 3,201,281 aliases scanned

This is not an authorative list but because it is such a comprehensive list of symbols and aliases there should be few false negatives. Obviously the last one is a false positive, but these should be easy to spot and weed out. The idea is to end up with a list of candidate exome targets. So the possible next step (when not using using a triple-store) allows for subtracting symbols already in a design (not yet implemented/NYI):

  exominer -s ncbi_symbols.tab --ignore list.tab < tcga_bc.txt

where list.tab contains a list of symbols to ignore. These symbols with their aliases are skipped in the text mining step.

This can be useful when mining a paper at a time. Mulitible papers is better, because there will be more evidence on gene names and symbols. Exominer can export results to RDF for powerful querying. More on that below.

Also when you have an existing exome design, is is possible to add a prepared exome list and accompanying design to an RDF triple store for further exploration.

Speeding up text search

To speed things up you can create a binary version of the symbols table with

  pack_exominer_symbols ncbi_symbols.tab

and rename that file to

  mv symbols.bin ncbi_symbols.bin

Now use the bin file instead with exominer's -s switch.

Using exominer with a triple-store

exominer supports RDF! This means that you can use a triple-store as a 'back-end' and add results of multiple runs incrementally. For every symbol it is possible to track back the publication and even mine extra information, such as publication date, journal type, and whether a symbol exists in one or more stored designs. We can even link aliases to Hugo symbols and link-out and fetch gene information, such as the length of the nucleotide sequence. Welcome to the world of the semantic web!

When parsing a publication or other resource we want to refer the result set to that. Ideally a DOI is used which can be turned into a URI through http://crossref.org/, e.g. doi:10.1038/171737a0 becomes http://dx.doi.org/10.1038/171737a0 and can be queried, as explained here.

If no URI exists, one can use a URL to a web publication, or even simply the file name with the year and some tags for describing the target of the publication, such as species or disease type.

The DOI describing the file:

  exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
    --doi doi:10.1038/nature11412 < tcga_bc.txt 

allows for mining title and publication date for every symbol found. To add some meta information you could add semi-colon separated tags

  exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
    --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt 

which helps mining data later on. If no doi exists, you may just add title and year:

  exominer --rdf -s ncbi_symbols.tab --tag 'title=Comprehensive molecular portraits of human breast tumours' \
    --tag 'year=2012;species=human;type=breast cancer' < tcga_bc.txt 

multiple tags are also allowed.

exominer generates RDF which can be added to a triple-store. If you want to add a design (old or new) treat it as a publication and use something like

  exominer --rdf --hugo hugo_symbols.tab --tag 'design=Targeted exome;year=2013;' < design.txt

These commands create turtle RDF with the --rdf switch. Pipe the output into the triple-store with

  curl -T file.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/exominer.rdf

The URI can be a little more descriptive, e.g.:

  curl -T design2012.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/design2012.rdf

Finally, to support multiple searches and make it easier to dereference sources you can supply a unique name to each result set with the --name switch. E.g.

  exominer --rdf --name tcga_bc -s ncbi_symbols.tab --hugo hugo_symbols.tab --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt 

Context

When a gene name gets mined from a text, it is nice to see where it is coming from. exominer provides context for this reason by including the text around the gene name with every reference. This is also a great way to weed out false positives! If the context for a gene named SE says: 'Department of Oncology, Lund University, SE-221 85 Lund, Sweden' - you may think twice about including it into your design.

Computers are not always good at automated text mining. The human eye can pick these mistakes up quickly, exominer makes use of human recognition. The RDF output contains this context by default. To switch context off, simply you can either add a CLI switch, or pass in a tag saying 'context=false'.

One extra (interesting) facility for context is the --context=line command. This will set the context to the full line in a text file (from LF to LF). This can be very useful when parsing tabular data (Excel dumps, for example).

Vocabularies

In addition to the standard W3C vocabularies, exominer uses the journal archiving and interchange tag set (JAT) for describing publications. Another is Bibliontology. The British Library vocabulary may be useful too.

Using exominer with a triple-store

If you intend to use exominer with a triple-store you need to install one. In principle you can use bio-rdf with any RDF triple store. Instructions for installing 4store can be found on bioruby-rdf. You can add a new triple-store with

4s-backend-setup exominer
4s-backend exominer
4s-httpd -p 8081 exominer

and check the webserver is running on http://localhost:8081/status/. Again, check bioruby-rdf for instructions on installing 4store and sparql-query and examples.

Mining gene symbols with SPARQL

Looking for all database information in the triple-store

SELECT * WHERE { ?s ?p ?o } 

This can be run with the sparql-query tool

sparql-query http://localhost:8081/sparql/ 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'

With a non-HUGO geneid information can be fetched with

SELECT ?type1, ?label1, count(*)
WHERE {
?s1 ?p1 ?o1 .
?o1 bif:contains "HK1" .
?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type1 .
?s1 <http://www.w3.org/2000/01/rdf-schema#label> ?label1 .
}
ORDER BY DESC (count(*))

will render a list of gene id's. Follow up with, for example, http://bio2rdf.org/geneid:100036759

Project home page

Information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/pjotrp/bioruby-exominer

TODO

  • Fix doi to make full URI

Cite

If you use this software, please cite one of

Biogems.info

This Biogem is published at (http://biogems.info/index.html#bio-exominer)

Copyright

Copyright (c) 2013,2014 Cuppen Group and Pjotr Prins. See LICENSE.txt for further details.