This is the codebase for the PGxMine project to using text-mining to identify papers for curation into PharmGKB. It is a Python3 project that makes use of the Kindred relation classifier along with the PubRunner project to run tools across PubMed and accessible PubMed Central.
Viewing the Data
To run a local instance of the PGxmine viewer, the R Shiny code can be found in shiny/ and installation instructions are found there too.
pip install kindred pubrunner pip install scispacy pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_sm-0.2.0.tar.gz
This project uses a variety of data sources. A few need to be downloaded as below and PubRunner will manage the others, apart from DrugBank which needs to be download manually.
- PubMed and accessible PubMed Central (downloaded by PubRunner)
- PubTator Central
- MeSH (only needed to update the drug list)
- DrugBank (download manually as account is required and name it drugbank.xml)
- PharmGKB (used for constructing the drug list and comparisons)
The prepareData.sh script downloads some of the data dependencies and runs some preprocessing to extract necessary data (such as gene name mappings). The commands that it runs are detailed below.
# Download PubTator Central, MeSH, dbSNP, Entrez Gene metadata and pharmGKB drug info sh downloadDataDependencies.sh # Extract the gene names associated with rsIDs from dbSNP python linkRSIDToGeneName.py --dbsnp <(zcat data/GCF_000001405.25.gz) --pubtator <(zcat data/bioconcepts2pubtatorcentral.gz) --outFile dbsnp_selected.tsv # Create the drug list with mappings from MeSH IDs to PharmGKB IDs (with some filtering using DrugBank categories) python createDrugList.py --meshC data/c2019.bin --meshD data/d2019.bin --drugbank drugbank.xml --pharmgkb data/drugs.tsv --outFile selected_chemicals.json # Extract a mapping from Entrez Gene ID to name zgrep -P "^9606\t" data/gene_info.gz | cut -f 2,3,10 -d $'\t' > gene_names.tsv # Unzip the annotated training data of pharmacogenomics relations gunzip -c annotations.variant_other.bioc.xml.gz > annotations.variant_other.bioc.xml gunzip -c annotations.variant_star_rs.bioc.xml.gz > annotations.variant_star_rs.bioc.xml
There is an example input file in the example directory which contains a couple PubMed abstracts in BioC format. The run_example.sh script does a full run extracting chemical/variant associations and is shown below with comments. The final output is three files: mini_unfiltered.tsv, mini_collated.tsv, mini_sentences.tsv.
# Align the PubTator Central extracted entities against the text sources to get offset positions for chemicals, variants python align.py --inBioc example/input.bioc.xml --annotations <(zcat data/bioconcepts2pubtatorcentral.gz) --outBioc example/aligned.bioc.xml # Parse and find sentences that mention a chemical, variant and likely a pharmacogenomic assocation (using filter terms) python findPGxSentences.py --inBioc example/aligned.bioc.xml --filterTermsFile pgx_filter_terms.txt --outBioc example/sentences.bioc.xml # Train relation classifiers (using the annotations* files as training data), filter for specific chemicals and apply the classifiers to extract associations and output with normalized genes, variants and chemicals python createKB.py --trainingFiles annotations.variant_star_rs.bioc.xml,annotations.variant_other.bioc.xml --inBioC example/sentences.bioc.xml --selectedChemicals selected_chemicals.json --dbsnp dbsnp_selected.tsv --variantStopwords stopword_variants.txt --genes gene_names.tsv --outKB example/kb.tsv # Collate the output of createKB (which in a full run would be ~1800 files) and filter using the relation probability and collated by counting number of papers python filterAndCollate.py --inData example --outUnfiltered example/mini_unfiltered.tsv --outCollated example/mini_collated.tsv --outSentences example/mini_sentences.tsv
To do a full run, you would need to use PubRunner. It would then manage the download and format conversion of PubMed, PubMed Central Open Access subset and PubMed Central Author Manuscript Collection. The command to do it is below
This will take a long time. Setting up PubRunner with a cluster is recommended. A test run is below.
pubrunner --test .
Here is a summary of the main script files. The pubrunner.yml file is the master script for PubRunner and lists the resources and script usage to actually run the project.
- align.py: Align PubTator Central entities against abstracts and full-text papers
- findPGxSentences.py: Identify star alleles then find sentences that mention a chemical and variant
- createKB.py: Train and apply a relation classifier to extract pharmacogenomic chemical/variant associations
- filterAndCollate.py: Filter the results to reduce false positives and collate the associations
- utils/init.py: Big functions for variant normalization and outputting the formatted sentences
- createDrugList.py: Creates the list of drugs and drug mappings from MeSH IDs to PharmGKB IDs with some filtering by categories
- linkRSIDToGeneName.py: Extracts gene names from dbSNP associated with rsIDs
- linkStarToRSID.py: Some rudimentary text mining to link star alleles with a specific rsID
- prepareForAnnotation.py: Select sentences and output to the standoff format to be annotated
- prCurve.py: Calculate PR curves for the classifiers
The paper can be recompiled using the dataset using Bookdown. All text and code for stats/figures are in the paper/ directory.
Supplementary materials for the manuscript are found in supplementaryMaterials/.