Alexander L. Hayes | Savannah Smith | Devendra Dhami | Sriraam Natarajan
Using data pulled from rxlist.com, openFDA, and PubMed. This repository contains the scripts for pulling data from each source, and an overview of each are explained below.
Questions? Contact Alexander Hayes at hayesall(at)indiana(dot)edu or Savannah Smith at savannah.smith(at)valpo(dot)edu
#####Overview:
This folder contains the final deliverables for the ProHealth Summer REU 2016.
- [Final Paper](Overview/Predicting Drug-Drug Interactions from Text using NLP and Privileged Information.pdf)
- Presentation Poster (png/pdf/psd)
- Project Overview Video - YouTube
 "Poster preview")
#####openFDA:
Here you will find several shell scripts for pulling drug names from rxlist.com, pulling labeling information from openFDA, and fixlist.sh to fix the list of drugs if rxdownloader.sh crashes halfway through.
Running the scripts will take some time (rxdownloader.sh took about 6 hours in total to crawl through the database), find out more about openFDA at their website.
The full set of extracted data can be found on Alexander's GitHub. Because of its size, downloading a .zip is recommended.
-
builddruglist.shView CodeDownloads drug names from rxlist.com
Outputs a text file (drugslist.txt)
Arguments can be passed to tweak the output.
bash builddruglist.sh(by default formats for openFDA)bash builddruglist.sh openFDA(replace spaces with +AND+)bash builddruglist.sh PubMed(replace spaces with +)bash builddruglist.sh Web(replace spaces with _)
-
rxdownloader.shView CodeTakes a list of drugs(drugslist.txt)
Queries openFDA for each drug (fdainteractions.shView Code)
Outputs a text file in drugInteractionsFolder/ (i.e. Warfarin+AND+Sodium will be output as drugInteractionsFolder/Warfarin+AND+Sodium-data.txt).Running it is simple (though time consuming):
bash rxdownloader.shAdditionally, RXDownloader does some sorting for us: it queries all generic drugs, outputs brand name drugs to a separate file (drugInteractionsFolder/BRANDNAMEDRUGS.txt), and separates unknown drugs as well (drugInteractionsFolder/UNKNOWNDRUGS.txt). Looking up generic drugs also pulls their brand-name equivalents, so redundancy is minimized. All drugs that have over 1000 results are cut off at 1000, their names are added to a file (drugInteractionsFolder/WARNINGS.txt). Finally, each step is detailed in a Log file (drugInteractionsFolder/LOG.txt) which outlines what was queried and when it was completed.
If RXDownloader crashes, it can be started up again to continue where it left off.
-
fixlist.shView CodeFor error checking: run
bash fixlist.shto download a fresh copy of drugslist.txt, check each entry against the LOG file generated by RXDownloader, and remove any entries that are present in both.
Return to Top | View in Folder
#####PubMed: The second dataset consists of Medical Abstracts from PubMed. The thought is that if drug-drug pairs appear in medical abstracts, the combination has been studied. From the abstracts' text, we can discern the findings and whether or not there are adverse events caused by taking both.
Presented here are several scripts for pulling the abstracts. The bulk of the work is done through the RefSense package (Maintained by Lars Arvestad and distributed under a GNU Public License: [Website | GitHub]).
Professor Natarajan suggested we extract the top twenty abstracts from the past ten years (numbers chosen based on a paper he worked on), querying every drug combination (a total of 11,912,080 possible) and outputing the results to text files. Once again, the full set of extracted data can be found on Alexander's GitHub (~25 GB).
pmid2text,pmsearch, and perlscripts/
-
Each article in PubMed is associated with a unique ID number.
-
pmid2textcan be invoked to search for specific keywords and return all PubMed IDs associated with them. Additionally, thetflag can be invoked to specify how old the abstracts can be (in this case,-t 3650for approximately 10 years), and thedflag can be passed to specify how many articles are returned (-d 20). -
pmid2textis used to convert these unique PubMed IDs to a text output:-a -ican be passed to pull the abstracts, and remove indentation, respectfully. -
We can chain these together:
perl pmsearch -t 3650 -d 20 $DRUG1 $DRUG2 | perl pmid2text -a -i > outputFile -
For completeness, additional functions are in the perlscripts/ directory. Normally these would be placed under lib/, the path variable is updated automatically by the scripts.
-
smartsplit.shView CodeThe drug combinations can be thought of as a matrix with drugs on the x and y axes.
With n=4881 drugs, checking every combination would normally take n^2 time. However, a simple property allows us to cut this number in half, because [(x & y) == (y & x) when x=/=y].However, running 11,912,080 checks in sequence would take close to 68 days (at least if you're running it on my laptop). We needed to run lots of the checks in parallel, and we needed to split the matrix in a way that made sure each node was responsible for a roughly equal number of calculations.
This graph represents what we are interested in. The shaded red section represents duplicates and unecessary checks where x=y. The shaded green columns show roughly equal areas: columns [A-H] have roughly the same number of checks as column [Z].smartsplit.shneeds to be size aware, splitting the matrix into equal parts depending on how many nodes are available to calculate.Running the code with
bash smartsplit.sh STABLE.txtcreates 71 directories under a directory calledData/. (71 is the max number of nodes that can be allocated at once on IU's Odin Supercluster, more on that later). Each directory underData/is named after a number between 1 and 71. Each of these subdirectories contains a copy ofSTABLE.txt,drugs.txt, and acheck_[1-71].A brief overview of each file (these are important for
pullabstractsODIN.sh).
STABLE.txt: a copy of druglist.txt, the output ofbash builddruglist.sh PubMed. It is called STABLE because it is never altered by a program, it is used for copying but is never modified.drugs.txt: a copy ofSTABLE.txtthat can be altered by other scripts (file is the y axis).check_[1-71]: this file is a list of drugs that a particular node is responsible for (file is the x axis). Each item in this list is checked against everything indrugs.txtuntil check_[1-71] is exhausted.
-
pullabstracts.shView CodeI would not suggest running this (but if you want to, it's as easy as
bash pullabstracts.sh STABLE.txt). The purpose I'm even including it is to explain why it's unusable.This script pulls abstracts sequentially, making all 11,912,080 checks by itself. This framework became
pullabstractsODIN.sh, which can run 71 checks in parallel. -
pullabstractsODIN.shView Codebash builddruglist.sh PubMed && mv drugslist.txt STABLE.txt
bash smartsplit.sh STABLE.txt
srun -N 71 pullabstractsODIN.shRunning 71 copies in parallel is the difference between taking 68 days and taking 1 day. I wrote it this way to make it extremely easy to start the code back up if it crashes: running
srun -N 71 pullabstractsODIN.shagain will automatically allocate the nodes again and pick up where it previously left off.The downside is that this is so specific to running on odin.cs.indiana.edu that it will not work anywhere else. The script automatically allocates 71 nodes (the maximum allowed), and tells each node to run the script on a different set of files corresponding to Data/[1-71]. Running elsewhere would likely require heavy revisions to the code, but the main section that would need to be changed is fairly small:
function synchronize {
HOSTNUMBER=`hostname | grep -o -P '(?<=odin).*(?=.cs.indiana.edu)' | sed 's/^0*//'`
echo $HOSTNUMBER
sleep $HOSTNUMBER
HOST=`hostname`
echo "$HOST" >> $LOG
OUTPUT=`wc --lines $LOG | cut -d 'L' -f 1 | cut -d 'D' -f 1`
echo "$HOST$OUTPUT" >> $FINAL
}
sleep 5
synchronize
NUMBERSTRING=`grep $HOST $FINAL | cut -d 'u' -f 2`
NUMBER=$(($NUMBERSTRING * 1))
echo $HOST is at $NUMBER`Return to Top | View in Folder
#####SocialBlogs:
Here you will find several attempts to automatically extract text from DailyStrength.org.
In order to run seventh_attempt.py you will need to download/install 'Beautiful Soup.' Also make sure you include all necessary packages (refer to import statements in the code). Until DailyStrength updated their website, this code ran as expected. It extracted all social blog discussions from an instructed location. As a result of the update, the links and tags were changed a bit, but this was an easy fix in Beautiful Soup. The problem was a 'show more' button replacing the option to go to the next page. To solve this, I needed to write a script that could interact with the web page (i.e. push the 'show more' button).
The best option to solve this (after some brief research) appeared to be the 'Selenium WebDriver.') I wrote a script in python using the Chrome WebDriver. This WebDriver can open a browser, search in a browser, and interact with it in other ways. This code was weird in the way that it would crash unexpectedly after running correctly on a similar task. It seemed that DailyDraft.py needed multiple 'scroll element into view' and multiple 'scroll up to the top of the page' commands. Running FinalDaily.py will take some time.
My final code required a lot of time.sleep()'s to pause the code and driver.implicitly_wait()'s to pause and give the webpage time to load before completing the next step. I have two output files from FinalDaily.py linked to this repository:
- DailyStrength1.txt has the discussion text of links 1-98 under the result of searching "Drug Interactions."
- DailyStrength2.txt has the output of links 99-130.
Return to Top | View in Folder
#####Confidence:
Return to Top | View in Folder
#####Learning:
Return to Top | View in Folder