Skip to content

Indiana University ProHealth REU 2016. Discern which drugs interact with which other drugs. Information extraction on text documents and scraping tools for openFDA, PubMed, and various blogs.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



48 Commits

Repository files navigation

Drug Interaction Discovery with NLP and Machine Learning:

Alexander L. Hayes | Savannah Smith | Devendra Dhami | Sriraam Natarajan

Using data pulled from, openFDA, and PubMed. This repository contains the scripts for pulling data from each source, and an overview of each are explained below.

Questions? Contact Alexander Hayes at hayesall(at)indiana(dot)edu or Savannah Smith at savannah.smith(at)valpo(dot)edu

Table of Contents
  1. Overview
  2. openFDA
  3. PubMed
  4. SocialBlogs
  5. Confidence
  6. Learning


This folder contains the final deliverables for the ProHealth Summer REU 2016.

![Alexander and Savannah's Poster Preview](/Overview/Poster/Drug-Drug\ Interactions\ Poster.png) "Poster preview")

#####openFDA: Here you will find several shell scripts for pulling drug names from, pulling labeling information from openFDA, and to fix the list of drugs if crashes halfway through.

Running the scripts will take some time ( took about 6 hours in total to crawl through the database), find out more about openFDA at their website.

The full set of extracted data can be found on Alexander's GitHub. Because of its size, downloading a .zip is recommended.

  1. View Code

    Downloads drug names from
    Outputs a text file (drugslist.txt)
    Arguments can be passed to tweak the output.

  • bash (by default formats for openFDA)
  • bash openFDA (replace spaces with +AND+)
  • bash PubMed (replace spaces with +)
  • bash Web (replace spaces with _)
  1. View Code

    Takes a list of drugs(drugslist.txt)
    Queries openFDA for each drug ( View Code)
    Outputs a text file in drugInteractionsFolder/ (i.e. Warfarin+AND+Sodium will be output as drugInteractionsFolder/Warfarin+AND+Sodium-data.txt).

    Running it is simple (though time consuming):

    Additionally, RXDownloader does some sorting for us: it queries all generic drugs, outputs brand name drugs to a separate file (drugInteractionsFolder/BRANDNAMEDRUGS.txt), and separates unknown drugs as well (drugInteractionsFolder/UNKNOWNDRUGS.txt). Looking up generic drugs also pulls their brand-name equivalents, so redundancy is minimized. All drugs that have over 1000 results are cut off at 1000, their names are added to a file (drugInteractionsFolder/WARNINGS.txt). Finally, each step is detailed in a Log file (drugInteractionsFolder/LOG.txt) which outlines what was queried and when it was completed.

    If RXDownloader crashes, it can be started up again to continue where it left off.

  2. View Code

    For error checking: run bash to download a fresh copy of drugslist.txt, check each entry against the LOG file generated by RXDownloader, and remove any entries that are present in both.

Return to Top | View in Folder

#####PubMed: The second dataset consists of Medical Abstracts from PubMed. The thought is that if drug-drug pairs appear in medical abstracts, the combination has been studied. From the abstracts' text, we can discern the findings and whether or not there are adverse events caused by taking both.

Presented here are several scripts for pulling the abstracts. The bulk of the work is done through the RefSense package (Maintained by Lars Arvestad and distributed under a GNU Public License: [Website | GitHub]).

Professor Natarajan suggested we extract the top twenty abstracts from the past ten years (numbers chosen based on a paper he worked on), querying every drug combination (a total of 11,912,080 possible) and outputing the results to text files. Once again, the full set of extracted data can be found on Alexander's GitHub (~25 GB).

  1. pmid2text, pmsearch, and perlscripts/
  • Each article in PubMed is associated with a unique ID number.

  • pmid2text can be invoked to search for specific keywords and return all PubMed IDs associated with them. Additionally, the t flag can be invoked to specify how old the abstracts can be (in this case, -t 3650 for approximately 10 years), and the d flag can be passed to specify how many articles are returned (-d 20).

  • pmid2text is used to convert these unique PubMed IDs to a text output: -a -i can be passed to pull the abstracts, and remove indentation, respectfully.

  • We can chain these together:
    perl pmsearch -t 3650 -d 20 $DRUG1 $DRUG2 | perl pmid2text -a -i > outputFile

  • For completeness, additional functions are in the perlscripts/ directory. Normally these would be placed under lib/, the path variable is updated automatically by the scripts.

  1. View Code

    The drug combinations can be thought of as a matrix with drugs on the x and y axes.
    With n=4881 drugs, checking every combination would normally take n^2 time. However, a simple property allows us to cut this number in half, because [(x & y) == (y & x) when x=/=y].

    However, running 11,912,080 checks in sequence would take close to 68 days (at least if you're running it on my laptop). We needed to run lots of the checks in parallel, and we needed to split the matrix in a way that made sure each node was responsible for a roughly equal number of calculations.

    Graph displaying how a matrix has duplicates when cut in half This graph represents what we are interested in. The shaded red section represents duplicates and unecessary checks where x=y. The shaded green columns show roughly equal areas: columns [A-H] have roughly the same number of checks as column [Z]. needs to be size aware, splitting the matrix into equal parts depending on how many nodes are available to calculate.

    Running the code with bash STABLE.txt creates 71 directories under a directory called Data/. (71 is the max number of nodes that can be allocated at once on IU's Odin Supercluster, more on that later). Each directory under Data/ is named after a number between 1 and 71. Each of these subdirectories contains a copy of STABLE.txt, drugs.txt, and a check_[1-71].

    A brief overview of each file (these are important for

  • STABLE.txt: a copy of druglist.txt, the output of bash PubMed. It is called STABLE because it is never altered by a program, it is used for copying but is never modified.
  • drugs.txt: a copy of STABLE.txt that can be altered by other scripts (file is the y axis).
  • check_[1-71]: this file is a list of drugs that a particular node is responsible for (file is the x axis). Each item in this list is checked against everything in drugs.txt until check_[1-71] is exhausted.
  1. View Code

    I would not suggest running this (but if you want to, it's as easy as bash STABLE.txt). The purpose I'm even including it is to explain why it's unusable.

    This script pulls abstracts sequentially, making all 11,912,080 checks by itself. This framework became, which can run 71 checks in parallel.

  2. View Code

    bash PubMed && mv drugslist.txt STABLE.txt
    bash STABLE.txt
    srun -N 71

    Running 71 copies in parallel is the difference between taking 68 days and taking 1 day. I wrote it this way to make it extremely easy to start the code back up if it crashes: running srun -N 71 again will automatically allocate the nodes again and pick up where it previously left off.

    The downside is that this is so specific to running on that it will not work anywhere else. The script automatically allocates 71 nodes (the maximum allowed), and tells each node to run the script on a different set of files corresponding to Data/[1-71]. Running elsewhere would likely require heavy revisions to the code, but the main section that would need to be changed is fairly small:

function synchronize { 
	 HOSTNUMBER=`hostname | grep -o -P '(?<=odin).*(?' | sed 's/^0*//'`

	 echo "$HOST" >> $LOG
	 OUTPUT=`wc --lines $LOG | cut -d 'L' -f 1 | cut -d 'D' -f 1`
	 echo "$HOST$OUTPUT" >> $FINAL

sleep 5
NUMBERSTRING=`grep $HOST $FINAL | cut -d 'u' -f 2`
echo $HOST is at $NUMBER`

Return to Top | View in Folder


Here you will find several attempts to automatically extract text from

In order to run you will need to download/install 'Beautiful Soup.' Also make sure you include all necessary packages (refer to import statements in the code). Until DailyStrength updated their website, this code ran as expected. It extracted all social blog discussions from an instructed location. As a result of the update, the links and tags were changed a bit, but this was an easy fix in Beautiful Soup. The problem was a 'show more' button replacing the option to go to the next page. To solve this, I needed to write a script that could interact with the web page (i.e. push the 'show more' button).

The best option to solve this (after some brief research) appeared to be the 'Selenium WebDriver.') I wrote a script in python using the Chrome WebDriver. This WebDriver can open a browser, search in a browser, and interact with it in other ways. This code was weird in the way that it would crash unexpectedly after running correctly on a similar task. It seemed that needed multiple 'scroll element into view' and multiple 'scroll up to the top of the page' commands. Running will take some time.

My final code required a lot of time.sleep()'s to pause the code and driver.implicitly_wait()'s to pause and give the webpage time to load before completing the next step. I have two output files from linked to this repository:

Return to Top | View in Folder


Return to Top | View in Folder


Return to Top | View in Folder


Indiana University ProHealth REU 2016. Discern which drugs interact with which other drugs. Information extraction on text documents and scraping tools for openFDA, PubMed, and various blogs.






No releases published