scraping antibiotics from wikipedia
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Data
Images
README.md
wikipedia_tables.Rmd

README.md

Scraping wikipedia tables

wikilogo

I recently ran into an issue in which I needed to create a long regular expression to match medications (antibiotics, specifically) in a giant medications table from an electronic health record (EHR). I wasn’t given a list of antibiotics to look for, but I did know this would end up as a binary (yes/no) variable in the final analytic data set.

This tutorial will walk you through scraping a table on Wikipedia and converting it’s contents into a regular expression, then using that regular expression to match strings in a different table.

An Example EHR medications table

I was able to find an example medications table from data.world. The file is titled, hoyt/Medications/Prescriptions in the “LibreHealth Educational EHR Project (LEEP)” project.

First load the packages:

# load tidyverse and friends
library(tidyverse)
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
##
##     set_names

## The following object is masked from 'package:tidyr':
##
##     extract
library(xml2)
library(rvest)
##
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
##
##     pluck

## The following object is masked from 'package:readr':
##
##     guess_encoding

I load the EHR data below:

# fs::dir_ls("Data")
EHRMedTable <- read_csv("Data/raw_data/RXQ_RX_G.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   seqn = col_integer(),
##   rxduse = col_integer(),
##   rxddrug = col_character(),
##   rxddrgid = col_character(),
##   rxqseen = col_integer(),
##   rxddays = col_integer(),
##   rxdcount = col_integer()
## )
EHRMedTable %>% glimpse(60)
## Observations: 18,704
## Variables: 8
## $ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
## $ seqn     <int> 62161, 62161, 62162, 62163, 62164, 621...
## $ rxduse   <int> 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2,...
## $ rxddrug  <chr> "FLUOXETINE", "METHYLPHENIDATE", NA, N...
## $ rxddrgid <chr> "d00236", "d00900", NA, NA, "d00181", ...
## $ rxqseen  <int> 2, 2, NA, NA, 2, NA, 1, NA, NA, NA, NA...
## $ rxddays  <int> 5840, 5840, NA, NA, 9125, NA, 547, NA,...
## $ rxdcount <int> 2, 2, NA, NA, 1, NA, 1, NA, NA, NA, NA...

Find antibiotic exposures on wikipedia

I googled “list of antibiotics” and found this wikipedia page. I’m only interested in the column titled, “Generic Name”, but I will download the entire table.

list_of_antibiotics

The first function I’ll use comes from the xml2 package. xml2::read_html() loads the html from the Wikipedia page into an R object I call wiki_html.

wiki_html <- xml2::read_html("https://en.wikipedia.org/wiki/List_of_antibiotics")

I always like to check the structure of new objects so I know what I’m working with. The structure of wiki_html object is below.

wiki_html %>% str()
## List of 2
##  $ node:<externalptr>
##  $ doc :<externalptr>
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

I can see this is a list of two objects (a node and a doc).

I want the html node, so I will use a function from the rvest package. The css argument is set to "table". Once again I check the structure of the output object.

wiki_html_tables <- wiki_html %>% rvest::html_nodes(css = "table")
wiki_html_tables %>% str()
## List of 3
##  $ :List of 2
##   ..$ node:<externalptr>
##   ..$ doc :<externalptr>
##   ..- attr(*, "class")= chr "xml_node"
##  $ :List of 2
##   ..$ node:<externalptr>
##   ..$ doc :<externalptr>
##   ..- attr(*, "class")= chr "xml_node"
##  $ :List of 2
##   ..$ node:<externalptr>
##   ..$ doc :<externalptr>
##   ..- attr(*, "class")= chr "xml_node"
##  - attr(*, "class")= chr "xml_nodeset"

This is a list of three lists, each of them xml_nodes.

Use grep to find relevant tables

In order to find the relevant tables in the wiki_html_tables object, I need to be able to search on something. Fortunately, the base::grep() function can be used in combination with sub-setting to extract the relevant_tables from wiki_html_tables.

Get the relevant tables from the xml_nodeset in wiki_html_tables.

relevant_tables <- wiki_html_tables[grep("Generic name", wiki_html_tables)]
relevant_tables %>% str()
## List of 2
##  $ :List of 2
##   ..$ node:<externalptr>
##   ..$ doc :<externalptr>
##   ..- attr(*, "class")= chr "xml_node"
##  $ :List of 2
##   ..$ node:<externalptr>
##   ..$ doc :<externalptr>
##   ..- attr(*, "class")= chr "xml_node"
##  - attr(*, "class")= chr "xml_nodeset"

This returned yet another list of lists (all xml_nodes). I need to use rvest::html_table() with bracket sub-setting to explore this object and learn about it’s contents. I will start with position [[1]] and set fill = TRUE.

I also use dplyr::glimpse(60)

rvest::html_table(relevant_tables[[1]], fill = TRUE) %>%
    dplyr::glimpse(60)
## Observations: 168
## Variables: 5
## $ `Generic name`             <chr> "Aminoglycosides", "...
## $ `Brand names`              <chr> "Aminoglycosides", "...
## $ `Common uses[3]`           <chr> "Aminoglycosides", "...
## $ `Possible side effects[3]` <chr> "Aminoglycosides", "...
## $ `Mechanism of action`      <chr> "Aminoglycosides", "...

Looks like this is the right table! I will assign it to a data frame titled, “WikiAntiB” and check the base::names().

WikiAntiB <- html_table(relevant_tables[[1]], fill = TRUE)
WikiAntiB %>% names()
## [1] "Generic name"             "Brand names"
## [3] "Common uses[3]"           "Possible side effects[3]"
## [5] "Mechanism of action"

The WikiAntiB table has all the antibiotics in the Wikipedia table. I’m wanting to split the Generic name column and take the first word (antibiotic) in the table.

But before I do that, I am going to give both tables snake_case variable names and reduce the EHRMedTable table to only id and med and call this smaller data frame EHRMeds. I also remove the missing meds from EHRMeds.

WikiAntiB <- WikiAntiB %>%
    dplyr::select(
        generic_name = `Generic name`,
        brand_name = `Brand names`,
        common_uses = `Common uses[3]`,
        poss_side_effects = `Possible side effects[3]`,
        mech_action = `Mechanism of action`)
WikiAntiB %>% dplyr::glimpse(60)
## Observations: 168
## Variables: 5
## $ generic_name      <chr> "Aminoglycosides", "Amikacin"...
## $ brand_name        <chr> "Aminoglycosides", "Amikin", ...
## $ common_uses       <chr> "Aminoglycosides", "Infection...
## $ poss_side_effects <chr> "Aminoglycosides", "Hearing l...
## $ mech_action       <chr> "Aminoglycosides", "Binding t...

Clean ERHMeds.

EHRMeds <- EHRMedTable %>%
    dplyr::select(
        id,
        med = rxddrug)
# remove missing
EHRMeds <- EHRMeds %>% filter(!is.na(med))
EHRMeds %>% dplyr::glimpse(60)
## Observations: 12,957
## Variables: 2
## $ id  <int> 1, 2, 5, 7, 12, 14, 15, 16, 17, 18, 19, 22,...
## $ med <chr> "FLUOXETINE", "METHYLPHENIDATE", "BUPROPION...

Prepping strings to be used as regex

The information generic_name isn’t quite ready to do regex pattern match on. For example, the first five lines in the med column in EHRMeds look like this:

EHRMeds$med %>% dplyr::glimpse(80)
##  chr [1:12957] "FLUOXETINE" "METHYLPHENIDATE" "BUPROPION" ...

These are in all caps, so I should convert them to lower case to make them easier to match on using dplyr::mutate() and stringr::str_to_lower().

EHRMeds <- EHRMeds %>%
    dplyr::mutate(med = stringr::str_to_lower(med))
EHRMeds$med %>% dplyr::glimpse(80)
##  chr [1:12957] "fluoxetine" "methylphenidate" "bupropion" ...

Now I need to make sure the generic_names in WikiAntiB can be used to search in the med column in EHRMeds. The pipeline below is long, but the comments describe each step so you should be able to follow along. If not, look in the help file for each function.

WikiAntiBGenName <- WikiAntiB %>%
    dplyr::select(generic_name) %>%
    tidyr::separate(col = generic_name, # select this column to split
                   into = c("gen_name", "etc"), # put medication in gen_name
                    sep = " ", # separate them on the whitespace
                    remove = FALSE) %>% # but keep the original variable
                                    # to verify contents here with head()
    dplyr::select(gen_name) %>% # then take new gen_name (first med)
    dplyr::distinct(gen_name) %>% # get the distinct values
    dplyr::mutate(gen_name =
                        str_to_lower(gen_name), # convert to lowercase
          gen_name = str_remove_all(gen_name,
                                    pattern = "\\(bs\\)"), # remove (bs)
          gen_name = str_replace_all(gen_name,
                                    pattern = "\\/", # replace "/"
                                    replacement = "_"), # w/ underscore
          gen_name = str_replace_all(gen_name,
                                    pattern = "-", # replace "-"
                                    replacement = "_")) %>% # w/ underscore
    tidyr::separate(col = gen_name, # split the new column again, this time
                    into = c("gen_name1", "gen_name2"), # into 2 gen_names
                    sep = "_", # on the underscore we put there ^
                    remove = FALSE) %>% # but keep the old variable
    tidyr::gather(key = gen_name_key, # now get all gen_name meds
                  value = gen_name_value, # into one column
                  gen_name1:gen_name2) %>%
    dplyr::filter(!is.na(gen_name_value)) %>% # remove missing
    dplyr::select(generic_name = gen_name_value) # and rename/minimize

Inspect this new data frame with a single column.

WikiAntiBGenName %>% dplyr::glimpse(60)
## Observations: 161
## Variables: 1
## $ generic_name <chr> "aminoglycosides", "amikacin", "ge...

Check out the unique values of generic_name with base::unique(), utils::head(), and base::writeLines() because these cause R to print the output to the RStudio Notebooks in a useful way.

WikiAntiBGenName$generic_name %>%
    base::unique() %>%
    base::writeLines()

unique_generic_name_writeLines

Note the entry between cefazolin and cefalexin is empty–remove it using dplyr::filter(generic_name != "").

WikiAntiBGenName <- WikiAntiBGenName %>%
    dplyr::filter(generic_name != "")

Now I can put this into a vector so it can be converted into a regular expression. The stringrr::str_c() function and regular expression symbols (+, ? , |) are covered in depth in R for Data Science and a little here on Storybench, so I won’t go into them too much. Just know this is how I construct a pattern to match on in the EHRMeds table.

antibiotic_med <- WikiAntiBGenName$generic_name %>% base::unique()
# collapse to regex
antibiotic_med <- stringr::str_c(antibiotic_med, collapse = "+?|")
antibiotic_med <- base::paste0(antibiotic_med, "+?")

Searching for string patterns with stringr::str_detect()

The stringr package comes with a handy str_detect() function that can be dropped inside dplyr::filter() to look through rows in a data frame for pattern matches. This function takes an input string (med in EHRmeds in this case), and a pattern (antibiotic_med, which we just created). When it’s inside filter(), it will return the rows that match the pattern.

First I check the number of distinct meds in EHRMeds with dplyr::distinct() and base::nrow(), then I test my pattern match with dplyr::filter(stringr::str_detect().

# check rows so I know I'm not fooling myself
EHRMeds %>%
    dplyr::distinct(med) %>%
    base::nrow() # [1] 701
## [1] 701
EHRMeds %>%
    dplyr::filter(stringr::str_detect(string = med,
                                      pattern = antibiotic_med)) %>%
    dplyr::distinct(med) %>%
    base::nrow() # [1] 53
## [1] 53

When I see it’s working (no errors), I assign it to EHRAntiBMeds and rename med to antib_med.

# now assign to new data frame!
EHRAntiBMeds <- EHRMeds %>%
    dplyr::filter(stringr::str_detect(med,
                               antibiotic_med)) %>%
    dplyr::select(id,
                  antib_med = med)

Now I can look in EHRAntiBMeds for the base::unique() medications (antib_med) to see if they all look like antibiotics.

EHRAntiBMeds$antib_med %>%
    base::unique()
##  [1] "rifaximin"
##  [2] "amoxicillin"
##  [3] "hydrocortisone; neomycin; polymyxin b otic"
##  [4] "trimethoprim"
##  [5] "cefdinir"
##  [6] "clindamycin"
##  [7] "azithromycin"
##  [8] "sulfamethoxazole"
##  [9] "ethambutol"
## [10] "pyrazinamide"
## [11] "minocycline"
## [12] "sulfamethoxazole; trimethoprim"
## [13] "cefixime"
## [14] "polymyxin b; trimethoprim ophthalmic"
## [15] "dexamethasone; tobramycin ophthalmic"
## [16] "cefuroxime"
## [17] "doxycycline"
## [18] "amoxicillin; clavulanate"
## [19] "erythromycin topical"
## [20] "ciprofloxacin"
## [21] "vancomycin"
## [22] "penicillin v potassium"
## [23] "silver sulfadiazine topical"
## [24] "penicillin"
## [25] "moxifloxacin ophthalmic"
## [26] "gatifloxacin ophthalmic"
## [27] "metronidazole"
## [28] "ciprofloxacin; dexamethasone otic"
## [29] "erythromycin ophthalmic"
## [30] "gentamicin ophthalmic"
## [31] "azithromycin ophthalmic"
## [32] "tetracycline"
## [33] "ofloxacin ophthalmic"
## [34] "ciprofloxacin ophthalmic"
## [35] "dexamethasone; neomycin; polymyxin b ophthalmic"
## [36] "chloramphenicol"
## [37] "mupirocin topical"
## [38] "isoniazid"
## [39] "levofloxacin"
## [40] "nitrofurantoin"
## [41] "moxifloxacin"
## [42] "benzoyl peroxide; clindamycin topical"
## [43] "sulfacetamide sodium ophthalmic"
## [44] "neomycin; polymyxin b sulfate topical"
## [45] "sulfasalazine"
## [46] "metronidazole topical"
## [47] "clarithromycin"
## [48] "cefprozil"
## [49] "clindamycin topical"
## [50] "polymyxin b sulfate"
## [51] "ofloxacin otic"
## [52] "tobramycin ophthalmic"
## [53] "dapsone topical"

Join this back to EHRMedTable

If I want to join the antibiotic medication used by the patient (identified with id) I can join this back to EHRMedTable.

EHRMedTable <- EHRMedTable %>%
    dplyr::left_join(., EHRAntiBMeds, by = "id")
EHRMedTable %>% glimpse(60)
## Observations: 18,704
## Variables: 9
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ seqn      <int> 62161, 62161, 62162, 62163, 62164, 62...
## $ rxduse    <int> 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2...
## $ rxddrug   <chr> "FLUOXETINE", "METHYLPHENIDATE", NA, ...
## $ rxddrgid  <chr> "d00236", "d00900", NA, NA, "d00181",...
## $ rxqseen   <int> 2, 2, NA, NA, 2, NA, 1, NA, NA, NA, N...
## $ rxddays   <int> 5840, 5840, NA, NA, 9125, NA, 547, NA...
## $ rxdcount  <int> 2, 2, NA, NA, 1, NA, 1, NA, NA, NA, N...
## $ antib_med <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

And now I can see the counts of the top ten antibiotic medications in EHRMedTable

EHRMedTable %>%
    dplyr::filter(!is.na(antib_med)) %>%
    dplyr::count(antib_med) %>%
    dplyr::distinct(id, .keep_all = TRUE) %>%
    dplyr::arrange(desc(n)) %>%
    utils::head(10)
## # A tibble: 10 x 2
##    antib_med                          n
##    <chr>                          <int>
##  1 amoxicillin                      126
##  2 azithromycin                      40
##  3 sulfamethoxazole; trimethoprim    33
##  4 doxycycline                       26
##  5 amoxicillin; clavulanate          23
##  6 minocycline                       16
##  7 cefdinir                          14
##  8 ciprofloxacin                     13
##  9 penicillin v potassium            13
## 10 nitrofurantoin                    11

BONUS! Using stingr::str_detect() within dplyr::case_when()

If you noticed the sulfamethoxazole; trimethoprim entry in the top ten table print-out above, you might want a variable that indicates there are more than 1 medications listed in the antib_med column. Well fortunately dplyr::case_when() works well with stingr::str_detect() because the result is logical. See how I use it below to create the new variable antib_2meds.

# test -----
EHRMedTable %>%
    dplyr::mutate(antib_2meds = dplyr::case_when(
        stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
        !stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
        TRUE ~ NA_character_)) %>%
    dplyr::count(antib_2meds, antib_med) %>%
    tidyr::spread(antib_2meds, n) %>%
    head(10)
## # A tibble: 10 x 4
##    antib_med                   `1 antibiotic med… `2 antibiotic me… `<NA>`
##    <chr>                                    <int>             <int>  <int>
##  1 amoxicillin                                126                NA     NA
##  2 amoxicillin; clavulanate                    NA                23     NA
##  3 azithromycin                                40                NA     NA
##  4 azithromycin ophthalmic                      2                NA     NA
##  5 benzoyl peroxide; clindamy…                 NA                 4     NA
##  6 cefdinir                                    14                NA     NA
##  7 cefixime                                     2                NA     NA
##  8 cefprozil                                    1                NA     NA
##  9 cefuroxime                                   3                NA     NA
## 10 chloramphenicol                              1                NA     NA
# assign -----
EHRMedTable <- EHRMedTable %>%
    dplyr::mutate(antib_2meds = dplyr::case_when(
        stringr::str_detect(antib_med, ";") ~ "2 antibiotic meds",
        !stringr::str_detect(antib_med, ";") ~ "1 antibiotic meds",
        TRUE ~ NA_character_))

I hope you find this tutorial helpful! This pipeline could be used if you had lists stored in other files, too (Excel files, googlesheets, etc.)

Be sure to export the data files!

fs::dir_ls("Data")
## Data/processed_data Data/raw_data
if (!file.exists("Data/processed_data")) {
    dir.create("Data/processed_data")
}
EHRMedTable_outfile <- paste0("Data/processed_data/",
                              "EHRMedTable",
                              as.character(Sys.Date()),
                              ".csv")
write_csv(as_data_frame(EHRMedTable), EHRMedTable_outfile)

Closing thoughts

This tutorial was inspired by the text " Teaching Statistics: A Bag of Tricks " from Andrew Gelman and Deborah Nolan.

The following quote is from a section titled, “learn how to learn new technologies”,

In the future, our students (and statisticians in general) will encounter an ever-changing array of novel technologies, data formats, and programming languages. For this reason, we believe it is important for our students to have the skills needed to learn about new technologies. We try to model how to learn about technologies in our course so that our students can continue to be facile with the computer, access data from various new sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media.

I do not recall this sentiment being taught (or explicitly stated) in any college course. And in a book filled with gems of statistical pedagogical techniques, it still stands out. The older I get, the more I see the need to ‘learn how to learn things efficiently.’ I added efficiently because it’s not likely you will have the time to attend a college course or seminar on every topic you will need to know.

I highly recommend this book (and every other book written by Nolan and Gelman) to anyone interested in improving their ability to teach (and learn) statistics and data science.

The data for this tutorial is available here