## Turn table S2 from the supplementary appendix of "Prediction of Susceptibility to First-Line Tuberculosis Drugs by DNA Sequencing. New England Journal of Medicine 2018;0:null. doi:10.1056/NEJMoa1800474" into a CSV file


Per the table 

"Gene coordinates based on NC_000962.2, with 100 nucleotide positions upstream of each gene read as well. Mutations characterised as 'S' in Walker et al but as 'R' by another source, were characterised as 'R'. Insertions and deletions characterise in Walker et al were re-computed from that data for this study
to ensure that the same version of Cortex was used for both data sets. These indels may therefore differ a little from those published in Walker et al."

In [None]:
#install.packages(c("pdftools", "tidyverse"))

In [None]:
library(tidyverse)

In [None]:
library(pdftools)

In [None]:
sessionInfo()

# Load the PDF file

You'll need a PDF copy of the supplementary appendix. Here it's named as when downloaded from nejm.org as "nejmoa1800474_appendix.pdf

In [None]:
supplementary_appendix_text <- pdf_text("nejmoa1800474_appendix.pdf")

Split it up into pages

In [None]:
supplementary_appendix_text_pages <- 
    strsplit(supplementary_appendix_text, "\n")

In [None]:
length(supplementary_appendix_text_pages)

126 pages. Let's search them for the begining of our table

In [None]:
which(grepl("Table S2", supplementary_appendix_text_pages, fixed=TRUE))

Two hits. Line 2 is probably the table of contents. Let's look at line 96

In [None]:
supplementary_appendix_text_pages[96]

Looks good. Let's find the end where table S3 starts.

In [None]:
which(grepl("Table S3", supplementary_appendix_text_pages, fixed=TRUE))

In [None]:
supplementary_appendix_text_pages[110]

Okay, we need to extract pages 96 to 109

In [None]:
tableS2 <- supplementary_appendix_text_pages[96:109]

In [None]:
head(tableS2[[1]], 10)

Currently it's a list of lists. Let's unwrap that a bit

In [None]:
tableS2 = unlist(tableS2, recursive=FALSE)

In [None]:
which(startsWith(tableS2, "Drug"))

The table starts on element 7

In [None]:
tableS2[7]

Start to set up a dataframe

In [None]:
column_names <- c("Drug", "Mutation", "Comment", "Characterisation", "Source")

Now to remove the header lines

In [None]:
tableS2 <- tableS2[8:length(tableS2)]

In [None]:
head(tableS2)

And get rid of the page number lines

In [None]:
tableS2[which(startsWith(tableS2, " "))]

What's with that "Yadon" line?

In [None]:
tableS2[1058:1060]

Weird misformat. No idea where that 1919 came from

In [None]:
tableS2[1059] <- "Pyrazinamide pncA_E174G    R Yadon et. al., Nature Communications 2017 Sep 19"

In [None]:
tableS2 <- tableS2[-which(startsWith(tableS2, " "))]

There's also a typo where a column gets cut off

In [None]:
which(grepl("codon 425", tableS2, fixed=TRUE))

In [None]:
tableS2[746]

The ending codon was cutoff. This is the RpoB RRDR. The Hain MDRTBPlus actually covers 424-452, so let's adjust

In [None]:
tableS2[746] <- 'Rifampicin   rpoB: Any amino acid substitution or insertion/deletion from codon 424 to codon 452 R
WHO endorsed line probe assays / Xpert MTB/RIF'

Now for a function to read each line and turn into a vector. The tricky part is that some entries have an entry in the indels column, a comment if any mutation in the codon is considered resistant, or for "phylogenetetic" comments. Will have to search for these and put into a comments field

In [None]:
readEntries <- function(line) {
    words <- unlist(strsplit(line, "\\s{1,}"))
    comment <- NA
    if (endsWith(words[2], "indel")) {
        comment <- words[3]
        words <- words[-3]
        }
    if (endsWith(words[2], ":")) {
        comment <- words[3]
        j <- 4
        while(words[j] != "R" & words[j] != "S") {
            comment <- paste(comment, words[j])
            j <- j + 1
        }
        output <- (c(words[1], substring(words[2],1,(nchar(words[2]) -1)), comment, words[j], 
                paste(words[j:length(words)], collapse=" ")))
    }
    else if (startsWith(words[4], "(Phylogenetic")) {
        output <- (c(words[1:2], "Phylogenetic SN", words[3], "Walker TM et. al., Lancet Infect Dis. 2015 Jun 23"))
    }
    else {
        output <- (c(words[1:2], comment, words[3], paste(words[4:length(words)], collapse=" ")))
    }
    if (output[4] != "R" & output[4] != "S") {
        output <- c(output[1:2], substring(output[3],1,(nchar(output[3]) -1)), 
                    substring(output[3],nchar(output[3]),nchar(output[3])), 
                    paste(words[3:length(words)], collapse=" "))
        print(output)
    }
    return(output)
}

In [None]:
entries <- lapply(tableS2[1:(length(tableS2))], readEntries)

In [None]:
entries

In [None]:
mutations_df <- data.frame(matrix(unlist(entries), nrow=length(entries), byrow=T), stringsAsFactors=FALSE)

In [None]:
mutations_df <- setNames(mutations_df, column_names)

In [None]:
mutations_df

# Known Issues:

The following deletions got truncated in the source PDF

1. ahpC_-400_588+586_del_GGTGGCCAGCCACACCGCCCGGGTGTTG
2. pncA_-1526_561+4428_del_GCGTTGGGGTGTCTTGACCTGTCGTCC
3. pncA_-745_492_del_TGCGCTGGTCGGGTTTCGGCGCCACCCATGCC

Not all of the mutations are machine readable


In [None]:
unique(mutations_df$Source)

In [None]:
mutations_df[mutations_df$Source == ]