Support for mzTab v1.0 in readMzTabData and writeMzTabData #41

Closed
richierocks opened this Issue Jan 8, 2015 · 6 comments

Comments

Projects
None yet
3 participants
@richierocks

Of the twelve examples found on the mzTab homepage, readMzTabData will only successfully read eight of them.

library(MSnbase)
library(plyr)

zip_file <- "examples.zip"
download.file(
  "http://www.ebi.ac.uk/pride/resources/tools/jmztab/latest/examples.zip", 
  zip_file
)
example_dir <- "mzTab_examples"
unzip(zip_file, exdir = example_dir)
files <- dir(example_dir, full.names = TRUE)
length(tryapply(files, readMzTabData)) # 8

Tested with MSnbase 1.14.1, though the development version here also contains the warning

Support for mzTab version 0.9 only. Support will be added soon.

In order to further adoption of the new file format in R workflows, it would be very useful if MSnbase supported the current (v1.0) spec for reading and writing.

@lgatto

This comment has been minimized.

Show comment
Hide comment
@lgatto

lgatto Jan 8, 2015

Owner

Thanks for your interest. I will move this up on my todo list and implement support for the latest version in the coming weeks. Feel free to send a pull request if you find time to work on this before me.

Owner

lgatto commented Jan 8, 2015

Thanks for your interest. I will move this up on my todo list and implement support for the latest version in the coming weeks. Feel free to send a pull request if you find time to work on this before me.

@lgatto lgatto self-assigned this Jan 8, 2015

@lgatto

This comment has been minimized.

Show comment
Hide comment
@lgatto

lgatto Jan 9, 2015

Owner

A small clarification regarding your issue - the 4 files that do not work are those that encode small molecules (SML), i.e. not proteins (PRT) not peptide-spectum matches (PSM).

> files[c(1, 6, 7, 8)]
[1] "mzTab_examples/Cytidine.mzTab"                        
[2] "mzTab_examples/lipidomics-HFD-LD-study-PL-DG-SM.mzTab"
[3] "mzTab_examples/lipidomics-HFD-LD-study-TG.mzTab"      
[4] "mzTab_examples/MTBLS2.mztab"                          

In the frame of the MSnbase package, that is aimed at proteomics data, small molecules have never meant to be supported. The PRT/PSM sections are not parsed properly (they end up in the feature metadata instead of assay slot), so that certainly needs to be fixed.

@sneumann mentioned that he had plans to work on an updated parser for version 1.0 and move it the mzR; Steffen, any news?

Owner

lgatto commented Jan 9, 2015

A small clarification regarding your issue - the 4 files that do not work are those that encode small molecules (SML), i.e. not proteins (PRT) not peptide-spectum matches (PSM).

> files[c(1, 6, 7, 8)]
[1] "mzTab_examples/Cytidine.mzTab"                        
[2] "mzTab_examples/lipidomics-HFD-LD-study-PL-DG-SM.mzTab"
[3] "mzTab_examples/lipidomics-HFD-LD-study-TG.mzTab"      
[4] "mzTab_examples/MTBLS2.mztab"                          

In the frame of the MSnbase package, that is aimed at proteomics data, small molecules have never meant to be supported. The PRT/PSM sections are not parsed properly (they end up in the feature metadata instead of assay slot), so that certainly needs to be fixed.

@sneumann mentioned that he had plans to work on an updated parser for version 1.0 and move it the mzR; Steffen, any news?

@sneumann

This comment has been minimized.

Show comment
Hide comment
@sneumann

sneumann Jan 9, 2015

Hi, yes, mzTab 1.0 is still on my very long list, I'd be happy Richard to discuss what it should look like, and guide you through the implementation.

sneumann commented Jan 9, 2015

Hi, yes, mzTab 1.0 is still on my very long list, I'd be happy Richard to discuss what it should look like, and guide you through the implementation.

@richierocks

This comment has been minimized.

Show comment
Hide comment
@richierocks

richierocks Jan 10, 2015

It seems to me that the problems of "reading in mzTab files" and "fitting the contents of mzTab files into an MSnSet object" are slightly different. So I think that readMzTabData should be constructed to read any mzTab file, storing the results as a list of data frames or data tables. Then conversion to MSnSet should be handled separately by an as.MSnSet function.

Here's a simple(ish) implementation that works on all the 12 example files. It's fast (took 0.3 seconds to read the largest example file on a three year old laptop) due to the use of stringi::stri_read_lines and data.table::fread.

General strategy:

  1. Read the file in as a character vector with stringi::stri_read_lines.
  2. Split based upon the first two characters in each line.
  3. Store comments as a character vector in the result's comment attribute.
  4. Use data.table::fread to parse the tab delimited lines.
  5. Rearrange the metadata content to be a named list, where each value is either a string or a character vector of length 4 (for controlled vocab items).
  6. Return as an S3 class for general usage or convert to MSnSet for compatibility with existing workflow.

I've kept the code fairly simple since there are some design decisions to be made, especially about what to do with malformed files.

For example, if there are multiple instances of PRT block, should you throw an error, warn the user, or silently read the file concatenating the blocks?

How about if there unknown line identifiers ("ZZZ" or whatever), or commas rather than tabs for delimiting fields? Clearly there are lots of things that could go wrong, and I don't know if there is an official way to proceed, or if you have preferences one way or another. In general it is good practice to read anything (but be strict about the spec when writing).

Anyway, here's the code. Let me know if you think it's worth pursuing further.

You'll need

library(data.table)
library(stringi)

though these will obviously go in the package imports once it's done properly. This looks long, but it's mostly comments.

#' @param file A string containing a the path to an mzTab file.
#' @param returnType Return as an \code{MSnSet} of a list of 
#' \code{data.table}s.
#' @importFrom stringi stri_read_lines
#' @export
readMzTabData <- function(file, returnType = c("MSnSet", "list"), verbose = getOption("verbose"))
{
  # TODO: verbose argument currently unused.  Make it 
  # display some messages.

  returnType <- match.arg(returnType)

  # Like readLines, but way faster
  lines <- stri_read_lines(file)
  lines <- lines[nzchar(lines)]

  # Split on the first two characters (so headers stay in
  # the same group as table content rows)
  lineType <- substring(lines, 1, 2)

  # Could be stricter in the type checking to check that all
  # three of the first characters match the 10 allowed types
  # but since it doesn't affect parsing, I don't think it's 
  # worth bothering.
  allowed_types <- c(
    "CO", "MT", "PR", "PE", "PS", "SM"
  )
  stopifnot(all(lineType %in% allowed_types))
  linesByType <- split(lines, lineType)

  # Comments are easy: just strip the first four characters
  # from each line.  Though is it important to record the 
  # position in the file where they were found?
  comments <- substring(linesByType[["CO"]], 5)

  # Parse the other five blocks in a loop, then fix up 
  # metadata afterwards
  res <- setNames(
    lapply(
      linesByType[c("MT", "PR", "PE", "PS", "SM")],
      readTab
    ),
    c("Metadata", "Proteins", "Peptides", "propensityScores", "SmallMolecules")
  )
  res[["Metadata"]] <- reshapeMetadata(res[["Metadata"]])

  comment(res) <- comments

  # For general usage, returning a simple list is fine
  # For MSnbase workflows, should preserve existing behaviour 
  # of returning an MSnSet object.
  if(returnType == "MSnSet")
  {
    return(as.MSnSet(res))  #TODO: write as.MSnSet
  }
  res # maybe structure(res, class = "mzTab")
}

#' @param x A character vector.
#' @return A data table.
#' @importFrom data.table fread setDT
readTab <- function(x)
{
  # Passing data to fread as a string doesn't work for < 2 lines, so
  # we have to handle these cases separately.
  if(length(x) == 0)
  {
    return(data.table())
  }
  if(length(x) == 1)
  {
    # [[-1]] is because we no longer care about the identifier
    # at the start of the line.
    return(setDT(read.delim(text = x)[, -1]))
  }
  # It isn't clear from the documentation which values are allowed
  # to represent missing values.  There were cases of "" and "null"
  # in the examples, but this may need to become an argument to
  # readMzTabData.
  # Also, some fiddling with converting columns types may be needed.
  # See https://github.com/Rdatatable/data.table/issues/729
  fread(paste0(x, collapse = "\n"), sep = "\t", na.strings = c("", "null"))[, -1, with = FALSE]
}

#' @param mtd A data table with 2 columns
#' @return A named list, where each element is either a 
#' string or a named character vector of length 4.
reshapeMetadata <- function(mtd)
{
  stopifnot(ncol(mtd) >= 2)
  # Second column of mtd contains keys
  # Third column contains values
  # Store results in a list, with keys as names
  metadata <- setNames(vector("list", nrow(mtd)), mtd[[1]])

  # Need different behaviour for strings vs. controlled vocab
  rx <- "\\[([[:alnum:]]+), *([[:alnum:]]+:[[:digit:]]+), *([[:print:]]+), *([[:print:]]+)?\\]"
  isControlledVocab <- stri_detect_regex(mtd[[2]], rx)
  controlledVocabValues <- stri_match_all_regex(
    mtd[[2]][isControlledVocab], 
    rx
  )
  metadata[!isControlledVocab] <- mtd[[2]][!isControlledVocab]
  metadata[isControlledVocab] <- lapply(
    controlledVocabValues,
    function(x)
    {
      setNames(x[-1], c("vocab", "id", "name", "def"))
    }
  )
  metadata
}

It seems to me that the problems of "reading in mzTab files" and "fitting the contents of mzTab files into an MSnSet object" are slightly different. So I think that readMzTabData should be constructed to read any mzTab file, storing the results as a list of data frames or data tables. Then conversion to MSnSet should be handled separately by an as.MSnSet function.

Here's a simple(ish) implementation that works on all the 12 example files. It's fast (took 0.3 seconds to read the largest example file on a three year old laptop) due to the use of stringi::stri_read_lines and data.table::fread.

General strategy:

  1. Read the file in as a character vector with stringi::stri_read_lines.
  2. Split based upon the first two characters in each line.
  3. Store comments as a character vector in the result's comment attribute.
  4. Use data.table::fread to parse the tab delimited lines.
  5. Rearrange the metadata content to be a named list, where each value is either a string or a character vector of length 4 (for controlled vocab items).
  6. Return as an S3 class for general usage or convert to MSnSet for compatibility with existing workflow.

I've kept the code fairly simple since there are some design decisions to be made, especially about what to do with malformed files.

For example, if there are multiple instances of PRT block, should you throw an error, warn the user, or silently read the file concatenating the blocks?

How about if there unknown line identifiers ("ZZZ" or whatever), or commas rather than tabs for delimiting fields? Clearly there are lots of things that could go wrong, and I don't know if there is an official way to proceed, or if you have preferences one way or another. In general it is good practice to read anything (but be strict about the spec when writing).

Anyway, here's the code. Let me know if you think it's worth pursuing further.

You'll need

library(data.table)
library(stringi)

though these will obviously go in the package imports once it's done properly. This looks long, but it's mostly comments.

#' @param file A string containing a the path to an mzTab file.
#' @param returnType Return as an \code{MSnSet} of a list of 
#' \code{data.table}s.
#' @importFrom stringi stri_read_lines
#' @export
readMzTabData <- function(file, returnType = c("MSnSet", "list"), verbose = getOption("verbose"))
{
  # TODO: verbose argument currently unused.  Make it 
  # display some messages.

  returnType <- match.arg(returnType)

  # Like readLines, but way faster
  lines <- stri_read_lines(file)
  lines <- lines[nzchar(lines)]

  # Split on the first two characters (so headers stay in
  # the same group as table content rows)
  lineType <- substring(lines, 1, 2)

  # Could be stricter in the type checking to check that all
  # three of the first characters match the 10 allowed types
  # but since it doesn't affect parsing, I don't think it's 
  # worth bothering.
  allowed_types <- c(
    "CO", "MT", "PR", "PE", "PS", "SM"
  )
  stopifnot(all(lineType %in% allowed_types))
  linesByType <- split(lines, lineType)

  # Comments are easy: just strip the first four characters
  # from each line.  Though is it important to record the 
  # position in the file where they were found?
  comments <- substring(linesByType[["CO"]], 5)

  # Parse the other five blocks in a loop, then fix up 
  # metadata afterwards
  res <- setNames(
    lapply(
      linesByType[c("MT", "PR", "PE", "PS", "SM")],
      readTab
    ),
    c("Metadata", "Proteins", "Peptides", "propensityScores", "SmallMolecules")
  )
  res[["Metadata"]] <- reshapeMetadata(res[["Metadata"]])

  comment(res) <- comments

  # For general usage, returning a simple list is fine
  # For MSnbase workflows, should preserve existing behaviour 
  # of returning an MSnSet object.
  if(returnType == "MSnSet")
  {
    return(as.MSnSet(res))  #TODO: write as.MSnSet
  }
  res # maybe structure(res, class = "mzTab")
}

#' @param x A character vector.
#' @return A data table.
#' @importFrom data.table fread setDT
readTab <- function(x)
{
  # Passing data to fread as a string doesn't work for < 2 lines, so
  # we have to handle these cases separately.
  if(length(x) == 0)
  {
    return(data.table())
  }
  if(length(x) == 1)
  {
    # [[-1]] is because we no longer care about the identifier
    # at the start of the line.
    return(setDT(read.delim(text = x)[, -1]))
  }
  # It isn't clear from the documentation which values are allowed
  # to represent missing values.  There were cases of "" and "null"
  # in the examples, but this may need to become an argument to
  # readMzTabData.
  # Also, some fiddling with converting columns types may be needed.
  # See https://github.com/Rdatatable/data.table/issues/729
  fread(paste0(x, collapse = "\n"), sep = "\t", na.strings = c("", "null"))[, -1, with = FALSE]
}

#' @param mtd A data table with 2 columns
#' @return A named list, where each element is either a 
#' string or a named character vector of length 4.
reshapeMetadata <- function(mtd)
{
  stopifnot(ncol(mtd) >= 2)
  # Second column of mtd contains keys
  # Third column contains values
  # Store results in a list, with keys as names
  metadata <- setNames(vector("list", nrow(mtd)), mtd[[1]])

  # Need different behaviour for strings vs. controlled vocab
  rx <- "\\[([[:alnum:]]+), *([[:alnum:]]+:[[:digit:]]+), *([[:print:]]+), *([[:print:]]+)?\\]"
  isControlledVocab <- stri_detect_regex(mtd[[2]], rx)
  controlledVocabValues <- stri_match_all_regex(
    mtd[[2]][isControlledVocab], 
    rx
  )
  metadata[!isControlledVocab] <- mtd[[2]][!isControlledVocab]
  metadata[isControlledVocab] <- lapply(
    controlledVocabValues,
    function(x)
    {
      setNames(x[-1], c("vocab", "id", "name", "def"))
    }
  )
  metadata
}
@lgatto

This comment has been minimized.

Show comment
Hide comment
@lgatto

lgatto Jun 19, 2015

Owner

Dear @richierocks,

cc @sneumann, @lars20070

Apologies for taking ages to get back to you. I have adapted your suggestion into MSnbase as a new MzTab class, that essentially encapsulates a list with metadata (as a list), proteins, peptides, PSMs and small molecules (as data.frames) and comments (a character). I have not used data.table and stringi, as I think the benefit they bring is not (yet) worth the addition of further dependencies (read.table seems fast enough for the data sizes we are currently confronted to, and the stringi functions is still under development) - this might change in the future. See ?MzTab for details.

I have also not included the CV parameter parsing at this stage. The rols package is meant to deal with CV terms, and I will make use of that in the next stage, which consists in producing MSnSet instances containing the data and metadata. I am not doing any data validity checks at this stage either. I will do some and the MSnSet level and might move some basic checking at the MzTab level; at this stage, I wanted to keep things as light as possible.

I have added you as a contributor - I hope don't mind and let me know if you prefer otherwise.

Best wishes,

Laurent

Owner

lgatto commented Jun 19, 2015

Dear @richierocks,

cc @sneumann, @lars20070

Apologies for taking ages to get back to you. I have adapted your suggestion into MSnbase as a new MzTab class, that essentially encapsulates a list with metadata (as a list), proteins, peptides, PSMs and small molecules (as data.frames) and comments (a character). I have not used data.table and stringi, as I think the benefit they bring is not (yet) worth the addition of further dependencies (read.table seems fast enough for the data sizes we are currently confronted to, and the stringi functions is still under development) - this might change in the future. See ?MzTab for details.

I have also not included the CV parameter parsing at this stage. The rols package is meant to deal with CV terms, and I will make use of that in the next stage, which consists in producing MSnSet instances containing the data and metadata. I am not doing any data validity checks at this stage either. I will do some and the MSnSet level and might move some basic checking at the MzTab level; at this stage, I wanted to keep things as light as possible.

I have added you as a contributor - I hope don't mind and let me know if you prefer otherwise.

Best wishes,

Laurent

@lgatto

This comment has been minimized.

Show comment
Hide comment
@lgatto

lgatto Jun 22, 2015

Owner

To conclude this discussion, I have now added as(., "MSnSetList") that converts an MzTab instance to a list of MSnSetList containing the data for proteins, peptides and PSMs. readMzTabData has been modified to to use the new infrastructure. Under the hood, it parses the mzTab file to an MzTab instance, then to an MSnSetList and returns the appropriate MSnSet based on what = c("PRT", "PEP", "PSM"). More details and examples in ?readMSnSet and MzTab.

Owner

lgatto commented Jun 22, 2015

To conclude this discussion, I have now added as(., "MSnSetList") that converts an MzTab instance to a list of MSnSetList containing the data for proteins, peptides and PSMs. readMzTabData has been modified to to use the new infrastructure. Under the hood, it parses the mzTab file to an MzTab instance, then to an MSnSetList and returns the appropriate MSnSet based on what = c("PRT", "PEP", "PSM"). More details and examples in ?readMSnSet and MzTab.

@lgatto lgatto closed this Jun 22, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment