Training set derived from PR2 for IDTAXA (DECIPHER) #26

carolinebrunel · 2020-07-16T06:51:20Z

Hi Daniel,

I was wondering whether there was a training set available derived from the PR2 database adapted for the idtaxa function (DECIPHER package) or whether there was a tutorial/ some documentation providing clues on how to create it from the original database.

Thanks for your help!

Caroline

vaulot · 2020-07-16T07:10:56Z

Hi Caroline

Here is a piece of code that I wrote 2 years ago. I did not use since so cannot guarantee it will work. If you improve, it will be nice to post here... Good luck. Daniel

Init


# Libraries for bioinfo ----------------------------------------------------

  library("dada2")
  library("Biostrings")
  library("taxa")
  library("scales")  # necessary to transform heatmaps

# Libraries tidyr ---------------------------------------------------------

  library("ggplot2") 
  library("dplyr")   
  library("readxl")
  library("tibble")
  library("readr")
  library("purrr")
  library("forcats")
  library("lubridate")
  library("stringr")
  library(openxlsx)
  library(DECIPHER)

Testing running a training set on PR2

Total number of sequences eliminated : 72786

Show in New WindowClear OutputExpand/Collapse Output
A training set of class 'Taxa'

K-mer size: 8
Number of rank levels: 9
Total number of sequences: 102685
Number of taxonomic groups: 45750
Number of problem groups: 117
Number of problem sequences: 363


seqs <- readDNAStringSet("pr2_version_4.12.0_dada2.fasta.gz")
# obtain the taxonomic assignments
groups <- names(seqs) # sequence names
# assume the taxonomy begins with 'Root;'
groups <- gsub("(.*)(Eukaryota;)", "\\2", groups) # extract the group label

# All taxos must start with "Root;"
groups <- str_replace(groups,"Eukaryota","Root;Eukaryota")
names(seqs)<-groups

groupCounts <- table(groups)
u_groups <- names(groupCounts) # unique groups
length(u_groups) # number of group

taxid <- NULL

maxGroupSize <- 20 # max sequences per label (>= 1)
remove <- logical(length(seqs))
for (i in which(groupCounts > maxGroupSize)) {
  index <- which(groups==u_groups[i])
  keep <- sample(length(index), maxGroupSize)
  remove[index[-keep]] <- TRUE
}
sum(remove) # number of sequences eliminated
# > sum(remove) # number of sequences eliminated
# [1] 72123

maxIterations <- 3 # must be >= 1
allowGroupRemoval <- FALSE

probSeqsPrev <- integer() # suspected problem sequences from prior iteration

for (i in seq_len(maxIterations)) {
  cat("Training iteration: ", i, "\n", sep="")
  # train the classifier
  trainingSet <- LearnTaxa(seqs[!remove],  names(seqs)[!remove],taxid)
  # look for problem sequences
  probSeqs <- trainingSet$problemSequences$Index
  if (length(probSeqs)==0) {
    cat("No problem sequences remaining.\n")
    break
  } else if (length(probSeqs)==length(probSeqsPrev) && all(probSeqsPrev==probSeqs)) {
    cat("Iterations converged.\n")
    break
  }
  if (i==maxIterations)
    break
    probSeqsPrev <- probSeqs
    # remove any problem sequences
    index <- which(!remove)[probSeqs]
    remove[index] <- TRUE # remove all problem sequences
    if (!allowGroupRemoval) {
    # replace any removed groups
    missing <- !(u_groups %in% groups[!remove])
    missing <- u_groups[missing]
    if (length(missing) > 0) {
      index <- index[groups[index] %in% missing]
      remove[index] <- FALSE # don't remove
    }
  }
}

sum(remove) # total number of sequences eliminated
length(probSeqs) # number of remaining problem sequences

saveRDS(trainingSet, "decipher_training_18S_PR2.4.12.0.rds")

Training set stats

trainingSet <- readRDS("decipher_training_18S_PR2.4.12.0.rds")
trainingSet
plot(trainingSet)

Identifying problem sequences

problems_seq <- data.frame(trainingSet$problemSequences) %>% 
                dplyr::mutate(Expected= str_extract(Expected,";[A-Z]{1}[a-z0-9_.]+;$"))

problems_groups <- data.frame(trainingSet$problemGroups)
knitr::kable(problems_seq)

Use training set to assign sequences

test <- readDNAStringSet("otu_taxo.fasta.gz")
ids <- IdTaxa(test,
        trainingSet,
        type="extended",
        strand="top",
        threshold=0)

  ids
  ids[[1]]
  # plot(ids)
  
  output <- data.frame(decipher=purrr::map_chr(ids,function (id) {
                      paste(id$taxon[[8]],
                      " (",
                      round(id$confidence[[8]], digits=1),
                      "%)",
                      sep="",
                      collapse="; ")
                    }))
  output <- tibble::rownames_to_column(output, var="dada2") %>%
            dplyr::mutate (dada2 = str_extract(dada2,"(\\|[A-Z]){1}[^|]+$"))
  
  knitr::kable(output)

otu_taxo.fasta.gz

carolinebrunel · 2020-07-16T07:13:06Z

Thanks a lot Daniel!

digitalwright · 2021-03-24T13:53:08Z

FYI, I posted a PR2 training set on the DECIPHER website.

vaulot · 2021-03-25T01:28:54Z

Hi Erik, thanks a lot. I posted the link in the latest release info. Cheers. Daniel

vaulot · 2021-06-26T02:33:11Z

Hi Erik
I have now posted a small tutorial to do a trained dataset from PR2: https://pr2database.github.io/pr2database/articles/pr2_04_decipher.html

Let me know of anything needs to be changed. Cheers.

vaulot closed this as completed Jun 26, 2021

FarzanehRah mentioned this issue Mar 9, 2023

Training a custom database for classification #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training set derived from PR2 for IDTAXA (DECIPHER) #26

Training set derived from PR2 for IDTAXA (DECIPHER) #26

carolinebrunel commented Jul 16, 2020 •

edited

vaulot commented Jul 16, 2020

carolinebrunel commented Jul 16, 2020

digitalwright commented Mar 24, 2021

vaulot commented Mar 25, 2021

vaulot commented Jun 26, 2021

Training set derived from PR2 for IDTAXA (DECIPHER) #26

Training set derived from PR2 for IDTAXA (DECIPHER) #26

Comments

carolinebrunel commented Jul 16, 2020 • edited

vaulot commented Jul 16, 2020

Init

Testing running a training set on PR2

carolinebrunel commented Jul 16, 2020

digitalwright commented Mar 24, 2021

vaulot commented Mar 25, 2021

vaulot commented Jun 26, 2021

carolinebrunel commented Jul 16, 2020 •

edited