Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training set derived from PR2 for IDTAXA (DECIPHER) #26

Closed
carolinebrunel opened this issue Jul 16, 2020 · 5 comments
Closed

Training set derived from PR2 for IDTAXA (DECIPHER) #26

carolinebrunel opened this issue Jul 16, 2020 · 5 comments

Comments

@carolinebrunel
Copy link

carolinebrunel commented Jul 16, 2020

Hi Daniel,

I was wondering whether there was a training set available derived from the PR2 database adapted for the idtaxa function (DECIPHER package) or whether there was a tutorial/ some documentation providing clues on how to create it from the original database.

Thanks for your help!

Caroline

@vaulot
Copy link
Collaborator

vaulot commented Jul 16, 2020

Hi Caroline

Here is a piece of code that I wrote 2 years ago. I did not use since so cannot guarantee it will work. If you improve, it will be nice to post here... Good luck. Daniel

Init


# Libraries for bioinfo ----------------------------------------------------

  library("dada2")
  library("Biostrings")
  library("taxa")
  library("scales")  # necessary to transform heatmaps

# Libraries tidyr ---------------------------------------------------------

  library("ggplot2") 
  library("dplyr")   
  library("readxl")
  library("tibble")
  library("readr")
  library("purrr")
  library("forcats")
  library("lubridate")
  library("stringr")
  library(openxlsx)
  library(DECIPHER)

Testing running a training set on PR2

Total number of sequences eliminated : 72786

Show in New WindowClear OutputExpand/Collapse Output
A training set of class 'Taxa'

  • K-mer size: 8
  • Number of rank levels: 9
  • Total number of sequences: 102685
  • Number of taxonomic groups: 45750
  • Number of problem groups: 117
  • Number of problem sequences: 363

seqs <- readDNAStringSet("pr2_version_4.12.0_dada2.fasta.gz")
# obtain the taxonomic assignments
groups <- names(seqs) # sequence names
# assume the taxonomy begins with 'Root;'
groups <- gsub("(.*)(Eukaryota;)", "\\2", groups) # extract the group label

# All taxos must start with "Root;"
groups <- str_replace(groups,"Eukaryota","Root;Eukaryota")
names(seqs)<-groups

groupCounts <- table(groups)
u_groups <- names(groupCounts) # unique groups
length(u_groups) # number of group

taxid <- NULL

maxGroupSize <- 20 # max sequences per label (>= 1)
remove <- logical(length(seqs))
for (i in which(groupCounts > maxGroupSize)) {
  index <- which(groups==u_groups[i])
  keep <- sample(length(index), maxGroupSize)
  remove[index[-keep]] <- TRUE
}
sum(remove) # number of sequences eliminated
# > sum(remove) # number of sequences eliminated
# [1] 72123

maxIterations <- 3 # must be >= 1
allowGroupRemoval <- FALSE

probSeqsPrev <- integer() # suspected problem sequences from prior iteration

for (i in seq_len(maxIterations)) {
  cat("Training iteration: ", i, "\n", sep="")
  # train the classifier
  trainingSet <- LearnTaxa(seqs[!remove],  names(seqs)[!remove],taxid)
  # look for problem sequences
  probSeqs <- trainingSet$problemSequences$Index
  if (length(probSeqs)==0) {
    cat("No problem sequences remaining.\n")
    break
  } else if (length(probSeqs)==length(probSeqsPrev) && all(probSeqsPrev==probSeqs)) {
    cat("Iterations converged.\n")
    break
  }
  if (i==maxIterations)
    break
    probSeqsPrev <- probSeqs
    # remove any problem sequences
    index <- which(!remove)[probSeqs]
    remove[index] <- TRUE # remove all problem sequences
    if (!allowGroupRemoval) {
    # replace any removed groups
    missing <- !(u_groups %in% groups[!remove])
    missing <- u_groups[missing]
    if (length(missing) > 0) {
      index <- index[groups[index] %in% missing]
      remove[index] <- FALSE # don't remove
    }
  }
}

sum(remove) # total number of sequences eliminated
length(probSeqs) # number of remaining problem sequences

saveRDS(trainingSet, "decipher_training_18S_PR2.4.12.0.rds")

Training set stats

trainingSet <- readRDS("decipher_training_18S_PR2.4.12.0.rds")
trainingSet
plot(trainingSet)

Identifying problem sequences

problems_seq <- data.frame(trainingSet$problemSequences) %>% 
                dplyr::mutate(Expected= str_extract(Expected,";[A-Z]{1}[a-z0-9_.]+;$"))

problems_groups <- data.frame(trainingSet$problemGroups)
knitr::kable(problems_seq)

Use training set to assign sequences

test <- readDNAStringSet("otu_taxo.fasta.gz")
ids <- IdTaxa(test,
        trainingSet,
        type="extended",
        strand="top",
        threshold=0)

  ids
  ids[[1]]
  # plot(ids)
  
  output <- data.frame(decipher=purrr::map_chr(ids,function (id) {
                      paste(id$taxon[[8]],
                      " (",
                      round(id$confidence[[8]], digits=1),
                      "%)",
                      sep="",
                      collapse="; ")
                    }))
  output <- tibble::rownames_to_column(output, var="dada2") %>%
            dplyr::mutate (dada2 = str_extract(dada2,"(\\|[A-Z]){1}[^|]+$"))
  
  knitr::kable(output)
  

otu_taxo.fasta.gz

@carolinebrunel
Copy link
Author

Thanks a lot Daniel!

@digitalwright
Copy link

FYI, I posted a PR2 training set on the DECIPHER website.

@vaulot
Copy link
Collaborator

vaulot commented Mar 25, 2021

Hi Erik, thanks a lot. I posted the link in the latest release info. Cheers. Daniel

@vaulot
Copy link
Collaborator

vaulot commented Jun 26, 2021

Hi Erik
I have now posted a small tutorial to do a trained dataset from PR2: https://pr2database.github.io/pr2database/articles/pr2_04_decipher.html

Let me know of anything needs to be changed. Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants