Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create method=agc and method=dgc for cluster/cluster.split command #169

Closed
pschloss opened this issue Sep 28, 2015 · 2 comments
Closed

Create method=agc and method=dgc for cluster/cluster.split command #169

pschloss opened this issue Sep 28, 2015 · 2 comments

Comments

@pschloss
Copy link
Contributor

Both of these would represent a wrapper for the agc (abundance-based greedy clustering) and dgc (distance-based greedy clustering) in the cluster and cluster.split commands

The algorithm would go something like this...

  1. Run unique.seqs on the data if a names file is not given
  2. Remove - and . characters from each sequence (i.e. degap.seqs)
  3. Append the number of sequences that each unique sequence represents to the end of the fasta file name like we do for chimera.uchime.
  4. Run vsearch as described below
  5. Convert outputted *.uc file into a list file

I currently have a hack to do this using bash and R. Here's how it goes using unaligned sequences...

For method=agc...

FASTA=$1
ROOT=$(echo $FASTA | sed 's/fasta/agc/' | sed 's/.ng//')

#this does steps 1 and 3...  (it's the same as for method=dgc)
vsearch --sizeout --derep_fulllength $FASTA --minseqlength 30 --threads 1 --uc $ROOT.sorted.uc --output $ROOT.sorted.fna --strand both --log $ROOT.sorted.log 

#this does step 4
vsearch --maxaccepts 16 --usersort --id 0.97 --minseqlength 30 --wordlength 8 --uc $ROOT.clustered.uc --cluster_smallmem $ROOT.sorted.fna --maxrejects 64 --strand both --log $ROOT.clustered.log --sizeorder

#this does step 5
R -e "source('uc_to_list.R'); uc_to_list('$ROOT.sorted.uc', '$ROOT.clustered.uc')"

#this cleans things up
rm $ROOT.sorted.uc $ROOT.sorted.fna $ROOT.sorted.log $ROOT.clustered.uc $ROOT.clustered.log

For method=dgc...

FASTA=$1
ROOT=$(echo $FASTA | sed 's/fasta/dgc/' | sed 's/.ng//')

#this does steps 1 and 3... (it's the same as for method=agc)
vsearch --sizeout --derep_fulllength $FASTA --minseqlength 30 --threads 1 --uc $ROOT.sorted.uc --output $ROOT.sorted.fna --strand both --log $ROOT.sorted.log 

#this does step 4
vsearch --maxaccepts 16 --usersort --id 0.97 --minseqlength 30 --wordlength 8 --uc $ROOT.clustered.uc --cluster_smallmem $ROOT.sorted.fna --maxrejects 64 --strand both --log $ROOT.clustered.log

#this does step 5
R -e "source('uc_to_list.R'); uc_to_list('$ROOT.sorted.uc', '$ROOT.clustered.uc')"

#this cleans things up
rm $ROOT.sorted.uc $ROOT.sorted.fna $ROOT.sorted.log $ROOT.clustered.uc $ROOT.clustered.log

The R code...

uc_to_list <- function(unique_file_name, clustered_file_name){

    uniqued <- read.table(file=unique_file_name, stringsAsFactors=FALSE)

    names_first_column <- uniqued[uniqued$V1=="S", "V9"]
    names_second_column <- names_first_column

    hits <- uniqued[uniqued$V1=="H", ]

    for(i in 0:(length(names_first_column)-1)){
        dups <- paste(hits[hits$V2==i, "V9"], collapse=",")
        names_second_column[i+1] <- paste(names_second_column[i+1], dups, sep=",")
    }
    names_second_column <- gsub(",$", "", names_second_column)


    clustered <- read.table(file=clustered_file_name, stringsAsFactors=FALSE)
    clustered$sequence <- 1:nrow(clustered)

    otus <- names_second_column[clustered[clustered$V1=="S", "sequence"]]
    hits <- clustered[clustered$V1=="H", ]

    for(i in 1:nrow(hits)){
        otus[hits[i,"V2"]+1] <- paste(otus[hits[i,"V2"]+1], names_second_column[hits[i,"sequence"]], sep=",") 
    }

    list_file_name <- gsub("clustered.uc", "list", clustered_file_name)
    list_data <- paste(otus, collapse="\t")
    list_data <- paste("userLabel", length(otus), list_data, sep="\t")
    write.table(x=list_data, file=list_file_name, quote=F, row.names=F, col.names=F, sep="\t")

}

Options to include...

  • cutoff = --id should be able to be set by the user. Make 0.97 the default. In mothur terms the default cutoff would be 0.03 so we need to do 1-cutoff to get the value for --id
  • processors = I'm pretty sure that this is the --threads option, which takes an int
  • I just noticed that the R code produces a label that us userLabel. This should be 0.03 (i.e. the value of cutoff). Also, the R code gets pretty slow, so if there's a way to do it better that would be great. Again, this was a hack :)

I'll email some example input and output that was generated without running the rm command at the end of each script for both methods.

@mothur-westcott
Copy link
Contributor

@pschloss Is step 4 supposed to have different settings? How does vsearch know which method you are requesting? Does it rely on the file extension? Do you want to use vsearch to run steps 1-3 or mothur?

mothur-westcott added a commit that referenced this issue Oct 12, 2015
mothur-westcott added a commit that referenced this issue Oct 12, 2015
mothur-westcott added a commit that referenced this issue Oct 12, 2015
@pschloss
Copy link
Contributor Author

Sorry - I had a typo above. The DGC isn't supposed to have the --sizeorder flag while AGC is. I've corrected the code above.

mothur-westcott added a commit that referenced this issue Jan 5, 2016
Issue #169
mothur-westcott added a commit that referenced this issue Jan 5, 2016
mothur-westcott added a commit that referenced this issue Feb 7, 2016
mothur-westcott added a commit that referenced this issue Feb 8, 2016
mothur-westcott added a commit that referenced this issue Feb 9, 2016
mothur-westcott added a commit that referenced this issue Feb 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants