Create method=agc and method=dgc for cluster/cluster.split command #169

pschloss · 2015-09-28T14:53:18Z

Both of these would represent a wrapper for the agc (abundance-based greedy clustering) and dgc (distance-based greedy clustering) in the cluster and cluster.split commands

The algorithm would go something like this...

Run unique.seqs on the data if a names file is not given
Remove - and . characters from each sequence (i.e. degap.seqs)
Append the number of sequences that each unique sequence represents to the end of the fasta file name like we do for chimera.uchime.
Run vsearch as described below
Convert outputted *.uc file into a list file

I currently have a hack to do this using bash and R. Here's how it goes using unaligned sequences...

For method=agc...

FASTA=$1
ROOT=$(echo $FASTA | sed 's/fasta/agc/' | sed 's/.ng//')

#this does steps 1 and 3...  (it's the same as for method=dgc)
vsearch --sizeout --derep_fulllength $FASTA --minseqlength 30 --threads 1 --uc $ROOT.sorted.uc --output $ROOT.sorted.fna --strand both --log $ROOT.sorted.log 

#this does step 4
vsearch --maxaccepts 16 --usersort --id 0.97 --minseqlength 30 --wordlength 8 --uc $ROOT.clustered.uc --cluster_smallmem $ROOT.sorted.fna --maxrejects 64 --strand both --log $ROOT.clustered.log --sizeorder

#this does step 5
R -e "source('uc_to_list.R'); uc_to_list('$ROOT.sorted.uc', '$ROOT.clustered.uc')"

#this cleans things up
rm $ROOT.sorted.uc $ROOT.sorted.fna $ROOT.sorted.log $ROOT.clustered.uc $ROOT.clustered.log

For method=dgc...

FASTA=$1
ROOT=$(echo $FASTA | sed 's/fasta/dgc/' | sed 's/.ng//')

#this does steps 1 and 3... (it's the same as for method=agc)
vsearch --sizeout --derep_fulllength $FASTA --minseqlength 30 --threads 1 --uc $ROOT.sorted.uc --output $ROOT.sorted.fna --strand both --log $ROOT.sorted.log 

#this does step 4
vsearch --maxaccepts 16 --usersort --id 0.97 --minseqlength 30 --wordlength 8 --uc $ROOT.clustered.uc --cluster_smallmem $ROOT.sorted.fna --maxrejects 64 --strand both --log $ROOT.clustered.log

#this does step 5
R -e "source('uc_to_list.R'); uc_to_list('$ROOT.sorted.uc', '$ROOT.clustered.uc')"

#this cleans things up
rm $ROOT.sorted.uc $ROOT.sorted.fna $ROOT.sorted.log $ROOT.clustered.uc $ROOT.clustered.log

The R code...

uc_to_list <- function(unique_file_name, clustered_file_name){

    uniqued <- read.table(file=unique_file_name, stringsAsFactors=FALSE)

    names_first_column <- uniqued[uniqued$V1=="S", "V9"]
    names_second_column <- names_first_column

    hits <- uniqued[uniqued$V1=="H", ]

    for(i in 0:(length(names_first_column)-1)){
        dups <- paste(hits[hits$V2==i, "V9"], collapse=",")
        names_second_column[i+1] <- paste(names_second_column[i+1], dups, sep=",")
    }
    names_second_column <- gsub(",$", "", names_second_column)


    clustered <- read.table(file=clustered_file_name, stringsAsFactors=FALSE)
    clustered$sequence <- 1:nrow(clustered)

    otus <- names_second_column[clustered[clustered$V1=="S", "sequence"]]
    hits <- clustered[clustered$V1=="H", ]

    for(i in 1:nrow(hits)){
        otus[hits[i,"V2"]+1] <- paste(otus[hits[i,"V2"]+1], names_second_column[hits[i,"sequence"]], sep=",") 
    }

    list_file_name <- gsub("clustered.uc", "list", clustered_file_name)
    list_data <- paste(otus, collapse="\t")
    list_data <- paste("userLabel", length(otus), list_data, sep="\t")
    write.table(x=list_data, file=list_file_name, quote=F, row.names=F, col.names=F, sep="\t")

}

Options to include...

cutoff = --id should be able to be set by the user. Make 0.97 the default. In mothur terms the default cutoff would be 0.03 so we need to do 1-cutoff to get the value for --id
processors = I'm pretty sure that this is the --threads option, which takes an int
I just noticed that the R code produces a label that us userLabel. This should be 0.03 (i.e. the value of cutoff). Also, the R code gets pretty slow, so if there's a way to do it better that would be great. Again, this was a hack :)

I'll email some example input and output that was generated without running the rm command at the end of each script for both methods.

The text was updated successfully, but these errors were encountered:

mothur-westcott · 2015-10-12T19:10:01Z

@pschloss Is step 4 supposed to have different settings? How does vsearch know which method you are requesting? Does it rely on the file extension? Do you want to use vsearch to run steps 1-3 or mothur?

Issue #169

pschloss · 2015-10-13T02:15:54Z

Sorry - I had a typo above. The DGC isn't supposed to have the --sizeorder flag while AGC is. I've corrected the code above.

Issue #169

#169

pschloss added Enhancement Priority labels Sep 28, 2015

pschloss added this to the Version 1.37.0 milestone Sep 28, 2015

mothur-westcott added a commit that referenced this issue Oct 12, 2015

Adds vsearch source to project

6fa8a34

Issue #169

mothur-westcott added a commit that referenced this issue Oct 12, 2015

Adds fasta, agc/dgc to cluster constructors

34004d2

Issue #169

mothur-westcott added a commit that referenced this issue Oct 12, 2015

Create function for mothurCluster

25009cb

Issue #169

mothur-westcott added a commit that referenced this issue Oct 12, 2015

Adds file prep needed to run vsearch methods

a7d1289

Issue #169

mothur-westcott added a commit that referenced this issue Oct 12, 2015

Adds vsearch driver to cluster command

1d87f7e

Issue #169

mothur-westcott added a commit that referenced this issue Jan 5, 2016

Created vsearchFileParser class to encapsulate code

9f653f5

Issue #169

mothur-westcott added a commit that referenced this issue Jan 5, 2016

WIP

4be675a

Issue #169

mothur-westcott added a commit that referenced this issue Jan 5, 2016

Adds agc and dgc methods to cluster command

4174b4f

Issue #169

mothur-westcott added a commit that referenced this issue Jan 5, 2016

Adds vsearch to cluster.split

0f33eda

Issue #169

mothur-westcott added a commit that referenced this issue Feb 7, 2016

vsearch weekend updates

49fbc9d

#169

mothur-westcott added a commit that referenced this issue Feb 8, 2016

Adds agc and dgc methods to cluster command

11e6aa0

#169

mothur-westcott added a commit that referenced this issue Feb 8, 2016

WIP cluster.split command agc dgc

f3de1a5

#169

mothur-westcott added a commit that referenced this issue Feb 9, 2016

Adds agc and dgc to cluster.split

1049f39

#169

mothur-westcott added a commit that referenced this issue Feb 9, 2016

Resolving conflicts in #169 merge

b6802a3

mothur-westcott added a commit that referenced this issue Feb 29, 2016

Globalizes inputDir for vsearch methods

eafae91

#169

mothur-westcott closed this as completed Mar 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create method=agc and method=dgc for cluster/cluster.split command #169

Create method=agc and method=dgc for cluster/cluster.split command #169

pschloss commented Sep 28, 2015

mothur-westcott commented Oct 12, 2015

pschloss commented Oct 13, 2015

Create method=agc and method=dgc for cluster/cluster.split command #169

Create method=agc and method=dgc for cluster/cluster.split command #169

Comments

pschloss commented Sep 28, 2015

mothur-westcott commented Oct 12, 2015

pschloss commented Oct 13, 2015