revisit subsampling? #179

matsen · 2017-07-15T00:23:47Z

At some point it would seem worthwhile to revisit the random subsampling used before building the clonal families in the case that we are interested in every last event that went into a seed lineage.

Perhaps a good thing to do would be for @lauranoges to pick a clonal family for which we would like better resolution on the path to maturity (but doesn't care so much about other sequences in the clonal family), and we can see what happens if @psathyrella just looks for sequences quite close to the seed.

metasoarous · 2017-07-16T15:36:45Z

Another thing we brought up in discussion about subsampling is that if we thought about it as a clustering problem, we could aggregate the duplicity and timepoint metadata for each cluster. @lauranoges et al. would like to avoid (e.g.) a situation where a small number of sequences from one timepoint belonging to some clonal family got completely left out of the subsampling, making it look like there were only hits from one timepoint in the corresponding cftweb tree(s). Something like UCLUST might do the trick here.

However, I think this heads in a little bit of a different direction than what you're talking about Erick (looking for sequences close to the seeds), and kind of plays into the seedlineage vs minadcl split. Would it be crazy to run partis on two subsampling strategies?

psathyrella · 2017-07-16T21:45:50Z

I think before adding more to the docket of things that need to be run we should have a better idea of how it could actually change a biological conclusion that we want to make. To be clear, the effect of this is that, on some small fraction of the samples, we could double or triple the size of the final clusters. The vast majority of these clusters are either massive, or of trivial size (i.e. just the seed sequence). In the former case, if going from, say, 2k to 4k sequences changes our conclusions, something's badly wrong in our analysis. In the latter, we're not making any conclusions based on a single/few sequences, so that shouldn't change anything either.

matsen · 2017-07-16T23:34:27Z

Thanks, all. @psathyrella, yes, I definitely want to think this through clearly.

The case in which this could make a difference is, like Chris said, just when we are looking very closely at a seed lineage. Even in the big clusters there can be relatively few sequences that branch directly off of the root to seed path. If we can even get one more of those, that provides another intermediate that can be tested in the lab. If we double the size of the cluster that doubles the potential for getting those close to seed lineage path. The flip-side is that we can be pretty strict in our clustering-- if the inferred naive sequence is too far from the seed naive, we can toss it.

I was also thinking this weekend that if we are down-sampling anyway, we might as well go for a stricter quality control on the pre-processing side.

Thoughts, @lauranoges ?

lauradoepker · 2017-07-17T18:54:02Z

Yes @matsen and @psathyrella : We might as well downsample to a higher-quality sequence set (throw out stop codons). However, if the STOP codon arose early, then the entire population of a given sequence would get thrown out, which is biased and could thwart us later without our knowledge... which is scary.

QA255.105-Vh and -Vk are examples of lineages that we are really interested in. Also BF520.1-Vh and -Vk.

I didn't realize we were downsampling until it was mentioned last week. Are we downsampling randomly @psathyrella or are we downsampling by intentionally picking samples that are _____? (close to seed? good quality? or what?)

psathyrella · 2017-07-17T19:10:45Z

yeah randomly

metasoarous · 2017-07-18T05:27:16Z

I was just reviewing, and realized that what I've been saying about how I've been filtering out sequences isn't quite right.

What we're actually doing is this:

cft/bin/process_partis.py

Lines 157 to 167 in 41247de

    
           def infer_frameshifts(line): 
        
               attrs = ['stops', 'indelfos', 'input_seqs'] 
        
               def infer_(args): 
        
                   stop, indelfo, input_seq = args 
        
                   aa = Seq(input_seq).translate() 
        
                   stop_count = aa.count("*") 
        
                   # We say it's a frameshift if the indell offsets don't leave us with a multiple of three, and if there 
        
                   # are stop codons. Can tweak this down the road, but for now... 
        
                   return bool(stop_count > 0 and indel_offset(indelfo) % 3) 
        
               return map(infer_, 
        
                          zip(*map(lambda x: line[x], attrs)))

. As mentioned in the highlighted comment, we're removing sequences for which:

there are stop codons and
the length is not a multiple of three

I think this was something @psathyrella and I settled on as a temporary solution at a point where we realized that the productivity information partis was sticking in the output was flawed (IIRC, it was not based on the indel reversed sequences or some such). Assuming we don't go the route of taking care of this filtering upstream of cft, would it make sense to switch to the updated productivity information coming out of partis?

metasoarous · 2018-01-12T00:09:02Z

@lauranoges @psathyrella What's the current status of this issue? I seem to recall @psathyrella did some tinkering with the downsampling, so are we good to close here?

psathyrella · 2018-01-12T00:31:42Z

yeah, no longer downsampling.

metasoarous · 2018-01-12T00:32:43Z

Great; thanks! Closing!

metasoarous mentioned this issue Jul 18, 2017

How to filter sequences #177

Closed

matsen mentioned this issue Jul 25, 2017

Try data set variants matsengrp/ecgtheow#4

Closed

metasoarous assigned metasoarous and lauradoepker and unassigned metasoarous Aug 3, 2017

metasoarous closed this as completed Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revisit subsampling? #179

revisit subsampling? #179

matsen commented Jul 15, 2017

metasoarous commented Jul 16, 2017

psathyrella commented Jul 16, 2017

matsen commented Jul 16, 2017

lauradoepker commented Jul 17, 2017

psathyrella commented Jul 17, 2017

metasoarous commented Jul 18, 2017

metasoarous commented Jan 12, 2018

psathyrella commented Jan 12, 2018

metasoarous commented Jan 12, 2018

revisit subsampling? #179

revisit subsampling? #179

Comments

matsen commented Jul 15, 2017

metasoarous commented Jul 16, 2017

psathyrella commented Jul 16, 2017

matsen commented Jul 16, 2017

lauradoepker commented Jul 17, 2017

psathyrella commented Jul 17, 2017

metasoarous commented Jul 18, 2017

metasoarous commented Jan 12, 2018

psathyrella commented Jan 12, 2018

metasoarous commented Jan 12, 2018