Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revisit subsampling? #179

Closed
matsen opened this issue Jul 15, 2017 · 9 comments
Closed

revisit subsampling? #179

matsen opened this issue Jul 15, 2017 · 9 comments
Assignees

Comments

@matsen
Copy link
Contributor

matsen commented Jul 15, 2017

At some point it would seem worthwhile to revisit the random subsampling used before building the clonal families in the case that we are interested in every last event that went into a seed lineage.

Perhaps a good thing to do would be for @lauranoges to pick a clonal family for which we would like better resolution on the path to maturity (but doesn't care so much about other sequences in the clonal family), and we can see what happens if @psathyrella just looks for sequences quite close to the seed.

@metasoarous
Copy link
Member

Another thing we brought up in discussion about subsampling is that if we thought about it as a clustering problem, we could aggregate the duplicity and timepoint metadata for each cluster. @lauranoges et al. would like to avoid (e.g.) a situation where a small number of sequences from one timepoint belonging to some clonal family got completely left out of the subsampling, making it look like there were only hits from one timepoint in the corresponding cftweb tree(s). Something like UCLUST might do the trick here.

However, I think this heads in a little bit of a different direction than what you're talking about Erick (looking for sequences close to the seeds), and kind of plays into the seedlineage vs minadcl split. Would it be crazy to run partis on two subsampling strategies?

@psathyrella
Copy link
Contributor

I think before adding more to the docket of things that need to be run we should have a better idea of how it could actually change a biological conclusion that we want to make. To be clear, the effect of this is that, on some small fraction of the samples, we could double or triple the size of the final clusters. The vast majority of these clusters are either massive, or of trivial size (i.e. just the seed sequence). In the former case, if going from, say, 2k to 4k sequences changes our conclusions, something's badly wrong in our analysis. In the latter, we're not making any conclusions based on a single/few sequences, so that shouldn't change anything either.

@matsen
Copy link
Contributor Author

matsen commented Jul 16, 2017

Thanks, all. @psathyrella, yes, I definitely want to think this through clearly.

The case in which this could make a difference is, like Chris said, just when we are looking very closely at a seed lineage. Even in the big clusters there can be relatively few sequences that branch directly off of the root to seed path. If we can even get one more of those, that provides another intermediate that can be tested in the lab. If we double the size of the cluster that doubles the potential for getting those close to seed lineage path. The flip-side is that we can be pretty strict in our clustering-- if the inferred naive sequence is too far from the seed naive, we can toss it.

I was also thinking this weekend that if we are down-sampling anyway, we might as well go for a stricter quality control on the pre-processing side.

Thoughts, @lauranoges ?

@lauradoepker
Copy link

Yes @matsen and @psathyrella : We might as well downsample to a higher-quality sequence set (throw out stop codons). However, if the STOP codon arose early, then the entire population of a given sequence would get thrown out, which is biased and could thwart us later without our knowledge... which is scary.

QA255.105-Vh and -Vk are examples of lineages that we are really interested in. Also BF520.1-Vh and -Vk.

I didn't realize we were downsampling until it was mentioned last week. Are we downsampling randomly @psathyrella or are we downsampling by intentionally picking samples that are _____? (close to seed? good quality? or what?)

@psathyrella
Copy link
Contributor

yeah randomly

@metasoarous
Copy link
Member

I was just reviewing, and realized that what I've been saying about how I've been filtering out sequences isn't quite right.

What we're actually doing is this:

cft/bin/process_partis.py

Lines 157 to 167 in 41247de

def infer_frameshifts(line):
attrs = ['stops', 'indelfos', 'input_seqs']
def infer_(args):
stop, indelfo, input_seq = args
aa = Seq(input_seq).translate()
stop_count = aa.count("*")
# We say it's a frameshift if the indell offsets don't leave us with a multiple of three, and if there
# are stop codons. Can tweak this down the road, but for now...
return bool(stop_count > 0 and indel_offset(indelfo) % 3)
return map(infer_,
zip(*map(lambda x: line[x], attrs)))
. As mentioned in the highlighted comment, we're removing sequences for which:

  • there are stop codons and
  • the length is not a multiple of three

I think this was something @psathyrella and I settled on as a temporary solution at a point where we realized that the productivity information partis was sticking in the output was flawed (IIRC, it was not based on the indel reversed sequences or some such). Assuming we don't go the route of taking care of this filtering upstream of cft, would it make sense to switch to the updated productivity information coming out of partis?

@metasoarous
Copy link
Member

@lauranoges @psathyrella What's the current status of this issue? I seem to recall @psathyrella did some tinkering with the downsampling, so are we good to close here?

@psathyrella
Copy link
Contributor

yeah, no longer downsampling.

@metasoarous
Copy link
Member

Great; thanks! Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants