Disregard the wobble base #1

MoentyPython · 2016-06-29T16:10:48Z

Hey,

thank you for this amazing tool, I have included it in out pipeline and it works perfectly. I just have one issue I'd like to bring up.

Sometimes, it is not the nucleotide (NT) sequence that we are interested in but the amino acid (AA) sequence. When I parse my IMGT files with VDJtools my output file lists all my sequences in order of highest expansion. However, due to the wobble base sometimes several NT sequences code the same AA sequence. When we now look at the top 20 AA-sequences we 1. dont have the actual count, as there might be several other NT sequences coding for it aswell, 2. might be missing AA sequences that would have made it into the top 20 if all there AA-sequences would have been summed up.

I have written my own script to deal with this, however it takes a long time to process 500k lines. I wonder if there is a option to disregard the NT-sequence or if it is possible to include something like that in future versions?

Best wishes,

Solaiman

mikessh · 2016-06-29T16:41:09Z

Dear Solaiman,

Thanks for your feedback! There are two possible workarounds:

If these NT sequences coding for the same AA are erroneous sub-variants, you can remove them using Correct utility, more details here. Note that correct has a certain error threshold (-r option) and will not scan variants that differ by more than 2 mismatches (however in case there is an intermediate variant, i.e. 1 -mismatch-> 2 -other-mismatch -> 3 -other-mismatch-> 4, those will be clustered together).
You can try to pool by amino acid using PoolSamples -i aa ..., more details here. There is a small problem here - pool needs 2+ samples, but you can actually use the metadata file, providing a single metadata file with a single entry per each sample (-m metadata.txt option). I'll fix this limitation in the next release.

Remember that prior to any of those operations you should convert your file to VDJtools format using Convert.

P.S. Issues should be reported to https://github.com/mikessh/vdjtools/issues, this is just a doc repository

Best,
Mike

MoentyPython · 2016-06-30T16:06:08Z

Dear Mikhail,

thank you for your quick response! The metadata file trick did it for me,
easy to write a script to automate it! Thank you.

I have another quick question regarding the OverlapPair
http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#overlappair
function. We would like to calculate the pearsonscore for pairs of samples,
however, a PostDoc in our lab told me it is important for our analysis,
that we include the non-overlapping sequences for our experimental setup.
We read under the description of your clonotype scatterplot, that it only
uses overlapping clonotypes for calculation.

Firstly, is there a specific reason you exclude the non-overlapping pairs,
especially in the HSCT example given, it would be much more interesting to
see how similar/dissimilar the entire repertoire of one patient is to the
next.

Secondly, is there an option im missing, to include non-overlapping pairs?

Thank you!

2016-06-29 18:41 GMT+02:00 Mikhail Shugay notifications@github.com:

Dear Solaiman,

Thanks for your feedback! There are two possible workarounds:

If these NT sequences coding for the same AA are erroneous
sub-variants, you can remove them using Correct utility, more details here
http://vdjtools-doc.readthedocs.io/en/latest/preprocess.html#correct.
Note that correct has a certain error threshold (-r option) and will not
scan variants that differ by more than 2 mismatches (however in case there
is an intermediate variant, i.e. 1 -mismatch-> 2 -other-mismatch -> 3
-other-mismatch-> 4, those will be clustered together).

You can try to pool by amino acid using PoolSamples -i aa ..., more
details here
http://vdjtools-doc.readthedocs.io/en/latest/operate.html#poolsamples.
There is a small problem here - pool needs 2+ samples, but you can actually
use the metadata file, providing a single metadata file with a single
entry per each sample (-m metadata.txt option). I'll fix this limitation
in the next release.

Remember that prior to both steps you should convert your file to VDJtools
format using Convert.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AQIyXErVKBZz7ZL6D9zdc3UBgJ5Me4VYks5qQqAlgaJpZM4JBTke
.

mikessh · 2016-06-30T16:56:24Z

Dear Solaiman,

Its not very clear for me, what do you mean under "including non-overlapping clonotypes". If it is for R score computation, then including a set of, say 1000, points that have 0 frequency in one sample and non-zero frequency in other can bias the estimate. I mean people frequently use correlation in bioinformatics for random variables that can lead to non-robust behavior (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Robustness), but in this case it can be wise to judge missing clonotypes separately based on frequency and number of non-overlapping clonotypes (see ``freq12`, etc in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#id4). Note that non-overlapping clonotype density is still shown in side bars (marginal distributions) of first plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output.

When you track clonotypes from the same individual (e.g. in HSCT), it is of course better to use plots like the second plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output. From statistical point of view, to bind together detected and non-detected clonotypes, as well as clonotype frequency one should have a good idea of whether a change in clonotype count is due to sampling (e.g. you sampled 1000 cells first time and 1010 cells second time) or clonal expansion (you observe 1000 cells first time and 2000 second time). However there is currently no good model to tell clonal expansions from sampling noise.

As for your last question - you can use JoinClonotypes with occurence threshold of 1 (http://vdjtools-doc.readthedocs.io/en/latest/operate.html#joinsamples).

MoentyPython · 2016-07-01T01:13:27Z

Hey,

thank you for the explanation! To my second question: The JoinSamples
function is great! However, I meant if there is anyway to run
CalcPairwiseDistances including the non-overlapping clonotypes for
calculating the Pearson Score (R value)?

Solai

2016-06-30 18:56 GMT+02:00 Mikhail Shugay notifications@github.com:

Dear Solaiman,

Its not very clear for me, what do you mean under "including
non-overlapping clonotypes". If it is for R score computation, then
including a set of, say 1000, points that have 0 frequency in one sample
and non-zero frequency in other can bias the estimate. I mean people
frequently use correlation in bioinformatics for random variables that can
lead to non-robust behavior (
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Robustness),
but in this case it can be wise to judge missing clonotypes separately
based on frequency and number of non-overlapping clonotypes (see `freq12,
etc in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#id4).
Note that non-overlapping clonotype density is still shown in side bars
(marginal distributions) of first plot in
http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output
.

When you track clonotypes from the same individual (e.g. in HSCT), it is
of course better to use plots like the second plot in
http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output.
From statistical point of view, to bind together detected and non-detected
clonotypes, as well as clonotype frequency one should have a good idea of
whether a change in clonotype count is due to sampling (e.g. you sampled
1000 cells first time and 1010 cells second time) or clonal expansion (you
observe 1000 cells first time and 2000 second time). However there is
currently no good model to tell clonal expansions from sampling noise.

As for your last question - you can use JoinClonotypes with occurence
threshold of 1 (
http://vdjtools-doc.readthedocs.io/en/latest/operate.html#joinsamples).

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AQIyXIj-ecW6iiMcQ4RhhBZLkEb9P24Wks5qQ_U5gaJpZM4JBTke
.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disregard the wobble base #1

Disregard the wobble base #1

MoentyPython commented Jun 29, 2016

mikessh commented Jun 29, 2016 •

edited

Loading

MoentyPython commented Jun 30, 2016

mikessh commented Jun 30, 2016

MoentyPython commented Jul 1, 2016

Disregard the wobble base #1

Disregard the wobble base #1

Comments

MoentyPython commented Jun 29, 2016

mikessh commented Jun 29, 2016 • edited Loading

MoentyPython commented Jun 30, 2016

mikessh commented Jun 30, 2016

MoentyPython commented Jul 1, 2016

mikessh commented Jun 29, 2016 •

edited

Loading