Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

Disregard the wobble base #1

Open
MoentyPython opened this issue Jun 29, 2016 · 4 comments
Open

Disregard the wobble base #1

MoentyPython opened this issue Jun 29, 2016 · 4 comments

Comments

@MoentyPython
Copy link

Hey,

thank you for this amazing tool, I have included it in out pipeline and it works perfectly. I just have one issue I'd like to bring up.

Sometimes, it is not the nucleotide (NT) sequence that we are interested in but the amino acid (AA) sequence. When I parse my IMGT files with VDJtools my output file lists all my sequences in order of highest expansion. However, due to the wobble base sometimes several NT sequences code the same AA sequence. When we now look at the top 20 AA-sequences we 1. dont have the actual count, as there might be several other NT sequences coding for it aswell, 2. might be missing AA sequences that would have made it into the top 20 if all there AA-sequences would have been summed up.

I have written my own script to deal with this, however it takes a long time to process 500k lines. I wonder if there is a option to disregard the NT-sequence or if it is possible to include something like that in future versions?

Best wishes,

Solaiman

@mikessh
Copy link
Owner

mikessh commented Jun 29, 2016

Dear Solaiman,

Thanks for your feedback! There are two possible workarounds:

  1. If these NT sequences coding for the same AA are erroneous sub-variants, you can remove them using Correct utility, more details here. Note that correct has a certain error threshold (-r option) and will not scan variants that differ by more than 2 mismatches (however in case there is an intermediate variant, i.e. 1 -mismatch-> 2 -other-mismatch -> 3 -other-mismatch-> 4, those will be clustered together).
  2. You can try to pool by amino acid using PoolSamples -i aa ..., more details here. There is a small problem here - pool needs 2+ samples, but you can actually use the metadata file, providing a single metadata file with a single entry per each sample (-m metadata.txt option). I'll fix this limitation in the next release.

Remember that prior to any of those operations you should convert your file to VDJtools format using Convert.

P.S. Issues should be reported to https://github.com/mikessh/vdjtools/issues, this is just a doc repository

Best,
Mike

@MoentyPython
Copy link
Author

Dear Mikhail,

thank you for your quick response! The metadata file trick did it for me,
easy to write a script to automate it! Thank you.

I have another quick question regarding the OverlapPair
http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#overlappair
function. We would like to calculate the pearsonscore for pairs of samples,
however, a PostDoc in our lab told me it is important for our analysis,
that we include the non-overlapping sequences for our experimental setup.
We read under the description of your clonotype scatterplot, that it only
uses overlapping clonotypes for calculation.

Firstly, is there a specific reason you exclude the non-overlapping pairs,
especially in the HSCT example given, it would be much more interesting to
see how similar/dissimilar the entire repertoire of one patient is to the
next.

Secondly, is there an option im missing, to include non-overlapping pairs?

Thank you!

2016-06-29 18:41 GMT+02:00 Mikhail Shugay notifications@github.com:

Dear Solaiman,

Thanks for your feedback! There are two possible workarounds:

  1. If these NT sequences coding for the same AA are erroneous
    sub-variants, you can remove them using Correct utility, more details here
    http://vdjtools-doc.readthedocs.io/en/latest/preprocess.html#correct.
    Note that correct has a certain error threshold (-r option) and will not
    scan variants that differ by more than 2 mismatches (however in case there
    is an intermediate variant, i.e. 1 -mismatch-> 2 -other-mismatch -> 3
    -other-mismatch-> 4, those will be clustered together).
  2. You can try to pool by amino acid using PoolSamples -i aa ..., more
    details here
    http://vdjtools-doc.readthedocs.io/en/latest/operate.html#poolsamples.
    There is a small problem here - pool needs 2+ samples, but you can actually
    use the metadata file, providing a single metadata file with a single
    entry per each sample (-m metadata.txt option). I'll fix this limitation
    in the next release.

Remember that prior to both steps you should convert your file to VDJtools
format using Convert.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AQIyXErVKBZz7ZL6D9zdc3UBgJ5Me4VYks5qQqAlgaJpZM4JBTke
.

@mikessh
Copy link
Owner

mikessh commented Jun 30, 2016

Dear Solaiman,

Its not very clear for me, what do you mean under "including non-overlapping clonotypes". If it is for R score computation, then including a set of, say 1000, points that have 0 frequency in one sample and non-zero frequency in other can bias the estimate. I mean people frequently use correlation in bioinformatics for random variables that can lead to non-robust behavior (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Robustness), but in this case it can be wise to judge missing clonotypes separately based on frequency and number of non-overlapping clonotypes (see ``freq12`, etc in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#id4). Note that non-overlapping clonotype density is still shown in side bars (marginal distributions) of first plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output.

When you track clonotypes from the same individual (e.g. in HSCT), it is of course better to use plots like the second plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output. From statistical point of view, to bind together detected and non-detected clonotypes, as well as clonotype frequency one should have a good idea of whether a change in clonotype count is due to sampling (e.g. you sampled 1000 cells first time and 1010 cells second time) or clonal expansion (you observe 1000 cells first time and 2000 second time). However there is currently no good model to tell clonal expansions from sampling noise.

As for your last question - you can use JoinClonotypes with occurence threshold of 1 (http://vdjtools-doc.readthedocs.io/en/latest/operate.html#joinsamples).

@MoentyPython
Copy link
Author

Hey,

thank you for the explanation! To my second question: The JoinSamples
function is great! However, I meant if there is anyway to run
CalcPairwiseDistances including the non-overlapping clonotypes for
calculating the Pearson Score (R value)?

Solai

2016-06-30 18:56 GMT+02:00 Mikhail Shugay notifications@github.com:

Dear Solaiman,

Its not very clear for me, what do you mean under "including
non-overlapping clonotypes". If it is for R score computation, then
including a set of, say 1000, points that have 0 frequency in one sample
and non-zero frequency in other can bias the estimate. I mean people
frequently use correlation in bioinformatics for random variables that can
lead to non-robust behavior (
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Robustness),
but in this case it can be wise to judge missing clonotypes separately
based on frequency and number of non-overlapping clonotypes (see `freq12,
etc in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#id4).
Note that non-overlapping clonotype density is still shown in side bars
(marginal distributions) of first plot in
http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output
.

When you track clonotypes from the same individual (e.g. in HSCT), it is
of course better to use plots like the second plot in
http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output.
From statistical point of view, to bind together detected and non-detected
clonotypes, as well as clonotype frequency one should have a good idea of
whether a change in clonotype count is due to sampling (e.g. you sampled
1000 cells first time and 1010 cells second time) or clonal expansion (you
observe 1000 cells first time and 2000 second time). However there is
currently no good model to tell clonal expansions from sampling noise.

As for your last question - you can use JoinClonotypes with occurence
threshold of 1 (
http://vdjtools-doc.readthedocs.io/en/latest/operate.html#joinsamples).


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AQIyXIj-ecW6iiMcQ4RhhBZLkEb9P24Wks5qQ_U5gaJpZM4JBTke
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants