Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering and interpreting results #62

Open
dariober opened this issue Nov 27, 2017 · 8 comments
Open

Filtering and interpreting results #62

dariober opened this issue Nov 27, 2017 · 8 comments

Comments

@dariober
Copy link

Sorry, I'm bombarding you with questions... I'd like to make sure my understanding is correct.

Typically one wants to filter out segments that are copy-number neutral. For this purpose, is it sufficient to filter out records where tcn.em = 2 or NA and lcn.em = 1 or NA? Is any other filtering recommended? E.g.:

datafile<- system.file("extdata", "stomach.csv.gz", package="facets")
rcmat<- readSnpMatrix(datafile)
xx<- preProcSample(rcmat)
oo<- procSample(xx,cval=150)
fit<- emcncf(oo)
cncf<- fit$cncf

# Uninteresting segments:
cncf[which((cncf$tcn.em == 2 & cncf$lcn.em == 1) | is.na(cncf$tcn.em) | is.na(cncf$lcn.em)),]

Related question, In order to assess the "strength" or enrichment of a segment I guess a useful metric is cnlr.median or cnlr.median.clust (these two appear to be almost perfectly correlated, so I guess it doesn't matter which one is chosen, any comment on this?) However, the raw cnlr.median should be adjusted for the logR of the diploid state (found in fit$dipLogR). Otherwise, with high or low ploidy the raw cnlr.median may be misleading.

For example, in the results plotted below segments with logR ~ 0 are still duplicated, so a useful metric to assess enrichment becomes: fit$cncf$cnlr.median - fit$dipLogR. Do you agree with this?

Thanks a lot again!
Dario

facets

@veseshan
Copy link
Collaborator

Feel free to bombard away since it could help to improve the package.

You may not want to filter away all the lcn.em=NA segments. There can be clear focal amplifications i.e. large tcn.em but due to the sparsity of het SNPs we can't determine lcn unambiguously.

cnlr.median is the segment specific estimate. For the clust suffix segments that are similar are combined and given a cluster specific estimate. So they ought to be very similar. cnlr.median by itself is not informative since it depends on library size, ploidy etc. As you said the right quantity to use is cnlr.median - dipLogR.

Venkat

@dariober
Copy link
Author

...Following this up... What about segments where tcn.em=2 and lcn.em=0?

This could be neutral segments if both tumour and normal are homozygote (hence lcn.em=0), so they could be filtered out. However, some of these could be loss-of-heterozygosity segments where the normal does have lcn.em=1 (instead cases where tcn.em != 2 and lcn.em=0 are definitely interesting).

I think to distinguish which case we have one needs to look at mafR.clust and see whether it is close to zero (neutral segment) or >0 (LOH). Now, I guess deciding when mafR.clust is convincingly > 0 may be tricky as it depends also on the number of markers in the segment (few markers are more likely to generate spurious deviations from 0). Any thoughts...?

Dario

@veseshan
Copy link
Collaborator

A segment that has tcn=2 and lcn=0 is always copy neutral LOH because in order to assign lcn the segment should have had enough het snps in the normal. If it didn't lcn assignment would have been NA. You can question whether min.nhet is large enough.

Venkat

@dariober
Copy link
Author

Indeed, that makes sense. Let's try to recapitulate. Would you agree with this classification?

tcn.em lcn.em type
0 0 DEL
0 NA DEL
1 0 HEMIZYGOTE
1 NA HEMIZYGOTE
2 0 LOH
2 1 NEUTR
2 NA NEUTR/Unknown
2+ 0 DUP-LOH
2+ 1+ DUP
2+ NA DUP-[LOH?]

@veseshan
Copy link
Collaborator

Yes except that tcn=1 automatically implies lcn=0. So (1,NA) shouldn't occur (if it does it's a bug).

@dariober
Copy link
Author

Hi - Thanks! I have a few cases where tcn=0 and lcn=NA, 9 out of 8486 (cumulated on several samples). These are the records, if useful:

chrom     start       end seg num.mark nhet segclust cnlr.median.clust   mafR.clust cf.em tcn.em lcn.em    sample_id 
 chrX     29133   2780079 171     3999 1035       20      -0.188145471  0.105767213     1      1     NA    276
 chrX   2780409  31879100 172    26156   31       15      -0.320095316           NA     1      1     NA    276
 chrX  31961182  58438972 174    22373   20       15      -0.320095316           NA     1      1     NA    276
 chrX  58439460  62366300 175      139    0       19      -0.244809231           NA     1      1     NA    276
 chrX  67532300  68324978 177      754    0       15      -0.320095316           NA     1      1     NA    276
 chrX  73961253  75429969 179     1357    0       15      -0.320095316           NA     1      1     NA    276
 chrX 155704979 156030162 181      603  131        8      -0.385180248 -0.001746385     1      1     NA    276
 chrX     21457   2781330  34     4363 1489        9       0.009380662 -0.001925520     1      1     NA    303
 chrX 155703951 156030162  36      622   79       10       0.009380662  0.003004941     1      1     NA    303

@veseshan
Copy link
Collaborator

Thanks. Will fix the bug.

@brewert2
Copy link

I have a quick question regarding the ploidy/purity output.
What does it mean when it says "75%" such as in " .purity.75%" and/or .ploidy.75%">

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants