Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: impact of sequencing throughput on peak calling #644

Closed
SplitInf opened this issue May 13, 2024 · 4 comments
Closed

Q: impact of sequencing throughput on peak calling #644

SplitInf opened this issue May 13, 2024 · 4 comments

Comments

@SplitInf
Copy link

Dear @taoliu

Thank you for macs and for the constant improvements you have been adding.

Use case
I have some histone chip data on cancer cell lines where they were sequenced extra deep by the sequencing service centre (>80M reads).

Describe the problem
Surprisingly, despite all cell lines being quite similar, the number of peaks called varied. A few deeply sequenced samples have significantly more peaks being called. Visually inspecting peak calling indicates that there are noticeably more small/noise-like peaks being called. It's also surprising since the input samples were sequenced deeply and would, in theory, help to mitigate the signal/noise problem.

Describe the solution you tried
QC results indicate all samples were successful, with an average RiP% > 15% ; RiBL% <5%. Down-sampling of the samples in question helped to lower the number of peaks being called. My question is: Are these behaviours expected and there is a recommended way to handle samples with uneven sequencing depths? Thanks!

@taoliu
Copy link
Contributor

taoliu commented May 14, 2024

It's related to a better way to control the threshold of calling peaks. One way is to use the IDR approach if you have multiple replicates. Or if you have one deeply sequenced sample, you can divide them into multiple technical replicates, then use the IDR approach later ( it's called pseudo replicate in the ENCODE pipeline, for example encode Histone ChIP-seq pipeline ). Another way is to try to use the cutoff-analysis approach to decide a better cutoff in MACS.

@SplitInf
Copy link
Author

SplitInf commented May 17, 2024

Thank you for your reply @taoliu! Digging deeper into my dataset, I realized that these problematic samples have noticeably higher proportion of peaks with low signalValue (but with high confidence).
I tried running the cutoff analysis and found that even using a very stringent qvlaue couldn't get rid of these weak peaks. My question now is, would you recommend using signalValue threshold? The same antibody was used for all samples and processed together, so I don't think it's the problem with the reagent. Many thanks again.

@taoliu
Copy link
Contributor

taoliu commented May 22, 2024

@SplitInf Yes. I would suggest you apply two thresholds -- signalValue (it is the log2foldchange values, so 1 means 2 folds), and p or q-value. It's common that when the number is big, a small log2fc can have a high confidence (p or q-value). As in RNA-seq differentially expressed genes analysis (e.g. volcano plot), you can use both at the same time to narrow down the results.

@SplitInf
Copy link
Author

thanks again for the explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants