Q: impact of sequencing throughput on peak calling #644

SplitInf · 2024-05-13T22:21:14Z

Thank you for macs and for the constant improvements you have been adding.

Use case
I have some histone chip data on cancer cell lines where they were sequenced extra deep by the sequencing service centre (>80M reads).

Describe the problem
Surprisingly, despite all cell lines being quite similar, the number of peaks called varied. A few deeply sequenced samples have significantly more peaks being called. Visually inspecting peak calling indicates that there are noticeably more small/noise-like peaks being called. It's also surprising since the input samples were sequenced deeply and would, in theory, help to mitigate the signal/noise problem.

Describe the solution you tried
QC results indicate all samples were successful, with an average RiP% > 15% ; RiBL% <5%. Down-sampling of the samples in question helped to lower the number of peaks being called. My question is: Are these behaviours expected and there is a recommended way to handle samples with uneven sequencing depths? Thanks!

taoliu · 2024-05-14T16:09:15Z

It's related to a better way to control the threshold of calling peaks. One way is to use the IDR approach if you have multiple replicates. Or if you have one deeply sequenced sample, you can divide them into multiple technical replicates, then use the IDR approach later ( it's called pseudo replicate in the ENCODE pipeline, for example encode Histone ChIP-seq pipeline ). Another way is to try to use the cutoff-analysis approach to decide a better cutoff in MACS.

SplitInf · 2024-05-17T21:35:44Z

Thank you for your reply @taoliu! Digging deeper into my dataset, I realized that these problematic samples have noticeably higher proportion of peaks with low signalValue (but with high confidence).
I tried running the cutoff analysis and found that even using a very stringent qvlaue couldn't get rid of these weak peaks. My question now is, would you recommend using signalValue threshold? The same antibody was used for all samples and processed together, so I don't think it's the problem with the reagent. Many thanks again.

taoliu · 2024-05-22T14:16:15Z

@SplitInf Yes. I would suggest you apply two thresholds -- signalValue (it is the log2foldchange values, so 1 means 2 folds), and p or q-value. It's common that when the number is big, a small log2fc can have a high confidence (p or q-value). As in RNA-seq differentially expressed genes analysis (e.g. volcano plot), you can use both at the same time to narrow down the results.

SplitInf · 2024-06-10T20:11:31Z

thanks again for the explanation

SplitInf added the General Question label May 13, 2024

SplitInf closed this as completed Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q: impact of sequencing throughput on peak calling #644

Q: impact of sequencing throughput on peak calling #644

SplitInf commented May 13, 2024

taoliu commented May 14, 2024 •

edited

Loading

SplitInf commented May 17, 2024 •

edited

Loading

taoliu commented May 22, 2024

SplitInf commented Jun 10, 2024

Q: impact of sequencing throughput on peak calling #644

Q: impact of sequencing throughput on peak calling #644

Comments

SplitInf commented May 13, 2024

taoliu commented May 14, 2024 • edited Loading

SplitInf commented May 17, 2024 • edited Loading

taoliu commented May 22, 2024

SplitInf commented Jun 10, 2024

taoliu commented May 14, 2024 •

edited

Loading

SplitInf commented May 17, 2024 •

edited

Loading