Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

narrowPeak score (5th) column value above 1000 #123

Closed
alexbarrera opened this issue Mar 30, 2016 · 9 comments
Closed

narrowPeak score (5th) column value above 1000 #123

alexbarrera opened this issue Mar 30, 2016 · 9 comments

Comments

@alexbarrera
Copy link
Contributor

According to the UCSC Encode narrowPeak format, the fifth column (score) should be a value between 0-1000. Occasionally, MACS2 will produce peaks with a score value above 1000 (in our analysis, on average, this happens for less than 0.4% of the reported peaks). Still, some of the values go well beyond 1000, so I wonder how is the score computed in MACS2 and which is the score range used.

I found out this to be an issue when submitting data to the ENCODE project, which would not accept narrowPeaks files containing peaks with scores above 1000.

@alexbarrera
Copy link
Contributor Author

Answering my own old question in case someone else lands here..

Today I observed (by chance) that the values of the 5th column in narrowPeak files are equal to the integer part of 9th column (-log10qvalue) multiplied by 10. int(-10*log10qvalue)

I have created a PR #209 for improving the README description of the column, but it might be worthy rescaling the value to be in the [0-1000] range.

@mortunco
Copy link

Thank you very much!

@amitjavilaventura
Copy link

I tried what you say, but it did not work in my data. So maybe column 9 (-log10qvalue) is not proportional to column 5 (score).

head(as.integer(m_WTplusWNT_1b_narrow4$V9*10)==(m_WTplusWNT_1b_narrow4$V5))

[1] FALSE FALSE FALSE FALSE FALSE FALSE

@alexbarrera
Copy link
Contributor Author

@amitjavilaventura are you sure you are correctly loading your narrowPeak file in R? If I do the same, I can't reproduce the error and I do get the expected values:

narrowPeak <- read.table('A549.BCL3.dex.00h.rep1.dedup.masked.sorted_peaks.narrowPeak', sep='\t')
head(as.integer(narrowPeak$V9*10) == narrowPeak$V5)
TRUE TRUE TRUE TRUE TRUE TRUE

If you think can't spot the error, can you perhaps post a few lines from your narrowPeak file?

@amitjavilaventura
Copy link

I have tried it once more, and I cannot find the error.
I load the file
> m_APC_1b_narrow4 <- read.table("myfile.narrowPeak", header = FALSE, sep = "\t")

The max value of columns 9 and 5, respectively, are the following ones:

> max(m_APC_1b_narrow4$V9)
[1] 2417.319
> max(m_APC_1b_narrow4$V5)
[1] 24267

This doesn't make sense, because both maximum values should be proportional.

Nevertheless I just tried to do the same with the -log10(p-value) instead of the -log10(q-value) and it just worked. At least for the maximum values.

> max(m_APC_1b_narrow4$V8)
[1] 2426.74
> max(m_APC_1b_narrow4$V5)
[1] 24267

I suppose that your quote depends on what you use to "filter" the peaks in peakcalling: a threshold in the p-value or a threshold on the q-value.

> head(as.integer(m_APC_1b_narrow4$V8*10)==(m_APC_1b_narrow4$V5))
[1] TRUE TRUE TRUE TRUE TRUE TRUE

I don't really need the information of the 5th column now, however it was quite annoying to discover that there was an "error" in the information, because theoretically, this value cannot surpass a value of 1000.

Thank you for your help.

@alexbarrera
Copy link
Contributor Author

alexbarrera commented Aug 1, 2019

Oh, I see. Right, that make sense. If you choose to use p-values (default) instead of q-values to apply any significance thresholds (or to filter like you said), the 5th column values are computed using p-values. I always filter on q-values, that's why I didn't notice the issue, but since the default is to filter on p-values, I agree this should be reflected in the README.md as well. Good catch!

@amitjavilaventura
Copy link

Thank you!

Actually, I think that the default is q-value. I am not sure but, from MACS manual:

-q/--qvalue
The qvalue (minimum FDR) cutoff to call significant regions. Default is 0.05. For broad marks, you can try 0.05 as cutoff. Q-values are calculated from p-values using Benjamini-Hochberg procedure.
-p/--pvalue
The pvalue cutoff. If -p is specified, MACS2 will use pvalue instead of qvalue.

@amitjavilaventura
Copy link

I have already started a request to change it in the README.md.

@taoliu
Copy link
Contributor

taoliu commented Aug 13, 2019

Thank you for your input! Changes have been integrated into master branch. Will be included in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants