Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrap the empirical FDR #14

Closed
noamteyssier opened this issue May 17, 2023 · 1 comment · Fixed by #15
Closed

bootstrap the empirical FDR #14

noamteyssier opened this issue May 17, 2023 · 1 comment · Fixed by #15
Labels
enhancement New feature or request

Comments

@noamteyssier
Copy link
Owner

a single instantiation of the FDR is biased because pseudogene groupings are completely random and nondeterministic. This leads to an FDR threshold that can vary wildly depending on the instantiation.

Instead, would be better to run the null distribution test a large number of times, calculate empirical FDRs, then average over the values to get an estimate of the FDR for every gene.

Another method would be to run the null test a large number of times, calculate empirical FDRs, find the threshold in each case, then average the threshold between the runs.

Would be best to do both and compare the statistics

@noamteyssier noamteyssier added the enhancement New feature or request label May 17, 2023
@noamteyssier
Copy link
Owner Author

Distribution of Rankings for pseudogenes in 500 runs

Pretty wide spread here, makes a good case for why there should be some form of aggregation

image

Distribution of FDR threshold ($\alpha = 0.05$) for 500 runs

image

In this case all of those genes at the $~2.5$ range would be considered hits, while in each individual INC run they may not be considered significant.

This would consider 60 genes as hits in my test dataset.

Calculating individual gene FDR averages

Another approach would be to calculate an average empirical FDR for each gene given the background distribution and then using the $\alpha = 0.05$ threshold on that average score. In this case an FDR would be calculated for each of the $m=500$ runs to bootstrap the pseudogenes, then for each gene the mean of all of their empirical FDRs would be taken and reported.

This would consider 73 genes as hits in my test dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant