-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: scipts to match predicted sites to ground truth #66
Conversation
…book) to run the PAS annotation
…and ground truth quantification
Thanks a lot @lschaerfen , looks like great work to me, and will surely be useful in more than one benchmark! Unfortunately I'm a much less versed programmer than you, so wouldn't want to give advice. Maybe @mzavolan and @mrgazzara could review? |
Is there a reason why the predicted PAS are merged before mapping to ground truth? I can see a couple of problems, especially for bigger windows: distant predicted PAS could be merged into one and then mapped to two different ground truth PAS, or predicted PAS that is more than one window away from GT could be first merged with a site located closer to GT. Or does the code work differently and I'm misunderstanding something? |
You're not misunderstanding, this is how the code works. The first step is to merge sites that fall within the window parameter. I can add a parameter to turn off merging, so we can figure out later if we want that or not? |
…t, removed handling one predicted site matching with multiple ground truth sites for now (these sites are discarded)
I addressed your first two scenarios @mzavolan in the latest commit. For now sites that fall under your 3rd point are discarded. I will look into your other points later, also your concern @daneckaw, but at least it is usable now. In the test data not many sites fall into scenario 3 when using a reasonable window parameter. Thank you for your comments!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly housekeeping comments.
9. end ground truth | ||
10. name ground truth | ||
11. expression ground truth (additional columns go after this one, such as ground truth gene_ID) | ||
12. weight of prediction expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it compatible with BED format, I would include "strand ground truth" in column 12. Then the output is BED of prediction, BED of ground truth and then additional columns.
vec_pred = [] | ||
|
||
# multiple predicted sites for one ground truth? | ||
multiple_predicted_sites = out.duplicated([6, 7, 8, 11], keep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was mentioned already, but for better readability it would be nice to have named columns.
args = parser.parse_args() | ||
|
||
fname = args.bed | ||
out = pd.read_csv(fname, delimiter='\t', header=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you could read in the csv with specific column names. As this is known and fixed, I don't see a reason to hardcode this.
parser.add_argument('a', help='The BED file containing predictions. MUST be BED6 format.') | ||
parser.add_argument('b', help='The ground truth bed file. First 6 columns must be standard BED6, but can have additional columns appended.') | ||
parser.add_argument('window', help='Number of bases to append to each side of the predicted site.', type=int) | ||
#parser.add_argument('-o', help='output file directory') # not yet implemented!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of output directory, you could also give the option to save output to specified file name.
window = args.window | ||
|
||
|
||
def bedtools_window(bed1, bed2, window, reverse=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the parameter reverse? According to bedtools the parameter "-v" means "Only report those entries in A that have no overlaps with B.", is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is to find the polyA sites that do not have an overlap in the ground truth set.
|
||
parser.add_argument('a', help='The BED file containing predictions. MUST be BED6 format.') | ||
parser.add_argument('b', help='The ground truth bed file. First 6 columns must be standard BED6, but can have additional columns appended.') | ||
parser.add_argument('window', help='Number of bases to append to each side of the predicted site.', type=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be easier to rename window
to window_size
.
|
||
# find sites with no overlap given the window | ||
out_rev = bedtools_window(f_PD, f_GT, window, reverse=True) | ||
out_rev.rename({0: 'chrom_p', 1: 'chromStart_p', 2: 'chromEnd_p', 3: 'name_p', 4: 'score_p', 5: 'strand_p', 6: 'chrom_g', 7: 'chromStart_g', 8: 'chromEnd_g', 9: 'name_g', 10: 'score_g', 11: 'strand_g'}, axis=1, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do the renaming in bedtools_window
directly, then you don't have to write the same code twice.
Fixes issues #64 and should remove blocker from #8
summary_workflows/quantification/match_with_gt.py
usesbedtools window
to match predicted and ground truth sites. Expression values are weighted in case multiple predicted sites overlap one ground truth, or vice versa.summary_workflows/quantification/corr_with_gt.py
takes the output (predicted sites matched with ground truth) and calculates the correlation coefficient of the corresponding expression levels.summary_workflows/quantification/corr_with_gt.py
takes the output (predicted sites matched with ground truth) and calculates the correlation coefficient of the corresponding expression levels.summary_workflows/quantification/README.md
to explain usage of scripts.