feat: scipts to match predicted sites to ground truth #66

lschaerfen · 2021-05-31T00:35:33Z

Fixes issues #64 and should remove blocker from #8

summary_workflows/quantification/match_with_gt.py uses bedtools window to match predicted and ground truth sites. Expression values are weighted in case multiple predicted sites overlap one ground truth, or vice versa.
summary_workflows/quantification/corr_with_gt.py takes the output (predicted sites matched with ground truth) and calculates the correlation coefficient of the corresponding expression levels.
summary_workflows/quantification/corr_with_gt.py takes the output (predicted sites matched with ground truth) and calculates the correlation coefficient of the corresponding expression levels.
Updates summary_workflows/quantification/README.md to explain usage of scripts.

…book) to run the PAS annotation

…and ground truth quantification

ninsch3000 · 2021-06-02T06:54:57Z

Thanks a lot @lschaerfen , looks like great work to me, and will surely be useful in more than one benchmark! Unfortunately I'm a much less versed programmer than you, so wouldn't want to give advice. Maybe @mzavolan and @mrgazzara could review?

summary_workflows/quantification/match_with_gt.py

summary_workflows/quantification/README.md

daneckaw · 2021-06-02T21:35:34Z

Is there a reason why the predicted PAS are merged before mapping to ground truth? I can see a couple of problems, especially for bigger windows: distant predicted PAS could be merged into one and then mapped to two different ground truth PAS, or predicted PAS that is more than one window away from GT could be first merged with a site located closer to GT. Or does the code work differently and I'm misunderstanding something?

lschaerfen · 2021-06-02T21:47:56Z

You're not misunderstanding, this is how the code works. The first step is to merge sites that fall within the window parameter. I can add a parameter to turn off merging, so we can figure out later if we want that or not?

summary_workflows/quantification/match_with_gt.py

…t, removed handling one predicted site matching with multiple ground truth sites for now (these sites are discarded)

lschaerfen · 2021-06-03T05:20:50Z

I addressed your first two scenarios @mzavolan in the latest commit. For now sites that fall under your 3rd point are discarded. I will look into your other points later, also your concern @daneckaw, but at least it is usable now. In the test data not many sites fall into scenario 3 when using a reasonable window parameter. Thank you for your comments!!

dominikburri

Mostly housekeeping comments.

dominikburri · 2021-06-03T07:28:50Z

summary_workflows/quantification/README.md

+	9. end ground truth
+	10. name ground truth
+	11. expression ground truth (additional columns go after this one, such as ground truth gene_ID)
+	12. weight of prediction expression


To make it compatible with BED format, I would include "strand ground truth" in column 12. Then the output is BED of prediction, BED of ground truth and then additional columns.

dominikburri · 2021-06-03T07:33:19Z

summary_workflows/quantification/corr_with_gt.py

+vec_pred = []
+
+# multiple predicted sites for one ground truth?
+multiple_predicted_sites = out.duplicated([6, 7, 8, 11], keep=False)


I think it was mentioned already, but for better readability it would be nice to have named columns.

dominikburri · 2021-06-03T07:34:02Z

summary_workflows/quantification/corr_with_gt.py

+args = parser.parse_args()
+
+fname = args.bed
+out = pd.read_csv(fname, delimiter='\t', header=None)


Here you could read in the csv with specific column names. As this is known and fixed, I don't see a reason to hardcode this.

dominikburri · 2021-06-03T07:37:22Z

summary_workflows/quantification/match_with_gt.py

+parser.add_argument('a', help='The BED file containing predictions. MUST be BED6 format.')
+parser.add_argument('b', help='The ground truth bed file. First 6 columns must be standard BED6, but can have additional columns appended.')
+parser.add_argument('window', help='Number of bases to append to each side of the predicted site.', type=int)
+#parser.add_argument('-o', help='output file directory') # not yet implemented!!


Instead of output directory, you could also give the option to save output to specified file name.

dominikburri · 2021-06-03T07:42:42Z

summary_workflows/quantification/match_with_gt.py

+window = args.window
+
+
+def bedtools_window(bed1, bed2, window, reverse=False):


Could you explain the parameter reverse? According to bedtools the parameter "-v" means "Only report those entries in A that have no overlaps with B.", is this correct?

Yes, this is to find the polyA sites that do not have an overlap in the ground truth set.

dominikburri · 2021-06-03T07:43:22Z

summary_workflows/quantification/match_with_gt.py

+
+parser.add_argument('a', help='The BED file containing predictions. MUST be BED6 format.')
+parser.add_argument('b', help='The ground truth bed file. First 6 columns must be standard BED6, but can have additional columns appended.')
+parser.add_argument('window', help='Number of bases to append to each side of the predicted site.', type=int)


Might be easier to rename window to window_size.

dominikburri · 2021-06-03T07:51:59Z

summary_workflows/quantification/match_with_gt.py

+
+# find sites with no overlap given the window
+out_rev = bedtools_window(f_PD, f_GT, window, reverse=True)
+out_rev.rename({0: 'chrom_p', 1: 'chromStart_p', 2: 'chromEnd_p', 3: 'name_p', 4: 'score_p', 5: 'strand_p', 6: 'chrom_g', 7: 'chromStart_g', 8: 'chromEnd_g', 9: 'name_g', 10: 'score_g', 11: 'strand_g'}, axis=1, inplace=True)


You could do the renaming in bedtools_window directly, then you don't have to write the same code twice.

lschaerfen and others added 5 commits May 27, 2021 10:31

Q3: annotating PAS, calculating per-gene usage

6bafc67

different input files, ideas to map PAS between different data sets

6503f35

added some commentary, separate files (separate from the jupyter note…

ef1fdfe

…book) to run the PAS annotation

adding annotation bed for mouse

ff0fa7d

adding script for ground truth matching

a0f22c5

lschaerfen assigned mfansler and mrgazzara May 31, 2021

lschaerfen marked this pull request as ready for review May 31, 2021 00:43

lschaerfen linked an issue May 31, 2021 that may be closed by this pull request

Matching up predicted with ground truth sites #64

Closed

uniqueg requested review from dominikburri and mfansler May 31, 2021 09:02

Leonard Schaerfen added 3 commits June 1, 2021 23:05

fixed bug in file naming

4a82dcf

added script to calculate correlation coefficient between prediction …

4c14a4e

…and ground truth quantification

updated

09e8d98

lschaerfen linked an issue Jun 2, 2021 that may be closed by this pull request

Q2: Specification - Correlation to 3'-End Seq #8

Closed

lschaerfen mentioned this pull request Jun 2, 2021

Kamikaze Pilot: Write summary workflow for benchmark x #82

Closed

1 task

updated

96d605f

ninsch3000 requested a review from mrgazzara June 2, 2021 06:55

daneckaw mentioned this pull request Jun 2, 2021

Add Q2 benchmark specification and sample files #91

Merged

4 tasks

mzavolan self-requested a review June 2, 2021 18:56

mzavolan requested changes Jun 2, 2021

View reviewed changes

summary_workflows/quantification/match_with_gt.py Outdated Show resolved Hide resolved

summary_workflows/quantification/match_with_gt.py Outdated Show resolved Hide resolved

summary_workflows/quantification/README.md Show resolved Hide resolved

daneckaw reviewed Jun 2, 2021

View reviewed changes

summary_workflows/quantification/match_with_gt.py Outdated Show resolved Hide resolved

lschaerfen added 2 commits June 3, 2021 00:38

named columns to increase readability

6cf98e7

removed merging predicted sites, added non-matched files to the outpu…

f641e8c

…t, removed handling one predicted site matching with multiple ground truth sites for now (these sites are discarded)

mfansler mentioned this pull request Jun 3, 2021

Kamikaze Pilot: Decide on one benchmark #77

Closed

7 tasks

uniqueg requested a review from mzavolan June 3, 2021 07:48

uniqueg changed the title ~~RE Issue #64: Add tool to match predicted sites to ground truth.~~ feat: scipt to match predicted sites to ground truth Jun 3, 2021

uniqueg changed the title ~~feat: scipt to match predicted sites to ground truth~~ feat: scipts to match predicted sites to ground truth Jun 3, 2021

uniqueg merged commit d54eeba into iRNA-COSI:main Jun 3, 2021

dominikburri reviewed Jun 3, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: scipts to match predicted sites to ground truth #66

feat: scipts to match predicted sites to ground truth #66

lschaerfen commented May 31, 2021 •

edited by uniqueg

Loading

ninsch3000 commented Jun 2, 2021

daneckaw commented Jun 2, 2021

lschaerfen commented Jun 2, 2021

lschaerfen commented Jun 3, 2021 •

edited

Loading

dominikburri left a comment

dominikburri Jun 3, 2021

dominikburri Jun 3, 2021

dominikburri Jun 3, 2021

dominikburri Jun 3, 2021

dominikburri Jun 3, 2021

lschaerfen Jun 3, 2021

dominikburri Jun 3, 2021

dominikburri Jun 3, 2021

		window = args.window


		def bedtools_window(bed1, bed2, window, reverse=False):

feat: scipts to match predicted sites to ground truth #66

feat: scipts to match predicted sites to ground truth #66

Conversation

lschaerfen commented May 31, 2021 • edited by uniqueg Loading

ninsch3000 commented Jun 2, 2021

daneckaw commented Jun 2, 2021

lschaerfen commented Jun 2, 2021

lschaerfen commented Jun 3, 2021 • edited Loading

dominikburri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lschaerfen commented May 31, 2021 •

edited by uniqueg

Loading

lschaerfen commented Jun 3, 2021 •

edited

Loading