Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: scipts to match predicted sites to ground truth #66

Merged
merged 11 commits into from
Jun 3, 2021

Conversation

lschaerfen
Copy link
Collaborator

@lschaerfen lschaerfen commented May 31, 2021

Fixes issues #64 and should remove blocker from #8

  • summary_workflows/quantification/match_with_gt.py uses bedtools window to match predicted and ground truth sites. Expression values are weighted in case multiple predicted sites overlap one ground truth, or vice versa.
  • summary_workflows/quantification/corr_with_gt.py takes the output (predicted sites matched with ground truth) and calculates the correlation coefficient of the corresponding expression levels.
  • summary_workflows/quantification/corr_with_gt.py takes the output (predicted sites matched with ground truth) and calculates the correlation coefficient of the corresponding expression levels.
  • Updates summary_workflows/quantification/README.md to explain usage of scripts.

@lschaerfen lschaerfen marked this pull request as ready for review May 31, 2021 00:43
@lschaerfen lschaerfen linked an issue May 31, 2021 that may be closed by this pull request
@lschaerfen lschaerfen linked an issue Jun 2, 2021 that may be closed by this pull request
@ninsch3000
Copy link
Collaborator

Thanks a lot @lschaerfen , looks like great work to me, and will surely be useful in more than one benchmark! Unfortunately I'm a much less versed programmer than you, so wouldn't want to give advice. Maybe @mzavolan and @mrgazzara could review?

summary_workflows/quantification/match_with_gt.py Outdated Show resolved Hide resolved
summary_workflows/quantification/match_with_gt.py Outdated Show resolved Hide resolved
summary_workflows/quantification/README.md Show resolved Hide resolved
@daneckaw
Copy link
Collaborator

daneckaw commented Jun 2, 2021

Is there a reason why the predicted PAS are merged before mapping to ground truth? I can see a couple of problems, especially for bigger windows: distant predicted PAS could be merged into one and then mapped to two different ground truth PAS, or predicted PAS that is more than one window away from GT could be first merged with a site located closer to GT. Or does the code work differently and I'm misunderstanding something?

@lschaerfen
Copy link
Collaborator Author

You're not misunderstanding, this is how the code works. The first step is to merge sites that fall within the window parameter. I can add a parameter to turn off merging, so we can figure out later if we want that or not?

…t, removed handling one predicted site matching with multiple ground truth sites for now (these sites are discarded)
@lschaerfen
Copy link
Collaborator Author

lschaerfen commented Jun 3, 2021

I addressed your first two scenarios @mzavolan in the latest commit. For now sites that fall under your 3rd point are discarded. I will look into your other points later, also your concern @daneckaw, but at least it is usable now. In the test data not many sites fall into scenario 3 when using a reasonable window parameter. Thank you for your comments!!

@uniqueg uniqueg requested a review from mzavolan June 3, 2021 07:48
@uniqueg uniqueg changed the title RE Issue #64: Add tool to match predicted sites to ground truth. feat: scipt to match predicted sites to ground truth Jun 3, 2021
@uniqueg uniqueg changed the title feat: scipt to match predicted sites to ground truth feat: scipts to match predicted sites to ground truth Jun 3, 2021
@uniqueg uniqueg merged commit d54eeba into iRNA-COSI:main Jun 3, 2021
Copy link
Collaborator

@dominikburri dominikburri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly housekeeping comments.

9. end ground truth
10. name ground truth
11. expression ground truth (additional columns go after this one, such as ground truth gene_ID)
12. weight of prediction expression
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it compatible with BED format, I would include "strand ground truth" in column 12. Then the output is BED of prediction, BED of ground truth and then additional columns.

vec_pred = []

# multiple predicted sites for one ground truth?
multiple_predicted_sites = out.duplicated([6, 7, 8, 11], keep=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was mentioned already, but for better readability it would be nice to have named columns.

args = parser.parse_args()

fname = args.bed
out = pd.read_csv(fname, delimiter='\t', header=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could read in the csv with specific column names. As this is known and fixed, I don't see a reason to hardcode this.

parser.add_argument('a', help='The BED file containing predictions. MUST be BED6 format.')
parser.add_argument('b', help='The ground truth bed file. First 6 columns must be standard BED6, but can have additional columns appended.')
parser.add_argument('window', help='Number of bases to append to each side of the predicted site.', type=int)
#parser.add_argument('-o', help='output file directory') # not yet implemented!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of output directory, you could also give the option to save output to specified file name.

window = args.window


def bedtools_window(bed1, bed2, window, reverse=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the parameter reverse? According to bedtools the parameter "-v" means "Only report those entries in A that have no overlaps with B.", is this correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to find the polyA sites that do not have an overlap in the ground truth set.


parser.add_argument('a', help='The BED file containing predictions. MUST be BED6 format.')
parser.add_argument('b', help='The ground truth bed file. First 6 columns must be standard BED6, but can have additional columns appended.')
parser.add_argument('window', help='Number of bases to append to each side of the predicted site.', type=int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be easier to rename window to window_size.


# find sites with no overlap given the window
out_rev = bedtools_window(f_PD, f_GT, window, reverse=True)
out_rev.rename({0: 'chrom_p', 1: 'chromStart_p', 2: 'chromEnd_p', 3: 'name_p', 4: 'score_p', 5: 'strand_p', 6: 'chrom_g', 7: 'chromStart_g', 8: 'chromEnd_g', 9: 'name_g', 10: 'score_g', 11: 'strand_g'}, axis=1, inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do the renaming in bedtools_window directly, then you don't have to write the same code twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Matching up predicted with ground truth sites Q2: Specification - Correlation to 3'-End Seq
8 participants