-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add parameter codes #93
Conversation
|
||
| PARAMCODE | dataset/samples | file name | matching ground truth | run type | genome/annotation | explicit param\_settings | processing directive | benchmarks | | ||
| --------- | -------------------------------------------------------------------------------------------------- | --------------------------------- | -------------------------------------------------------------------------------------------------- | -------- | ----------------- | ------------------------ | ------------------------------------------------ | -------------- | | ||
| AA | SRR6795720 | Mayr\_NB\_R1 | SRR1005606<br> | indiv | GRCh38 | 4 cores | individual sample | I2,I3,I4,Q2,Q3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really nice! But I still think we should run all of those with 1, 4 and 16 cores. Perhaps we could append the cores to the parameter codes in the filenames so as not to explode the table (and then remove the column). So AA1, AA4, AA16 would be run AA with 1, 4, 16 cores. Having these data will allow us to evaluate the parallelization efficiency and scalability. And it should be easy to run these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. One thing I worry about is ending up with something like:
AJ1/MISO/AJ1_MISO_01.bed
AJ4/MISO/AJ4_MISO_01.bed
AJ16/MISO/AJ16_MISO_01.bed
Where, all those outputs should be identical. Which one(s) should say benchmark I2 assess?
The cores parameter should only matter for the I1 (#6), Q1 (#7), and D1 (#13) benchmarks (sidenote: we still need an output code for those). Perhaps a compromise would be that outputs that are agnostic to resources omit the cores count, so
AJ/MISO/AJ_MISO_01.bed
while the resource benchmarking outputs include one, like,
AJ4/MISO/AJ4_MISO_04.json
Perhaps that is even more confusing? 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the table! I think it includes everything.
As for the number of cores: It should just be a new parameter code. Most fields will be duplicated, but this shouldn't be a worry. It also gives the flexibility to adjust the benchmarks used etc. E.g. AP with one core will be just AQ and AP with 16 cores will be AR.
To make it more readable, we could use the first letter in a more logical way. E.g. instead of "A" we could use the first letter of the file name, in this case "M".
Maybe I'm just overthinking the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it!
Remarks:
- when I suggested this format earlier, @uniqueg and @dominikburri were strongly opposing to individual samples having their own param code. Do you still want to address that issue?
- Let's keep the codes in concurrent alphabetical order, else it will get complicated and more dangerous to accidentally introduce duplicate codes
- Agree with @mfansler that appending cores to param codes will produce triplicate directories and identical output files. Where will the number of cores be specified in nextflow/snakemake? For Snakemake: if it was in the config, one could use it only to build the outfile name for out 04 like so:
# config.yaml
param_code: AJ
cores: 4
# Snakefile
out1 = os.pathjoin(config["param_code"], config["method_name]", "_".join(config["param_code]", config["method_name"], "01.bed"))
out4 = os.pathjoin(config["param_code"], config["method_name]", "_".join(config["param_code]", config["method_name"], "04", "config["cores"], ".json"))
- Adapting Mervins example:
AJ/MISO/AJ_MISO_01.bed
AJ/MISO/AJ_MISO_04_4.json
Well, not sure... out 01-03 would then be overwritten at each new run with different cores. That ok? But might be confusing for executors. So maybe we should just keep the original outline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that we wouldn't need a I1, Q1 & D1 anymore, but have compute performance monitored for every benchmark (or, to use the OpenEBench terminology, make the compute performance metrics of every challenge). This would give us more fine-grained insights into where individual tools struggle, performance-wise. It may not be all that interesting maybe, but it seems to me that keeping track of performance parameters for every run is something like "good practice"? Now, whether to run every tool n times is debatable, guess we'd need to consider cost/benefit here. I'm totally fine with leaving that and doing it only once.
However, just to respond to some arguments:
As for producing n times the data: guess that can't be avoided if we want to run the tools n times, regardless of how we name the file.
As for choosing which one of many (supposedly identical) output files to use afterwards (for other benchmarks): we will need to arbitrarily define that (or have the tools define that for us by using defaults) regardless of whether we run it n times or just once. And as for them being identical or not: this would already give us another metric we could always check for, for every run: are the results reproducible (using different amounts of cores)?
Yes, still against having parameter codes for individual samples, as those would then just be short names for samples and wouldn't reflect what is, in OEB parlance, a "challenge" (describing a specific set of inputs). To construct that we would then need to somehow merge different sample codes along some rules into a complex parameter/challenge code, and I think that's complicated. So, by all means, we should have one simple code that identifies a specific way each tool should be run (and that should reasonably lead to identical or extremely similar results every time it is run for that tool). Of course, that doesn't mean that a set of inputs couldn't be just one sample, if that is what we really want to have for a specific run (I can't really judge that because I was following the metrics discussions at best peripherally).
I also agree with @ninsch3000 that it's best to keep the 2-letter code in alphabetical order, without any semantics, for the reasons she mentioned. But perhaps even better, we could follow the rules of the challenge IDs defined in the OpenEBench schema definition, which call for the codes being generated in accordance with this regex: ^OEBX[0-9]{3}[A-Z0-9]{7}$
So that would be, e.g., OEBX123ABCD0EF
or OEBX00000000AA
, so basically we could keep our 2-letter codes, but could prefix them with OEBX0000000
. Or we could use the first three digits to indicate the cores: OEBX00100000AA
, OEBX00400000AA
, OEBX01600000AA
.
Anyway, let's vote on a couple of things :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ = Generate parameter codes in alphabetical order (as is)
🚀 = Design semantic parameter codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ = Use 2-letter code, e.g., AA
(as is)
🚀 = Use prefix to generate OEB challenge _id
-compatible parameter codes, e.g., OEBX00000000AA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ = run every challenge (i.e., set of inputs) once unless specifically designed to assess reproducibility or parallelization performance
🚀 = run every challenge multiple times to assess reproducibility and/or parallelization performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ = use param codes for different # of CPUs or technical replicates
🚀 = codify # of CPUs and technical replicates in param codes
🎉 = codfy # of CPUs and technical replicates in output names only (and just give a general guideline for all challenges to be run, say, 3x with default settings and once each with 1, 4 & 16 cores, if supported by the method)
Note that conceptually, the first two options imply cross-challenge comparison metrics in OpenEBench if we want to assess reproducibility and/or parallelization performance. This might be particularly tricky/confusing if we go for challenge ID-type param codes.
Suggestions for codifying rules:
- When using 2-letter param codes:
AA_nnn_###
(example:AD_003_016
) - When using OEB-compatible param codes:
OEBXnnn###00AA
, wherennn
signifies the technical replicate, the###
signifies the core number andAA
signifies the set of inputs (example:OEBX00301600AD
) - Use
000
to signify the default number of cores - Always use
000
for cores and001
for technical replicates for any metrics of that challenge that are not related to testing for reproducibility or parallelization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with all suggestions, but I still think we should have 1 param code per sample in some cases (As it is proposed in the table). More specifically, when we have a matching ground truth file for each sample and want to asses that, we wouldn't want the tools to merge several samples and give us only one output file for all samples. As we can't anticipate at the moment how merging or not merging of samples can be controlled for each tool, I think it is the most robust way to also have samples with their own param code. Another possibility might be to require the execution workflows to have a rule/directive that controls whether the tool receives a group of samples or individual samples as an input. This directive would then have to be configured according to the run_type
or processing_directive
of the param code table.
That said, please do note that in other param codes a group of samples is specified, and those can be the same samples that earlier had their own param code. But now they would be allowed to be merged and will have to be compared to a merged ground truth. We never suggested to have only sample specific param codes.
There are other datasets not included in this first param code table, where we do not have one ground truth per sample, and then of course we wouldn't run individual samples with their own param code but group them according to how their ground truths had been grouped.
@@ -1,23 +1,58 @@ | |||
# Parameter codes | |||
|
|||
## Input to summary workflows | |||
Parameter codes uniquely identify what combinations of input files and parameters were used to run an execution workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the OpenEBench terminology, it may be better to call these challenge codes. What we currently call challenges (identification, quantification, differential usage) would then become benchmarking events, and what we currently call benchmarks would become metrics (again?).
It's annoying, but we're unlikely to change the OpenEBench specifications/schemas, which have been around for a while and have some good thinking in them. And it would be good to adopt one consistent terminology.
Resolves #75
This is a working proposal. Please use review for discussion.