docs: add parameter codes #93

mfansler · 2021-06-02T19:08:19Z

Resolves #75

This is a working proposal. Please use review for discussion.

uniqueg · 2021-06-02T20:42:43Z

summary_workflows/parameter_codes.md

+
+| PARAMCODE | dataset/samples                                                                                    | file name                         | matching ground truth                                                                              | run type | genome/annotation | explicit param\_settings | processing directive                             | benchmarks     |
+| --------- | -------------------------------------------------------------------------------------------------- | --------------------------------- | -------------------------------------------------------------------------------------------------- | -------- | ----------------- | ------------------------ | ------------------------------------------------ | -------------- |
+| AA        | SRR6795720                                                                                         | Mayr\_NB\_R1                      | SRR1005606<br>                                                                                     | indiv    | GRCh38            | 4 cores                  | individual sample                                | I2,I3,I4,Q2,Q3 |


This looks really nice! But I still think we should run all of those with 1, 4 and 16 cores. Perhaps we could append the cores to the parameter codes in the filenames so as not to explode the table (and then remove the column). So AA1, AA4, AA16 would be run AA with 1, 4, 16 cores. Having these data will allow us to evaluate the parallelization efficiency and scalability. And it should be easy to run these

I like that idea. One thing I worry about is ending up with something like:

AJ1/MISO/AJ1_MISO_01.bed AJ4/MISO/AJ4_MISO_01.bed AJ16/MISO/AJ16_MISO_01.bed

Where, all those outputs should be identical. Which one(s) should say benchmark I2 assess?

The cores parameter should only matter for the I1 (#6), Q1 (#7), and D1 (#13) benchmarks (sidenote: we still need an output code for those). Perhaps a compromise would be that outputs that are agnostic to resources omit the cores count, so

AJ/MISO/AJ_MISO_01.bed

while the resource benchmarking outputs include one, like,

AJ4/MISO/AJ4_MISO_04.json

Perhaps that is even more confusing? 😬

I really like the table! I think it includes everything.

As for the number of cores: It should just be a new parameter code. Most fields will be duplicated, but this shouldn't be a worry. It also gives the flexibility to adjust the benchmarks used etc. E.g. AP with one core will be just AQ and AP with 16 cores will be AR.

To make it more readable, we could use the first letter in a more logical way. E.g. instead of "A" we could use the first letter of the file name, in this case "M".

Maybe I'm just overthinking the problem.

I like it!
Remarks:

when I suggested this format earlier, @uniqueg and @dominikburri were strongly opposing to individual samples having their own param code. Do you still want to address that issue?

Let's keep the codes in concurrent alphabetical order, else it will get complicated and more dangerous to accidentally introduce duplicate codes

Agree with @mfansler that appending cores to param codes will produce triplicate directories and identical output files. Where will the number of cores be specified in nextflow/snakemake? For Snakemake: if it was in the config, one could use it only to build the outfile name for out 04 like so:

# config.yaml param_code: AJ cores: 4 # Snakefile out1 = os.pathjoin(config["param_code"], config["method_name]", "_".join(config["param_code]", config["method_name"], "01.bed")) out4 = os.pathjoin(config["param_code"], config["method_name]", "_".join(config["param_code]", config["method_name"], "04", "config["cores"], ".json"))

Adapting Mervins example:

AJ/MISO/AJ_MISO_01.bed AJ/MISO/AJ_MISO_04_4.json

Well, not sure... out 01-03 would then be overwritten at each new run with different cores. That ok? But might be confusing for executors. So maybe we should just keep the original outline

What I meant is that we wouldn't need a I1, Q1 & D1 anymore, but have compute performance monitored for every benchmark (or, to use the OpenEBench terminology, make the compute performance metrics of every challenge). This would give us more fine-grained insights into where individual tools struggle, performance-wise. It may not be all that interesting maybe, but it seems to me that keeping track of performance parameters for every run is something like "good practice"? Now, whether to run every tool n times is debatable, guess we'd need to consider cost/benefit here. I'm totally fine with leaving that and doing it only once.

However, just to respond to some arguments:

As for producing n times the data: guess that can't be avoided if we want to run the tools n times, regardless of how we name the file.

As for choosing which one of many (supposedly identical) output files to use afterwards (for other benchmarks): we will need to arbitrarily define that (or have the tools define that for us by using defaults) regardless of whether we run it n times or just once. And as for them being identical or not: this would already give us another metric we could always check for, for every run: are the results reproducible (using different amounts of cores)?

Yes, still against having parameter codes for individual samples, as those would then just be short names for samples and wouldn't reflect what is, in OEB parlance, a "challenge" (describing a specific set of inputs). To construct that we would then need to somehow merge different sample codes along some rules into a complex parameter/challenge code, and I think that's complicated. So, by all means, we should have one simple code that identifies a specific way each tool should be run (and that should reasonably lead to identical or extremely similar results every time it is run for that tool). Of course, that doesn't mean that a set of inputs couldn't be just one sample, if that is what we really want to have for a specific run (I can't really judge that because I was following the metrics discussions at best peripherally).

I also agree with @ninsch3000 that it's best to keep the 2-letter code in alphabetical order, without any semantics, for the reasons she mentioned. But perhaps even better, we could follow the rules of the challenge IDs defined in the OpenEBench schema definition, which call for the codes being generated in accordance with this regex: ^OEBX[0-9]{3}[A-Z0-9]{7}$ So that would be, e.g., OEBX123ABCD0EF or OEBX00000000AA, so basically we could keep our 2-letter codes, but could prefix them with OEBX0000000. Or we could use the first three digits to indicate the cores: OEBX00100000AA, OEBX00400000AA, OEBX01600000AA.

Anyway, let's vote on a couple of things :)

❤️ = Generate parameter codes in alphabetical order (as is)
🚀 = Design semantic parameter codes

❤️ = Use 2-letter code, e.g., AA (as is)
🚀 = Use prefix to generate OEB challenge _id-compatible parameter codes, e.g., OEBX00000000AA

❤️ = run every challenge (i.e., set of inputs) once unless specifically designed to assess reproducibility or parallelization performance
🚀 = run every challenge multiple times to assess reproducibility and/or parallelization performance

❤️ = use param codes for different # of CPUs or technical replicates
🚀 = codify # of CPUs and technical replicates in param codes
🎉 = codfy # of CPUs and technical replicates in output names only (and just give a general guideline for all challenges to be run, say, 3x with default settings and once each with 1, 4 & 16 cores, if supported by the method)

Note that conceptually, the first two options imply cross-challenge comparison metrics in OpenEBench if we want to assess reproducibility and/or parallelization performance. This might be particularly tricky/confusing if we go for challenge ID-type param codes.

Suggestions for codifying rules:

When using 2-letter param codes: AA_nnn_### (example: AD_003_016)

When using OEB-compatible param codes: OEBXnnn###00AA, where nnn signifies the technical replicate, the ### signifies the core number and AA signifies the set of inputs (example: OEBX00301600AD)

Use 000 to signify the default number of cores

Always use 000 for cores and 001 for technical replicates for any metrics of that challenge that are not related to testing for reproducibility or parallelization

Agree with all suggestions, but I still think we should have 1 param code per sample in some cases (As it is proposed in the table). More specifically, when we have a matching ground truth file for each sample and want to asses that, we wouldn't want the tools to merge several samples and give us only one output file for all samples. As we can't anticipate at the moment how merging or not merging of samples can be controlled for each tool, I think it is the most robust way to also have samples with their own param code. Another possibility might be to require the execution workflows to have a rule/directive that controls whether the tool receives a group of samples or individual samples as an input. This directive would then have to be configured according to the run_type or processing_directive of the param code table.
That said, please do note that in other param codes a group of samples is specified, and those can be the same samples that earlier had their own param code. But now they would be allowed to be merged and will have to be compared to a merged ground truth. We never suggested to have only sample specific param codes.
There are other datasets not included in this first param code table, where we do not have one ground truth per sample, and then of course we wouldn't run individual samples with their own param code but group them according to how their ground truths had been grouped.

uniqueg · 2021-06-03T23:44:57Z

summary_workflows/parameter_codes.md

@@ -1,23 +1,58 @@
 # Parameter codes

-## Input to summary workflows
+Parameter codes uniquely identify what combinations of input files and parameters were used to run an execution workflow. 


Considering the OpenEBench terminology, it may be better to call these challenge codes. What we currently call challenges (identification, quantification, differential usage) would then become benchmarking events, and what we currently call benchmarks would become metrics (again?).

It's annoying, but we're unlikely to change the OpenEBench specifications/schemas, which have been around for a while and have some good thinking in them. And it would be good to adopt one consistent terminology.

working paramcodes tbl; resolves #75

01241eb

mfansler requested review from dominikburri, uniqueg, ninsch3000 and mrgazzara June 2, 2021 19:08

uniqueg reviewed Jun 2, 2021

View reviewed changes

mrgazzara mentioned this pull request Jun 3, 2021

Kamikaze Pilot: Decide on one dataset #78

Closed

4 tasks

uniqueg changed the title ~~Working Parameter Codes Table~~ docs: add parameter codes Jun 3, 2021

uniqueg reviewed Jun 3, 2021

View reviewed changes

ninsch3000 marked this pull request as draft July 21, 2021 12:55

ninsch3000 closed this Aug 4, 2023

ninsch3000 deleted the paramcodes-mf branch August 8, 2023 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add parameter codes #93

docs: add parameter codes #93

mfansler commented Jun 2, 2021

uniqueg Jun 2, 2021

mfansler Jun 2, 2021

dominikburri Jun 3, 2021

ninsch3000 Jun 3, 2021

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading

ninsch3000 Jun 8, 2021

uniqueg Jun 3, 2021

docs: add parameter codes #93

docs: add parameter codes #93

Conversation

mfansler commented Jun 2, 2021

uniqueg Jun 2, 2021

Choose a reason for hiding this comment

mfansler Jun 2, 2021

Choose a reason for hiding this comment

dominikburri Jun 3, 2021

Choose a reason for hiding this comment

ninsch3000 Jun 3, 2021

Choose a reason for hiding this comment

uniqueg Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

uniqueg Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

uniqueg Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

uniqueg Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

uniqueg Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

ninsch3000 Jun 8, 2021

Choose a reason for hiding this comment

uniqueg Jun 3, 2021

Choose a reason for hiding this comment

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading

uniqueg Jun 3, 2021 •

edited

Loading