Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add parameter codes #93

Closed
wants to merge 1 commit into from
Closed

docs: add parameter codes #93

wants to merge 1 commit into from

Conversation

mfansler
Copy link
Collaborator

@mfansler mfansler commented Jun 2, 2021

Resolves #75

This is a working proposal. Please use review for discussion.


| PARAMCODE | dataset/samples | file name | matching ground truth | run type | genome/annotation | explicit param\_settings | processing directive | benchmarks |
| --------- | -------------------------------------------------------------------------------------------------- | --------------------------------- | -------------------------------------------------------------------------------------------------- | -------- | ----------------- | ------------------------ | ------------------------------------------------ | -------------- |
| AA | SRR6795720 | Mayr\_NB\_R1 | SRR1005606<br> | indiv | GRCh38 | 4 cores | individual sample | I2,I3,I4,Q2,Q3 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice! But I still think we should run all of those with 1, 4 and 16 cores. Perhaps we could append the cores to the parameter codes in the filenames so as not to explode the table (and then remove the column). So AA1, AA4, AA16 would be run AA with 1, 4, 16 cores. Having these data will allow us to evaluate the parallelization efficiency and scalability. And it should be easy to run these

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. One thing I worry about is ending up with something like:

AJ1/MISO/AJ1_MISO_01.bed
AJ4/MISO/AJ4_MISO_01.bed
AJ16/MISO/AJ16_MISO_01.bed

Where, all those outputs should be identical. Which one(s) should say benchmark I2 assess?

The cores parameter should only matter for the I1 (#6), Q1 (#7), and D1 (#13) benchmarks (sidenote: we still need an output code for those). Perhaps a compromise would be that outputs that are agnostic to resources omit the cores count, so

AJ/MISO/AJ_MISO_01.bed

while the resource benchmarking outputs include one, like,

AJ4/MISO/AJ4_MISO_04.json

Perhaps that is even more confusing? 😬

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the table! I think it includes everything.

As for the number of cores: It should just be a new parameter code. Most fields will be duplicated, but this shouldn't be a worry. It also gives the flexibility to adjust the benchmarks used etc. E.g. AP with one core will be just AQ and AP with 16 cores will be AR.

To make it more readable, we could use the first letter in a more logical way. E.g. instead of "A" we could use the first letter of the file name, in this case "M".

Maybe I'm just overthinking the problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!
Remarks:

  • when I suggested this format earlier, @uniqueg and @dominikburri were strongly opposing to individual samples having their own param code. Do you still want to address that issue?
  • Let's keep the codes in concurrent alphabetical order, else it will get complicated and more dangerous to accidentally introduce duplicate codes
  • Agree with @mfansler that appending cores to param codes will produce triplicate directories and identical output files. Where will the number of cores be specified in nextflow/snakemake? For Snakemake: if it was in the config, one could use it only to build the outfile name for out 04 like so:
# config.yaml
param_code: AJ
cores: 4

# Snakefile
out1 = os.pathjoin(config["param_code"], config["method_name]", "_".join(config["param_code]", config["method_name"], "01.bed"))
out4 = os.pathjoin(config["param_code"], config["method_name]", "_".join(config["param_code]", config["method_name"], "04", "config["cores"], ".json"))
  • Adapting Mervins example:
AJ/MISO/AJ_MISO_01.bed
AJ/MISO/AJ_MISO_04_4.json

Well, not sure... out 01-03 would then be overwritten at each new run with different cores. That ok? But might be confusing for executors. So maybe we should just keep the original outline

Copy link
Collaborator

@uniqueg uniqueg Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that we wouldn't need a I1, Q1 & D1 anymore, but have compute performance monitored for every benchmark (or, to use the OpenEBench terminology, make the compute performance metrics of every challenge). This would give us more fine-grained insights into where individual tools struggle, performance-wise. It may not be all that interesting maybe, but it seems to me that keeping track of performance parameters for every run is something like "good practice"? Now, whether to run every tool n times is debatable, guess we'd need to consider cost/benefit here. I'm totally fine with leaving that and doing it only once.

However, just to respond to some arguments:

As for producing n times the data: guess that can't be avoided if we want to run the tools n times, regardless of how we name the file.

As for choosing which one of many (supposedly identical) output files to use afterwards (for other benchmarks): we will need to arbitrarily define that (or have the tools define that for us by using defaults) regardless of whether we run it n times or just once. And as for them being identical or not: this would already give us another metric we could always check for, for every run: are the results reproducible (using different amounts of cores)?

Yes, still against having parameter codes for individual samples, as those would then just be short names for samples and wouldn't reflect what is, in OEB parlance, a "challenge" (describing a specific set of inputs). To construct that we would then need to somehow merge different sample codes along some rules into a complex parameter/challenge code, and I think that's complicated. So, by all means, we should have one simple code that identifies a specific way each tool should be run (and that should reasonably lead to identical or extremely similar results every time it is run for that tool). Of course, that doesn't mean that a set of inputs couldn't be just one sample, if that is what we really want to have for a specific run (I can't really judge that because I was following the metrics discussions at best peripherally).

I also agree with @ninsch3000 that it's best to keep the 2-letter code in alphabetical order, without any semantics, for the reasons she mentioned. But perhaps even better, we could follow the rules of the challenge IDs defined in the OpenEBench schema definition, which call for the codes being generated in accordance with this regex: ^OEBX[0-9]{3}[A-Z0-9]{7}$ So that would be, e.g., OEBX123ABCD0EF or OEBX00000000AA, so basically we could keep our 2-letter codes, but could prefix them with OEBX0000000. Or we could use the first three digits to indicate the cores: OEBX00100000AA, OEBX00400000AA, OEBX01600000AA.

Anyway, let's vote on a couple of things :)

Copy link
Collaborator

@uniqueg uniqueg Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ = Generate parameter codes in alphabetical order (as is)
🚀 = Design semantic parameter codes

Copy link
Collaborator

@uniqueg uniqueg Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ = Use 2-letter code, e.g., AA (as is)
🚀 = Use prefix to generate OEB challenge _id-compatible parameter codes, e.g., OEBX00000000AA

Copy link
Collaborator

@uniqueg uniqueg Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ = run every challenge (i.e., set of inputs) once unless specifically designed to assess reproducibility or parallelization performance
🚀 = run every challenge multiple times to assess reproducibility and/or parallelization performance

Copy link
Collaborator

@uniqueg uniqueg Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ = use param codes for different # of CPUs or technical replicates
🚀 = codify # of CPUs and technical replicates in param codes
🎉 = codfy # of CPUs and technical replicates in output names only (and just give a general guideline for all challenges to be run, say, 3x with default settings and once each with 1, 4 & 16 cores, if supported by the method)

Note that conceptually, the first two options imply cross-challenge comparison metrics in OpenEBench if we want to assess reproducibility and/or parallelization performance. This might be particularly tricky/confusing if we go for challenge ID-type param codes.

Suggestions for codifying rules:

  • When using 2-letter param codes: AA_nnn_### (example: AD_003_016)
  • When using OEB-compatible param codes: OEBXnnn###00AA, where nnn signifies the technical replicate, the ### signifies the core number and AA signifies the set of inputs (example: OEBX00301600AD)
  • Use 000 to signify the default number of cores
  • Always use 000 for cores and 001 for technical replicates for any metrics of that challenge that are not related to testing for reproducibility or parallelization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all suggestions, but I still think we should have 1 param code per sample in some cases (As it is proposed in the table). More specifically, when we have a matching ground truth file for each sample and want to asses that, we wouldn't want the tools to merge several samples and give us only one output file for all samples. As we can't anticipate at the moment how merging or not merging of samples can be controlled for each tool, I think it is the most robust way to also have samples with their own param code. Another possibility might be to require the execution workflows to have a rule/directive that controls whether the tool receives a group of samples or individual samples as an input. This directive would then have to be configured according to the run_type or processing_directive of the param code table.
That said, please do note that in other param codes a group of samples is specified, and those can be the same samples that earlier had their own param code. But now they would be allowed to be merged and will have to be compared to a merged ground truth. We never suggested to have only sample specific param codes.
There are other datasets not included in this first param code table, where we do not have one ground truth per sample, and then of course we wouldn't run individual samples with their own param code but group them according to how their ground truths had been grouped.

@uniqueg uniqueg changed the title Working Parameter Codes Table docs: add parameter codes Jun 3, 2021
@@ -1,23 +1,58 @@
# Parameter codes

## Input to summary workflows
Parameter codes uniquely identify what combinations of input files and parameters were used to run an execution workflow.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the OpenEBench terminology, it may be better to call these challenge codes. What we currently call challenges (identification, quantification, differential usage) would then become benchmarking events, and what we currently call benchmarks would become metrics (again?).

It's annoying, but we're unlikely to change the OpenEBench specifications/schemas, which have been around for a while and have some good thinking in them. And it would be good to adopt one consistent terminology.

@ninsch3000 ninsch3000 marked this pull request as draft July 21, 2021 12:55
@ninsch3000 ninsch3000 closed this Aug 4, 2023
@ninsch3000 ninsch3000 deleted the paramcodes-mf branch August 8, 2023 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compile parameter codes
4 participants