Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parameterising study definitions #860

Merged
merged 2 commits into from
Aug 31, 2022
Merged

Conversation

evansd
Copy link
Contributor

@evansd evansd commented Aug 29, 2022

This teaches generate_cohort to accept a new, multi-valued --param argument e.g.

cohortextractor generate_cohort --param key1=value1 --param key2=value2 --param key3

The study definition can access these via the params dict in the cohortextractor module:

from cohortextractor import params
print(params)
# {'key1': 'value1', 'key2': 'value2', 'key3': ''}

There was a suggestion earlier of using --arg for this, but I think --param avoids confusion with other kinds of thing which are also called "arguments", and fits neatly with the term "parameterised study definition".

This was all much harder than it ought to be because implicit in this feature is the need for the output file name to be customised. Previously the output directory and format could be changed, but the file name was determined by the study definition name e.g. study_definition_test.py would always produce a file called <output_dir>/input_test.<extension>.

So we now support an --output-file argument which sets the directory, base name and file format in one go e.g. some_dir/my_results.feather.

@evansd evansd force-pushed the evansd/study-def-params branch 3 times, most recently from d1e5672 to 805b3ac Compare August 30, 2022 12:26
Copy link
Contributor

@rebkwok rebkwok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo and some questions that are all along the lines of "what happens if a user does this weird thing"; feel free to ignore if they're irrelevant

cohortextractor/cohortextractor.py Outdated Show resolved Hide resolved
expectations_population,
dummy_data_file,
index_date_range=index_date_range,
skip_existing=skip_existing,
output_format=output_format,
output_name=output_name or "input",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urgh

class PatchStack(ExitStack):
"""
Apply multiple `patch` context managers without nesting or using the ugly multi
argument form
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

cohortextractor/cohortextractor.py Show resolved Hide resolved
cohortextractor/cohortextractor.py Outdated Show resolved Hide resolved
@wjchulme
Copy link
Contributor

wjchulme commented Aug 31, 2022

This is fab! Couple of questions:

If I do --param arg1 = 1 in the action and arg1 = params["arg1"] in the study def, is arg1 a string "1" or a numeric 1? What about arg1 = True? string or logical? I'm thinking of scenarios where we need the non-string type in the study definition, which might catch users out. For example I don't think find_first_match_in_period = "True" will work (with True as a string not a bool), so we'd need to do arg1 = params["arg1"]=="True" or something like that to get the correct type.

Can I do --param key1 = "string with white space" ?

@sebbacon
Copy link
Contributor

🌊 🐦 : have you done documentation and "platform news" for this? Cos it's cool.

@evansd
Copy link
Contributor Author

evansd commented Aug 31, 2022

Thanks @wjchulme. This is interesting feedback because it already exposes one misunderstanding which is that you can't write key = value, it has to be key=value because of the vagaries of bash's argument parsing (the former is considered to be three separate arguments, and would need to be quoted as 'key = value' if you wanted it treated as a single one). I think command line folk get so used to this that they forget not everyone has the same form of Stockholm syndrome :)

That said, I think we can make the space separated version work with a bit of fiddling. We'd need to let --param accept an arbitrary number of arguments and then join them together before parsing them. And given that you won't be the only person to use spaces I think we'll have to support that as well.

You can definitely include whitespace. If we make the change above then your example would work exactly as written. Otherwise you'd have to write it as --param 'key1=string with whitespace'.

As for types, there isn't a neat solution with this approach. I think the only sane option is to make it clear that everything is a string and needs to be explicitly converted if you want something else.

There is a totally different approach we could take though that would give type support, and that's to use the config: option in project.yaml. That would let you write something like:

generate_cohort:
  run: cohortextractor:latest generate_cohort --study-definition study_definition
  config:
    param_1: some string
    param_2: true
    param_3: 10
  outputs:
    highly_sensitive:
      cohort: output/input.csv

I actually think that would be less work than trying to add the support for spaces I described above. So maybe that's the way to go?

@evansd
Copy link
Contributor Author

evansd commented Aug 31, 2022

have you done documentation and "platform news" for this? Cos it's cool.

No, but was planning to do the docs once it was merged. And maybe the platform news after at least one person has tried it first?

@wjchulme
Copy link
Contributor

everything is a string and needs to be explicitly converted if you want something else

This is how the equivalent functionality works in R, so at least there's some consistency. The problem is that most users won't know how to do the conversion in python. These are edges cases though -- I reckon a vast majority of parameters used will be needed as strings (even if conceptually they're numeric, eg f"date + {param1} days").

@wjchulme
Copy link
Contributor

I think users would quickly get used to having to use key=value, not key = value. We could just advise that all key-value pairs should be expressed inside quotes to avoid problems.

I do like the config: suggestion though. It's probably less error-prone. More sturdy.

This allows specifying the full path to the output file, including the
output directory and extension. We need this in order to support
parameterising study definitions because at the moment the output file
name is determined by the study definition name, but we want to use the
same study definition to produce different output files.

We could acheive the same thing by adding an `--output-name` argument
and combining these with the pre-existing `--output-directory` and
`--output-format` arguments. But it seems much simpler to specify all
three together in a single argument.
@evansd evansd force-pushed the evansd/study-def-params branch 2 times, most recently from 77ce53b to 856faff Compare August 31, 2022 15:48
@evansd
Copy link
Contributor Author

evansd commented Aug 31, 2022

I do like the config: suggestion though. It's probably less error-prone. More sturdy.

Yeah, I go a bit back and forth on this. It relies on coupling bits of the system together in ways which make me a bit uncomfortable. Not saying we definitely shouldn't do it, but I'd rather avoid baking that behaviour in at this stage.

For now I've just kept the --param syntax but implemented the whitespace handling.

This teaches `generate_cohort` to accept a new mutli-valued `--param`
argument e.g.

    generate_cohort --param key1=value1 --param key2=value2 --param key3

The study definition can access these via the `params` dict in the
`cohortextractor` module:

    from cohortextractor import params
    print(params)
    # {'key1': 'value1', 'key2': 'value2', 'key3': ''}
@evansd evansd force-pushed the evansd/study-def-params branch from 856faff to df6ec7d Compare August 31, 2022 15:59
@evansd evansd merged commit 3023bf0 into main Aug 31, 2022
@evansd evansd deleted the evansd/study-def-params branch August 31, 2022 16:14
evansd added a commit to opensafely/documentation that referenced this pull request Sep 6, 2022
evansd added a commit to opensafely/documentation that referenced this pull request Sep 6, 2022
evansd added a commit to opensafely/documentation that referenced this pull request Sep 6, 2022
* Document the `--param` argument to cohortextractor

As added in:
opensafely-core/cohort-extractor#860
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants