Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial notes #1

Merged
merged 2 commits into from
Apr 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions notes/initial-cleanup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Notes
After a read through of the codebase,
I've jotted down some thought about the steps to clean up this repository.
Broadly it is easy enough to follow where things live,
but the goal is to improve the consistency of this repo (and other OM repos).
This should make it easier to work as a team
and produce extensible software in the long-run.

The goal would be to make no science changes as part of this refactor,
i.e. the output data should not change.

This clean-up might not be the #1 priority, but should go on the roadmap.
Once we agree on the steps involved,
we can create a set of GH issues to track the process.

## Steps:
* Add an integration test
* Might need to use an environment variable to run in "reduced complexity mode"
* Run through from start to finish and verify that we get the same result.
* Run ruff over the codebase. This will coalesce on a common variable/function naming conventions
* Refactor code into a package. Working name `om_prior`
* Split out the processing steps from the utilities
* Define an interface that all layers must implement
* Move each of the layer scripts into a common place
* Move inputs,intermediates,outputs into a data folder
* Migrate to the CR copier template to define the repository structure and workflows.
* This will add linting, CI, versioning etc
* There will be some changes in the development workflow so there might be some training needed
* We should discuss what steps work for the rest of the team. There are part of the template that we can pick and choose
* We've documented the technical decisions that we have made which we can share
* Run CI periodically to double-check that the processing is always ready
* Add licenseheaders to automate the generation of the license headers in all source files

## Questions for Peter
* How does the downloading of inputs work? I see that data is fetched from `https://prior.openmethane.org`. How does data get there?
* What do you do with the outputs? Are they uploaded to `https://prior.openmethane.org`?
Comment on lines +35 to +36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gerard showed me yesterday that prior.openmethane.org is a R2 bucket (CloudFlare version of S3). His understanding is that the prior doesn't change very frequently unless new data or methodology becomes available. This can be computed based on (fairly) static inputs and placed in an easily accessible location for reference by the model.

There is some utility to this data and the methods for generating it being publicly available.

It probably makes sense for the prior to be "productionised" but possibly not automated. @prayner what are your thoughts?

Copy link
Collaborator Author

@lewisjared lewisjared Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance it looks like the data in the R2 bucket is only the semi-static data that is updated semi-annually. I'm fine if these data are updated on an ad hoc basis rather than being checked monthly and automatically used assuming @prayner is also. The step of "productionising" those semi-static data might be as simple as documentation (public or private).

The GFAS data is downloaded during the processing of its layer and will be updated at least monthly.

@aethr When you say "prior" are you referring to this repository as a whole or just the data on R2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data in the R2 bucket is only the semi-static data that is updated semi-annually. I'm fine if these data are updated on an ad hoc basis rather than being checked monthly and automatically used assuming @prayner is also. The step of "productionising" those semi-static data might be as simple as documentation (public or private).

@lewisjared that is my understanding as well. As I understand it the data in the R2 bucket are all inputs to the prior which aren't know to change on a predictable cadence. @prayner will have to help us understand where these are sourced from and how we can understand when they might need to be updated.

The GFAS data is downloaded during the processing of its layer and will be updated at least monthly.

I wasn't aware of that, and perhaps that's an argument that this should be done on a schedule, ignoring when the "semi-static" inputs are updated.

When you say "prior" are you referring to this repository as a whole or just the data on R2?

I understood "the prior" to be a data set we compute in advance, based on semi-static data, which forms one of the inputs to the Open Methane model. The data in R2 is the input (along with GFAS), and the code in this repo transforms and combines the input to give us data in "our format" (ie, the grid).

Not sure what the plan is for the outputs, but if this only needs to get run once each time data changes (ie, new GFAS) it does seem like a good idea to store/cache it somewhere the model can access it.

Sorry, my comments probably aren't that helpful! Will wait for @prayner to illuminate things. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are always helpful. We are all trying to get on the same page and use the same language.

I think your description of the prior is correct, other than adding that this is our initial guess at the methane emissions across Australia.

I think that this has to be run at least monthly to pull in the new GFAS (and maybe wetlands) data. In our meeting notes, we said that the "above steps are run daily" which included the prior calculation so it might even need to be run at a daily cadance. The current implementation allows for individual layers to be rerun as needed so this isn't a major can be heavily cached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "above steps are run daily" which included the prior calculation so it might even need to be run at a daily cadance.

If that is true then I think "the prior" is probably the process (ie "run the prior") rather than the data set (ie "access the prior").

* There is a mix of scripts and library code here. Broadly the scripts as I understand are as follows:
* omDownloadInputs.py
* omCreateDomainInfo.py
* omPrior.py
* omPriorVerify.py
* each of the layer generation files can be run independently
* What should we test?
* Running the whole pipeline (10 minutes) might be too slow, but we might be able to run a reduced complexity version for the slow step
* How to handle downloading inputs and caching between runs? How much data is there?
* Is this normally run locally or on NCI?
* Do you envision the need for more layers or different sources in future? The level of flux here may change how much structure is needed.
* Will type hints be useful anywhere?
* We typically use type hints to provide some additional context for future contributor and some additional guardrails. These type hints are then checked as part of the CI.
* This can be hard to add into an existing project and get good coverage.
* There might be useful places to add type hints and progressively enhance the coverage over time.

## Future steps
* Are there additional validation steps to perform. There could be a couple of notebooks that visualise and compare the new prior against a previous priors
* What can be automated?
* What needs a human in the loop?
* Are there any additional things to archive with a prior release
* Develop and document a release process for the prior
* Documentation of each step and the required outputs