-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow configuring the database name #159
Conversation
In order to be consistent with the "`database` hyphen `option name`" style of the rest of the database options, the option gets the `name` suffix to make it more obvious that the options specifies the database name.
Turns out I actually only need one direction and that can be done inline, so I don't really need that code. So commit the code, but revert it immediately after, to keep it in history in case I want to look it up later.
This reverts commit 378e078445a146f1c7b52e73bee6fe712605fa0e because the code isn't actually needed anymore, but should be kept in history for reference purposes.
This makes sure that the module doesn't contain executable code besides the function definition. The CLI entry point is still `main` but now `main` does all setup and initialization and then invokes the main CLI command. This is also the only way of catching exceptions raised during the commands execution, which has to be used later on.
If the configuration file doesn't exist, create one on startup by writing the default values of all command line options having defaults to a file in the correct format. In order for this to work, the command line options actually need defaults, so specify those and make them be shown in the `--help` message. Also put a note in the help message, explaining how to configure `egon-data` as well as the fact, that it will modify it's working directory. Since `egon-data` might be run multiple times in parallel, also write the combination of configuration values and command line overrides to a special configuration file from the first `egon-data` process started. Exclusively use the options from this configuration file in all other processes and make sure that the configuration file is deleted when the process exits by wrapping the main command in a `try` block where the file containing the running processes active option configuration is deleted in a `finally` block. Put a configuration file path discovery function in the `egon.data.config` module in order to support all this machinery. Finally, adapt the CHANGELOG.rst to these new changes.
The clarification was missing from the original announcement.
Keys are now the Python identifiers corresponding to the command line switches. This change was necessary because the translation to the keys used in "docker-compose.yml" is now done at the time of reading the file, instead of when writing the file.
Consistently use the standard configuration file "egon-data.configuration.yaml" instead.
Partly addresses issue #59.
Allow configuring the "Docker"ed database via command line arguments. In order to do so, convert the shipped "docker-compose.yml" to a template, which is copied to the "docker" subdirectory of the current working directory, with the template fields filled in with the corresponding values from the command line arguments. Run `docker-compose` using the copied template, instead of the original file shipped with "egon.data". Put a note about this behavior into `egon-data`'s help message. Stop reading database parameters from the shipped `docker-compose.yaml` because that's no longer possible, as the file is a template without meaningful values now. Finally, update the changelog to reflect this new feature.
I intentionally violated the style in the last commit, in order to keep the diff as small as possible. While fixing the style, I also changed the variable `docker_db_config` to `configuration`. Using the same variable name that is also used to store the "egon-data" part of the configuration file is done intentionally, because the new `configuration` dictionary is just the old one, with the keys filtered to the ones also present in `translated` and translated accordingly.
Group them loosely on how they belong together thematically, instead of ordering them alphabetically.
Previously the initial configuration file was generated using only the default values which IMHO is surprising behaviour if command line overrides are present.
Since default values are always present in the options to a command, the behaviour up to this commit meant that non-default values taken from the configuration file where overwritten with the default, in case no value was specified for a specific command line switch.
This has a lot of consequences. First and foremost, the "sql_alchemy_conn" configuration setting in "airflow.cfg" has to be determined dynamically from the database configuration settings. In order to do so, "airflow.cfg" was converted to a template, where everything in curly brackets, e.g. "{slot}", gets filled with appropriate values, which also means that curly braces not signifying a slot to be filled have to be escaped to "{{" and "}}". This is the cause for most of the changes in "airflow.cfg". The template gets handled similar to "docker-compose.yml", in that it gets copied to the "airflow" subdirectory of the current working directory and has its slots filled. As a consequence, the directory in which DAGs are discovered can no longer be determined from `$AIRFLOW_HOME` but has to be filled in as well, because our DAGs still reside in the "egon.data" package. This explains having to set the "dags_folder" configuration setting. Since the one existing database has a new purpose, it also gets a setting, with which it can be configured and which is used in "docker-compose.yml" to determine the database name. The database now has to be up prior to Airflow, so we can no longer initialize it in an Airflow task. Therefore starting the docker container is now done at the start of the `egon-data` command. Since the docker container can only be configured to contain one database initially, it's now checked whether the normal `egon-data` database exists and if it doesn't it is created. That way one can use multiple different databases, simply by supplying different database names via `--database-name`. Last but not least, the "entrypoints" make no sense anymore, because those would be applied to the Airflow metadata database. Extension creation and other initialization has therefore been moved to the `initdb` Airflow task, which finally fixes #76.
Now that we're using PostgreSQL for Airflow's metadata, this is finally possible.
That means the data is managed by Docker, so no longer simply available on the file system, but this was the only way, I managed to get rid of really annoying permission errors. See Docker's [documentation][0] for more information. [0]: https://docs.docker.com/storage/volumes/
Finally managed to get this working. Would you please try it out, @ClaraBuettner and @gplssm. But beware, a lot of things changed. I'm now expecting users to run |
I tested choosing another database and it worked fine. Since the tasks run in parallel now, the computation time is significantly reduced, which is great. And I get some error messages when I run
I also notices that I messed up the dependencies of import-nep-data which also needs zenus data in test mode. And fails now when the task run in parallel. I will fix this in a separate issue. |
The parameter was added in Python 3.8. Since we also support Python 3.7, ignore the error manually which is thrown if trying to remove a nonexistent file.
Would it also be possible to rephrase or put an example in the
Nice catch. Thanks. Fixed that. You might now have an |
Another question from me: now that there is the option of working with an arbitraty database, how do I tell the pipeline to be in "test mode" for one database and in "production mode" for another one? |
The new code (hopefully) does the same as the old code: collect option flags and the corresponding values for defaults and command line overrides in a dictionary so they can be merged and written to a configuration file. The new version is just shorter and hopefully not too obscure.
Did my part. The new option is available under two names: |
Thanks! 👍
No, I'm totally fine with |
Before I can approve/we can merge,
For the first one, I had a look into
|
When I do what you have done I get the expected output:
I created a new folder to get a new egon-data.configuration.yaml. Did you also do that? And did you already started with the tasks? I can also do some of them. |
No.
No, didn't started yet. Feel free to change things. I'll do the remaining ones. @gnn do you want to resolve the conflicts? You've changed most here and I think you're the fastest. |
I ran the workflow again and can confirm that is runs sucessfully. And indeed, it's way faster now. Very cool! 😎 I noticed one more thing that should be changed. Except for MaStR data, all downloaded data lands in the repo instead of being saved in the project directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that the installation from this branch works fine and also run the workflow without any issues again. Since checking if solving conflicts with dev doesn't seem to be part of reviewing, I will approve this PR.
But @gnn if you want someone to test again after solving the conflicts, I can do that. You just need to ask for it.
Narf. Typing in responses isn't enough. You also have to send them. So this is what I wanted to send yesterday evening:
Nah, it's fine. Ill resolve the conflicts directly while merging into dev, so no further reviews required. Thanks for everything. Merge coming in soon. |
Quick question @gplssm: did you already solve the problem of the clashing container names? |
|
Damn. That means I can't use you to test whether changing the container name fixes that one. Ah, well, no big deal. I guess we'll see whether people still run into the issue or not. |
Before this commit, there was an ambiguity in the name `egon-data-local-database`: the name was used for both, the service as well as the container. As the container name will get configurable, it's better to use non ambiguous names from now on. While doing this, I also noticed that the name `egon-data-database-volume` does not use the same convention as the two new names introduced. In order to be consistent when naming things in this file, I also changed the name to `egon-data-database-volume`.
Not updating is in line with how "docker-compose.yml" is treated. If that behaviour should change, it's probably best to introduce extra command line flags or explicit commands for updating the configuration files.
The function is unnecessary because the option will only ever appear under its first name.
src/egon/data/config.py
Outdated
def dataset_boundaries(): | ||
"""Return dataset boundaries from configuration settings""" | ||
|
||
if '--dataset-boundary' in settings()['egon-data']: | ||
dataset = settings()['egon-data']['--dataset-boundary'] | ||
elif '--clip-datasets-to' in settings()['egon-data']: | ||
dataset = settings()['egon-data']['--clip-datasets-to'] | ||
|
||
return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stumbled upon this during merging and just wanted to note that this function isn't necessary. I tried to anticipate this confusion with my remark that
The first one is the canonical one so this one will be used when writing to the configuration file or when obtaining settings via the newly added
egon.data.config.settings
function.
When using egon.data.config.settings
the option will always appear under the key "--dataset-boundary"
, so I removed "dataset_boundaries" in 7a090f8. I'll also remove the second option before the merge, as Guido requested.
It's still available as `"--dataset-boundary"`.
Unfortunately the `:pr:` and `:issue:` shortcuts are a Sphinx only feature so they can not be used in reST files which are also rendered by GitHub.
CHANGELOG.rst
Outdated
@@ -30,7 +30,7 @@ Added | |||
directtory in which ``egon-data`` is started, so it's probably best to | |||
run ``egon-data`` in a dedicated directory. | |||
There's also the new function `egon.data.config.settings` which | |||
returns the current configuration settings. | |||
returns the current configuration settings. See :pr:`159` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to curb your enthusiasm here, especially since I'm the one introducing the :issue:
and :pr:
roles, but unfortunately these roles are only rendered by Sphinx, so it's probably better to not use them in reST files which will be viewed both, on GitHub and on Read the Docs.
So manually expanded :pr:
s in "CHANGELOG.rst" with 5b4b373 .
If the configured database doesn't exist, it will be created. This enables seamless usage of multiple databases and should fix #112.