Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring the database name #159

Merged
merged 59 commits into from
Mar 23, 2021
Merged

Allow configuring the database name #159

merged 59 commits into from
Mar 23, 2021

Conversation

gnn
Copy link
Collaborator

@gnn gnn commented Mar 15, 2021

If the configured database doesn't exist, it will be created. This enables seamless usage of multiple databases and should fix #112.

gnn added 20 commits March 9, 2021 16:15
In order to be consistent with the "`database` hyphen `option name`"
style of the rest of the database options, the option gets the `name`
suffix to make it more obvious that the options specifies the database
name.
Turns out I actually only need one direction and that can be done
inline, so I don't really need that code. So commit the code, but revert
it immediately after, to keep it in history in case I want to look it up
later.
This reverts commit 378e078445a146f1c7b52e73bee6fe712605fa0e because the
code isn't actually needed anymore, but should be kept in history for
reference purposes.
This makes sure that the module doesn't contain executable code besides
the function definition. The CLI entry point is still `main` but now
`main` does all setup and initialization and then invokes the main CLI
command. This is also the only way of catching exceptions raised during
the commands execution, which has to be used later on.
If the configuration file doesn't exist, create one on startup by
writing the default values of all command line options having defaults
to a file in the correct format. In order for this to work, the command
line options actually need defaults, so specify those and make them be
shown in the `--help` message. Also put a note in the help message,
explaining how to configure `egon-data` as well as the fact, that it
will modify it's working directory. Since `egon-data` might be run
multiple times in parallel, also write the combination of configuration
values and command line overrides to a special configuration file from
the first `egon-data` process started. Exclusively use the options from
this configuration file in all other processes and make sure that the
configuration file is deleted when the process exits by wrapping the
main command in a `try` block where the file containing the running
processes active option configuration is deleted in a `finally` block.

Put a configuration file path discovery function in the
`egon.data.config` module in order to support all this machinery.

Finally, adapt the CHANGELOG.rst to these new changes.
The clarification was missing from the original announcement.
Keys are now the Python identifiers corresponding to the command line
switches. This change was necessary because the translation to the keys
used in "docker-compose.yml" is now done at the time of reading the
file, instead of when writing the file.
Consistently use the standard configuration file
"egon-data.configuration.yaml" instead.
Allow configuring the "Docker"ed database via command line arguments.
In order to do so, convert the shipped "docker-compose.yml" to a
template, which is copied to the "docker" subdirectory of the current
working directory, with the template fields filled in with the
corresponding values from the command line arguments. Run
`docker-compose` using the copied template, instead of the original file
shipped with "egon.data". Put a note about this behavior into
`egon-data`'s help message. Stop reading database parameters from the
shipped `docker-compose.yaml` because that's no longer possible, as the
file is a template without meaningful values now. Finally, update the
changelog to reflect this new feature.
I intentionally violated the style in the last commit, in order to keep
the diff as small as possible. While fixing the style, I also changed
the variable `docker_db_config` to `configuration`. Using the same
variable name that is also used to store the "egon-data" part of the
configuration file is done intentionally, because the new
`configuration` dictionary is just the old one, with the keys filtered
to the ones also present in `translated` and translated accordingly.
Group them loosely on how they belong together thematically, instead of
ordering them alphabetically.
Previously the initial configuration file was generated using only the
default values which IMHO is surprising behaviour if command line
overrides are present.
Since default values are always present in the options to a command, the
behaviour up to this commit meant that non-default values taken from the
configuration file where overwritten with the default, in case no value
was specified for a specific command line switch.
This has a lot of consequences. First and foremost, the
"sql_alchemy_conn" configuration setting in "airflow.cfg" has to be
determined dynamically from the database configuration settings. In
order to do so, "airflow.cfg" was converted to a template, where
everything in curly brackets, e.g. "{slot}", gets filled with
appropriate values, which also means that curly braces not signifying a
slot to be filled have to be escaped to "{{" and "}}". This is the
cause for most of the changes in "airflow.cfg".

The template gets handled similar to "docker-compose.yml", in that it
gets copied to the "airflow" subdirectory of the current working
directory and has its slots filled. As a consequence, the directory in
which DAGs are discovered can no longer be determined from
`$AIRFLOW_HOME` but has to be filled in as well, because our DAGs still
reside in the "egon.data" package. This explains having to set the
"dags_folder" configuration setting.

Since the one existing database has a new purpose, it also gets a
setting, with which it can be configured and which is used in
"docker-compose.yml" to determine the database name. The database now
has to be up prior to Airflow, so we can no longer initialize it in an
Airflow task. Therefore starting the docker container is now done at the
start of the `egon-data` command. Since the docker container can only be
configured to contain one database initially, it's now checked whether
the normal `egon-data` database exists and if it doesn't it is created.
That way one can use multiple different databases, simply by supplying
different database names via `--database-name`.

Last but not least, the "entrypoints" make no sense anymore, because
those would be applied to the Airflow metadata database. Extension
creation and other initialization has therefore been moved to the
`initdb` Airflow task, which finally fixes #76.
Now that we're using PostgreSQL for Airflow's metadata, this is finally
possible.
That means the data is managed by Docker, so no longer simply available
on the file system, but this was the only way, I managed to get rid of
really annoying permission errors.
See Docker's [documentation][0] for more information.

[0]: https://docs.docker.com/storage/volumes/
@gnn
Copy link
Collaborator Author

gnn commented Mar 15, 2021

Finally managed to get this working. Would you please try it out, @ClaraBuettner and @gplssm. But beware, a lot of things changed. I'm now expecting users to run egon-data in a dedicated directory and the compose file changed, so the container will probably be rebuild.

@ClaraBuettner
Copy link
Contributor

I tested choosing another database and it worked fine. Since the tasks run in parallel now, the computation time is significantly reduced, which is great.
First, I just had some problems to set the options in the correct order, egon-data --database 'test-egon-data' serve worked but an example in the test mode section might be helpful.

And I get some error messages when I run egon-data --help but everything is working. I have Python 3.7.6 and click 7.1.2.

> (egon-data) (base) clara@clara-LIFEBOOK-U749:~/test-egon-data$ egon-data --help
> Usage: egon-data [OPTIONS] COMMAND [ARGS]...
> 
>   Run and control the eGo^n data processing pipeline.
> 
>   It is recommended to create a dedicated working directory in which to run
>   `egon-data` because `egon-data` because `egon-data` will use it's working
>   directory to store configuration files and other data generated during a
>   workflow run.
> 
>   You can configure `egon-data` by putting a file named "egon-
>   data.configuration.yaml" into the directory from which you are running
>   `egon-data`. If that file doesn't exist, `egon-data` will create one,
>   containing the command line parameters supplied, as well as the defaults
>   for those switches for which no value was supplied. Last but not least, if
>   you're using the default behaviour of setting up the database in a Docker
>   container, the working directory will also contain a directory called
>   "docker", containing the database data as well as other volumes used by
>   the dockered database.
> 
> Options:
>   --airflow-database-name DB      Specify the name of the airflow metadata
>                                   database.  [default: airflow]
> 
>   --database-name, --database DB  Specify the name of the local database. The
>                                   database will be created if it doesn't
>                                   already exist.
>                                   
>                                   Note: "--database" is deprecated and will
>                                   be removed in the future. Please use the
>                                   longer but consistent "--database-name".
>                                   [default: egon-data]
> 
>   --database-user USERNAME        Specify the user used to access the local
>                                   database.  [default: egon]
> 
>   --database-host HOST            Specify the host on which the local database
>                                   is running.  [default: 127.0.0.1]
> 
>   --database-port PORT            Specify the port on which the local DBMS is
>                                   listening.  [default: 59734]
> 
>   --database-password PW          Specify the password used to access the
>                                   local database.  [default: data]
> 
>   --jobs N                        Spawn at maximum N tasks in parallel.
>                                   Remember that in addition to that, there's
>                                   always the scheduler and probably the server
>                                   running.  [default: 16]
> 
>   --version                       Show the version and exit.
>   --help                          Show this message and exit.
> 
> Commands:
>   airflow
>   serve    Start the airflow webapp controlling the egon-data pipeline.
> Traceback (most recent call last):
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 781, in main
>     with self.make_context(prog_name, args, **extra) as ctx:
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 700, in make_context
>     self.parse_args(ctx, args)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 1212, in parse_args
>     rest = Command.parse_args(self, ctx, args)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 1048, in parse_args
>     value, args = param.handle_parse_result(ctx, opts, args)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 1630, in handle_parse_result
>     value = invoke_param_callback(self.callback, ctx, self, value)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 123, in invoke_param_callback
>     return callback(ctx, param, value)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 951, in show_help
>     ctx.exit()
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 558, in exit
>     raise Exit(code)
> click.exceptions.Exit: 0
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/home/clara/GitHub/eGon-data/src/egon/data/cli.py", line 285, in main
>     egon_data.main(sys.argv[1:])
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 810, in main
>     sys.exit(e.exit_code)
> SystemExit: 0
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/home/clara/venv/egon-data/bin/egon-data", line 33, in <module>
>     sys.exit(load_entry_point('egon.data', 'console_scripts', 'egon-data')())
>   File "/home/clara/GitHub/eGon-data/src/egon/data/cli.py", line 287, in main
>     config.paths(pid="current")[0].unlink(missing_ok=True)
> TypeError: unlink() got an unexpected keyword argument 'missing_ok'
> 

I also notices that I messed up the dependencies of import-nep-data which also needs zenus data in test mode. And fails now when the task run in parallel. I will fix this in a separate issue.

The parameter was added in Python 3.8. Since we also support Python 3.7,
ignore the error manually which is thrown if trying to remove a
nonexistent file.
@gnn
Copy link
Collaborator Author

gnn commented Mar 15, 2021

First, I just had some problems to set the options in the correct order, egon-data --database 'test-egon-data' serve worked but an example in the test mode section might be helpful.

Would it also be possible to rephrase or put an example in the --help message to clarify things? How would you want to read it? Because I'd really like the --help message to be the goto help text and have it both, on the command line and in Read the Docs. Since you apparently also aren't aware that --database is deprecated in favour of --database-name, I don't seem to do a good job of explaining things. :)

And I get some error messages when I run egon-data --help but everything is working. I have Python 3.7.6 and click 7.1.2.

> 
> Traceback (most recent call last):
>   File "/home/clara/venv/egon-data/bin/egon-data", line 33, in <module>
>     sys.exit(load_entry_point('egon.data', 'console_scripts', 'egon-data')())
>   File "/home/clara/GitHub/eGon-data/src/egon/data/cli.py", line 287, in main
>     config.paths(pid="current")[0].unlink(missing_ok=True)
> TypeError: unlink() got an unexpected keyword argument 'missing_ok'
> 

Nice catch. Thanks. Fixed that. You might now have an "egon-data.pid-SOME_NUMBER.configuration.yaml" file lying around, which may create problems down the line. You can delete that/these.

@gnn
Copy link
Collaborator Author

gnn commented Mar 15, 2021

Another question from me: now that there is the option of working with an arbitraty database, how do I tell the pipeline to be in "test mode" for one database and in "production mode" for another one?

gnn added 3 commits March 16, 2021 03:23
The new code (hopefully) does the same as the old code: collect option
flags and the corresponding values for defaults and command line
overrides in a dictionary so they can be merged and written to a
configuration file. The new version is just shorter and hopefully not
too obscure.
@gnn
Copy link
Collaborator Author

gnn commented Mar 16, 2021

Did my part. The new option is available under two names: --dataset-boundary and --clip-datasets-to. Both can be used interchangeably. The first one is the canonical one so this one will be used when writing to the configuration file or when obtaining settings via the newly added egon.data.config.settings function. If you like one more than the other, one variant can be removed. I couldn't decide, so I left both in.
Last question: I noticed that you are using PWD, @gplssm. Do you find that more intuitive than saying "the current working directory"?

@gplssm
Copy link
Contributor

gplssm commented Mar 17, 2021

Thanks! 👍

Did my part. The new option is available under two names: --dataset-boundary and --clip-datasets-to. Both can be used interchangeably. The first one is the canonical one so this one will be used when writing to the configuration file or when obtaining settings via the newly added egon.data.config.settings function. If you like one more than the other, one variant can be removed. I couldn't decide, so I left both in.

I don't care at all here. It makes it easier for all of us to have only one name here. In particular, if this name is used elsewhere (e.g. egon.data.config.settings). I vote for keeping only --dataset-boundary.

Last question: I noticed that you are using PWD, @gplssm. Do you find that more intuitive than saying "the current working directory"?

No, I'm totally fine with CWD. I was just to lazy to search for the exact difference... Changed it.

@gplssm
Copy link
Contributor

gplssm commented Mar 17, 2021

Before I can approve/we can merge, twothree things need to be done:

  • Incorporate --dataset-boundary in the pipeline/ the respective tasks (yesterday we've said that we prefer to remove the op_args and read from settings in the task functions directly)
  • Add a note to CHANGELOG @gplssm double checks
  • Resolve conflicts with dev@gnn

For the first one, I had a look into settings. I expected that the key --dataset-boundary would be available here. Either with its default value "Everything" or with a custom dataset. See

import pprint
from egon.data.config import settings

pprint.pprint(settings())
{'egon-data': {'--airflow-database-name': 'airflow',
               '--database-host': '127.0.0.1',
               '--database-name': 'egon-data',
               '--database-password': 'data',
               '--database-port': '59734',
               '--database-user': 'egon',
               '--jobs': 1}}

@ClaraBuettner
Copy link
Contributor

For the first one, I had a look into settings. I expected that the key --dataset-boundary would be available here. Either with its default value "Everything" or with a custom dataset.

When I do what you have done I get the expected output:

> pprint.pprint(settings())
> {'egon-data': {'--airflow-database-name': 'airflow',
>                '--database-host': '127.0.0.1',
>                '--database-name': 'test-egon-data',
>                '--database-password': 'data',
>                '--database-port': '59734',
>                '--database-user': 'egon',
>                '--dataset-boundary': 'Schleswig-Holstein',
>                '--jobs': 1}}

I created a new folder to get a new egon-data.configuration.yaml. Did you also do that?

And did you already started with the tasks? I can also do some of them.

@gplssm
Copy link
Contributor

gplssm commented Mar 17, 2021

I created a new folder to get a new egon-data.configuration.yaml. Did you also do that?

No.
I didn't re-create the config file after pulling latest changes. This was the mistake. So, everything works fine 🎉 thanks or testing on your side!

And did you already started with the tasks? I can also do some of them.

No, didn't started yet. Feel free to change things. I'll do the remaining ones.

@gnn do you want to resolve the conflicts? You've changed most here and I think you're the fastest.

@gplssm
Copy link
Contributor

gplssm commented Mar 17, 2021

I ran the workflow again and can confirm that is runs sucessfully. And indeed, it's way faster now. Very cool! 😎

I noticed one more thing that should be changed. Except for MaStR data, all downloaded data lands in the repo instead of being saved in the project directory.

Copy link
Contributor

@ClaraBuettner ClaraBuettner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that the installation from this branch works fine and also run the workflow without any issues again. Since checking if solving conflicts with dev doesn't seem to be part of reviewing, I will approve this PR.
But @gnn if you want someone to test again after solving the conflicts, I can do that. You just need to ask for it.

@gnn
Copy link
Collaborator Author

gnn commented Mar 18, 2021

Narf. Typing in responses isn't enough. You also have to send them. So this is what I wanted to send yesterday evening:

I checked that the installation from this branch works fine and also run the workflow without any issues again. Since checking if solving conflicts with dev doesn't seem to be part of reviewing, I will approve this PR.
But @gnn if you want someone to test again after solving the conflicts, I can do that. You just need to ask for it.

Nah, it's fine. Ill resolve the conflicts directly while merging into dev, so no further reviews required. Thanks for everything. Merge coming in soon.

@gnn
Copy link
Collaborator Author

gnn commented Mar 18, 2021

Quick question @gplssm: did you already solve the problem of the clashing container names?

@gplssm
Copy link
Contributor

gplssm commented Mar 18, 2021

Quick question @gplssm: did you already solve the problem of the clashing container names?

Yes. https://egon-data--159.org.readthedocs.build/en/159/troubleshooting.html#error-cannot-create-container-for-service-egon-data-local-database

@gnn
Copy link
Collaborator Author

gnn commented Mar 18, 2021

Quick question @gplssm: did you already solve the problem of the clashing container names?

Yes. https://egon-data--159.org.readthedocs.build/en/159/troubleshooting.html#error-cannot-create-container-for-service-egon-data-local-database

Damn. That means I can't use you to test whether changing the container name fixes that one. Ah, well, no big deal. I guess we'll see whether people still run into the issue or not.

gnn added 6 commits March 18, 2021 19:01
Before this commit, there was an ambiguity in the name
`egon-data-local-database`: the name was used for both, the service as
well as the container. As the container name will get configurable, it's
better to use non ambiguous names from now on. While doing this, I also
noticed that the name `egon-data-database-volume` does not use the same
convention as the two new names introduced. In order to be consistent
when naming things in this file, I also changed the name to
`egon-data-database-volume`.
Not updating is in line with how "docker-compose.yml" is treated. If
that behaviour should change, it's probably best to introduce extra
command line flags or explicit commands for updating the configuration
files.
The function is unnecessary because the option will only ever appear
under its first name.
Comment on lines 56 to 64
def dataset_boundaries():
"""Return dataset boundaries from configuration settings"""

if '--dataset-boundary' in settings()['egon-data']:
dataset = settings()['egon-data']['--dataset-boundary']
elif '--clip-datasets-to' in settings()['egon-data']:
dataset = settings()['egon-data']['--clip-datasets-to']

return dataset
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stumbled upon this during merging and just wanted to note that this function isn't necessary. I tried to anticipate this confusion with my remark that

The first one is the canonical one so this one will be used when writing to the configuration file or when obtaining settings via the newly added egon.data.config.settings function.

When using egon.data.config.settings the option will always appear under the key "--dataset-boundary", so I removed "dataset_boundaries" in 7a090f8. I'll also remove the second option before the merge, as Guido requested.

gnn added 2 commits March 19, 2021 18:18
It's still available as `"--dataset-boundary"`.
Unfortunately the `:pr:` and `:issue:` shortcuts are a Sphinx only
feature so they can not be used in reST files which are also rendered by
GitHub.
CHANGELOG.rst Outdated
@@ -30,7 +30,7 @@ Added
directtory in which ``egon-data`` is started, so it's probably best to
run ``egon-data`` in a dedicated directory.
There's also the new function `egon.data.config.settings` which
returns the current configuration settings.
returns the current configuration settings. See :pr:`159`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to curb your enthusiasm here, especially since I'm the one introducing the :issue: and :pr: roles, but unfortunately these roles are only rendered by Sphinx, so it's probably better to not use them in reST files which will be viewed both, on GitHub and on Read the Docs.
So manually expanded :pr:s in "CHANGELOG.rst" with 5b4b373 .

@gnn gnn merged commit 2ebbae3 into dev Mar 23, 2021
@gnn gnn deleted the features/#112-option-testmode branch March 23, 2021 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option for running workflow in test mode
3 participants