Allow configuring the database name #159

gnn · 2021-03-15T12:53:29Z

If the configured database doesn't exist, it will be created. This enables seamless usage of multiple databases and should fix #112.

In order to be consistent with the "`database` hyphen `option name`" style of the rest of the database options, the option gets the `name` suffix to make it more obvious that the options specifies the database name.

Turns out I actually only need one direction and that can be done inline, so I don't really need that code. So commit the code, but revert it immediately after, to keep it in history in case I want to look it up later.

This reverts commit 378e078445a146f1c7b52e73bee6fe712605fa0e because the code isn't actually needed anymore, but should be kept in history for reference purposes.

This makes sure that the module doesn't contain executable code besides the function definition. The CLI entry point is still `main` but now `main` does all setup and initialization and then invokes the main CLI command. This is also the only way of catching exceptions raised during the commands execution, which has to be used later on.

If the configuration file doesn't exist, create one on startup by writing the default values of all command line options having defaults to a file in the correct format. In order for this to work, the command line options actually need defaults, so specify those and make them be shown in the `--help` message. Also put a note in the help message, explaining how to configure `egon-data` as well as the fact, that it will modify it's working directory. Since `egon-data` might be run multiple times in parallel, also write the combination of configuration values and command line overrides to a special configuration file from the first `egon-data` process started. Exclusively use the options from this configuration file in all other processes and make sure that the configuration file is deleted when the process exits by wrapping the main command in a `try` block where the file containing the running processes active option configuration is deleted in a `finally` block. Put a configuration file path discovery function in the `egon.data.config` module in order to support all this machinery. Finally, adapt the CHANGELOG.rst to these new changes.

The clarification was missing from the original announcement.

Keys are now the Python identifiers corresponding to the command line switches. This change was necessary because the translation to the keys used in "docker-compose.yml" is now done at the time of reading the file, instead of when writing the file.

Consistently use the standard configuration file "egon-data.configuration.yaml" instead.

Partly addresses issue #59.

Allow configuring the "Docker"ed database via command line arguments. In order to do so, convert the shipped "docker-compose.yml" to a template, which is copied to the "docker" subdirectory of the current working directory, with the template fields filled in with the corresponding values from the command line arguments. Run `docker-compose` using the copied template, instead of the original file shipped with "egon.data". Put a note about this behavior into `egon-data`'s help message. Stop reading database parameters from the shipped `docker-compose.yaml` because that's no longer possible, as the file is a template without meaningful values now. Finally, update the changelog to reflect this new feature.

I intentionally violated the style in the last commit, in order to keep the diff as small as possible. While fixing the style, I also changed the variable `docker_db_config` to `configuration`. Using the same variable name that is also used to store the "egon-data" part of the configuration file is done intentionally, because the new `configuration` dictionary is just the old one, with the keys filtered to the ones also present in `translated` and translated accordingly.

Fixes #137.

Group them loosely on how they belong together thematically, instead of ordering them alphabetically.

Previously the initial configuration file was generated using only the default values which IMHO is surprising behaviour if command line overrides are present.

Since default values are always present in the options to a command, the behaviour up to this commit meant that non-default values taken from the configuration file where overwritten with the default, in case no value was specified for a specific command line switch.

This has a lot of consequences. First and foremost, the "sql_alchemy_conn" configuration setting in "airflow.cfg" has to be determined dynamically from the database configuration settings. In order to do so, "airflow.cfg" was converted to a template, where everything in curly brackets, e.g. "{slot}", gets filled with appropriate values, which also means that curly braces not signifying a slot to be filled have to be escaped to "{{" and "}}". This is the cause for most of the changes in "airflow.cfg". The template gets handled similar to "docker-compose.yml", in that it gets copied to the "airflow" subdirectory of the current working directory and has its slots filled. As a consequence, the directory in which DAGs are discovered can no longer be determined from `$AIRFLOW_HOME` but has to be filled in as well, because our DAGs still reside in the "egon.data" package. This explains having to set the "dags_folder" configuration setting. Since the one existing database has a new purpose, it also gets a setting, with which it can be configured and which is used in "docker-compose.yml" to determine the database name. The database now has to be up prior to Airflow, so we can no longer initialize it in an Airflow task. Therefore starting the docker container is now done at the start of the `egon-data` command. Since the docker container can only be configured to contain one database initially, it's now checked whether the normal `egon-data` database exists and if it doesn't it is created. That way one can use multiple different databases, simply by supplying different database names via `--database-name`. Last but not least, the "entrypoints" make no sense anymore, because those would be applied to the Airflow metadata database. Extension creation and other initialization has therefore been moved to the `initdb` Airflow task, which finally fixes #76.

Now that we're using PostgreSQL for Airflow's metadata, this is finally possible.

That means the data is managed by Docker, so no longer simply available on the file system, but this was the only way, I managed to get rid of really annoying permission errors. See Docker's [documentation][0] for more information. [0]: https://docs.docker.com/storage/volumes/

gnn · 2021-03-15T12:57:05Z

Finally managed to get this working. Would you please try it out, @ClaraBuettner and @gplssm. But beware, a lot of things changed. I'm now expecting users to run egon-data in a dedicated directory and the compose file changed, so the container will probably be rebuild.

ClaraBuettner · 2021-03-15T16:11:28Z

I tested choosing another database and it worked fine. Since the tasks run in parallel now, the computation time is significantly reduced, which is great.
First, I just had some problems to set the options in the correct order, egon-data --database 'test-egon-data' serve worked but an example in the test mode section might be helpful.

And I get some error messages when I run egon-data --help but everything is working. I have Python 3.7.6 and click 7.1.2.

> (egon-data) (base) clara@clara-LIFEBOOK-U749:~/test-egon-data$ egon-data --help
> Usage: egon-data [OPTIONS] COMMAND [ARGS]...
> 
>   Run and control the eGo^n data processing pipeline.
> 
>   It is recommended to create a dedicated working directory in which to run
>   `egon-data` because `egon-data` because `egon-data` will use it's working
>   directory to store configuration files and other data generated during a
>   workflow run.
> 
>   You can configure `egon-data` by putting a file named "egon-
>   data.configuration.yaml" into the directory from which you are running
>   `egon-data`. If that file doesn't exist, `egon-data` will create one,
>   containing the command line parameters supplied, as well as the defaults
>   for those switches for which no value was supplied. Last but not least, if
>   you're using the default behaviour of setting up the database in a Docker
>   container, the working directory will also contain a directory called
>   "docker", containing the database data as well as other volumes used by
>   the dockered database.
> 
> Options:
>   --airflow-database-name DB      Specify the name of the airflow metadata
>                                   database.  [default: airflow]
> 
>   --database-name, --database DB  Specify the name of the local database. The
>                                   database will be created if it doesn't
>                                   already exist.
>                                   
>                                   Note: "--database" is deprecated and will
>                                   be removed in the future. Please use the
>                                   longer but consistent "--database-name".
>                                   [default: egon-data]
> 
>   --database-user USERNAME        Specify the user used to access the local
>                                   database.  [default: egon]
> 
>   --database-host HOST            Specify the host on which the local database
>                                   is running.  [default: 127.0.0.1]
> 
>   --database-port PORT            Specify the port on which the local DBMS is
>                                   listening.  [default: 59734]
> 
>   --database-password PW          Specify the password used to access the
>                                   local database.  [default: data]
> 
>   --jobs N                        Spawn at maximum N tasks in parallel.
>                                   Remember that in addition to that, there's
>                                   always the scheduler and probably the server
>                                   running.  [default: 16]
> 
>   --version                       Show the version and exit.
>   --help                          Show this message and exit.
> 
> Commands:
>   airflow
>   serve    Start the airflow webapp controlling the egon-data pipeline.
> Traceback (most recent call last):
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 781, in main
>     with self.make_context(prog_name, args, **extra) as ctx:
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 700, in make_context
>     self.parse_args(ctx, args)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 1212, in parse_args
>     rest = Command.parse_args(self, ctx, args)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 1048, in parse_args
>     value, args = param.handle_parse_result(ctx, opts, args)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 1630, in handle_parse_result
>     value = invoke_param_callback(self.callback, ctx, self, value)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 123, in invoke_param_callback
>     return callback(ctx, param, value)
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 951, in show_help
>     ctx.exit()
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 558, in exit
>     raise Exit(code)
> click.exceptions.Exit: 0
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/home/clara/GitHub/eGon-data/src/egon/data/cli.py", line 285, in main
>     egon_data.main(sys.argv[1:])
>   File "/home/clara/venv/egon-data/lib/python3.7/site-packages/click/core.py", line 810, in main
>     sys.exit(e.exit_code)
> SystemExit: 0
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/home/clara/venv/egon-data/bin/egon-data", line 33, in <module>
>     sys.exit(load_entry_point('egon.data', 'console_scripts', 'egon-data')())
>   File "/home/clara/GitHub/eGon-data/src/egon/data/cli.py", line 287, in main
>     config.paths(pid="current")[0].unlink(missing_ok=True)
> TypeError: unlink() got an unexpected keyword argument 'missing_ok'
>

I also notices that I messed up the dependencies of import-nep-data which also needs zenus data in test mode. And fails now when the task run in parallel. I will fix this in a separate issue.

The parameter was added in Python 3.8. Since we also support Python 3.7, ignore the error manually which is thrown if trying to remove a nonexistent file.

gnn · 2021-03-15T17:26:51Z

First, I just had some problems to set the options in the correct order, egon-data --database 'test-egon-data' serve worked but an example in the test mode section might be helpful.

Would it also be possible to rephrase or put an example in the --help message to clarify things? How would you want to read it? Because I'd really like the --help message to be the goto help text and have it both, on the command line and in Read the Docs. Since you apparently also aren't aware that --database is deprecated in favour of --database-name, I don't seem to do a good job of explaining things. :)

And I get some error messages when I run egon-data --help but everything is working. I have Python 3.7.6 and click 7.1.2.

> 
> Traceback (most recent call last):
>   File "/home/clara/venv/egon-data/bin/egon-data", line 33, in <module>
>     sys.exit(load_entry_point('egon.data', 'console_scripts', 'egon-data')())
>   File "/home/clara/GitHub/eGon-data/src/egon/data/cli.py", line 287, in main
>     config.paths(pid="current")[0].unlink(missing_ok=True)
> TypeError: unlink() got an unexpected keyword argument 'missing_ok'
>

Nice catch. Thanks. Fixed that. You might now have an "egon-data.pid-SOME_NUMBER.configuration.yaml" file lying around, which may create problems down the line. You can delete that/these.

gnn · 2021-03-15T17:28:54Z

Another question from me: now that there is the option of working with an arbitraty database, how do I tell the pipeline to be in "test mode" for one database and in "production mode" for another one?

The new code (hopefully) does the same as the old code: collect option flags and the corresponding values for defaults and command line overrides in a dictionary so they can be merged and written to a configuration file. The new version is just shorter and hopefully not too obscure.

gnn · 2021-03-16T18:40:19Z

Did my part. The new option is available under two names: --dataset-boundary and --clip-datasets-to. Both can be used interchangeably. The first one is the canonical one so this one will be used when writing to the configuration file or when obtaining settings via the newly added egon.data.config.settings function. If you like one more than the other, one variant can be removed. I couldn't decide, so I left both in.
Last question: I noticed that you are using PWD, @gplssm. Do you find that more intuitive than saying "the current working directory"?

gplssm · 2021-03-17T06:29:33Z

Thanks! 👍

Did my part. The new option is available under two names: --dataset-boundary and --clip-datasets-to. Both can be used interchangeably. The first one is the canonical one so this one will be used when writing to the configuration file or when obtaining settings via the newly added egon.data.config.settings function. If you like one more than the other, one variant can be removed. I couldn't decide, so I left both in.

~~I don't care at all here.~~ It makes it easier for all of us to have only one name here. In particular, if this name is used elsewhere (e.g. egon.data.config.settings). I vote for keeping only --dataset-boundary.

Last question: I noticed that you are using PWD, @gplssm. Do you find that more intuitive than saying "the current working directory"?

No, I'm totally fine with CWD. I was just to lazy to search for the exact difference... Changed it.

gplssm · 2021-03-17T06:44:42Z

Before I can approve/we can merge, ~~two~~three things need to be done:

Incorporate --dataset-boundary in the pipeline/ the respective tasks (yesterday we've said that we prefer to remove the op_args and read from settings in the task functions directly)
Add a note to CHANGELOG @gplssm double checks
Resolve conflicts with dev@gnn

For the first one, I had a look into settings. I expected that the key --dataset-boundary would be available here. Either with its default value "Everything" or with a custom dataset. See

import pprint
from egon.data.config import settings

pprint.pprint(settings())
{'egon-data': {'--airflow-database-name': 'airflow',
               '--database-host': '127.0.0.1',
               '--database-name': 'egon-data',
               '--database-password': 'data',
               '--database-port': '59734',
               '--database-user': 'egon',
               '--jobs': 1}}

ClaraBuettner · 2021-03-17T07:40:05Z

For the first one, I had a look into settings. I expected that the key --dataset-boundary would be available here. Either with its default value "Everything" or with a custom dataset.

When I do what you have done I get the expected output:

> pprint.pprint(settings())
> {'egon-data': {'--airflow-database-name': 'airflow',
>                '--database-host': '127.0.0.1',
>                '--database-name': 'test-egon-data',
>                '--database-password': 'data',
>                '--database-port': '59734',
>                '--database-user': 'egon',
>                '--dataset-boundary': 'Schleswig-Holstein',
>                '--jobs': 1}}

I created a new folder to get a new egon-data.configuration.yaml. Did you also do that?

And did you already started with the tasks? I can also do some of them.

gplssm · 2021-03-17T08:11:41Z

I created a new folder to get a new egon-data.configuration.yaml. Did you also do that?

No.
I didn't re-create the config file after pulling latest changes. This was the mistake. So, everything works fine 🎉 thanks or testing on your side!

And did you already started with the tasks? I can also do some of them.

No, didn't started yet. Feel free to change things. I'll do the remaining ones.

@gnn do you want to resolve the conflicts? You've changed most here and I think you're the fastest.

gplssm · 2021-03-17T10:19:51Z

I ran the workflow again and can confirm that is runs sucessfully. And indeed, it's way faster now. Very cool! 😎

I noticed one more thing that should be changed. Except for MaStR data, all downloaded data lands in the repo instead of being saved in the project directory.

ClaraBuettner

I checked that the installation from this branch works fine and also run the workflow without any issues again. Since checking if solving conflicts with dev doesn't seem to be part of reviewing, I will approve this PR.
But @gnn if you want someone to test again after solving the conflicts, I can do that. You just need to ask for it.

gnn · 2021-03-18T09:12:16Z

Narf. Typing in responses isn't enough. You also have to send them. So this is what I wanted to send yesterday evening:

I checked that the installation from this branch works fine and also run the workflow without any issues again. Since checking if solving conflicts with dev doesn't seem to be part of reviewing, I will approve this PR.
But @gnn if you want someone to test again after solving the conflicts, I can do that. You just need to ask for it.

Nah, it's fine. Ill resolve the conflicts directly while merging into dev, so no further reviews required. Thanks for everything. Merge coming in soon.

gnn · 2021-03-18T12:51:01Z

Quick question @gplssm: did you already solve the problem of the clashing container names?

gplssm · 2021-03-18T13:47:55Z

Quick question @gplssm: did you already solve the problem of the clashing container names?

Yes. https://egon-data--159.org.readthedocs.build/en/159/troubleshooting.html#error-cannot-create-container-for-service-egon-data-local-database

gnn · 2021-03-18T14:15:23Z

Quick question @gplssm: did you already solve the problem of the clashing container names?

Yes. https://egon-data--159.org.readthedocs.build/en/159/troubleshooting.html#error-cannot-create-container-for-service-egon-data-local-database

Damn. That means I can't use you to test whether changing the container name fixes that one. Ah, well, no big deal. I guess we'll see whether people still run into the issue or not.

Before this commit, there was an ambiguity in the name `egon-data-local-database`: the name was used for both, the service as well as the container. As the container name will get configurable, it's better to use non ambiguous names from now on. While doing this, I also noticed that the name `egon-data-database-volume` does not use the same convention as the two new names introduced. In order to be consistent when naming things in this file, I also changed the name to `egon-data-database-volume`.

Not updating is in line with how "docker-compose.yml" is treated. If that behaviour should change, it's probably best to introduce extra command line flags or explicit commands for updating the configuration files.

The function is unnecessary because the option will only ever appear under its first name.

gnn · 2021-03-19T17:18:20Z

src/egon/data/config.py

+def dataset_boundaries():
+    """Return dataset boundaries from configuration settings"""
+
+    if '--dataset-boundary' in settings()['egon-data']:
+        dataset = settings()['egon-data']['--dataset-boundary']
+    elif '--clip-datasets-to' in settings()['egon-data']:
+        dataset = settings()['egon-data']['--clip-datasets-to']
+
+    return dataset


Stumbled upon this during merging and just wanted to note that this function isn't necessary. I tried to anticipate this confusion with my remark that

The first one is the canonical one so this one will be used when writing to the configuration file or when obtaining settings via the newly added egon.data.config.settings function.

When using egon.data.config.settings the option will always appear under the key "--dataset-boundary", so I removed "dataset_boundaries" in 7a090f8. I'll also remove the second option before the merge, as Guido requested.

It's still available as `"--dataset-boundary"`.

Unfortunately the `:pr:` and `:issue:` shortcuts are a Sphinx only feature so they can not be used in reST files which are also rendered by GitHub.

gnn · 2021-03-22T15:08:37Z

CHANGELOG.rst

@@ -30,7 +30,7 @@ Added
  directtory in which ``egon-data`` is started, so it's probably best to
  run ``egon-data`` in a dedicated directory.
  There's also the new function `egon.data.config.settings` which
-  returns the current configuration settings.
+  returns the current configuration settings. See :pr:`159`


Sorry to curb your enthusiasm here, especially since I'm the one introducing the :issue: and :pr: roles, but unfortunately these roles are only rendered by Sphinx, so it's probably better to not use them in reST files which will be viewed both, on GitHub and on Read the Docs.
So manually expanded :pr:s in "CHANGELOG.rst" with 5b4b373 .

gnn added 20 commits March 9, 2021 16:15

Deprecate the --database option

ec29ba0

In order to be consistent with the "`database` hyphen `option name`" style of the rest of the database options, the option gets the `name` suffix to make it more obvious that the options specifies the database name.

Translate option flags to variable names and vice versa

6dbf84e

Turns out I actually only need one direction and that can be done inline, so I don't really need that code. So commit the code, but revert it immediately after, to keep it in history in case I want to look it up later.

Remove flag to variable name, and vice versa, conversion

3a8c4f4

This reverts commit 378e078445a146f1c7b52e73bee6fe712605fa0e because the code isn't actually needed anymore, but should be kept in history for reference purposes.

Clarify database configuration option limitations

30afef1

The clarification was missing from the original announcement.

Drop support for "local-database.yaml"

f4b8c37

Consistently use the standard configuration file "egon-data.configuration.yaml" instead.

Use importlib_resources.files to get AIRFLOW_HOME

cf61cc1

Partly addresses issue #59.

Fix "disaggregator" dependency line length

8f4e04d

Use a less common default database port

2f21dae

Fixes #137.

Reorder how options are displayed in the help message

9ff4473

Group them loosely on how they belong together thematically, instead of ordering them alphabetically.

Respect CLI overrides when writing initial configuration

61b4ea2

Previously the initial configuration file was generated using only the default values which IMHO is surprising behaviour if command line overrides are present.

Use LocalExecutor which executes tasks in parallel

932882a

Now that we're using PostgreSQL for Airflow's metadata, this is finally possible.

Add an option that limits the parallel job count

5b5a20d

gnn requested review from gplssm and ClaraBuettner March 15, 2021 12:54

Work around missing_ok not being available

30fd3d4

The parameter was added in Python 3.8. Since we also support Python 3.7, ignore the error manually which is thrown if trying to remove a nonexistent file.

gnn added 3 commits March 16, 2021 03:23

Use a proper logger to output information

0fc3b36

Note alternative implementation variants

95a6c85

Guido Pleßmann added 2 commits March 17, 2021 07:24

Replace PWD by CWD

1cf539a

Add complete example for egon-data testmode

09f7142

Read in dataset-boundaries from config.settings()

5345cd7

Guido Pleßmann added 2 commits March 17, 2021 13:04

Explain in docs where to save download data files

adc7b78

Reference PR in CHANGELOG

e247474

ClaraBuettner approved these changes Mar 17, 2021

View reviewed changes

gplssm approved these changes Mar 18, 2021

View reviewed changes

gnn added 6 commits March 18, 2021 19:01

Don't update existing "airflow.cfg"s by default

ab65602

Not updating is in line with how "docker-compose.yml" is treated. If that behaviour should change, it's probably best to introduce extra command line flags or explicit commands for updating the configuration files.

Add a command line option for the name of the Docker container

578c976

Run "pipeline.py" through black

583df60

Use black on "src/egon/data/importing/vg250.py"

ca6cca7

Remove egon.data.confg.dataset_boundaries

7a090f8

The function is unnecessary because the option will only ever appear under its first name.

gnn commented Mar 19, 2021

View reviewed changes

gnn added 2 commits March 19, 2021 18:18

Drop the "--clip-datasets-to" option

2e13b1d

It's still available as `"--dataset-boundary"`.

Remove :pr: uses from "CHANGELOG.rst"

5b4b373

Unfortunately the `:pr:` and `:issue:` shortcuts are a Sphinx only feature so they can not be used in reST files which are also rendered by GitHub.

gnn commented Mar 22, 2021

View reviewed changes

gnn merged commit 2ebbae3 into dev Mar 23, 2021

gnn deleted the features/#112-option-testmode branch March 23, 2021 17:48

ClaraBuettner mentioned this pull request Mar 25, 2021

Features/#165 centralize map zensus to boundaries #180

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuring the database name #159

Allow configuring the database name #159

gnn commented Mar 15, 2021

gnn commented Mar 15, 2021

ClaraBuettner commented Mar 15, 2021

gnn commented Mar 15, 2021

gnn commented Mar 15, 2021

gnn commented Mar 16, 2021

gplssm commented Mar 17, 2021 •

edited

Loading

gplssm commented Mar 17, 2021 •

edited

Loading

ClaraBuettner commented Mar 17, 2021

gplssm commented Mar 17, 2021

gplssm commented Mar 17, 2021

ClaraBuettner left a comment

gnn commented Mar 18, 2021

gnn commented Mar 18, 2021

gplssm commented Mar 18, 2021

gnn commented Mar 18, 2021

gnn Mar 19, 2021

gnn Mar 22, 2021

Allow configuring the database name #159

Allow configuring the database name #159

Conversation

gnn commented Mar 15, 2021

gnn commented Mar 15, 2021

ClaraBuettner commented Mar 15, 2021

gnn commented Mar 15, 2021

gnn commented Mar 15, 2021

gnn commented Mar 16, 2021

gplssm commented Mar 17, 2021 • edited Loading

gplssm commented Mar 17, 2021 • edited Loading

ClaraBuettner commented Mar 17, 2021

gplssm commented Mar 17, 2021

gplssm commented Mar 17, 2021

ClaraBuettner left a comment

Choose a reason for hiding this comment

gnn commented Mar 18, 2021

gnn commented Mar 18, 2021

gplssm commented Mar 18, 2021

gnn commented Mar 18, 2021

gnn Mar 19, 2021

Choose a reason for hiding this comment

gnn Mar 22, 2021

Choose a reason for hiding this comment

gplssm commented Mar 17, 2021 •

edited

Loading

gplssm commented Mar 17, 2021 •

edited

Loading