Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable feeding the server configuration from environment variables #765

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lfoppiano
Copy link
Collaborator

Following up #762. This PR enables the substitution of certain configuration item from environment variables.
In future it can be reused as long as the configuration is loaded from dropwizard.

@lfoppiano lfoppiano linked an issue May 30, 2021 that may be closed by this pull request
@coveralls
Copy link

Coverage Status

Coverage increased (+0.003%) to 39.077% when pulling 2c0e061 on feature/enable-env-variables into 5486030 on master.

@kermitt2
Copy link
Owner

In the new yaml configuration, there is no more specific server config file, just a unique one.

While this mechanism could work for the new yaml config, I am trying to have something more generic using Java reflection which does not require changing the yaml file and follow the same format as for the current property file (e.g. --env GROBID__NB_THREADS=10 ).
So this PR will probably not be merged, because the yaml-config branch is almost ready.

@kermitt2
Copy link
Owner

kermitt2 commented May 31, 2021

I am trying to have something more generic using Java reflection which does not require changing the yaml file and follow the same format as for the current property file (e.g. --env GROBID__NB_THREADS=10).

Ok I did not manage to find a generic solution. I think nothing reasonable can work with list. It would be probably better to forget about overriding config. parameter with environment variables? I think this can work only for trivial configuration.

@lfoppiano
Copy link
Collaborator Author

No problem.

For the YAML de-serialisation perhaps using Jackson would solve some of the problems in the mapping. If it works as the JSON/XML mapper it should be rather simple to implement even with complex nested YAML config files...

@kermitt2
Copy link
Owner

For the YAML de-serialisation perhaps using Jackson would solve some of the problems in the mapping. If it works as the JSON/XML mapper it should be rather simple to implement even with complex nested YAML config files...

The problem is the mapping of environment variable (which are kind of linear serialization) to override the configuration class/object, which is nested with lists.

@lfoppiano
Copy link
Collaborator Author

The solution adopted from dropwizard (we could borrow the idea.. and the code) is to specify the variable directly in the configuration, so that the complexity is left out, without losing flexibility.

@kermitt2
Copy link
Owner

Luca, how do you see the process with the new yaml config? It would be only user-driven? I don't think we can hack the yaml file for all the possible parameter values, because it has potentially much more than hundred of possible parameter keys now (each model has its own set of nested parameters).

@lfoppiano
Copy link
Collaborator Author

Luca, how do you see the process with the new yaml config? It would be only user-driven? I don't think we can hack the yaml file for all the possible parameter values, because it has potentially much more than hundred of possible parameter keys now (each model has its own set of nested parameters).

Yes, let's keep it simple. User can always define their own configuration and decide which variables can be read from the env.

@lfoppiano
Copy link
Collaborator Author

I've updated this PR. Now the users can select which information can be read from environment variables by using

${VARIABLE_NAME:- $$default_value$$}

E.g. for the grobid-home:

${GROBID__GROBID_HOME:- "grobid-home"}

To keep it simple we can just leave the user to decide what to replace (or we can add some variables that can be useful e.g. for the docker container"

@kermitt2 kermitt2 added this to the 0.7.1 milestone Aug 5, 2021
@kermitt2
Copy link
Owner

kermitt2 commented Aug 5, 2021

When I tries to write the documentation for this, I realized that there is no clear way to get it working with docker, for example:

docker run -t --rm --init -p 8080:8070 -p 8081:8071 \
    --env GROBID_MODEL_PRELOAD=true \
    --env GROBID_GROBID_HOME="/some/where/grobid-home" \
    lfoppiano/grobid:0.7.0

... because the grobid.yaml in the image does not have these environment variable definitions. Of course if we need to modify the configuration file, we are in the case where the configuration file that is mounted when running the container, so we don't need environment variables.

The user would need not only its own definitions in the configuration file, but also rebuild a new image with the modified configuration file.

@lfoppiano
Copy link
Collaborator Author

Yes indeed. My idea would be that once you've set up your configuration file with the parameters, then you can easily run the docker by changing the parameters at the command line, instead of editing the file.
Maybe others that are using more the docker architecture could provide more useful use cases? @de-code, what do you think?

@de-code
Copy link
Collaborator

de-code commented Aug 6, 2021

Yes, for me the main use-case would also look like that. Mainly with docker / container (but not strictly).
There would be a default configuration and environment variables make it easy to override some of those values (including setting previously unset values). This is to make deployment easier but could also make experimentation easier in the same way.

This to potentially consider:

  • Allow overriding only specific configuration options rather than providing everything (this could also be possible by merging multiple configuration files)
  • Running a single docker command setting environment variables may be easier to copy and paste by users as an example, than a multi-step example with a separate config file
  • Deploying container in Kubernetes would require be easier with environment variables than a more bespoke deployment config supporting the configuration file
  • Environment variables can almost always be set

(see also #480 - I had my previous implementation based on Airflow's config)

@kermitt2 kermitt2 modified the milestones: 0.7.1, 0.7.2 Aug 6, 2021
@kermitt2
Copy link
Owner

kermitt2 commented Aug 6, 2021

Sorry if I was not clear, the problem I raise is the complexity of the solution:

  • the idea it to have a mechanism to easily run the docker by changing the parameters at the command line, instead of editing the file,
  • but to realize this solution we have to
  1. edit the config file (anyway)
  2. rebuild an image (which takes ages for the complete one with DL)
  3. then we can run the docker by changing the parameters at the command line

The consequences are that the solution can never be compatible with the images on dockerhub and it requires some heavy preliminary work that finally makes it look more complicated that just specifying a config file as parameters at the command line.

I initially started to document it in the Docker section, but this might need to go to the "developer notes" because the existing docker images cannot support this.

@de-code
Copy link
Collaborator

de-code commented Aug 8, 2021

I guess you could include a default config file with all of the placeholders in Docker Hub?
Sub-projects would need to do the same, or anyone with a different default config.
But it does indeed look like a bit of effort to maintain all of those placeholders (and prone to copy and paste errors).

That is why Airflow' config opted for a more general pattern.
You mentioned that it doesn't work well with lists, which is true. One could try to make it work maybe. One challenge would be to select the right item, not based on the index (e.g. by model name).
But maybe it will be easier to revise the config to avoid lists for things that could avoid it.
For example I would maybe want to override the model parameters which are currently a list, but could also be a map (with the name as the key, using underscore instead of hyphen).

Perhaps a bit similar (although not environment variables), Helm install supports --set (and variations) to override values in the yaml file. It also has some limitations but works for many cases.

@lfoppiano
Copy link
Collaborator Author

Sorry if I was not clear, the problem I raise is the complexity of the solution:
[...]
Yes, the solution is supposed to be for advanced users not for simple users of the docker version... But IMHO I don't think a rebuild of the docker image is necessary.

I guess you could include a default config file with all of the placeholders in Docker Hub?

We can include some placeholder by default so that certain basic configurations can be modified from the docker image without supplying an additional config file. E.g. grobid home, number of threads, lazy/eager load of models... etc...

Personally, for more advance usege, I would modify the config file and add placeholders where I need to, supply such values to the docker image and I will have the possibility to run a different configuration without modifying the configuration file.

@kermitt2 kermitt2 removed this from the 0.7.2 milestone Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support loading models eagerly
4 participants