Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2143] Adding a ConfigLoader instance into hook specs params #506

Closed
takikadiri opened this issue Sep 8, 2020 · 20 comments
Closed

[KED-2143] Adding a ConfigLoader instance into hook specs params #506

takikadiri opened this issue Sep 8, 2020 · 20 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@takikadiri
Copy link

Description

When developping a kedro plugin, i regularly need to access to configs and potentielly some plugin-specific configs files. Since the plugin use hook mechanism, i no longer can bring whatever context attribute to my hook implemantation (except the parameters defined in the hook specs).

Context

Here in the kedro-mlflow plugin we were forced to redefine a ConfigLoader instance inside the plugin.
That lead to incoherence between the context ConfigLoader property and the new Configloader created inside the hook.

Other plugins will need this functionality, i imagine a kedro-spark plugin that use hook mechanism and access a spark config file from project folder path (spark.yml), or a kedro-sas plugins that do the same thing (getting configs in order to create a parametrized session)

Possible Implementation

A possible implementation is to pass the context config_loader to the hook.

hook specs

 @hook_spec
    def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog, config_loader: ConfigLoader
    ) -> None:

context

hook_manager = get_hook_manager()
hook_manager.hook.before_pipeline_run(  # pylint: disable=no-member
            run_params=record_data, pipeline=filtered_pipeline, catalog=catalog, config_loader=self.config_loader
        )
@takikadiri takikadiri added the Issue: Feature Request New feature or improvement to existing feature label Sep 8, 2020
@lorenabalan
Copy link
Contributor

Hi @takikadiri , you've highlighted a very good point. We thought about this and we've actually added a set of hooks to register library components, such as pipelines, data catalog, and config loader, with a Kedro project. I think might solve your use case.
This functionality will be made available in 0.16.5, which is going to be released very soon. :)

@takikadiri
Copy link
Author

Thank you @lorenabalan for the quick reply ! It's realy great having the possibility to registrer library component such as the config loader, i will certainly use it.
But my point here is about not having the possibility to pass the config loader instance (created with register_config_loader) to another hook let's say the pipeline_before_run hook.
There may be something that escapes me about the hook mechanisms :)

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Sep 15, 2020

Hello @lorenabalan, I am not sure if I miss the point but I think this is not what is at stake here, correct me if I'm wrong.

I don't know if this is the best place to write this or if it should be in another issue, but here is a more detailed description of the problem and discussion on different design decisions and potential decisions.

Description

Since hooks have been released in kedro==0.16.0, they have become a popular tool among developers who create kedro plugins (to be honest the community is small but quite active 😉 ).

It is a common pattern for hook to need to access to configuration files (for instance to create a session with an external tool with credentials, to use parameters inside the hook and more likely in the case of kedro-mlflow to use a custom config file for the plugin.

I personnaly feel that this configuration file access must be template-independent. The hook is not supposed to assume anything on the template (which may be changed by the user) since the ProjectContext already have all the necessary informations (i.e. mainly the ConfigLoader initiated but potentially other attributes of the ProjectContext). If the hook needs to recreate any attributes of the ProjectContext (for instance the ConfigLoader), there is a high risk that the hook behaves differently than the ProjectContext, which is something we absolutely want to avoid.

Concrete use cases:

Use case 1: Accessing proprietary configuration file inside hook

  • Let's say that the user modifies the register_config_loader (for instance to use the TemplatedConfigLoader in your documentation):
from kedro.config import TemplatedConfigLoader

class ProjectHooks:
    @hook_impl
    def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
        return TemplatedConfigLoader(
            conf_paths,
            globals_pattern="*globals.yml",
            globals_dict={"param1": "pandas.CSVDataSet"}
        )
  • Now imagine I have another plugin that must access its own configuration file (say mlflow.yml) inside hook calls. For instance
class MlflowNodeHook:
    @hook_impl
    def before_node_run(
        self,
        node: Node,
        catalog: DataCatalog,
        inputs: Dict[str, Any],
        is_async: bool,
        run_id: str,
    ) -> None:
     # get the config loader of the current context
     config_loader = get_config_loader()  # actually, config_loader is not available here, this magic function does not exist! i need to eventually get the one registered in the project
     
    # do whatever I want using the conf and implementing my own logic
    conf_mlflow = config_loader.get("mlflow*", "mlflow*/**") 
    do_my_own_logic(conf_mlflow )

Use case 2: Accessing credentials file inside hook

Let's say that I want to create a connection with a remote server (say SAS) globally to interact before/afeter node, and eventually inside node

class MlflowPipelineHook:
    @hook_impl
    def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
      # get the config loader of the current context
     config_loader = get_config_loader()  # actually, config_loader is not available here, this magic function does not exist!
     
    # do whatever I want using the conf and implementing my own logic
    credentials = config_loader.get("credentials*", "credentials*/**") 
    saspy.SASsession(credentials)

@WaylonWalker @deepyaman You guys seem to develop a lot of hooks, do these use cases are hitting you too? I see you sometimes use environment variable for configuration of your hooks, I guess it is somehow related to this.

Overview of solutions to this problem

Existing solution

Existing solution 1 : recreate config loader locally

For instance, example 1 would become:

class MlflowNodeHook:
    @hook_impl
    def before_node_run(
        self,
        node: Node,
        catalog: DataCatalog,
        inputs: Dict[str, Any],
        is_async: bool,
        run_id: str,
    ) -> None:
     # recreate the config loader manually
        conf_paths = [
            str(self.project_path / self.CONF_ROOT / "base"), # these attributes are not accessible outside the context, they must be hardcoded actually
            str(self.project_path / self.CONF_ROOT / self.env), # suppressed
        ]
       hook_manager = get_hook_manager()
        config_loader = hook_manager.hook.register_config_loader(  # pylint: disable=no-member
            conf_paths=conf_paths
        ) or ConfigLoader(conf_paths)
     
    # do whatever I want using the conf and implementing my own logic
    conf_mlflow = config_loader.get("mlflow*", "mlflow*/**") 
    do_my_own_logic(conf_mlflow )

Pros:

  • it is a "better than nothing" solution

Cons:

  • project_path is hardcoded
  • env is not accessible
  • code is not reused, if CONF_ROOT changes
    This is neither a reliable, nor flexible, nor maintenable way to do this

Existing solution 2 : reload context when possible

Some hooks methods have access to some of the project context attributes: for instance, after_catalog_created can access credentials, before_pipeline_run and after_pipeline_run can access project_path. In these methods, we can call load_context(project_path) to access to all of the context attributes.

Pros:

  • it works perfectly

Cons:

  • it recreates a whole context object for very little use and might get slow
  • this solution does not work for all hooks (before_node_run and after_node_run do not have access to the project_path for instance)

Existing solution 3 : assume call is made at the root of the kedro project and go back to solution 2

For the hooks without access to the project_path, call load_context() without the project_path argument.

Pros:

  • it is the best solution I've found so far

Cons:

  • it assumes your working directory is the root of the kedro project
  • It prevents the user to use kedro interactively with a different working directory, and may also break using kedro inside jupyter notebook.

Potential solutions which need development on Kedro's side

Solution 1: Add config loader to all hooks

As the title of this issue states, a solution would be to pass the config loader to each @hook_spec parameters to make it accessible within hooks

Solution 2: Use the KedroSession ?

By digging in the code, I noticed a merged yet not documented feature called KedroSession. This creates a global variable which is accessible without any hypothesis on the template just by calling get_current_session(), and it contains the context, hence the ConfigLoader. It should be accessible in the hooks.

Pros:

  • the problem will be solved very smoothly (and in an even more general way, since any attributes of the context will be accessible).

Cons:

@DmitriiDeriabinQB it seems you are the one developing KedroSession, is it how it is intended to be used in the future?

@WaylonWalker
Copy link
Contributor

@Galileo-Galilei Steel Toes utilizes the project's context by defining your hooks as a property on the ProjectContext rather than a list.

from steel_toes import SteelToes

class ProjectContext(KedroContext):
   project_name = "kedro0160"
   project_version = "0.16.1"
   package_name = "kedro0160"

   @property
   def hooks(self):
      self._hooks = [ SteelToes(self), ]
      return self._hooks

You can see where the context is used inside the hook here. I do feel like this is a bit of a hack and asks users to implement hooks on their project in a non-traditional way. The next upcoming change will make the context argument not required. Note that context contains a config_loader method that might be useful for you.

I would really like to get access to the project's context inside of a hook, especially if we could configure hook behavior inside of .kedro.yml. I think this would align with how plugins work on pytest. I am do not know how it works, but I know when using a plugin like pytest-cov you can pass in command-line arguments, or add to a config file to configure how it is ran. https://pytest-cov.readthedocs.io/en/latest/config.html.

@Galileo-Galilei
Copy link
Member

Hello @WaylonWalker and thanks for the reply. This is a clever hack and works like a charm, but it breaks auto-discovery and configuration in kedro.yml as you mention (not to mention that it is a user facing change, even if it easy to setup). I feel that it can be a temporary way to make the hook more stable, but it is definitely not a long term solution and should be integrated to kedro core IMHO. Aligning on pytest sounds reasonable indeed.

By the way, it seems @tamsanh is hitting the same problem and need to access the context inside his KedroWings hook to be able to use interactively (which is the same issue some users mention here and here in kedro-mlflow.)

Some tests on KedroSession look promising (initalise a session before_pipeline_run and retrieve it anywhere you need it), but I don't want to rely on it since it is explicitly mentionned in the script that it is not stable and may change even between releases.

@DmitriiDeriabinQB
Copy link
Contributor

@Galileo-Galilei you are right assuming that KedroSession has been designed to eventually become responsible for carrying KedroContext (and project data in general) which would make the use case that you've describe much less painful. Hence, as you have already noticed it has been made a singleton to ensure its accessibility from hooks, for example.

However, this is still a work in progress and currently it's not at the stage where we can officially announce it and freeze the design. The general idea is that KedroSession will gradually take over the responsibility for the lifecycle events, while KedroContext will be treated as a "gatekeeper to the library components" (definition by @limdauto) in a new model.

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Sep 23, 2020

Thanks for the insightful answer. I use one of above solutions as a better than nothing way to achieve what I want, and wait for the KedroSession to be more official 😄 If you need beta-testers, feel free to ask!

@921kiyo 921kiyo changed the title Adding a ConfigLoader instance into hook specs params [KED-2143] Adding a ConfigLoader instance into hook specs params Oct 2, 2020
@yetudada
Copy link
Contributor

@Galileo-Galilei We've watched your great work on kedro-mlflow and hope that product changes have made it easier. I'll close this issue for now but let me know if I should re-open it.

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Mar 2, 2022

Hi @yetudada, would you mind reopening this issue? This works perfectly with kedro==0.17.x, but the get_current_session public function will be removed in kedro==0.18.x (it is already removed from the develop branch).

This completely prevents accessing configuration / and credentials inside hooks, which was the point of this issue. @datajoely sorry to ping directly but is there something I am missing here? Use cases described above are still prevalent and will no longer be doable with kedro==0.18.x, which feels like a regression. Even worse, above hacks no longer works.

@datajoely
Copy link
Contributor

@idanov @lorenabalan would you mind chiming in here? I'm not sure I have the knowledge

@antonymilne
Copy link
Contributor

@Galileo-Galilei @takikadiri We just had a discussion about all this. Not sure what the solution will be, but curious what your thoughts are on making context available in before_pipeline_run. context includes the config_loader (albeit currently hidden away since the public config_loader property was removed in 0.18.0 which was maybe a mistake...), and also params, catalog, env, _get_config_credentials. Our feeling is that maybe this would be better than just providing config_loader - interested to know what you think.

The other thing I'm wondering is whether it would actually be useful to have context available at some point before before_pipeline_run. Potential problems I see:

  1. before_pipeline_run is actually called some way through session.run. What if you need access to config_loader before this point in a kedro run? Would it be better to actually put it after_catalog_created or even some new hook after_session_created?
  2. All hooks are called only in session.run. What if I want access to config_loader outside a kedro run entirely? Currently I need to make a kedro session, load context and then get the config_loader from there. I'm not sure if there should be some better way of doing this.

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Apr 14, 2022

curious what your thoughts are on making context available in before_pipeline_run

I think this would solve all our problems and is very flexible because of all the attributes / methods you quote (e.g it automatically solves the problem of accessing credentials as you suggest).

I'm wondering is whether it would actually be useful to have context available at some point before before_pipeline_run [...] or even some new hook after_session_created

I would love having an after_session_created hook (and I guess I already suggested this in another discussion, likely on the interactive workflow). The key point is that it is quite common to setup some external connections at the beginning of the pipeline, e.g. for kedro-mlflow:

with KedroSession.create(project_path, env) as session: 
    mlflow_config = get_mlflow_config(session)
    mlflow_config.setup()

which basically sets the mlflow tracking uri and the experiment. Most times people forget to add these two lines in their notebook / scripts and find out that their mlflow configuration is completely messed up. Obviously when it is run to the CLI these two lines are run in the before_pipeline_run and the configuration is set correctly.

This would also be the place to instantiate global connection (e.g. what we discussed here on kedro-engines, and especially for SparkContext instantiation).

if I want access to config_loader outside a kedro run entirely? Currently I need to make a kedro session, load context and then get the config_loader from there. I'm not sure if there should be some better way of doing this.

Not sure what the problem is here: if you want to access the config_loader outside a session, you can just recreate it from settings since you always have access to the env and the project_path in such a situation, can't you? :

from kedro.framework.project import settings

settings.CONFIG_LOADER_CLASS(
            conf_source=str(PROJECT_PATH / settings.CONF_SOURCE),
            env=env,
            **settings.CONFIG_LOADER_ARGS,
        )  

@antonymilne
Copy link
Contributor

Cool, thank you for the feedback! It has always been at the back of my mind that the current instantiation of spark via a custom KedroContext is annoying and that really this should happen in some sort of hook instead. I think one of the reasons this didn't already happen was precisely because config_loader isn't available in before_pipeline_run, so we couldn't load up spark.yml there (not actually sure whether spark.yml should be considered a configuration file like parameters and catalog anyway, but that's another matter). So if we do make context (and/or maybe even session??) accessible through some hook that lives outside session.run then I think we could kill two birds with one stone here and improve the way that we initialise spark also. To me this makes more sense than adding a new argument to before_pipeline_run.

As for accessing config_loader outside a session, you're absolutely right that's how you would create it. It just feels a bit awkward and fragile to me because you have to explicitly call code that looks like it's quite deep in kedro internals. But I guess it's only quite advanced users who would be looking to do that anyway. And I can't immediately think of a better way of us exposing this.

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Apr 15, 2022

It has always been at the back of my mind that the current instantiation of spark via a custom KedroContext is annoying and that really this should happen in some sort of hook instead.

I think it does, and it very confusing for users too, as shown by this discussion. the user needs mlflow to be instantiated before Spark, and creating the spark session in the context (before any hook can be executed) instead of a hook completely prevents to "compose" logic nicely. The user ended up to move the spark instantiation logic to the before_pipeline_run hook. I personally advocated against following the official documentation (sorry for that, usually your docs are amazing!).

Adding after_session_created would help for this, and would introduce consistency between interactive and CLI workflow. The only drawback I can see is that it would fired up when launching plugins which do not want to run the session which may have unexpected consequences (e.g. the kedro viz command will fail if there are no mlflow.yml or spark.yml in the configuration, even it should simply not care about them).

@antonymilne
Copy link
Contributor

Very useful, thanks @Galileo-Galilei. That's a good observation about the drawback about such a hook. Possibly it would make more sense to have an after_context_created hook (which would be called in session.run but not session.create). I'm not sure - with all these it's obviously nice to be able to solve a few use cases with the hooks we create rather than designing something to solve one particular problem. So we'd need to do a bit of research to see exactly what would be most powerful for users here without exposing too much stuff (e.g. probably we don't want to make the whole session available in a hook).

Some use cases that currently spring to mind:

  • the original issue of making config loader available somehow
  • the above issue of initialising spark session, mlflow, etc. early on in a kedro session (or kedro run?), ideally before before_pipeline_run I think
  • following these comments about how kedro viz should be able to show node run status: ultimately I think all hooks should have the session, or at least the session store, available to them. The best way I can see to achieve this without cluttering up every single hook specification and lots of other function signatures is to make session or session_store available in the new after_*_created hook. Users can then register it to their hook class via self._session_store to use throughout all their hooks as required like this

Overall, pending more research, my initial idea would be to do this in a series of steps, roughly in order:

  1. Add after_session_created and/or after_context_created hook that exposes context. This would provide a much better solution for the immediate problem of how to access config loader (and also provides env, credentials and other things which sound like they would be useful)
  2. Complete the removal work of Remove logic to activate and deactivate KedroSession #1431 since there should hopefully be no need to access the session any more
  3. Provide a set of pyspark hooks in kedro to replace the custom context method. These would use the new after_*_created hook and be activated in settings.py with something like from kedro.extras.hooks.pyspark (could also be done in some other package like kedro-pyspark that a user pip installs but probably that overcomplicates it... possibly the mooted kedro-extras for datasets etc. would be the right place for it though?)
  4. Remove/simplify pyspark starters, documentation, etc.
  5. Add session_store (assuming that's sufficient rather than the whole session) to the after_*_created hook to enable kedro viz to track node run status etc.

Best of all, this is all non-breaking I believe.

@noklam
Copy link
Contributor

noklam commented Apr 20, 2022

Adding one minor item in case I forgot.

  • Bringing back the @Property config_loader, probably similar to what we do with catalog.

@Galileo-Galilei
Copy link
Member

Hi @AntonyMilneQB and @noklam,

I really like the direction of the developments, but I still have a few concerns / comments:

Add after_session_created and/or after_context_created hook that exposes context. This would provide a much better solution for the immediate problem of how to access config loader (and also provides env, credentials and other things which sound like they would be useful)

I agree, but unless I misunderstand something the current implementation in #1465 make the context a "frozen" attrs class. This has the major default of preventing hook to interoperate and force them to be enteierly independant, e.g. by passing some object which need to be reused by other hooks. A simple and common example would be the following:

  • add a mlflow hook which create a mlflow_client to interact with the database
  • make this hook a context attribute (context.mlflow_client=mlflow_client)
  • other hooks will be able to retrieve the client since they can access the context (e.g. a spark hook which want to tag something specific, or another mlflow hook which adds extra behaviour but still want to benefit to the configuraton setup made by the first hook). This relies on hooks register ordering though.

I think it is a requested feature for developers to have a way to make hooks communicate with one another. This is currently possible by recreating the entire session, so it does sound like a regression from a plugin developer point of view, even the current way does not look the right way to do this.

The simplest solution would enable to dynamically add attributes to the context, i.e. avoid making the context frozen.

Provide a set of pyspark hooks in kedro to replace the custom context method. These would use the new after_*_created hook and be activated in settings.py with something like from kedro.extras.hooks.pyspark (could also be done in some other package like kedro-pyspark that a user pip installs but probably that overcomplicates it... possibly the mooted kedro-extras for datasets etc. would be the right place for it though?)

Lim's comment ("There used to be 2 kinds of hooks: registration & life-cycle. Managing them using the same hook managers was a mistake") and the difficulties you have to locate this new hook is in line with #904 and the "engines" design patterns: registering objects is not the same as calling them during session lifecycle. kedro.extras is likely the right place for now, but it likely need some more thoughts on the design later.

@astrojuanlu
Copy link
Member

Hi everyone, I got directed to this issue by @antonymilne and, since it's a bit old, I'm trying to understand what's left to do here. The after_context_created hook is available and it passes a KedroContext around, which has a config_loader property:

https://docs.kedro.org/en/stable/kedro.framework.hooks.specs.KedroContextSpecs.html

KedroContext will be a frozen class in 0.19.0 as per #1465, and @Galileo-Galilei stated that "there is a workaround to add attributes even if the class is frozen".

@antonymilne expressed some more ideas in above but I feel they are beyond the scope of the original issue as stated by @takikadiri is complete.

Do you think we can close this one? Should we open follow-up tasks? Or am I missing something and we should keep it open?

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Jun 14, 2023

Hi @astrojuanlu, I guess discussions arose from time to time (see slack) on various topics (credentials in hook, stateful hook for interoperability, upgrade between 0.17 and 0.18 about get_current_session() deprecation, kedro viz potential side effects (see #kedro-org/kedro-viz#1159, ...) linked to this issue and the issues MR related above. From my point of view, there is no more left on this specific issue and it should be closed, but I guess @antonymilne point was that there is some "advanced documentation" to write about the various thoughts that emerged from above discussion (and related issues / MR).

@astrojuanlu
Copy link
Member

Opened #2690 about documenting stateful hooks, and proceeding to close this issue. Thanks all who participated (here and in various discussions), and don't hesitate to open new issues if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

10 participants