Remove logic to activate and deactivate `KedroSession` #1431

merelcht · 2022-04-11T13:30:25Z

KedroSession is still keeping a global for the currently active session, which causes problems when trying to create a session in a notebook with the Kedro IPython extension and makes it unable to create/run the session. Here's the functions we need to drop:
https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L34-L47

And here is where they are used: https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L274-L287

Need to add a regression test:

Create a test that shows the bug: create a session, then try creating another session and see the failure
Then fix it in the code
Then this test should pass

antonymilne · 2022-04-11T13:56:19Z

I think we should take care here to understand exactly why this code exists in the first place and what would be the consequences of making this change, i.e. what will behave differently after removing this code. Presumably it was originally put there for a good reason to handle some particular cases. Maybe such cases are no longer possible now we have one run session, or maybe they're so marginal that we don't care about them. But it would be good to understand the context of why this logic is here currently.

noklam · 2022-04-12T11:09:42Z

@AntonyMilneQB

I checked the repository the _activate_session() and the global _active_session are only used by ipython.py and the KedroSession's context manager. It was also used by get_current_session() but it no longer exists now.

The activate/deactivate session mechanism doesn't look necessary to me, in fact, if you just do session.run() today it runs perfectly fine. This whole activate/deactivate thing only stops you if you are using context manager.

This will run

# Assume activate session exists but not closed
session.run() # this will run

with KedroSession.create('dummy') as new_session:
    new_session.run() # this will fail due to existence of unclosed session.

However, there is a test_nested_sessions test expecting this to fail.

kedro/tests/framework/session/test_session.py

Lines 521 to 532 in 307783b

    
           @pytest.mark.usefixtures("mock_settings") 
        
           def test_nested_sessions(self, fake_project, mock_package_name): 
        
               configure_project(mock_package_name) 
        
               session1 = KedroSession.create(mock_package_name, fake_project) 
        
               session2 = KedroSession.create(mock_package_name, fake_project) 
        
               with session1: 
        
                   pattern = ( 
        
                       "Cannot activate the session as another active session already exists" 
        
                   ) 
        
                   with pytest.raises(RuntimeError, match=pattern), session2: 
        
                       pass  # pragma: no cover

It was created since Ipython extension exists. [KED-1996] Create an IPython extension for Kedro (#853) · 80e2202 · GitHub](80e2202)

Galileo-Galilei · 2022-04-12T20:19:06Z

I beg you to not make this change (i.e. not merge #1434) this before you manage to provide a solution for #1411.

Many plugins (including steel-toes, kedro-wings, kedro-neptune and kedro-mlflow(e.g. here, here and here) needs to access the config_loader inside hooks (either to access credentials or a proprietary config file). This is described with many details in issue #506.

The key idea is that we theoretically should be able to do the following :

import settings
config_loader=settings.CONFIG_LOADER_CLASS(*settings.CONFIG_LOADER_ARGS)

inside a hook to retrieve some configuration. However, we have no way to retrieve the environment (--env) used, which is a key argument to retrieve the right configuration.

Notice that none of the workaround described in #506 works with 0.18. The discussion in #1411 shows that we could be able to create a custom context which would export the variable as an environment variable, and then get it back when needed but this looks that an horrible solution and I would hate having to force kedro-mlflow users use a complicated and unintuitive setup for doing simple things which currently works automatically. It really feels like a regression since it currently works "as is".

I hope I was convincing :)

P.S: Obviously, when #1411 will be solved natively in the core library, this would make sense to get rid of this global variable.

noklam · 2022-04-12T21:12:43Z

@Galileo-Galilei A quick question before I finished reading all the issues. The problem is you no longer have access to session/context/config loader, which is probably already a problem since the removal of get_current_session(). What are the current workaround you are using with kedro-mlflow since I can see it supports kedro==0.18.x.

Galileo-Galilei · 2022-04-13T11:33:06Z

Hi, I currently recreate the get_current_session function at the plugin level by importing the global variable. This is a terrible workaround but it has a low migration impact. If you have a better solution, I'd be glad to use it.

antonymilne · 2022-04-14T10:32:33Z

@Galileo-Galilei Thanks very much for your comments here. I for one am convinced by your plea! I thought I had a workaround here since env is available in run_params in the before_pipeline_run hook, and so can be used to register an instance of the config_loader that can be used in any subsequent hooks:

class Hooks:
    @hook_impl
    def before_pipeline_run(run_params: Dict[str, Any]):
        config_loader_class = settings.CONFIG_LOADER_CLASS
        return config_loader_class(
            conf_source=str(PROJECT_PATH / settings.CONF_SOURCE),
            env=run_params["env"],
            runtime_params=run_params,
            **settings.CONFIG_LOADER_ARGS,
        )

... but I'm not immediately sure where we can get PROJECT_PATH.

I've reopened #506. Sorry your comments on this were not noticed before (for future reference, I think many people don't notice Github notifications - if you really need something then probably best to send a message on discord so it doesn't get overlooked).

Ultimately, as I understand it, the fundamental problem here is that there has never been a good way to access the config_loader instance. Is that right? Currently it's possible to hack something together through accessing the current session.

Just to add some further context here, let me copy a discussion we had when the code around config_loader was refactored prior to 0.18. Unfortunately this was on an internal repo before we had joined the LF (link for people who do have access).

What we should deprecate though, would be context.config_loader as the component should be loaded from settings directly (in the KedroSession code and anywhere else we might be using it). A question I had around that was for the case when that property was a useful thing to work with in a jupyter notebook. settings.CONFIG_LOADER_CLASS won't give them the same as context.config_loader, so idk if there should be:
a) a method on the KedroSession object (and call session.get_config_loader() or
b) somewhere else (which they can import)
c) some magic that makes it importable as an instantiated class from kedro.framework.project
d) nothing, people should build it manually
e) something else I'm missing

There was some user research conducted around whether anyone used context.config_loader and it was found that no one did and instead, if required, users could make a new instance themselves through calling config_loader = settings.CONFIG_LOADER_CLASS(*settings.CONFIG_LOADER_ARGS) like you suggest. i.e. option d from the above list.

However, it seems that the very important case of plugin authors was not considered here. I wish we had seen #506 at the time, because that makes it clear that it would be valuable to have the config_loader instance available (rather than just the class and args given in settings.py, which is clearly not sufficient since it doesn't give env).

Do you think one of the other options (a-c) makes sense here? If access to the session is removed, option a would of course not be possible.

noklam · 2022-04-14T10:59:28Z

In kedro-mlflow, there are 2 places where session is needed currently. Case 2 will be addressed by providing config_loader, Case 1 will probably need the after_catalog_created hook to retrieve the conf_cred instead.

PROJECT_PATH is only stored inside session's store I am afraid.

def _export_credentials(self, session: KedroSession = None):
        session = session or _get_current_session()
        context = session.load_context()
        conf_creds = context._get_config_credentials()
        mlflow_creds = conf_creds.get(self.server.credentials, {})
        for key, value in mlflow_creds.items():
            os.environ[key] = value

def get_mlflow_config(session: Optional[KedroSession] = None):
    session = session or _get_current_session()
    context = session.load_context()
    try:
        conf_mlflow_yml = context._config_loader.get("mlflow*", "mlflow*/**")
    except MissingConfigException:
        raise KedroMlflowConfigError(
            "No 'mlflow.yml' config file found in environment. Use ``kedro mlflow init`` command in CLI to create a default config file."
        )
    conf_mlflow_yml["project_path"] = context.project_path
    mlflow_config = KedroMlflowConfig.parse_obj(conf_mlflow_yml)
    return mlflow_config

Galileo-Galilei · 2022-04-14T14:02:08Z

Thank you both for your answers and the time you dedicated to find out a solution for this.

Comments on your answers

@AntonyMilneQB wrote: I thought I had a workaround here since env is available in run_params in the ``before_pipeline_run hook, and so can be used to register an instance of the config_loader that can be used in any subsequent hooks:

I actually found the exact same things and I was about to write that it it seems to work, since project_path is available in run_params

This solves the problem for before_pipeline_run and after_pipeline_run hooks, but not for the other hooks.

@AntonyMilneQB wrote: Ultimately, as I understand it, the fundamental problem here is that there has never been a good way to access the config_loader instance. Is that right? Currently it's possible to hack something together through accessing the current session.

I almost agree. Accessing the ConfigLoader instance is the only use case I am aware of which justifies retrieving the KedroSession after its instantiation. Actually it is the original title of #506 😅 That said, and given noklam comments below yours, we may want to retrieve credentials too. there is currently a possibility to do it with the context._get_config_credentials() utility, but we can do this directly through the config loader so this is not a big deal to not have a specific management for credentials.

Functional use cases

The 2 identified use cases where the context is needed are summarized by @noklam above:

accessing a proprietary config file
accessing credentials

I think the problem depends more on where we want to access these informations rather than what we want to do with them:

access inside a node: I think you have made your point very clearly that nodes should never be aware of I/O operations. It is possible for now to access the session, but I think it is totally ok to deprecate it and nudge users who have potentially done this before to a more kedronic way to achieve what they want. => no real issue here
access inside a notebook or a script: the goal is often to "setup" a bunch of operations before manually running the pipeline (usually these operations are done in the before_pipeline_run hook but it may be skipped while running nodes step by step). In this situation, you have created the session manually so you have the information about the session, the environment and the project_apth, so you are able to pass them to the plugin if needed, e.g. :

with KedroSession.create(project_path, env) as session: 
    get_mlflow_config(session) # <- you don't need an active session, the user can pass the session (or the needed information) "manually"

=> no real issue here

accessing inside a custom CLI command: I have also seen user creating manually a session when creating a custom CLI to change the running logic, e.g. in kedro-kubeflow plugin. this is the same situation than in a notebook: since the user is creating the session manually, he is aware of all the informations (env, project path, entire session object) and can access/recreate the config loader (it does not need to look for a "potentially activated session") => no real issue here
accessing the config_loader instance inside a hook: in this situation, you need to access the env of the session you are called from and this is currently not possible (which is what is at stake here). It seems from above discussion that project_path and env are available for before_pipeline_run and after_pipeline_run => The issue only arise for other hooks.

Potential solutions

It seems quite easy to propagate the env and project_path variables to all hooks, but it requires to change their signature which is a breaking change 💥
If we want 1, maybe it is juste easier to propagate the config loader in all hooks specs. I don't really what is the benefit of recreating it in the hooks. Unfortunately it requires to change hooks signature which is a breaking change 💥
If we want to avoid breaking changes, a workaround would be to make user store the config_loader instance in the before_pipeline_run hook as an attribute (provided we can access the config loader in before_pipeline_run) to reuse it later:

class MyHook():
    @hook_impl
    def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
        
        self._config_loader= settings.CONFIG_LOADER_CLASS(
            conf_source=(Path(run_params["project_path"]) / settings.CONF_SOURCE).as_posix(),
            env=run_params["env"],
            runtime_params=run_params,
            **settings.CONFIG_LOADER_ARGS,
        )   
    @hook_impl
    def before_node_run(
        self, node: Node, catalog: DataCatalog, inputs: Dict[str, Any], is_async: bool
    ) -> None:
       self._config_loader.get("whatever") # self._config_loader is accessible

What we should deprecate though, would be context.config_loader as the component should be loaded from settings directly (in the KedroSession code and anywhere else we might be using it). A question I had around that was for the case when that property was a useful thing to work with in a jupyter notebook. settings.CONFIG_LOADER_CLASS won't give them the same as context.config_loader, so idk if there should be:
a) a method on the KedroSession object (and call session.get_config_loader() or
b) somewhere else (which they can import)
c) some magic that makes it importable as an instantiated class from kedro.framework.project
d) nothing, people should build it manually
e) something else I'm missing

Finally, I think that:

a) is not really necessary, since when the session is accessible it is somehow already trivial to access the config_loader. This is already ok for notebook use.
b) c) would make above use case possible, but it is likely that you would replace a global variable (session) by another (env or config_loader) so I am not sure if it is an improvement. The real issue is that the config_loader needs an information of the session object without communicating with it. This is only possible through a global / environment variable and I don't think this should be something completely generic. The issue above arise only when you want to interact within the session itself, e.g in nodes (=not supported) or hooks (=see above solution).
d) For all other use cases, d) seems to be the right choice indeed.

What to do

I will give a try in kedro-mlflow to solution 3. While writing this answer I just found out that it may already works as is 😅

antonymilne · 2022-04-14T14:18:08Z

Thanks very much @Galileo-Galilei, that all makes a lot of sense and is very useful. What you're trying to do in MyHooks by saving the self._config_loader so that it's accessible in all subsequent hooks is actually exactly what I meant to write in my suggestion above, just I accidentally put return instead 😅

I think this should work well actually (especially because you found project_path is in fact available also). This general pattern is something I've often suggested when the something you want isn't available in the hook specification but was available in a previously executed hook:

class Hooks:
    @hook_impl
    def before_pipeline_run(self, something):
        self._something = something
    
    @hook_impl
    def before_node_run(self): 
        do_stuff(self._something)

I'm not sure how much I like this, but it does work well. On the one hand, doing it a lot probably indicates that we're not making the right things available in the right hooks; but it does mean that we don't have to pass something to every single hook, so the hook specs don't become too bloated.

I still regard this as a bit of a workaround because requiring the user/plugin author to do settings.CONFIG_LOADER_CLASS(...) just doesn't look that nice to me compared to just making the config_loader instance available.

I left a couple of questions in #506 about what we should actually make available in the hooks 🙂

P.S. unless I'm missing something, adding an argument to a hook specification shouldn't be a breaking change since I believe you need to always write hook_impl using keyword rather than positional arguments?

noklam self-assigned this Apr 12, 2022

noklam mentioned this issue Apr 12, 2022

Remove KedroSession activate/deactivate mechanism #1434

Merged

5 tasks

antonymilne mentioned this issue Apr 19, 2022

[KED-2143] Adding a ConfigLoader instance into hook specs params #506

Closed

noklam mentioned this issue Apr 20, 2022

Adding a new after_context_created hook into Kedro's hooks collection #1458

Closed

4 tasks

noklam closed this as completed in #1434 May 9, 2022

Galileo-Galilei mentioned this issue Nov 8, 2022

Kedro-viz needs too much to run properly kedro-org/kedro-viz#1159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove logic to activate and deactivate `KedroSession` #1431

Remove logic to activate and deactivate `KedroSession` #1431

merelcht commented Apr 11, 2022

antonymilne commented Apr 11, 2022

noklam commented Apr 12, 2022 •

edited

Loading

Galileo-Galilei commented Apr 12, 2022 •

edited

Loading

noklam commented Apr 12, 2022 •

edited

Loading

Galileo-Galilei commented Apr 13, 2022

antonymilne commented Apr 14, 2022

noklam commented Apr 14, 2022 •

edited

Loading

Galileo-Galilei commented Apr 14, 2022 •

edited

Loading

antonymilne commented Apr 14, 2022 •

edited

Loading

Remove logic to activate and deactivate KedroSession #1431

Remove logic to activate and deactivate KedroSession #1431

Comments

merelcht commented Apr 11, 2022

antonymilne commented Apr 11, 2022

noklam commented Apr 12, 2022 • edited Loading

Galileo-Galilei commented Apr 12, 2022 • edited Loading

noklam commented Apr 12, 2022 • edited Loading

Galileo-Galilei commented Apr 13, 2022

antonymilne commented Apr 14, 2022

noklam commented Apr 14, 2022 • edited Loading

Galileo-Galilei commented Apr 14, 2022 • edited Loading

Comments on your answers

Functional use cases

Potential solutions

What to do

antonymilne commented Apr 14, 2022 • edited Loading

Remove logic to activate and deactivate `KedroSession` #1431

Remove logic to activate and deactivate `KedroSession` #1431

noklam commented Apr 12, 2022 •

edited

Loading

Galileo-Galilei commented Apr 12, 2022 •

edited

Loading

noklam commented Apr 12, 2022 •

edited

Loading

noklam commented Apr 14, 2022 •

edited

Loading

Galileo-Galilei commented Apr 14, 2022 •

edited

Loading

antonymilne commented Apr 14, 2022 •

edited

Loading