Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove logic to activate and deactivate KedroSession #1431

Closed
merelcht opened this issue Apr 11, 2022 · 9 comments · Fixed by #1434
Closed

Remove logic to activate and deactivate KedroSession #1431

merelcht opened this issue Apr 11, 2022 · 9 comments · Fixed by #1434
Assignees

Comments

@merelcht
Copy link
Member

KedroSession is still keeping a global for the currently active session, which causes problems when trying to create a session in a notebook with the Kedro IPython extension and makes it unable to create/run the session. Here's the functions we need to drop:
https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L34-L47

And here is where they are used: https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L274-L287

Need to add a regression test:

  • Create a test that shows the bug: create a session, then try creating another session and see the failure
  • Then fix it in the code
  • Then this test should pass
@antonymilne
Copy link
Contributor

I think we should take care here to understand exactly why this code exists in the first place and what would be the consequences of making this change, i.e. what will behave differently after removing this code. Presumably it was originally put there for a good reason to handle some particular cases. Maybe such cases are no longer possible now we have one run session, or maybe they're so marginal that we don't care about them. But it would be good to understand the context of why this logic is here currently.

@noklam noklam self-assigned this Apr 12, 2022
@noklam
Copy link
Contributor

noklam commented Apr 12, 2022

@AntonyMilneQB

I checked the repository the _activate_session() and the global _active_session are only used by ipython.py and the KedroSession's context manager. It was also used by get_current_session() but it no longer exists now.

The activate/deactivate session mechanism doesn't look necessary to me, in fact, if you just do session.run() today it runs perfectly fine. This whole activate/deactivate thing only stops you if you are using context manager.

This will run

# Assume activate session exists but not closed
session.run() # this will run

with KedroSession.create('dummy') as new_session:
    new_session.run() # this will fail due to existence of unclosed session.

However, there is a test_nested_sessions test expecting this to fail.

@pytest.mark.usefixtures("mock_settings")
def test_nested_sessions(self, fake_project, mock_package_name):
configure_project(mock_package_name)
session1 = KedroSession.create(mock_package_name, fake_project)
session2 = KedroSession.create(mock_package_name, fake_project)
with session1:
pattern = (
"Cannot activate the session as another active session already exists"
)
with pytest.raises(RuntimeError, match=pattern), session2:
pass # pragma: no cover

It was created since Ipython extension exists. [KED-1996] Create an IPython extension for Kedro (#853) · 80e2202 · GitHub](80e2202)

@Galileo-Galilei
Copy link
Contributor

Galileo-Galilei commented Apr 12, 2022

I beg you to not make this change (i.e. not merge #1434) this before you manage to provide a solution for #1411.

Many plugins (including steel-toes, kedro-wings, kedro-neptune and kedro-mlflow(e.g. here, here and here) needs to access the config_loader inside hooks (either to access credentials or a proprietary config file). This is described with many details in issue #506.

The key idea is that we theoretically should be able to do the following :

import settings
config_loader=settings.CONFIG_LOADER_CLASS(*settings.CONFIG_LOADER_ARGS)

inside a hook to retrieve some configuration. However, we have no way to retrieve the environment (--env) used, which is a key argument to retrieve the right configuration.

Notice that none of the workaround described in #506 works with 0.18. The discussion in #1411 shows that we could be able to create a custom context which would export the variable as an environment variable, and then get it back when needed but this looks that an horrible solution and I would hate having to force kedro-mlflow users use a complicated and unintuitive setup for doing simple things which currently works automatically. It really feels like a regression since it currently works "as is".

I hope I was convincing :)

P.S: Obviously, when #1411 will be solved natively in the core library, this would make sense to get rid of this global variable.

@noklam
Copy link
Contributor

noklam commented Apr 12, 2022

@Galileo-Galilei A quick question before I finished reading all the issues. The problem is you no longer have access to session/context/config loader, which is probably already a problem since the removal of get_current_session(). What are the current workaround you are using with kedro-mlflow since I can see it supports kedro==0.18.x.

@Galileo-Galilei
Copy link
Contributor

Hi, I currently recreate the get_current_session function at the plugin level by importing the global variable. This is a terrible workaround but it has a low migration impact. If you have a better solution, I'd be glad to use it.

@antonymilne
Copy link
Contributor

@Galileo-Galilei Thanks very much for your comments here. I for one am convinced by your plea! I thought I had a workaround here since env is available in run_params in the before_pipeline_run hook, and so can be used to register an instance of the config_loader that can be used in any subsequent hooks:

class Hooks:
    @hook_impl
    def before_pipeline_run(run_params: Dict[str, Any]):
        config_loader_class = settings.CONFIG_LOADER_CLASS
        return config_loader_class(
            conf_source=str(PROJECT_PATH / settings.CONF_SOURCE),
            env=run_params["env"],
            runtime_params=run_params,
            **settings.CONFIG_LOADER_ARGS,
        )   

... but I'm not immediately sure where we can get PROJECT_PATH.

I've reopened #506. Sorry your comments on this were not noticed before (for future reference, I think many people don't notice Github notifications - if you really need something then probably best to send a message on discord so it doesn't get overlooked).

Ultimately, as I understand it, the fundamental problem here is that there has never been a good way to access the config_loader instance. Is that right? Currently it's possible to hack something together through accessing the current session.

Just to add some further context here, let me copy a discussion we had when the code around config_loader was refactored prior to 0.18. Unfortunately this was on an internal repo before we had joined the LF (link for people who do have access).

What we should deprecate though, would be context.config_loader as the component should be loaded from settings directly (in the KedroSession code and anywhere else we might be using it). A question I had around that was for the case when that property was a useful thing to work with in a jupyter notebook. settings.CONFIG_LOADER_CLASS won't give them the same as context.config_loader, so idk if there should be:
a) a method on the KedroSession object (and call session.get_config_loader() or
b) somewhere else (which they can import)
c) some magic that makes it importable as an instantiated class from kedro.framework.project
d) nothing, people should build it manually
e) something else I'm missing

There was some user research conducted around whether anyone used context.config_loader and it was found that no one did and instead, if required, users could make a new instance themselves through calling config_loader = settings.CONFIG_LOADER_CLASS(*settings.CONFIG_LOADER_ARGS) like you suggest. i.e. option d from the above list.

However, it seems that the very important case of plugin authors was not considered here. I wish we had seen #506 at the time, because that makes it clear that it would be valuable to have the config_loader instance available (rather than just the class and args given in settings.py, which is clearly not sufficient since it doesn't give env).

Do you think one of the other options (a-c) makes sense here? If access to the session is removed, option a would of course not be possible.

@noklam
Copy link
Contributor

noklam commented Apr 14, 2022

In kedro-mlflow, there are 2 places where session is needed currently. Case 2 will be addressed by providing config_loader, Case 1 will probably need the after_catalog_created hook to retrieve the conf_cred instead.

PROJECT_PATH is only stored inside session's store I am afraid.

def _export_credentials(self, session: KedroSession = None):
        session = session or _get_current_session()
        context = session.load_context()
        conf_creds = context._get_config_credentials()
        mlflow_creds = conf_creds.get(self.server.credentials, {})
        for key, value in mlflow_creds.items():
            os.environ[key] = value
def get_mlflow_config(session: Optional[KedroSession] = None):
    session = session or _get_current_session()
    context = session.load_context()
    try:
        conf_mlflow_yml = context._config_loader.get("mlflow*", "mlflow*/**")
    except MissingConfigException:
        raise KedroMlflowConfigError(
            "No 'mlflow.yml' config file found in environment. Use ``kedro mlflow init`` command in CLI to create a default config file."
        )
    conf_mlflow_yml["project_path"] = context.project_path
    mlflow_config = KedroMlflowConfig.parse_obj(conf_mlflow_yml)
    return mlflow_config

@Galileo-Galilei
Copy link
Contributor

Galileo-Galilei commented Apr 14, 2022

Thank you both for your answers and the time you dedicated to find out a solution for this.

Comments on your answers

@AntonyMilneQB wrote: I thought I had a workaround here since env is available in run_params in the ``before_pipeline_run hook, and so can be used to register an instance of the config_loader that can be used in any subsequent hooks:

I actually found the exact same things and I was about to write that it it seems to work, since project_path is available in run_params

This solves the problem for before_pipeline_run and after_pipeline_run hooks, but not for the other hooks.

@AntonyMilneQB wrote: Ultimately, as I understand it, the fundamental problem here is that there has never been a good way to access the config_loader instance. Is that right? Currently it's possible to hack something together through accessing the current session.

I almost agree. Accessing the ConfigLoader instance is the only use case I am aware of which justifies retrieving the KedroSession after its instantiation. Actually it is the original title of #506 😅 That said, and given noklam comments below yours, we may want to retrieve credentials too. there is currently a possibility to do it with the context._get_config_credentials() utility, but we can do this directly through the config loader so this is not a big deal to not have a specific management for credentials.

Functional use cases

The 2 identified use cases where the context is needed are summarized by @noklam above:

  • accessing a proprietary config file
  • accessing credentials

I think the problem depends more on where we want to access these informations rather than what we want to do with them:

  • access inside a node: I think you have made your point very clearly that nodes should never be aware of I/O operations. It is possible for now to access the session, but I think it is totally ok to deprecate it and nudge users who have potentially done this before to a more kedronic way to achieve what they want. => no real issue here
  • access inside a notebook or a script: the goal is often to "setup" a bunch of operations before manually running the pipeline (usually these operations are done in the before_pipeline_run hook but it may be skipped while running nodes step by step). In this situation, you have created the session manually so you have the information about the session, the environment and the project_apth, so you are able to pass them to the plugin if needed, e.g. :
with KedroSession.create(project_path, env) as session: 
    get_mlflow_config(session) # <- you don't need an active session, the user can pass the session (or the needed information) "manually"

=> no real issue here

  • accessing inside a custom CLI command: I have also seen user creating manually a session when creating a custom CLI to change the running logic, e.g. in kedro-kubeflow plugin. this is the same situation than in a notebook: since the user is creating the session manually, he is aware of all the informations (env, project path, entire session object) and can access/recreate the config loader (it does not need to look for a "potentially activated session") => no real issue here
  • accessing the config_loader instance inside a hook: in this situation, you need to access the env of the session you are called from and this is currently not possible (which is what is at stake here). It seems from above discussion that project_path and env are available for before_pipeline_run and after_pipeline_run => The issue only arise for other hooks.

Potential solutions

  1. It seems quite easy to propagate the env and project_path variables to all hooks, but it requires to change their signature which is a breaking change 💥
  2. If we want 1, maybe it is juste easier to propagate the config loader in all hooks specs. I don't really what is the benefit of recreating it in the hooks. Unfortunately it requires to change hooks signature which is a breaking change 💥
  3. If we want to avoid breaking changes, a workaround would be to make user store the config_loader instance in the before_pipeline_run hook as an attribute (provided we can access the config loader in before_pipeline_run) to reuse it later:
class MyHook():
    @hook_impl
    def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
        
        self._config_loader= settings.CONFIG_LOADER_CLASS(
            conf_source=(Path(run_params["project_path"]) / settings.CONF_SOURCE).as_posix(),
            env=run_params["env"],
            runtime_params=run_params,
            **settings.CONFIG_LOADER_ARGS,
        )   
    @hook_impl
    def before_node_run(
        self, node: Node, catalog: DataCatalog, inputs: Dict[str, Any], is_async: bool
    ) -> None:
       self._config_loader.get("whatever") # self._config_loader is accessible

What we should deprecate though, would be context.config_loader as the component should be loaded from settings directly (in the KedroSession code and anywhere else we might be using it). A question I had around that was for the case when that property was a useful thing to work with in a jupyter notebook. settings.CONFIG_LOADER_CLASS won't give them the same as context.config_loader, so idk if there should be:
a) a method on the KedroSession object (and call session.get_config_loader() or
b) somewhere else (which they can import)
c) some magic that makes it importable as an instantiated class from kedro.framework.project
d) nothing, people should build it manually
e) something else I'm missing

Finally, I think that:

  • a) is not really necessary, since when the session is accessible it is somehow already trivial to access the config_loader. This is already ok for notebook use.
  • b) c) would make above use case possible, but it is likely that you would replace a global variable (session) by another (env or config_loader) so I am not sure if it is an improvement. The real issue is that the config_loader needs an information of the session object without communicating with it. This is only possible through a global / environment variable and I don't think this should be something completely generic. The issue above arise only when you want to interact within the session itself, e.g in nodes (=not supported) or hooks (=see above solution).
  • d) For all other use cases, d) seems to be the right choice indeed.

What to do

I will give a try in kedro-mlflow to solution 3. While writing this answer I just found out that it may already works as is 😅

@antonymilne
Copy link
Contributor

antonymilne commented Apr 14, 2022

Thanks very much @Galileo-Galilei, that all makes a lot of sense and is very useful. What you're trying to do in MyHooks by saving the self._config_loader so that it's accessible in all subsequent hooks is actually exactly what I meant to write in my suggestion above, just I accidentally put return instead 😅

I think this should work well actually (especially because you found project_path is in fact available also). This general pattern is something I've often suggested when the something you want isn't available in the hook specification but was available in a previously executed hook:

class Hooks:
    @hook_impl
    def before_pipeline_run(self, something):
        self._something = something
    
    @hook_impl
    def before_node_run(self): 
        do_stuff(self._something)

I'm not sure how much I like this, but it does work well. On the one hand, doing it a lot probably indicates that we're not making the right things available in the right hooks; but it does mean that we don't have to pass something to every single hook, so the hook specs don't become too bloated.

I still regard this as a bit of a workaround because requiring the user/plugin author to do settings.CONFIG_LOADER_CLASS(...) just doesn't look that nice to me compared to just making the config_loader instance available.

I left a couple of questions in #506 about what we should actually make available in the hooks 🙂

P.S. unless I'm missing something, adding an argument to a hook specification shouldn't be a breaking change since I believe you need to always write hook_impl using keyword rather than positional arguments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants