-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow injecting data into a KedroSession
run
#2169
Comments
The other potentially tangetal user issue is when users do go down the dynamic catalog route they need to do it in two places:
Despite our preference in avoiding dynamic pipelines, enough users ask this that I think we need to come up with some way that DRY can be achieved or at the very least you only have to define this in one place |
Tech design 01/02/23:
Proposed implementation 1:
Proposed implementation 2:
Data that user provides overrides any catalog definition with the same name. The naming of any additional parameter has implications for breaking changes and should be considered carefully. |
Just have this top of my head, currently most of the |
Looking through the code, it seems that Implementation 1
Implementation 2
|
@noklam note that |
At our organisation we often needed to run a kedro session from outside kedro, for example from streamlit app or a rest api. For now, we don't manage to do that, we instead run the web app server inside of kedro, more specifically inside a custom kedro runner. Here is our current reasons for doing that :
We run the web app server inside kedro by using the runner abstraction. Our CLI instantiate our custom runners using a runner.yml before giving it to the kedro session. We have then a way to embed a whole diverses kind of configurable applications inside kedro (web and batch oriented), those apps dictate how and when we run the pipeline and with which elements of the catalog. We managed to get <100 ms latency with this setup. By the way this pattern was lately advised for running kedro pipelines in Dask I believe that intoducing the runner API (with a sort of runner.yml) will open up a whole new kind of plugins/runners that can be proposed by the community (a rest api that serve kedro pipelines, a spark rdd runner, a dask runner, ...). We already advocate this pattern with my collegue here. Even IncrementalDataset and PartitionDataSet logics can be more naturally implemented in a runner. I understand that this way of instrumenting kedro runs is not generalizable in all situations (Datarobot App as mentionned here). We are interested to see more progress in making kedro more integrable in other applications. |
@takikadiri Thank you for your feedback, this is very interesting! There is an idea about reducing the overhead of KedroSession creation, it's more relevant for Web application. Might be similar idea to your point
Can I get some clarification on this point? How does embedding a web server inside a KedroSession help this? Does these HTTP trigger a new pipeline run or it simply fetch some data from catalog and return something?
|
@noklam the embeded web app start inside a runner. At that time the KedroSession has already been created and the context, pipeliene and catalog are also already created. Inside the runner, the web app have access to the pipeline and the catalog and will decide how to use them. At each HTTP call the web app replace a given input dataset from the catalog with a MemoryDataSet that contains the request body and run the pipeline using a standard runner (sequential runner for example). The HTTP response is made from a given output dataset (can be infered if the only free output of the pipeline). All the free inputs datasets are materialized as MemoryDataSet at web app init time, except the input dataset. |
@takikadiri
The main job of the |
@noklam yes exactly, the WebServerRunner don't run pipeline but launch a webserver and alterate the pipeline and catalog at each HTTP call The input and output datasets are defined in the catalog. We use a runner.yml to define the args of our custom runners, Here is an example of the runner.yml :
The runners are then used this way : Our custom CLI instantiate the selected runner using the runner.yml. Today, the native kedro CLI do not support runners that needs args for their initialization. With this design, our data scientists develop one pipeline that can be used as a batch or as a service, with a simple runner selection. At the same time, it separate responsabilities, as we don't want data scientists to develop the web app (for performance, reliability and security reasons), they have a yaml API to declaratively define their web app though. We start the web server in a runner and not the CLI because we want one generic CLI for all our kedro apps. When deploying ou kedro apps, the platforms that run our apps run it as python packages with I hope this helps, we welcome any feedback on this design, and we're looking forward to seeing kedro more integrable with other services. ` |
This is all very neat. I like that it mitigates the cost of session creation and uses Kedro CLI consistently as an entry point. Initially, I found having a runner that launches a web server but not running a pipeline a bit weird, but I understand this could be the best solution to improve the performance. I am a bit nervous that it starts to feel that this is highly coupled with Kedro. Conceptually it's more natural to launch a web server and run Kedro pipeline, but I need more time to think about it.
How would a batch pipeline be different from service pipeline? I imagine they will be different as the service pipeline got at least two more node/datasets for the HTTP request body & response. How would the webserver change the pipeline, does the parameter
|
It is indeed highly coupled with kedro, for the better or worst :) It can be ok to embed a thin web app that somehow extend kedro (serve a pipeline), but i agree that it can feel a bit weird when the web app have a range of features that have nothing to do with kedro but need kedro just in a specific feature (datarobot app or streamlit app for example),
The batch pipeline and service pipeline are the same. While waiting for kedro to be more integrable with other applications, we'll keep using it as the entry point that run our thin apps and manage their lifecycle, and above all that give us great abstractions for business logic code authoring. |
Thanks for explaining your thinking @takikadiri so clearly. We built a version of this in the early days of kedro and eventually deprecated it since it wasn't being used! This is super validating that sophisticated users like yourselves need this and now is the time for us to think about the design we need to solve this problem space. Please keep in touch and let us know what you learn as it will steer how we build this! |
I agree with this, especially after our discussion about OmegaConfigLoader and #2530. |
So I took this thinking further by trying to solve this problem space (Integrating kedro project in a larger App/system, dynamic catalog, multiple runs without session initialization overhead, injecting data into session, ...). I propose a solution called
|
Hey @takikadiri , thanks a lot for the detailed writeup, really appreciated ❤️ I think it intersects with #143 and possibly other issues. See also this long series of issues opened by @Galileo-Galilei , #770 #904 #1041, which may or may not have overlap with these ideas too. We are now focusing on the upcoming 0.19.0 release so it will take us some time to give this a proper read, but rest assured we will do it when we are ready. |
This issue, among others, was mentioned on 2024-02-14 tech design, in which @Galileo-Galilei and @takikadiri showed kedro-boot https://github.com/takikadiri/kedro-boot/ there was agreement that implementing this, plus making the |
xref #3540 |
This issue was mentioned again, among others, on 2024-05-22 tech design, in which @ankatiyar walked the team through the adoption of the TaskFlow API in |
Description
Currently there's no easy way to inject data into a
KedroSession
interactively. The only way data is being loaded is through theDataCatalog
. This makes it hard to embed a Kedro project into an existing Web server or a tool like DataRobot, where the entrypoint is owned by another application and the initial data is provided preloaded (see https://github.com/datarobot/datarobot-user-models/blob/master/model_templates/python3_sklearn/custom.py#L10 for example, the filecustom.py
is executed and the functiontransform
already has two artifacts preloaded, themodel
and thedata
).In effort to make the interactive sessions and potentially different deployment modes (e.g. calling a Kedro run from a FastAPI server with some data provided in an HTTP request) easier to work with, we should think of a way to allow injecting data into a KedroSession. Currently the only way to do that is through a Hook, which is clunky and not very intuitive, also not at all documented.
The text was updated successfully, but these errors were encountered: