Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094

Closed
datajoely opened this issue Sep 28, 2023 · 11 comments
Closed
Labels
TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue TD: technical deepdive Tech Design label that's meant for deep technical discussion on a topic. Expect deep dives into code

Comments

@datajoely
Copy link
Contributor

datajoely commented Sep 28, 2023

Authored with @AlpAribal

Deploying Kedro to and integrating with MLOps Platforms

This document aims to cover the current state regarding deploying Kedro on
enterprise-grade MLOps platforms:

  • pain points observed integrating with distributed, container-based systems.
  • feedback we gathered from Kedro developers, users, and plugin developers such as GetInData.
  • learnings from implementing an MLRun-specific integration.

Common pain points

High level graphic summary of the problem space identified:

Deciding on granularity when translating to orchestrator DSL

  • A node within an orchestrator is typically an entire container.
  • There is often a significant conceptual mismatch between a single Kedro node and an orchestrator container node.
  • One needs to decide on what a "node" means in the orchestrator's environment i.e. the "granularity" of your nodes.
Expand detail

1:1 Mapping

This is where a single Kedro node is translated to a single orchestrator node.

  • Kedro encourages small, manageable nodes.
  • These nodes contain smaller logic units than typical orchestrator containers.
  • Distributing very small steps in orchestrators can lead to performance overhead. Consider running the pipeline in a single container mode (M:1 granularity) for efficiency.

Distributing each node also complicates the data flow between them:

  • When the pipeline is run locally non-persisted data is passed around as MemoryDatasets.
  • When each step runs in isolation, this feature is lost and most implementations require every step to be persisted. See this section for more details.

Currently, most deployment plugins use 1:1 mapping and hence are impacted by these drawbacks.

M:1 Mapping

This is where the whole Kedro pipeline is run as a single node on the target platform.

  • The main benefit is simplicity: One job goes to the orchestrator, executed on a single machine.

  • However, there are inefficiencies:

    1. Large setup: Setting up the single node to execute all tasks involves creating an environment, handling potentially conflicting requirements, and more.
    2. Limited parallelization: This approach often underutilizes available compute resources.

All dependencies need to be compatible in this configuration, see
Requirements management in a Kedro project
for more details.

M:N Mapping

This is where the full pipeline is divided into a set of sub-pipelines, that can be run separately. Today, there is no obvious way to do this.

This approach provides a middle ground between shortcomings of both the 1:1 and M:1 mappings:

  • Small node groups form large buckets of work that justify the overhead of creating an execution environment.
  • The orchestrator is free to schedule the sub-pipelines to be run in parallel / isolation.

Kedro is a fast, iterative development tool largely because the user is not required to think about execution contexts. This unmanaged complexity is why it is difficult to resolve this granularity mismatch in production contexts.

Piecemeal localised conventions for describing M:N granularity have emerged across mature users:

Convention Merits Drawbacks
Node tags Simple to use, CLI accessible,Applies across pipelines No bounded context, zero validation
Registered pipelines Simple to use, conceptually maps to sub-pipelines, CLI accessible No bounded context, zero validation
Pipeline namespaces Bounded context, CLI accessible, Visualisation integration Harder to use, confusing error messages, verbose catalog¹

Each of these has merits and drawbacks. In every case, the user is given no easy way to validate if these groups are mutually exclusive or collectively exhaustive.

Despite the namespace option being the most robust approach available (since v0.16.x), these are not in wide use across our power-user base. There are several hypotheses for the low adoption rate:

Hypothesis area Comments
Confusing feature space namespaces != modular pipelines != micropackaging, Overlapping features all unrelated to deployment confuse the value for the user.
• Today, namespaces are primarily used for visualisation and pipeline-reuse not deployment.
• Internal monorepo tooling now covers much of the micropackaging feature space.
UX • Users have reported they dislike the catalog verbosity introduced by namespaces¹
• The error messages provided by Kedro when applying namespaces are unhelpful²

¹ May be resolved by new dataset factory feature

² e.g.
Failed to map datasets and/or parameters: params:features

Potential approaches to M:N grouping

Even for a mid-sized pipeline, it is not trivial to find the "optimum" grouping of nodes.

Approach Thoughts
Manual Grouping Pipeline developers are typically aware of broad groups (e.g. preprocessing, training). However, this is something that may take a while to stabilise during development
Via Kedro metadata (nodes, tags, namespaces) See M:N Mapping section above, each approach requires some human direction and unvalidated conventions.
Via DAG branching Nodes which split the pipeline graph into distinct branches can be used as sub-pipeline boundaries. This is a similar mechanism as used by ParallelRunner and ThreadRunner.
Via persistence points Nodes that persist data, i.e. nodes whose dataset type in the catalog is not MemoryDataset, to be the starting node of a new group. The assumption is that users persist data after checkpointing meaningful work. In a theoretically perfect production system one would only persist the very end of the pipeline.

Validating the groups

After nodes are mapped to several groups, sanity checks and questions need to be answered.

  • How may we enforce that the groups are still acyclic?
  • Should a node be able to be re-used across multiple groups?
  • How do we surface / manage un-grouped nodes?
  • Do we try and add validation to registered pipelines / node tags to better bound their context?

A possible solution here is to introduce a formal before_pipelines_registered and after_pipelines_registered hooks which would expose the pipelines in a state where grouping validation could be injected and applied (See issue #3000 here). There is no way to do this on a portable, plug-in level at the time of writing.

Expressing the groups

  • After the pipeline is broken down into groups, there must be a way to express these groups.
  • This expression mechanism must be serialisable so that it can be stored, reused, and passed between Kedro core, plugins, and orchestrators.

A possible solution is to build upon Pipeline.filter. If run configuration parameters share the same names (from_nodes, tags etc.), then at execution time, we can get the pipeline with the given name and just execute pipe.filter(**args).

Requirements management in large Kedro projects

Expand detail
  • The Kedro project template comes with a single requirements file for the whole Kedro registry.
  • The requirements of individual pipelines and nodes are not captured. All pipelines are usually run using the same environment.
  • It can be hard to manage a single environment for large projects. We have evidence of users adopting a monorepo of distinct projects when this becomes a blocker.
  • Modular pipelines do support localised requirements.txts, but it is still up to the user to make these work neatly in independent environments.

There is a 1:1 relationship between pipeline granularity and the dependencies required for that scope.
A full solution could include metadata such as dependencies, Docker base image, preferred execution engine (e.g. Pod, Spark job, Ray parallel processing), and other relevant aspects.

Separating pipeline definition and execution environments

Expand detail

This section is very much coupled with
Requirements management in a Kedro project

  • Most project-scoped CLI commands eagerly load all pipelines of the project.
  • Since Kedro nodes keep a pointer to the function object that the node has to run, loading all pipelines means importing all modules at once:
    • This is very expensive in large pipelines. The most common manifestation of this problem is where Kedro-Viz takes several minutes to load, despite not requiring a functional DAG.
    • This hinders the ability to isolate different work teams within a project, e.g. the data science team has to install Spark and the data engineering team has to install TensorFlow.

There are active initiatives to address this, but no concrete progress has been made at the time of writing.

No link between distributed KedroSessions of the same pipeline

Expand detail

As described below, most deployment plugins run the Kedro CLI under the hood.

  • When the execution of the pipeline is separated into multiple steps, a new KedroSession for each of these steps is created, and a separate session_id is assigned to each of them.
  • This makes it hard to have a single overview of the pipeline execution.

This point has been raised by the community and there is ongoing work by the Kedro team. Users often report bypassing Kedro's session_id and introducing their own mechanism.

Passing ephemeral data between distributed runs

Expand detail

Kedro, by default, uses MemoryDataSets to hold intermediate data. However, this dataset type cannot be used in a distributed setting since containers do not share main memory.

Deployment plugins usually replace the MemoryDataset by:

  • Having a Runner implementation with another default dataset type
  • Explicitly mapping catalog entries to another dataset type

In either case, ephemeral data is, at least temporarily, persisted to storage (cloud bucket, Kubernetes volume, etc.). The [de-]seriliasation of data throttles the pipeline execution speed and, in many cases, leads to worse performance in the distributed setting compared to a local run.

There are some solutions like the CNCF vineyard project that have in-memory data access offerings that might improve execution speed in only K8s specific situations.

Differentiating between data, model, and reporting artifacts

Expand detail

There is a wider point here that granular information about the entire Kedro execution lifecycle needs often to be exposed to the underlying MLOps platform in order to maximise the features available.

  • Most mature MLOps platforms differentiate between kinds of pipeline steps, models and artifacts.
  • For example:
  • Kedro Dataset classes do not contain metadata about the kind of data they store or
    load.
    • For example, a PickleDataSet can store any Python object and it is not known whether the dataset stores a model. In general, there is a strong argument that ONNX (LFAI) must be the default model serialisation mechanism within Kedro.
    • There are some 1st and 3rd party model specific datasets, but it is a manual exercise to classify these.
  • Users who hit this problem are forced to rely on type hints or some sort of object introspection to retrieve this information (see example).
    • Kedro hooks can be utilised to inspect objects at the right time during pipeline execution.
    • At translation time, type annotations of the node functions can be used similarly.

A potential solution here is to establish and enforce conventions. Introducing something like AbstractModelDataSet would make this much easier. We could also use the new metadata catalog key, but the onus is on the user to update this.

Lack of a standard pattern for iterative development

Expand detail

Currently, deployment plugins address the one-way task of converting a developed pipeline into a deployment. When deployment is viewed as an iterative process of development and deployment steps, additional gaps need to be bridged.

Linking source code to execution

There are two popular configurations (1) tight (2) loose between source code and platform:

  1. When the execution environment is not necessarily aware of ML concepts such as pipelines, models and artifacts it is on the user to ensure that deployments are versioned correctly. For example - steps must be taken to avoid pushing untracked code into deployment.
  2. When the execution environment is loosely coupled with the project's source code (e.g.Databricks Repos, AzureML Environment, MLRun Function), the deployment platform usually maintains the linkage between code and pipeline execution.

Keeping code and configuration separated

  • By design, Kedro separates code and configuration.
    • Configuration is not included as part of kedro package in strict adherence with 12factor app.
  • However, in most deployment patterns, the configuration is baked into the deployment.
    • Since v0.18.5 it is possible to easily pass a zip file containing configuration via the command line, but it is not easy to:
      • Point to a shared or cloud bucket location, only local directories are supported
      • Inject configuration directly through the command line in something like JSON.

One option is to use environment variables in the configuration and manage environment
variables at deployment time. There is significant complexity in doing this at scale.

Limiting duplicated build efforts

In a setup where the pipeline is continuously deployed, repeating the same deployment workflow may lead to inefficiencies:

  • re-translating the pipeline: Ideally, the pipeline is translated only for changes ("deltas") in the repo that alter the structure of the pipeline. For example, some changes in the bound node function should not necessitate re-translation.
  • re-creating the environment: When source code is injected into the execution environment, the same Docker image should be re-used across several versions of the deployment. Some platforms support this out-of-the-box (MLRun, AzureML, Databricks).

It might be possible to implement a platform-agnostic solution, e.g. cloning the repo at execution time before executing the Kedro command.

Kedro dependency after deployment to orchestrator

Expand detail

There may be some situations where Kedro integrating with a target platform leaves much of the platform feature set under-utilised. From the platform's perspective, deployed Kedro pipelines may feel like "closed boxes".

For many deployment plugins, translating a Kedro pipeline means encapsulating the Kedro project within a Docker container and executing
specific nodes via the Kedro CLI.

So, pipeline execution depends on Kedro in two ways:

  1. Session management:
    • Kedro still manages the run context, execution order, as well as importing and running lifecycle hooks.
    • This gives the user a familiar way to modify execution behaviour but can also be limiting for the orchestrator. For example, the nodes in the pipeline may not be fully transparent to the orchestrator in the case of M:1 mapping.
    • While it might be possible to remove session management from a simple project, it becomes very challenging when the project heavily utilizes hooks or any sort of dynamic pipelining.
  2. I/O:
    • Kedro datasets contain arbitrary custom logic that cannot be reliably mapped to data loading native logic supported by the orchestrator or platform.
    • If platforms are opinionated (like Sagemaker /opt/models) in the way that they handle artifact management these features often be bypassed and not automatically available to the users.

Recommended changes to Kedro core

  1. Distributed session_id Setting: Simplify session_id management in distributed Kedro pipelines (see issue #2182).
  2. Artifact kind assignment: Enhance dataset integration with artifact kinds. Make ONNX the default path.
  3. M:N Groups in Kedro: Establish conventions for M:N groups with deployment focus. (See kedro-plugins PR#241)
  4. Modular Requirements: Simplify pipeline deployments and development constraints. (Slack conversation)
  5. Group-Level Validation Hooks: Add hooks for enforcing constraints like MECE pipelines (see issue #3000).
  6. Lazy loading of pipeline structure Enable DAG resolution without dependencies present in the environment (#2829).
  7. Make Kedro pipeline serialisable: Inputs, outputs and fully qualified function references would enable easier translation into target DSLs. JSON target seems reasonable.
  8. Deterministic toposort: Users often report that the sort order is not reproducible, this affects any implicit grouping strategies considerably.

Deployment plugins

Overview of plugins

Almost all plugins rely on a Docker image to wrap the Kedro project. The Docker image is usually built just before executing the pipeline, and source code is copied into the image as part of the build.

  • A Docker container is spun up from this image on the MLOps platform.
  • The Kedro pipeline is run through the Kedro CLI.
  • Plugins provide hooks and datasets to manage the communication between Kedro and the platform.
  • This communication includes mapping Kedro datasets to platform artifacts, managing experiment tracking via MLflow deployed on the platform and [de-]serialisation of MemoryDatasets.

It is also worth noting that beyond data management and experiment tracking, deployment plugins often fail to leverage or unlock the full potential of platform-specific capabilities.

These unused capabilities include:

  • serving trained models via an endpoint
  • labelling and retraining workflows
  • incorporating feature stores
  • model monitoring

Comparison

Plugin Mapping Support Handling Memory Datasets Execution Setup Source Code Translation Platform Integration Reflections
kedro-airflow[O] 1:1 only Not supported Airflow-defined environment Source available in executor; cwd set to project path Kedro DAG -> Python script using Airflow API Designed for Airflow, not for container platforms.
kedro-docker[O] M:1 only N/A since M:1 Dockerfile environment Source available via Docker mount at execution time No translation or orchestration Introductory, platform-agnostic tutorial.
kedro-sagemaker[G] 1:1 only Cloudpickle & AWS bucket Dockerfile environment Source copied to container at build; auto-rebuilt SageMakerPipeline object using SageMaker API MLflow tracking, native pipeline visualization.
kedro-vertexai[G] 1:1 only Cloudpickle & GCS Dockerfile environment Source expected in container Kubeflow Pipelines DSL MLflow tracking, elastic machine allocation.
kedro-azureml[G] 1:1 only Cloudpickle & Azure Blob AzureML Environment Source in AzureML Environment Inputs/outputs -> AzureML counterparts AzureMLPipelineDataSet, MLflow, distributed training.
kedro-kubeflow[G] 1:1, M:1 KFP Volumes Dockerfile environment Source expected in container Kubeflow Pipelines DSL Scaling issue with KFP Volumes.
kedro-mlrun 1:1, M:N (prototype) MLRun artifact Dockerfile environment Source fetched from repo by MLRun Kubeflow Pipelines DSL Native model tracking, serving, pipeline visualization.

* [O] maintained by the Kedro org, [G] maintained by the GetInData org

@noklam noklam added TD: technical deepdive Tech Design label that's meant for deep technical discussion on a topic. Expect deep dives into code Component: Documentation 📄 Issue/PR for markdown and API documentation TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue labels Sep 28, 2023
@noklam noklam added this to the Make deployment to Orchestrator easier milestone Sep 28, 2023
@Lasica
Copy link

Lasica commented Sep 28, 2023

I've developed M:N functionality example for kedro-airflow plugin based on tags. I believe that similiar solution can be achieved in more container-related plugins (does not necessarily require containers). It bases on that we have all the code available and we can filter what to run with tags. Validating the tagging (my input) and passing the adequate cmd to run with docker images can effectively group as many nodes as we want in M:N fashion (N < M).

I will soon write a blog post about kedro-airflow example and release a code demo probably around next week.

@takikadiri
Copy link

takikadiri commented Sep 28, 2023

This is insighful and well framed, thank you for sharing this publicly :) I did some similar thinking about this problem space. Here is my takeaway :

Let's say a project is composed of two set of requirements, functional requirements (business logic) and non-functional requirements (app lifecycle, config, data management, runner, logging, web server, ....). I see two approach to satify these requirements :

  • The framework approach : It consist of somehow bundling the funtional and non-functional requirements into the code/application. With this approach the deployment/orchestration can be done using a generic platform that don't know anything about the data nor the functional of the app. "It just orchestrate/run it" This is the path taken by the modern web apps. That make the application extremly portable, and enable standarization across industry around a solution like kubernetes (kube doesn't know nearly anything about the app)

  • The platform approach (AzureML, Sagemaker, Databricks, ...) : In this approach, the platform serve the non-functional requirements and expect you to deploy/integrate a plain python code decorated with some platform-specific-objects. The deployed code cover the functional requirements.

Beware, the platform approach can lock you in an execution environment, with a fixed or slow evolving set of features, this is mostly due to the coupling between the app code and execution environment.

Beware also from the multiplication of platforms in you stack, where each platform target specific part of the ML workloads, This can lead to a complex and costly stack/system. An ML project is a combinaison of many workloads at the same time, ML engineering (Training, eval, tracking, ...), Data engineering (data modeling, data preparation, feature store ..), Data analytics (ad hoc analysis, data viz, ...) , Sotfware engineering (API, ...).

That's why i double down in the framework approach by using kedro as the base framework, and extending it to cover these diverse workloads.

I believe that kedro users are now in the middle path between framework and platform approach, as kedro do not currenly covers all data workloads. This indeed make the deployment/integration harder, because users need to map kedro concepts with their target plateform concepts.

If we take for example the model serving functionnality, it's something lacked from kedro. This pushes users to use an MLOps platform alongside kedro, just because it offer such functionality. Integrating some of the "orchestration" features into kedro will make the user workload/ux smoother. This will lower the need to map kedro concepts with the orchestrator/platform concepts. The application can just be orchestrated with a generic orchestrator (An Airflow with kedro operator ? or plain docker for API).

I'm not a dbt expert, but i think that we can draw a parallel. Going down the framework path using dbt, lead to lowering functionnalities needed from the orchestration platform (Airflow just run dbt).

Hope this help.

@sbrugman
Copy link
Contributor

sbrugman commented Oct 9, 2023

@datajoely Great synthesis!

Grouping MemoryDataSets: when running nodes in separate environments, MemoryDataSets won't work anymore. For kedro-airflow, we group nodes that are connected through MemoryDataSets to ensure that they run in the same pod/machine (kedro-org/kedro-plugins#241). Solving this on the framework side might result in other choices (e.g. implementing it as a Runner) and could greatly simplify plugins.

Grouping in general allows the user to make a logical separation using the Kedro framework, while not having to make an unnecessary trade-off for performance. If each node corresponds to a machine/pod/executor, then there is overhead for spinning them up. As a user, I need to be able run multiple nodes on a single machine without it being one node. Controlling the grouping via tags seem a sensible choice.

Requirements management in large Kedro projects: This is a blocker for us to move to a mono-repository. The result: overhead of maintaining multiple Kedro repositories. It's possible to work around this limitation of Kedro, however it would be an enormous plus if supported out-of-the-box. On our spark cluster, most nodes will run on an image with the default dependencies for a project. Some nodes will have heavier and conflicting dependencies (e.g. numpy) and will use different dedicated images (via Airflow SparkSubmitOperator). One environment per pipeline would work, as then more complex pipelines could be split into multiple.

The other "common pain points" are not that relevant for me at this moment.

One topic that came up with kedro-airflow recently that might be relevant for other MLOps platforms too is deterministic ordering: kedro-org/kedro-plugins#380

MLOps topics which needs are already addressed by Kedro:

@astrojuanlu
Copy link
Member

Solving this on the framework side might result in other choices (e.g. implementing it as a Runner) and could greatly simplify plugins.

Crazy idea: kedro run --runner=AirflowRunner, similar to ZenML concept of orchestrators https://docs.zenml.io/stacks-and-components/component-guide/orchestrators

@Lasica
Copy link

Lasica commented Oct 26, 2023

I promised to publish blog post about using kedro-airflow plugin and demo grouping mechanism and here it is:
https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow

@stichbury stichbury removed the Component: Documentation 📄 Issue/PR for markdown and API documentation label Nov 1, 2023
@astrojuanlu
Copy link
Member

Also, worth considering what happens when a target platform supports something that cannot be defined by Kedro DAGs, like conditionals https://www.databricks.com/blog/announcing-enhanced-control-flow-databricks-workflows

@datajoely
Copy link
Contributor Author

To an extent, Airflow has had this for a long time:
https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/python/index.html#airflow.operators.python.BranchPythonOperator

@astrojuanlu
Copy link
Member

Yep. But one of the outcomes of kedro-org/kedro-devrel#94 is that platforms should probably be a priority over open source orchestrators, because OSS orchestrators are more used to, well, orchestrate ETL/ELT tools (say Airflow + Airbyte, Prefect + meltano) but for "ML pipelines" (actually MLOps) commercial platforms seem to be much more widely used.

So maybe before we could afford ignoring this pesky bit, but the moment platforms start growing a more complex set of features, the gap widens.

@astrojuanlu
Copy link
Member

Turned research synthesis into wiki page https://github.com/kedro-org/kedro/wiki/Synthesis-of-research-related-to-deployment-of-Kedro-to-modern-MLOps-platforms there's nothing else to do here.

@astrojuanlu astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2024
@datajoely
Copy link
Contributor Author

It would be great to see a parent ticket - I was using this to track the status of some of the recommendations

@astrojuanlu
Copy link
Member

There will be a parent ticket soon, when the next steps are a bit more clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue TD: technical deepdive Tech Design label that's meant for deep technical discussion on a topic. Expect deep dives into code
Projects
Status: Done
Development

No branches or pull requests

7 participants