[SPIKE] cases: list of ideas (related to prod envs) #2490

jorgeorpinel · 2021-05-19T01:42:00Z

Most of the existing ideas summarized here have something to do with ML models, I think.

Training on a remote machine cases: training on a remote machine (SSH) #194 was the first idea and would narrowly focus on remote SSH executors. We can prob consider it part of 3.0 experiments now? (I can't find the related discussion but there's one cc @pmrowla)
DVC in Production cases: DVC in Production #862 sounds broad but it intends to define and doc a best method around "synchronizing" DVC pipelines between experimental local dev and working prod envs (I think). Cc @dmpetrov
CI/CD for ML (WIP) docs: use cases: add CI/CD for ML #2404 is a bit broad so far but in the process of being narrowed down. Not clear what it will cover exactly but likely at least continuous integration using DVC + CML mentions. Cc @casperdcl
Deploying models for real-time inference case: deploying models for inference #2431 may be a bit narrow but adds interesting edge to the model deployment topic, specifically using "model registries" and real-time data streaming. Cc @dberenbaum
Model ~~zoo~~ registry api: high level dvc.api import/export meets data catalog dvc#2719 (comment). Whether using a DVC repo to centralize ML model management is fundamentally different from general data registries (possibly our most popular UC) has been a recurrent question. I think that probably yes but not sure how exactly. Cc @shcheklein

Extracted from #820

UPDATE: Jump to #2490 (comment)

jorgeorpinel · 2021-05-19T01:46:32Z

And a side question: is this direction a higher priority than Experiments-related use cases atm? (see #2270)

shcheklein · 2021-05-19T01:55:47Z

Some thoughts:

CI/CD is taken care of by @casperdcl . It takes time to iterate but we will get there. And I think we'll be fine, at lease with this specific title.
Deploying models for real-time inference - yep, feel too narrow, need to find a better angle
Model zoo - is too high level concept I think (model zoo is close to a product). I think we can start with a model registry?

Some ideas for this list:

Model Management and/or Model Lifecycle - explain DVC from the models angle - we capture all information that is relevant to models - data, weights, metrics, experiments - and allow people to navigate
Model Registry - discovery and reusability
Experiments tracking/management - here we should sell W&B, MlFlow, etc - rapid iterations, live metrics + other metrics + navigation

jorgeorpinel · 2021-05-19T17:35:29Z

OK we're going to try to make this into a spike to come up with actionable items within 7 days or less hopefully. Please help if you can guys. I'll tag people via chat... ⌛

dberenbaum · 2021-05-19T20:11:54Z

It might help to start with thesis statements instead of topics. Thesis statements would be like single-sentence use cases arguing for the utility of the products in given scenarios. Use cases are more persuasive writing compared to the explanatory writing of other docs, so a topic may not clarify what we plan to say about it. This will probably take more time and debate, but hopefully we will have more clarity in deciding which use cases to pursue and in writing the use cases. What do you think?

jorgeorpinel · 2021-05-19T21:18:48Z

title is confusing cases: MLOps direction for me ... why not just - cases: list of use case to write ?

Because we also have cases: Experiments #2270. That seemed like a totally different direction from all the previous ideas summarized here (mainly from #820), which I think at least somewhat relate to MLOps? Happy to change the title but this is not the a comprehensive list of use case ideas in all possible product directions.

jorgeorpinel · 2021-05-19T21:34:51Z

Model Management and/or Model Lifecycle - explain DVC from the models angle - we capture all information that is relevant to models - data, weights, metrics, experiments - and allow people to navigate
Model Registry - discovery and reusability

I have a feeling that model registries aren't different enough from data registries to write another full use case on that. But maybe it can be part of a Model Mgmt/Lifecycle use case. I like that idea! It could also cover or mention some of the topics above (training remotely, deployment, real-time predictions).

shcheklein · 2021-05-20T02:57:01Z

Happy to change the title but this is not the a comprehensive list of use case ideas in all possible product directions.

The way I initially understood the title cases: new directions and the meaning of this research is to consolidate all possible ideas (w/o this split - experiments, ml models - which is hard for me to understand tbh - e.g. why experiments are not about models?).

The title for that ticket you mention about experiments was about one specific use case to my mind.

I have a feeling that model registries aren't different enough from data registries to write another full use case on that.

it's a matter of what we are optimizing here. I would not be trying to generalize by sacrificing the initial goal - more people come, see the high level title that resonates with them . It's fine that they will overlap internally.

In this specific case - I think model registry can be significantly different.

jorgeorpinel · 2021-05-20T20:20:39Z

why experiments are not about models

Sure, it all connects. But here I'm thinking mostly about solutions for deploying and using ml models via DVC/CML e.g. production environments, model deployment, etc. Sorry for the confusion...

So it looks like so far the better-defined scenarios are

synchronizing between development and production ml models (cases: DVC in Production #862)
ml model registry (construction? usage?)
ml model lifecycle/management (see [SPIKE] cases: list of ideas (related to prod envs) #2490 (comment))

jorgeorpinel · 2021-05-20T20:30:13Z

It might help to start with thesis statements instead of topics — single-sentence use cases arguing for the utility of the products in given scenarios.

@dberenbaum

you can use DVC and CML to deploy ml models to production, and sync back results/status with the master repo
you can package and ship (pre-trained) ml models to a central registry and make DVC projects downstream that use and depend on them.
DVC helps you develop and manage ml models throughout their whole lifecycle (needs detailing)

Keep in mind a) this is not my area of expertise and b) this is based on preliminary understanding of the proposals, so my explanations above may be inexact.

dberenbaum · 2021-05-24T13:01:47Z

Thanks, @jorgeorpinel! I didn't mean to suggest that you should bear responsibility for developing each thesis statement, or that each one needs to be perfected.

1. you can use DVC and CML to deploy ml models to production, and sync back the model learning to your development env/team

We have a few use case ideas around "production" and/or "deployment," and it's not clear to me what they mean. There are different scenarios that I have seen described as production deployments:
a. Automated training: Run a scheduled, automated training pipeline to keep your model updated with the latest data (this seems to be #862). The retrained model might then be used for the scoring scenarios below.
b. Batch scoring: Run a scheduled, automated scoring pipeline to always have updated predictions.
c. Real-time scoring: Submit data as needed to an API that returns model scores (see #2431).

I'd probably vote to focus on b since a solution for c might not be fully developed yet. a could maybe be included as part of it if it's not too complex, but to me it's being covered by the CI/CD use case in development.

3\. DVC helps you develop and manage ml models throughout their whole lifecycle (needs detailing)

As @shcheklein has mentioned, this can either be about a single model or many models, which might be different use cases.

For a single model, track, visualize, and analyze everything about your experiment, including code, parameters, metrics, plots, data, training DAG, and any other artifacts included in your repo.

For many models, try many different experiments and track them, enabling you to compare, select, reproduce, and iterate on any experiments.

jorgeorpinel · 2021-05-26T02:50:03Z

a. Automated training

This could or could not be considered related to "in production". Training somewhere seems rather like a pre-requisite. I think it has more to do with CI/CD (which can be part of a prod deployment workflow, so there's overlap). This can probably be covered initially in #2404 indeed. Cc @casperdcl

b. Batch scoring

Is this basically ETL where E=get chunk of data, T=run pre-trained model, L=store/upload scores ? That could be part of a use case but may still not be high-level enough.

c. Real-time scoring

Not sure I get how DVC play a part in this. Probably just in the way to deploy the model (e.g. via the DVC API which would be similar to this -- going back to the "model registry" idea). Still not high-level enough IMO but b and c def. seem related.

jorgeorpinel · 2021-05-26T03:04:32Z

ml model lifecycle/management

this can either be about a single model or many models

Hmmm... By many models do you mean actually different models with different goals (would relate with "model registry'), or multiple versions of a same model in development? I usually assume the typical ML pipeline/project ends up in a single model.

BTW can we clarify what we mean by "model lifecycle"? Maybe training, active, inactive (related to "in production") or planning, data eng, modeling (much broader topic). Cc @shcheklein

initial goal - more people come, see the high level title that resonates with them . It's fine that they will overlap

Going back to this (which is why titles are important too), I think "DVC in Production" is a really good umbrella concept to begin with, keeping in mind it would be the first use case in this direction. It can have a story (maybe sections) that cover several of the scenarios we've discussed above. Later on we could split into multiple use cases if that's better. WDYT?

UPDATE: See quick draft (idea) in #2506

rel #2490 (comment)

dberenbaum · 2021-05-26T20:23:07Z

Is this basically ETL where E=get chunk of data, T=run pre-trained model, L=store/upload scores ? That could be part of a use case but may still not be high-level enough.

Yup, although T could include other things in your pipeline (feature engineering).

Not sure I get how DVC play a part in this. Probably just in the way to deploy the model (e.g. via the DVC API which would be similar to this -- going back to the "model registry" idea). Still not high-level enough IMO but b and c def. seem related.

Right, other than the model registry idea, there's not much of a clear pattern here for how to use DVC.

Hmmm... By many models do you mean actually different models with different goals (would relate with "model registry'), or multiple versions of a same model in development? I usually assume the typical ML pipeline/project ends up in a single model.

Sorry, I meant many experiments from the same pipeline.

jorgeorpinel · 2021-05-29T17:47:00Z

More feedback (from https://iterativeai.slack.com/archives/C6YHPP2TB/p1621617453043300):

From @mnrozhkov

Batch Scoring project use case: it’s a common for large companies like Telecoms, Banks & FinTech

for production running we could use Airflow

From @dmpetrov

E2E from getting data from DB (or Spark) to training and setting up batch scoring (AirFlow prod)

An external dvc-airflow integration https://github.com/covid-genomics/airflow-dvc

☝️ From these comments I take 1) there's support for covering the "batch scoring" scenario, 2) there's interest in certain integrations, specifically Airflow (I need to play with it ⌛) -- maybe also MLFlow? and 3) an e2e case could be a meaningful way to present some of these topics.

Also, @shcheklein shared https://neptune.ai/blog/model-registry-makes-mlops-work with me (on the "model registry" idea). I think this answers the Q of how model registries relate to MLOps/ "in production". Summary:

collaborative hub where teams can work together at different stages of the ML lifecycle [from (after) experimentation to production]... allows to publish, test, monitor, govern and share [models]
all the key values (data, config, env, code, versions, and docs) are in one place

centralized tracking system that stores lineage, versioning, and related metadata for published ML models.
(1) provide a mechanism to store model metadata
(2) connect independent model training and inference processes by acting as a communication layer
[metadata:] identifier, name, desc?, version, date, performance, path to the serialized model, and stage of deployment (dev, shadow-mode, prod, etc.)

dberenbaum · 2021-05-31T01:07:04Z

Nice, @jorgeorpinel! The comments on batch scoring and model registry use cases look good to me.

there's interest in certain integrations, specifically Airflow (I need to play with it ⌛) -- maybe also MLFlow?

Yes to Airflow since it is the default choice for pipeline orchestration, although might be worth looking into some alternatives like prefect (see https://neptune.ai/blog/best-workflow-and-pipeline-orchestration-tools).

MLFlow is probably better left for the experiment management use case since its focus is on tracking and comparing experiments rather than executing pipelines.

jorgeorpinel · 2021-06-03T04:54:16Z

Summary (again)

Here's a list proposal with 4 big ideas that group most of the concepts we've discussed (with overlaps):

DVC in Production (rel. #2506) (intro to MLOps)
Training remotely
Deploying models (CLI or API)
Keep pipelines, artifacts in sync between environments
Batch scoring a.k.a. "DVC for ETL"
+ Distributed computing
+ Parallel exec?

ML Model Registry
Model lifecycle (training, shadow, active, inactive)
Automated/Continuous training (remotely)
Discovery and reusability
Deploying models
Batch scoring example
+ Real-time inference

Production Integrations
Databases (e.g. SQL dump versioning/preprocessing)
Spark (e.g. remote training)
AirFlow (e.g. batch scoring)
Kafka (e.g. real-time predictions)

End-to-end scenario with a combination from above e.g.:
Importing (versioning?) data from Spark
(Automated) Training remotely
MLOps via Model Registry
Batch scoring (AirFlow integration)

shcheklein · 2021-06-04T02:03:22Z

Thanks @jorgeorpinel ! Sounds good, what/where can we get the full list of uses case that we write/consider to write, etc? (I assume that this ticket is still about "prod envs"?

E.g. where should we put "Experiments tracking/management" / "ML bookkeeping" case, for example?

jorgeorpinel · 2021-06-06T02:35:01Z

All use case ideas we have in GH have been consolidated here (see original desc.) — we could even close some/all — except #2270 (an epic itself) and #2512 (new, discussing).

I should prob make an epic/story ticket to close this and maybe some of the other issues linked above ⌛

jorgeorpinel · 2021-06-08T18:11:22Z

Resulting list of ideas: #2544

Closing spike.

jorgeorpinel added the A: docs Area: user documentation (gatsby-theme-iterative) label May 19, 2021

jorgeorpinel mentioned this issue May 19, 2021

use-cases: revise so they're more high level "landing pages" #820

Closed

8 tasks

jorgeorpinel added the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label May 19, 2021

jorgeorpinel changed the title ~~cases: new directions~~ cases: MLOps direction May 19, 2021

This comment has been minimized.

Sign in to view

jorgeorpinel self-assigned this May 19, 2021

jorgeorpinel changed the title ~~cases: MLOps direction~~ [SPIKE] cases: MLOps direction May 19, 2021

jorgeorpinel removed the ✨ epic Placeholder ticket for multi-sprint direction, use story, improvement label May 19, 2021

jorgeorpinel changed the title ~~[SPIKE] cases: MLOps direction~~ [SPIKE] cases: next scenario to write May 19, 2021

jorgeorpinel mentioned this issue May 19, 2021

case: deploying models for inference #2431

Closed

jorgeorpinel changed the title ~~[SPIKE] cases: next scenario to write~~ [SPIKE] cases: next scenario to write (ML model related?) May 19, 2021

jorgeorpinel mentioned this issue May 19, 2021

cases: Experiments #2270

Closed

jorgeorpinel changed the title ~~[SPIKE] cases: next scenario to write (ML model related?)~~ [SPIKE] cases: next direction (ML models related) May 20, 2021

jorgeorpinel mentioned this issue May 26, 2021

cases: DVC in Production #862

Closed

jorgeorpinel added the p1-important Active priorities to deal within next sprints label May 26, 2021

jorgeorpinel added a commit that referenced this issue May 26, 2021

cases: quick draft of in-production topic

6c215ef

rel #2490 (comment)

jorgeorpinel mentioned this issue May 26, 2021

cases: DVC in production #2506

Closed

jorgeorpinel changed the title ~~[SPIKE] cases: next direction (ML models related)~~ [SPIKE] cases: list of ideas (mostly related to production environments) May 29, 2021

jorgeorpinel changed the title ~~[SPIKE] cases: list of ideas (mostly related to production environments)~~ [SPIKE] cases: list of ideas (related to prod envs) Jun 3, 2021

jorgeorpinel mentioned this issue Jun 3, 2021

blog/use-case: Unit tests for data using DVC #2512

Closed

jorgeorpinel mentioned this issue Jun 8, 2021

cases: list of ideas #2544

Closed

5 tasks

jorgeorpinel closed this as completed Jun 8, 2021

iesahin added the C: cases Content of /doc/use-cases label Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPIKE] cases: list of ideas (related to prod envs) #2490

[SPIKE] cases: list of ideas (related to prod envs) #2490

jorgeorpinel commented May 19, 2021 •

edited

Loading

jorgeorpinel commented May 19, 2021 •

edited

Loading

shcheklein commented May 19, 2021

This comment has been minimized.

jorgeorpinel commented May 19, 2021

dberenbaum commented May 19, 2021

jorgeorpinel commented May 19, 2021 •

edited

Loading

jorgeorpinel commented May 19, 2021 •

edited

Loading

shcheklein commented May 20, 2021

jorgeorpinel commented May 20, 2021 •

edited

Loading

jorgeorpinel commented May 20, 2021 •

edited

Loading

dberenbaum commented May 24, 2021

jorgeorpinel commented May 26, 2021 •

edited

Loading

jorgeorpinel commented May 26, 2021 •

edited

Loading

dberenbaum commented May 26, 2021

jorgeorpinel commented May 29, 2021 •

edited

Loading

dberenbaum commented May 31, 2021

jorgeorpinel commented Jun 3, 2021 •

edited

Loading

shcheklein commented Jun 4, 2021

jorgeorpinel commented Jun 6, 2021 •

edited

Loading

jorgeorpinel commented Jun 8, 2021

[SPIKE] cases: list of ideas (related to prod envs) #2490

[SPIKE] cases: list of ideas (related to prod envs) #2490

Comments

jorgeorpinel commented May 19, 2021 • edited Loading

jorgeorpinel commented May 19, 2021 • edited Loading

shcheklein commented May 19, 2021

This comment has been minimized.

jorgeorpinel commented May 19, 2021

dberenbaum commented May 19, 2021

jorgeorpinel commented May 19, 2021 • edited Loading

jorgeorpinel commented May 19, 2021 • edited Loading

shcheklein commented May 20, 2021

jorgeorpinel commented May 20, 2021 • edited Loading

jorgeorpinel commented May 20, 2021 • edited Loading

dberenbaum commented May 24, 2021

jorgeorpinel commented May 26, 2021 • edited Loading

jorgeorpinel commented May 26, 2021 • edited Loading

dberenbaum commented May 26, 2021

jorgeorpinel commented May 29, 2021 • edited Loading

dberenbaum commented May 31, 2021

jorgeorpinel commented Jun 3, 2021 • edited Loading

Summary (again)

shcheklein commented Jun 4, 2021

jorgeorpinel commented Jun 6, 2021 • edited Loading

jorgeorpinel commented Jun 8, 2021

jorgeorpinel commented May 19, 2021 •

edited

Loading

jorgeorpinel commented May 19, 2021 •

edited

Loading

jorgeorpinel commented May 19, 2021 •

edited

Loading

jorgeorpinel commented May 19, 2021 •

edited

Loading

jorgeorpinel commented May 20, 2021 •

edited

Loading

jorgeorpinel commented May 20, 2021 •

edited

Loading

jorgeorpinel commented May 26, 2021 •

edited

Loading

jorgeorpinel commented May 26, 2021 •

edited

Loading

jorgeorpinel commented May 29, 2021 •

edited

Loading

jorgeorpinel commented Jun 3, 2021 •

edited

Loading

jorgeorpinel commented Jun 6, 2021 •

edited

Loading