Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Support ML Monitoring #206

Open
skonto opened this issue Jun 3, 2022 · 6 comments
Open

Proposal: Support ML Monitoring #206

skonto opened this issue Jun 3, 2022 · 6 comments

Comments

@skonto
Copy link

skonto commented Jun 3, 2022

Hello all,

Not sure if this is the right repo but apologizes in advance if not and would be glad to be re-directed at the right place. Also not sure if this has been discussed before elsewhere and could be a long shot or completely inappropriate. :)
I would like to propose a way that Machine Learning (ML) monitoring becomes a first class citizen of the specification.
Machine Learning enabled applications deviate from the traditional cloud native ones but they are being adopted heavily as part of the enterprise stack.
Monitoring a ML model is important for ML in production and essentially a ML model cannot be operated without proper
visibility. ML monitoring is also part of the MLOPS practices.
It is very common to have a model served via a service and that model to emit metrics eg. latency of the
model scoring, performance related metrics such as accuracy or metrics related to concept drift etc.
Of course this is only part of the story of the data related Observability domain. To be more specific as part of the OTEL spec a new resource could be added to capture the concept of a model (similar to FaaS). Then specific metrics can be defined per ML model category, for an example check here.
Adding such support also helps connecting the metadata that exists in ML metadata stores directly to the models deployed and their emitted metrics. Tracing can also be enhanced to be ML specific eg. ML operations over input.
I am sure existing concepts could be used to build something on top but some key benefits of this:
a) Establish a model for emitted info that is understandable by data scientists and others involved in ML.
b) Help with the integration with different systems that create similar information by defining a common ground.
c) Make OpenTelemetry easy to use in an important domain so that users dont have to re-invent concepts.

Any feedback would be welcome.

Thank you!
Stavros

@jkwatson
Copy link
Contributor

Hi @skonto. This is definitely something that I am interested in, being at verta. Have you thought about what kind of metric aggregations would be relevant for model monitoring, and how they might map to the OpenTelemetry metric instruments? Would we need some custom aggregations? Could they be done in-process, or would some sort of collector be more appropriate? Love to hear your thoughts on this topic.

@skonto
Copy link
Author

skonto commented Sep 22, 2022

Hi @jkwatson here are some initial thoughts. We already have enough to start with existing instruments and aggregations.
For example we can split model monitoring into two basic areas like model performance and model operational performance (terms can change). In the first area we need to do the work of identifying what we want to measure and then try map to the standard or extend it. For example we might find useful to measure data drift and use Jense-Shannon divergence implemented as a histogram. In the second domain we might consider disko i/o, uptime, cpu/mem utilization, scoring latency etc.
Of course we need input about what is useful to standardize, what people use in practice etc.
We could start researching this by doing the exercise of putting together a number of metrics that provide a good overview of a model's performance. cc @dineshg13

@jkwatson
Copy link
Contributor

I think the interesting new work would be in model performance, and not necessarily in the operational performance, which I think would probably fall understand standard observability-style metrics.

With regards to drift... I can imagine a reference histogram being recorded somewhere, and then some kind of time-windowed cumulative histogram being used to compute the divergence. I'm not sure OTel itself really has facilities for "time windowed" cumulative histograms right now... there might be an opportunity to make a proposal to fulfill that need.

@skonto
Copy link
Author

skonto commented Sep 26, 2022

I'm not sure OTel itself really has facilities for "time windowed" cumulative histograms right now... there might be an opportunity to make a proposal to fulfill that need.

That would be interesting work, want to start a draft doc?

@jkwatson
Copy link
Contributor

I'm not sure OTel itself really has facilities for "time windowed" cumulative histograms right now... there might be an opportunity to make a proposal to fulfill that need.

That would be interesting work, want to start a draft doc?

I don't think I have time to work on that at the moment, but I'd be very happy to take a look at a proposal from someone who did have time.

@matiasdahl
Copy link

Somewhat related. On the topic of MLOps observability, I have been working on an open source ML Experiment tracker (and combined task executor) that emits data as OpenTelemetry (spans).

https://github.com/composable-logs/composable-logs

This focuses only on the ML training phase.

A potential advantage of OpenTelemetry is that one does not need a dedicated experiment tracker, like eg ML Flow. Rather, one could use the same logging service as for other software components.

A challenge is that OpenTelemetry might not be the best format for describing computational DAG:s common in ML workflows, and existing visualization tools seem to be more focused on microservice setting. But the OpenTelemetry data is open and could be queried otherwise.

Below is a demo ML training pipeline of this:

https://composable-logs.github.io/mnist-digits-demo-pipeline

The demo use only services from a free Github account (Github actions for compute, Github artifacts to store the training logs, and Github pages to host a static website for reporting). (The static website is built by converting the training logs into a static website. This is based on a fork of the ML Flow UI).

This is experimental, any ideas are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants