Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logger: Add notifier to next_step? #90

Closed
daavoo opened this issue Jun 15, 2021 · 14 comments
Closed

logger: Add notifier to next_step? #90

daavoo opened this issue Jun 15, 2021 · 14 comments

Comments

@daavoo
Copy link
Contributor

daavoo commented Jun 15, 2021

Depending on the type of model to be trained, the time in between calls to next_step may vary significantly. In common deep learning scenarios, i.e. the keras callback, next_step is being called at the end of an epoch which could result in long times (maybe hours) in between calls.

It could be useful to have built-in support for optionally sending a notification each time next_step is being called.

Without changing dvclive, the user could just call a custom library (i.e. https://github.com/liiight/notifiers) after next_step:

class MetricsCallback(Callback):
    def on_epoch_end(self, epoch: int, logs: dict = None):
        logs = logs or {}
        for metric, value in logs.items():
            dvclive.log(metric, value)
        dvclive.next_step()
        notify('pushover', user='foo', token='bar', message=f'epoch: {epoch}')

But having the notification step built inside MetricLogger would have some benefits like access to internals (i.e. _metrics) and configuration options in addition to hiding complexity to the end user.

However, I'm not sure if it is worth to implement this feature inside dvclive or if it would be better to keep dvclive as lightweight as possible.

@dmpetrov
Copy link
Member

dmpetrov commented Jun 15, 2021

@daavoo I'm trying to understand the motivation behind this? 😄
could you please elaborate on this? o you want to update the files more often? What notify() means?

EDIT: do you have any references in other ml logger frameworks to this functionality?

@daavoo
Copy link
Contributor Author

daavoo commented Jun 15, 2021

@daavoo I'm trying to understand the motivation behind this? smile
could you please elaborate on this? o you want to update the files more often? What notify() means?

Sorry about the lack of clarity.

The motivation comes from working with deep learning models that take a lot of time to train (i.e. hours or days). When working under that circumstances we always ended up writing some sort of "notification" code to complement or integrate into the ml logger. The main reason was to be able to monitor the train loop remotely (i.e. no need to look at the stdout in a terminal)

This notification code takes care of sending a message to some platform (i.e. e-mail, slack / discord / telegram channel, etc) containing information like the number of finished epoch (a.k.a step in dvclive) and associated metrics. We also used it to inform when exceptions occurred during the training loop.

notify() is usually a function that sends information as a message to an app.

EDIT: do you have any references in other ml logger frameworks to this functionality?

I think that in other ml loggers we usually have an associated UI with a view that is automatically being updated as the plots/information are being logged (Related with this Studio issue: iterative/studio-support#13)

In addition to that, some ml loggers also provide "notification" utilities:

Beyond existing functionality in other ml loggers, I have found different teams and open source communities solving this problem, including some I work/have worked with:

@daavoo
Copy link
Contributor Author

daavoo commented Jul 14, 2021

I've just discovered another open-source tool focused on this kind of functionality:

https://github.com/labmlai/labml

@pared
Copy link
Contributor

pared commented Jul 14, 2021

Related to #91

@daavoo
Copy link
Contributor Author

daavoo commented Sep 9, 2021

Another open source tool:

https://github.com/aporia-ai/mlnotify

@daavoo
Copy link
Contributor Author

daavoo commented Oct 21, 2021

Interesting integration between DagsHub and New Relic highlighting alerts as one of the main features:

https://dagshub.com/blog/real-time-machine-learning-monitroing-new-relic-dagshub/

@dberenbaum
Copy link
Contributor

Related to #91 (comment), I think the most useful integration here would be making it dead simple to send full reports (similar to the html today) through supported channels.

For example, the slack api could probably be used generate a message with the metrics and plot images, and similar for email (personally, I would prioritize slack because it's more collaborative and probably easier for users to set up).

The local html generated now could just be one report/alert format in that case (and the cml markdown report another).

@daavoo
Copy link
Contributor Author

daavoo commented Feb 10, 2022

Related to #91 (comment), I think the most useful integration here would be making it dead simple to send full reports (similar to the html today) through supported channels.

For example, the slack api could probably be used generate a message with the metrics and plot images, and similar for email (personally, I would prioritize slack because it's more collaborative and probably easier for users to set up).

The local html generated now could just be one report/alert format in that case (and the cml markdown report another).

That would be the way to go and the original idea using https://github.com/liiight/notifiers .

For metrics is very feasible. However, the images / rendered plots would be kind of tricky because most channels don't have support to directly send images. We could rely on cml publish to host the images and send the link (like in cml mardkown report) but this would imply CML as dependency for any channel.

@dberenbaum
Copy link
Contributor

For metrics is very feasible. However, the images / rendered plots would be kind of tricky because most channels don't have support to directly send images. We could rely on cml publish to host the images and send the link (like in cml mardkown report) but this would imply CML as dependency for any channel.

Rather than wrapping a general-purpose text-based notifier with support for many providers, it might be more useful to focus on providers in which we can send the entire report, including images/rendered plots. AFAIK this should be feasible without hosting in Slack (https://api.slack.com/methods/files.upload) and email (https://docs.python.org/3/library/email.examples.html).

I'm not sure text-based alerts add enough value (we could instead have a doc or blog post showing how to use dvclive + https://github.com/liiight/notifiers). Full reports with plots seem like a more unique feature, and they extend dvclive's initial value prop of lightweight live monitoring for model training, providing serverless alerting and reporting anywhere without needing to access the training machine. Since a lot of training happens in headless environments anyway, this seems pretty useful to me. What do you think?

@daavoo
Copy link
Contributor Author

daavoo commented Feb 11, 2022

I'm not sure text-based alerts add enough value (we could instead have a doc or blog post showing how to use dvclive + https://github.com/liiight/notifiers). Full reports with plots seem like a more unique feature, and they extend dvclive's initial value prop of lightweight live monitoring for model training, providing serverless alerting and reporting anywhere without needing to access the training machine. Since a lot of training happens in headless environments anyway, this seems pretty useful to me. What do you think?

I think it's useful and would be directly adding value for DVCLive.

I'm a little "worried" about how easy would be to maintain because Report Providers sounds like integrations potentially growing perpendicular to ML Frameworks.

So far, looking at slack and email APIs, it doesn't look that bad.

@dberenbaum
Copy link
Contributor

@shcheklein mentioned that it might be worthwhile to look into RSS feed aggregators. There are some parallels in how RSS expects a particular schema of elements (https://validator.w3.org/feed/docs/rss2.html) and can publish them in a consistent format, so maybe it can give some ideas for how to implement.

@casperdcl
Copy link

casperdcl commented Apr 4, 2022

So it's about tidying up this sort of thing?

from tqdm.contrib.{slack,telegram,discord} import trange

with trange(live.get_step(), epochs, unit="epoch") as pbar:
    for epoch in pbar:
        ...
        live.log("loss", loss)
        pbar.set_postfix(loss=loss)
        live.next_step()

i.e. providing a callback interface?

live.set_callback(
    on_log=lambda name, metric: pbar.set_postfix({name: metric}),
    on_step=lambda new_step: print(f"starting epoch {new_step:>5d}", file=some_log))

Or is it more advanced? live.notify_slack(on_step=True, channel="#...", token="...")

@dberenbaum
Copy link
Contributor

Sorry @casperdcl, I missed this comment. It's closer to the latter advanced usage. Probably channel, token, etc. can be set in environment variables, and the method can be something like live.make_report(type="slack").

@dberenbaum
Copy link
Contributor

I don't think we are likely to do this now that we have live metrics in Studio and other solutions exist for alerting.

@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants