Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate callback functionality into elephas (History, checkpoints, etc.) #131

Closed
sd12832 opened this issue Feb 10, 2019 · 13 comments
Closed

Comments

@sd12832
Copy link

sd12832 commented Feb 10, 2019

https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L341

The fit function in Keras returns a graph that can be used to determine if the model is overfitting or not. This would be very useful from Elephas.

@maxpumperla
Copy link
Owner

@sd12832 very interesting idea. with elephas, each of the N workers would receive a callback instance. The question is how to consolidate callback data once training is done, i.e. after we send updates back to the master network. A list of callbacks? Suggestions?

@sd12832
Copy link
Author

sd12832 commented Feb 11, 2019

@maxpumperla Maybe each worker gets their own history based on asynchronous callbacks, with an option to average the values among all the workers? We can follow the keras code of having a callback base class that is inherited into different purposes.

In terms of the list of callbacks, wouldn't n asynchronous callbacks suffice? Pardon my lack of understanding of the elephas code, I have not looked at it in too much detail.

I would very much like to accelerate the building of this feature by helping as much as I can. Please guide me on how to do so!

@maxpumperla
Copy link
Owner

@sd12832 thanks for your feedback. yeah, so the first step is to init the master network with a callback, serialize it so that elephas can ship it to workers, and then deserialize the callback and set it to each of the N worker networks. that's the straightforward part (in a way). The question is what to with the callback data. For model history you basically accumulate logs, right? hence my specific, naive suggestion of lists.

Now, what do you do about things like early stopping? Do you want individual workers to stop, or would you rather have the whole training process stopped? In the latter case the callback on master is enough and we just need to evaluate the callback properly.

Come to think of it, maybe this is what you want. I.e. do you just want to register callbacks on the master network? In the end, it doesn't really matter where the updates come from.

You see, there are a lot of questions :D

@maxpumperla maxpumperla changed the title Add the functionality to get the history from the fit function Integrate callback functionality into elephas (History, checkpoints, etc.) Feb 11, 2019
@maxpumperla
Copy link
Owner

after all, elephas' master network is a keras model, which can have callbacks. it may really just be a question of how to incorporate this into the top level API of elephas.

@sd12832
Copy link
Author

sd12832 commented Feb 11, 2019

So this is my understanding of how Elephas works right now (please correct me if I'm wrong). We can just serialize the callbacks, which are registered on the master network, and ship it to the workers. Once we get the data from callbacks(that are grouped together?), we can accumulate the data.

I think early stopping would result in the entire training process to stop, right? Why would we want early stopping to happen on a single node? Therefore, the callback on the master should be enough?

I think the top level API should be working in the same manner as the Keras with automatically getting the history back from a fit function and the ability to add callbacks.

@sd12832
Copy link
Author

sd12832 commented Feb 12, 2019

Any updates? I can start exploring the code today.

@maxpumperla
Copy link
Owner

@sd12832 I can barely follow my github notifications these days, sorry. so no "updates" from me. But your reasoning sounds good, if you poke around a little and see what we can do, I'm happy to discuss here. Let's come up with a plan first and execute in a bit (I can help).

@Ben-Epstein
Copy link

Hello, are there any updates on this? I think this is a crucial tool

@danielenricocahall
Copy link
Collaborator

I will start working on this for the next release, as it seems like a useful and desired feature!

@OscarDPan
Copy link
Contributor

Hi @danielenricocahall I was wondering if you are at this project yet? I think for now Elephas doesn't support any Callback? I was thinking if I can help with adding some high level integration to accept Callback objects? If yes maybe I can start working on it sometime next week.

@danielenricocahall
Copy link
Collaborator

danielenricocahall commented Feb 12, 2021

I started working on it but I've been a bit busy - nothing too groundbreaking, just playing with the idea of making the callback objects a broadcast variables. There were a few minor bumps I hit that I haven't worked around yet - I can push up my branch over the weekend in case you want to use it for reference, or if you want to collaborate on things.

@hassanmehmud
Copy link

Hello, is the feature working now ? how can we get the loss with every epoch using Pipelines?

Thank you for support

@danielenricocahall
Copy link
Collaborator

Moved this issue to the new fork: danielenricocahall#9. Closing for now but still on the radar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants