-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate callback functionality into elephas (History, checkpoints, etc.) #131
Comments
@sd12832 very interesting idea. with elephas, each of the N workers would receive a callback instance. The question is how to consolidate callback data once training is done, i.e. after we send updates back to the master network. A list of callbacks? Suggestions? |
@maxpumperla Maybe each worker gets their own history based on asynchronous callbacks, with an option to average the values among all the workers? We can follow the keras code of having a callback base class that is inherited into different purposes. In terms of the list of callbacks, wouldn't n asynchronous callbacks suffice? Pardon my lack of understanding of the elephas code, I have not looked at it in too much detail. I would very much like to accelerate the building of this feature by helping as much as I can. Please guide me on how to do so! |
@sd12832 thanks for your feedback. yeah, so the first step is to init the master network with a callback, serialize it so that elephas can ship it to workers, and then deserialize the callback and set it to each of the N worker networks. that's the straightforward part (in a way). The question is what to with the callback data. For model history you basically accumulate logs, right? hence my specific, naive suggestion of lists. Now, what do you do about things like early stopping? Do you want individual workers to stop, or would you rather have the whole training process stopped? In the latter case the callback on master is enough and we just need to evaluate the callback properly. Come to think of it, maybe this is what you want. I.e. do you just want to register callbacks on the master network? In the end, it doesn't really matter where the updates come from. You see, there are a lot of questions :D |
after all, elephas' master network is a keras model, which can have callbacks. it may really just be a question of how to incorporate this into the top level API of elephas. |
So this is my understanding of how Elephas works right now (please correct me if I'm wrong). We can just serialize the callbacks, which are registered on the master network, and ship it to the workers. Once we get the data from callbacks(that are grouped together?), we can accumulate the data. I think early stopping would result in the entire training process to stop, right? Why would we want early stopping to happen on a single node? Therefore, the callback on the master should be enough? I think the top level API should be working in the same manner as the Keras with automatically getting the history back from a |
Any updates? I can start exploring the code today. |
@sd12832 I can barely follow my github notifications these days, sorry. so no "updates" from me. But your reasoning sounds good, if you poke around a little and see what we can do, I'm happy to discuss here. Let's come up with a plan first and execute in a bit (I can help). |
Hello, are there any updates on this? I think this is a crucial tool |
I will start working on this for the next release, as it seems like a useful and desired feature! |
Hi @danielenricocahall I was wondering if you are at this project yet? I think for now Elephas doesn't support any Callback? I was thinking if I can help with adding some high level integration to accept Callback objects? If yes maybe I can start working on it sometime next week. |
I started working on it but I've been a bit busy - nothing too groundbreaking, just playing with the idea of making the callback objects a broadcast variables. There were a few minor bumps I hit that I haven't worked around yet - I can push up my branch over the weekend in case you want to use it for reference, or if you want to collaborate on things. |
Hello, is the feature working now ? how can we get the loss with every epoch using Pipelines? Thank you for support |
Moved this issue to the new fork: danielenricocahall#9. Closing for now but still on the radar! |
https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L341
The fit function in Keras returns a graph that can be used to determine if the model is overfitting or not. This would be very useful from Elephas.
The text was updated successfully, but these errors were encountered: