-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not to select another subset S' #5
Comments
Thanks for your questions. Q1. We have tried both versions (whether to select the same subset of devices for estimating the gradient and for updating), and neither of them have good empirical performance (which is part of the message in the paper). Q2. It is not a typo. To adapt DANE to federated settings, one way is to use a subset of devices to estimate the average gradients in the gradient correction term, and this subset doesn't need to be the same as the subset of devices we choose to update the model (see Section C in the paper for details). Q3. What do you mean by 'first training loop'? To get the average gradients, we don't perform any training (i.e., don't apply the gradients). Q4. We allow for partial device participation, this is taking care of the issue that some devices may drop out of the network. But we assume that none of the selected devices drop after they are selected and before they send back the updates. |
Thank you so much for your time! Let me ask Q3 in a different way. In order to get the average gradients, we need to collect the gradients first, and to get the gradients in a specific round, we need to run forward prop then backward prop (to generate the gradients) without applying the gradients (since we don't want to update the weights) is that correct? Many thanks for your help! |
So the updating rule (in Eq 3) requires (1) to first compute the average gradients at w^{t-1}, and then (2) for each selected device to solve the local subproblem to update w^{t-1}. Therefore, it needs two communication rounds. This is adapted from DANE. At first, the models are randomly initialized. w^0 is provided as an input. |
Thank you so much for your efforts (papers + code).
There are few pieces I did not understand. Would you please help!
I noticed that in the FedDANE paper, it is mentioned in the algorithm that we first need to choose a subset S to compute the gradients, then we need to choose another subset S' to run the actual training (update clients weights). However, in your code I noticed that in FedDANE trainer you are passing the same seed:
selected_clients = self.select_clients(i, num_clients=self.clients_per_round)
line number 28 and 39 in FedDANE trainer. So you are choosing the same subset again not another subset S'.
Q2: In the algorithm, it is mentioned that the averaging will take place over the subset S not S' that we actually trained. So I was wondering is that a typo? If not, then would you please explain why we need to train S' then average another set S ?
Q3: When we run the first training loop to average the gradients, then we train for one epoch only, right? Since adding more than one epoch, will overwrite the gradients.
Q4: Finally, I believe in your code you assumed none of the devices will drop, is that correct?
Thank you so much for your time!
The text was updated successfully, but these errors were encountered: