Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not to select another subset S' #5

Closed
Gharibim opened this issue Feb 11, 2021 · 3 comments
Closed

Why not to select another subset S' #5

Gharibim opened this issue Feb 11, 2021 · 3 comments

Comments

@Gharibim
Copy link

Thank you so much for your efforts (papers + code).
There are few pieces I did not understand. Would you please help!

I noticed that in the FedDANE paper, it is mentioned in the algorithm that we first need to choose a subset S to compute the gradients, then we need to choose another subset S' to run the actual training (update clients weights). However, in your code I noticed that in FedDANE trainer you are passing the same seed:
selected_clients = self.select_clients(i, num_clients=self.clients_per_round)
line number 28 and 39 in FedDANE trainer. So you are choosing the same subset again not another subset S'.

Q2: In the algorithm, it is mentioned that the averaging will take place over the subset S not S' that we actually trained. So I was wondering is that a typo? If not, then would you please explain why we need to train S' then average another set S ?

Q3: When we run the first training loop to average the gradients, then we train for one epoch only, right? Since adding more than one epoch, will overwrite the gradients.

Q4: Finally, I believe in your code you assumed none of the devices will drop, is that correct?

Thank you so much for your time!

@litian96
Copy link
Owner

Thanks for your questions.

Q1. We have tried both versions (whether to select the same subset of devices for estimating the gradient and for updating), and neither of them have good empirical performance (which is part of the message in the paper).

Q2. It is not a typo. To adapt DANE to federated settings, one way is to use a subset of devices to estimate the average gradients in the gradient correction term, and this subset doesn't need to be the same as the subset of devices we choose to update the model (see Section C in the paper for details).

Q3. What do you mean by 'first training loop'? To get the average gradients, we don't perform any training (i.e., don't apply the gradients).

Q4. We allow for partial device participation, this is taking care of the issue that some devices may drop out of the network. But we assume that none of the selected devices drop after they are selected and before they send back the updates.

@Gharibim
Copy link
Author

Gharibim commented Feb 12, 2021

Thank you so much for your time!

Let me ask Q3 in a different way. In order to get the average gradients, we need to collect the gradients first, and to get the gradients in a specific round, we need to run forward prop then backward prop (to generate the gradients) without applying the gradients (since we don't want to update the weights) is that correct?
Or we just collect the current gradients (which were generated from the previous epoch) from all the models and average them? (if that is the case, then how do we average the gradients in the first round and first epoch when the gradients are still null?).

Many thanks for your help!

@litian96
Copy link
Owner

litian96 commented Feb 14, 2021

So the updating rule (in Eq 3) requires (1) to first compute the average gradients at w^{t-1}, and then (2) for each selected device to solve the local subproblem to update w^{t-1}. Therefore, it needs two communication rounds. This is adapted from DANE. At first, the models are randomly initialized. w^0 is provided as an input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants