# Forget-SVGD: Particle-Based Bayesian Federated Unlearning

![1](https://drive.google.com/uc?export=view&id=1fPNP4yab9B3wPvDJGYJkV2sawNwGMzck)

## Setup

we consider a federated learning set-up with a set $\mathcal{K}=\{1,\ldots,K\}$ of $K$ agents communicating with a central node in a parameter server architecture. The local data set $\mathcal{D}_{k}=\{z_{k,n} \}_{n=1}^{N_k}$ of agent $k\in\mathcal{K}$ contains $N_k$ data points, and the associated training loss for model parameter $\theta$ is defined as
$$L_k(\theta)=\frac{1}{N_k}\sum_{n=1}^{N_k} \ell_k(z_{k,n}|\theta)$$
for some loss function $\ell_k(z|\theta)$. We also denote as $\mathcal{D}=\bigcup_{k=1}^K\mathcal{D}_k$ the global data set.

In variational Bayesian federated learning, the agents collectively aim at obtaining a *variational distribution* $q(\theta)$ on the model parameter space that minimizes the *global free energy*
\begin{align}
\min_{q(\theta)} \bigg\{ F(q(\theta))\!=\!\sum_{k=1 }^K&\mathbb{E}_{\theta\sim q(\theta)}[L_k(\theta)]+\alpha\cdot \mathbb{D}\big( q(\theta)\big\|p_0 (\theta)\big) \bigg\},
\end{align}
where $\alpha > 0$ is a ``temperature parameter"; $\mathbb{D}\left(\cdot\|\cdot\right)$ denotes Kullback–Leibler (KL) divergence; and $p_0(\theta)$ is a prior distribution. Minimizing the global free energy $F(q(\theta))$ seeks for a distribution $q(\theta)$ that is close to the prior $p_0 (\theta)$ while also minimizing the average sum-training loss. The unconstrained optimal solution of minimization of global free energy is given by the *global generalized posterior distribution*
\begin{align}
q^*(\theta|\mathcal{D})&=\frac{1}{Z}\cdot \tilde{q}^*(\theta|\mathcal{D})
\end{align} 
\begin{align}
\textrm{where}\quad\tilde{q}^*(\theta|\mathcal{D})&=p_0 (\theta) \exp \left(-\frac{1}{\alpha}\sum_{k=1}^K L_k(\theta)\right),
\end{align}
which coincides with the conventional posterior $p\big(\theta|\mathcal{D}\big)$ when we set $\alpha=1$ and the loss function is given by the log-loss $\ell_k(z|\theta)=-\log p(z|\theta)$.

In practice, minimization of global free energy can only be solved approximately by either: (*i*) assuming a parametric form for the variational posterior $q(\theta|\mathcal{D})$, e.g., a Gaussian probability density function; or (*ii*) representing the variational posterior $q(\theta|\mathcal{D})$ in a non-parametric fashion based on a number of samples $\{\theta_n\}^{N}_{n=1}$.

## Federated Machine Unlearning

In this paper, we focus on the problem of machine unlearning. Accordingly, the goal is to "remove" information about the data of a subset $\mathcal{U}\subset \mathcal{K}$ of agents from the approximate solution $q(\theta|\mathcal{D})$ of the federated learning problem (minimization of global free energy). Clearly, one could obtain a variational posterior $q(\theta|\mathcal{D}_{-\mathcal{U}})$ by retraining of scratch using the data set $\mathcal{D}_{-\mathcal{U}}=\mathcal{D}\setminus \mathcal{D}_{\mathcal{U}}$ that excludes the data sets $\mathcal{D}_{\mathcal{U}}=\{\mathcal{D}_k \}_{k\in\mathcal{U}}$ from the agents whose data are to be ``forgotten". However, this may be costly in terms of computation and convergence time. Machine unlearning is concerned with developing more efficient unlearning protocols that do not require complete retraining from scratch.

We follow the variational unlearning formulation introduced in, whereby unlearning of a data set $\mathcal{D}_{\mathcal{U}}$ from the variational posterior $q(\theta|\mathcal{D})$ is formulated as the minimization of the *unlearning free energy*
\begin{align}
\min_{q(\theta)} \bigg\{ F_{\mathcal{U}}(q(\theta))= \sum_{k\in\mathcal{U}}&\mathbb{E}_{\theta\sim q(\theta)}[-L_k (\theta)]\nonumber\\
&+\alpha\cdot \mathbb{D}(q(\theta)\|q(\theta|\mathcal{D}))\bigg\}.
\end{align}
Note that, unlike the global free energy $F(q(\theta))$ in minimization of global free energy, the unlearning free energy $F_{\mathcal{U}}(q(\theta))$ in (\ref{eq:ref_unlearning}) includes as its first term the negative of the training loss of the data to be removed, while the role of the prior $p_0 (\theta)$ in minimization of global free energy is played in the above equation by the variational posterior $q(\theta|\mathcal{D})$. Intuitively, minimization of the unlearning free energy aims at finding a distribution $q(\theta)$ that is close to the current variational posterior $q(\theta|\mathcal{D})$, while maximizing the average training loss for the data sets to be forgotten. 

## Distributed SVGD

In DSVGD, the server maintains a set of $N$ particles $\{\theta_n \}_{n=1}^N$ that are iteratively updated by a subset of agents. We focus here on the case of a single agent scheduled at each iteration, since the extension to more than one agent is direct. DSVGD minimizes the global free energy over the variational distribution $q(\theta)$ in a distributed manner. At the beginning of the $i$-th iteration, the server stores the current particles $\{\theta_n^{(i-1)} \}_{n=1}^N$, which represent the current iterate $q^{(i-1)}(\theta)$ of the global variational distribution. An explicit estimate of distribution $q^{(i-1)} (\theta)$ can be obtained, e.g., via kernel density estimator (KDE) with some kernel function $K(\theta, \theta')$. Following expectation propagation (EP) and PVI, DSVGD writes the variational distribution $q^{(i-1)}(\theta)$ is interpreted as being factorized as $q^{(i-1)}(\theta)=p_0(\theta)\prod_{k=1}^K t_k^{(i-1)}(\theta)$, where the term $t_k^{(i-1)} (\theta)$ is known as approximate likelihood of agent $k$. At each iteration $i$, the scheduled agent $k$ updates the variational distribution $q^{(i-1)}(\theta)$ by modifying its approximate likelihood to a new iterate $t_{k}^{(i)}(\theta)$.

o this end, at each iteration $i$, the scheduled agent $k$ updates the current set of particles $\{\theta_n^{(i-1)} \}_{n=1}^N$ with the goal of minimizing the \textit{local free energy}
\begin{align}
 \min_{q(\theta)} \bigg\{F_k^{(i)}(q(\theta))=\mathbb{E}_{\theta\sim q(\theta)}[L_{k}(\theta)]+\alpha \mathbb{D}\big(q(\theta)\big\| \hat{p}_k^{(i)}(\theta)\big)\bigg\},
\end{align}
where the so-called cavity distribution $\hat{p}_k^{(i)}(\theta)$
\begin{align}
\hat{p}_k^{(i)}(\theta)\propto\frac{q^{(i-1)}(\theta)}{t_{k}^{(i-1)}(\theta)}
\end{align}
removes the approximate local likelihood of agent $k$, which is updated as 
\begin{align}
t_k^{(i)}(\theta)=\frac{q^{(i)}(\theta)}{q_{k}^{(i-1)}(\theta)} t_{k}^{(i-1)}(\theta).
\end{align}

## Forget-SVGD
- **Initialization.** The initial set of $N$ particles $\{\theta_n^{(0)}\}_{n=1}^N$ represents the variational distribution obtained as a result of Bayesian federated learning; initialize at random local particles $\{\theta_{k,n}^{(0)} \}_{n=1}^N$ for all agents $k\in \mathcal{U}$

- **Step 1.** At iteration $i$, the server schedules an agent $k\in\mathcal{U}$ in the set of agents whose data must be unlearned. Agent $k$ downloads the current global particles $\{\theta_n^{(i-1)} \}_{n=1}^N$ from the server.

- **Step 2.** Agent $k$ initialize $\{\theta_{n}^{[0]}=\theta_{n}^{(i-1)}\}_{n=1}^N$ and updates downloaded particles as (\ref{eq:unlearn_svgd_particle})-(\ref{eq:phi_hat}) with the caveat that the (unnormalized) tilted distribution $\tilde{p}_k^{(i)}(\theta)$ is defined as
\begin{align}
\tilde{p}_k^{(i)}(\theta)=\frac{q^{(i-1)}(\theta)}{t_k^{(i-1)}(\theta)} \exp \left(\frac{1}{\alpha} L_k(\theta)\right),
\end{align}
where $q^{(i-1)}(\theta)$ and $t_k^{(i-1)}(\theta)$ are computed by using the respective KDEs with global and local particles, respectively.

- **Step 3.** Agent $k$ sets $\{\theta_n^{(i)}=\theta_n^{[L]}\}_{n=1}^N$. The updated particles $\{\theta_n^{(i)} \}_{n=1}^N$ are sent to the server that sets $\{\theta_n = \theta_n^{(i)} \}_{n=1}^N$. Agent $k$ updates its local particles $\{\theta_{k, n}^{(i)} \}_{n=1}^N$ via $L_{\textrm{local}}$ distillation steps as (\ref{eq:update_local_particle})-(\ref{eq:phi_hat_local}). Finally, agent $k$ updates the current local particles as $\{\theta_{k,n}^{(i)}=\theta_{k,n}^{[L]}\}_{n=1}^N$, while the other agents $k'\neq k$ set $\{\theta_{k',n}^{(i)} =\theta_{k',n}^{(i-1)} \}_{n=1}^N$.

# References
- J. Gong, J. Kang, O. Simeone and R. Kassab, "Forget-SVGD: Particle-Based Bayesian Federated Unlearning," 2022 IEEE Data Science and Learning Workshop (DSLW), 2022, pp. 1-6, doi: 10.1109/DSLW53931.2022.9820602. [[Paper](https://ieeexplore.ieee.org/document/9820602)]