Skip to content

Latest commit

 

History

History
26 lines (13 loc) · 4.57 KB

File metadata and controls

26 lines (13 loc) · 4.57 KB

FLoC Clustering Algorithm: k Random Centers

Based on Federated Learning of Cohorts (FLoC) original proposal, instead of including a unique identifier per user (such as a cookie) in each bid request, we assign each user to a large cohort of users with similar browsing histories, and include only cohort IDs in bid requests. A cohort ID reveals a user’s general interests, but since any single cohort ID is shared by many users, it cannot be easily used to track individual browsing behavior across sites.

We determine a user’s cohort ID by applying a locality-sensitive hash function to a vector representing the user’s browsing history. There are many ways to encode a user’s browsing history as a vector formula. For example, we can number all websites from formula to formula, and then let formula be a formula-dimensional vector whose formulath coordinate is formula if the user visited the formulath website and formula otherwise.

k-random centers is a simple locality-sensitive hash function. Given a formula-dimensional input vector formula, the hash value formula is determined as follows:

  1. Choose formula random points (‘centers’) in formula-dimensional space, and number them formula to formula.
  2. Let formula be the index of the center closest to x according to cosine similarity

We let formula be the cohort ID of a user with history vector formula. For comparison with other algorithms, note that since the cohort ID is the index of a center, the number of bits required to encode the ID is formula.

See figure below for an illustration of this hash function.

alt text

Each input vector is assigned to its closest center, and the hash value is the index of that center. In this example, vector formula is assigned to center 1, and vectors formula and formula are assigned to center 2.

Another way to understand the k-random centers hash function is to note that it is essentially equivalent to the first iteration of the standard k-means clustering algorithm.

A key advantage of k-random centers (and any locality-sensitive hash function) is that it can be implemented in a fully distributed manner without the need for a central server that stores private user data. If all users use the same pseudo-random seed to generate the random centers, then each user’s cohort ID can be calculated independently by each user without any communication or coordination among the users, and yet similar browsing histories will nonetheless be mapped to the same cohort ID.

The main downsides of a fully distributed cohort assignment algorithm are that (1) within-cohort similarity is weaker than for a more centralized algorithm, and (2) a minimum cohort size cannot be enforced. With respect to the latter, we have observed that in practice cohort sizes are roughly uniformly distributed, with the number of users per cohort decreasing as the number of centers increases.