Based on Federated Learning of Cohorts (FLoC) original proposal, instead of including a unique identifier per user (such as a cookie) in each bid request, we assign each user to a large cohort of users with similar browsing histories, and include only cohort IDs in bid requests. A cohort ID reveals a user’s general interests, but since any single cohort ID is shared by many users, it cannot be easily used to track individual browsing behavior across sites.
We determine a user’s cohort ID by applying a locality-sensitive hash function to a vector representing the user’s browsing history. There are many ways to encode a user’s browsing history as a vector . For example, we can number all websites from
to
, and then let
be a
-dimensional vector whose
th coordinate is
if the user visited the
th website and
otherwise.
k-random centers is a simple locality-sensitive hash function. Given a -dimensional input vector
, the hash value
is determined as follows:
- Choose
random points (‘centers’) in
-dimensional space, and number them
to
.
- Let
be the index of the center closest to x according to cosine similarity
We let be the cohort ID of a user with history vector
. For comparison with other algorithms, note that since the cohort ID is the index of a center, the number of bits required to encode the ID is
.
See figure below for an illustration of this hash function.
Each input vector is assigned to its closest center, and the hash value is the index of that center. In this example, vector is assigned to center 1, and vectors
and
are assigned to center 2.
Another way to understand the k-random centers hash function is to note that it is essentially equivalent to the first iteration of the standard k-means clustering algorithm.
A key advantage of k-random centers (and any locality-sensitive hash function) is that it can be implemented in a fully distributed manner without the need for a central server that stores private user data. If all users use the same pseudo-random seed to generate the random centers, then each user’s cohort ID can be calculated independently by each user without any communication or coordination among the users, and yet similar browsing histories will nonetheless be mapped to the same cohort ID.
The main downsides of a fully distributed cohort assignment algorithm are that (1) within-cohort similarity is weaker than for a more centralized algorithm, and (2) a minimum cohort size cannot be enforced. With respect to the latter, we have observed that in practice cohort sizes are roughly uniformly distributed, with the number of users per cohort decreasing as the number of centers increases.