Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider alternative per-caller filtering methods #114

Open
cilvento opened this issue Nov 8, 2022 · 0 comments
Open

Consider alternative per-caller filtering methods #114

cilvento opened this issue Nov 8, 2022 · 0 comments

Comments

@cilvento
Copy link
Collaborator

cilvento commented Nov 8, 2022

The Topics API aims to reduce the usefulness of topics as a fingerprinting surface by using a coarse taxonomy of ~350 human-curated topics and returning a uniformly random topic from the taxonomy with 5% probability for each new site visit. API callers are only able to receive topics that they have previously observed for the user in a recent epoch. However, this per-caller filtering can cause challenges with using the Topics API by reducing recall and introducing consistent cross-site signals that may be possible to exploit (e.g., #74).

Furthermore, per-caller filtering may make it more difficult to reason about topics taxonomy changes or expansions. There has been some feedback suggesting increasing the granularity or size of the topics taxonomy. However, if the taxonomy increases in size, recall may become worse as callers may now observe disjoint sets of highly granular topics, counteracting any utility gains from increased detail of information provided by topics.

Consider approximate per-caller filtering

The Topics API could allow API callers to receive topics within the same sub-tree as a topic that the caller has previously observed for the user. For example, a caller could be allowed to receive any topic that is a descendant of a topic it has observed for the user (which would provide more detail than the caller's prior observation), or a caller could be allowed to receive any topic that is an ancestor of a topic it has observed for the user (which provides less new information). More nuanced criteria could also be applied, e.g., allowing a caller to receive topics with least common ancestor equal to the parent of any observed topic.

This would still require callers to have had some interaction with the user on a closely-related topic in the past, but may improve recall if a user is assigned a similar, but not exactly matching, topic to the observations made by the caller.

Consider noisy per-caller filtering

An alternative approach could be to introduce noise into per-caller filtering. For example, at the end of each epoch, the browser could apply the following logic to each caller who has observed at least one topic in the previous epoch: for each topic in the taxonomy, independently sample the "observed" bit with probability proportional to the distance to the nearest observed topic in the taxonomy. The new set of “observed” topics should not be shared with the caller directly, and should only be used in filtering. This logic could be applied per site-visit (further breaking consistency of matching topics signals across sites) but this may be cumbersome for the browser to implement. This randomization method would allow callers to (probabilistically) receive topics close to topics they have previously observed.

Potential benefits of these approaches could be:

  • Improved recall for similar/nearby topics.
  • (For approximate per-caller filtering) Can use approximations that should be straightforward for callers and end-users to comprehend (e.g., subcategory).
  • (For noisy per-caller filtering) reduce certainty of coordinated cross-site attacks based on per-caller filtering consistency, e.g. #74. This will depend on the noise level and any limitations on the number of callers per site.
  • (For noisy per-caller filtering) In combination with re-randomization of the epoch topics per site (i.e., sampling topics close to the 5 topics assigned per epoch rather than consistently returning one of the same 5), this could significantly reduce the fingerprinting potential of topics, even if the taxonomy size increases.

Potential downsides of these approaches could be:

  • May not preserve the original goal of preventing additional cross-site information sharing beyond the scope of 3rd party cookies (depending on approximation and noise choices). It's possible that a caller could learn about a topic or category that it had not observed for the user in the past, even for a topic where the caller has no existing site integrations (particularly for noisy per-caller filtering).
  • May be difficult for callers and end-users to reason about or understand noise in per-caller filtering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant