Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging support in the aggregation service: Feedback Requested #42

Open
cylee81 opened this issue Mar 20, 2024 · 1 comment
Open

Debugging support in the aggregation service: Feedback Requested #42

cylee81 opened this issue Mar 20, 2024 · 1 comment
Labels
enhancement New feature or request question Further information is requested

Comments

@cylee81
Copy link

cylee81 commented Mar 20, 2024

Hi,

The Aggregation service team is looking for your feedback to improve debugging support in the service.

Adtech can already get metrics for their jobs (status, errors, execution time etc.) from the Cloud metadata (DynamoDb in AWS and Spanner on GCP).

We are exploring other metrics, traces and logs that can provide a better understanding of the job processing within the Trusted Execution Environment without impacting privacy. We are considering providing CPU and memory metrics and total execution time traces for the adtech deployment and will benefit from your feedback on other metrics that adtech may find useful.

We are also considering adding useful logs which can give information about the job processing for debugging purposes such as ‘Job at data reading stage’ etc.. This is subject to review and approval considering user privacy.

Your inputs will be reviewed by the Privacy Sandbox team. We welcome any feedback on debugging Aggregation Service jobs.

Thank you!

@keke123 keke123 added enhancement New feature or request question Further information is requested labels Mar 21, 2024
@CGossec
Copy link

CGossec commented May 17, 2024

At Criteo, we are using the aggregation service when testing the end-to-end pipeline of ARA reports. We have been using the Aggregation Service for months, and have faced several issues when trying to run aggregation jobs. While the setup documentation is really clear, it turns out that most of our efforts w.r.t the aggregation service were spent not deploying or maintaining it, but in debugging it. Here we give some ideas of features that we think would greatly enhance our visibility when debugging aggregation jobs, as well as insight on information we think should be part of the aggregation service documentation.

1. More details on PRIVACY_BUDGET_EXHAUSTED errors

Root causes for aggregation jobs failing to execute are currently very obscure, and it’s hard to know where the error lies.

This is specifically the case for PRIVACY_BUDGET_EXHAUSTED errors. It would be a lot easier for us to locate and fix errors if an aggregation service failure could give information on either:

The report(s) causing the error, or at least the sharedID's information (or sharedIDs' information) related to the issue

The jobId of the aggregations that were related to the error, be it the aggregation that failed, but also any other, previous aggregation, that could have consumed the privacy budget for the faulty sharedIDs

2. Additional documentation on the AWS internal architecture

To simplify the understanding of the AS structure in AWS, it would be helpful to have a document explaining the various components of the aggregation service (job queue on SQS, job status table in DynamoDB, workers on EC2, access through API Gateway, etc.). Knowing what type of information is exposed via AWS tools, its format, and where to look for it would all be useful.

Additionally, once changes are made to the AS running online by the adtechs, a new deployment using Google’s cloned repositories will probably override the specific settings reached at that point (although we haven’t done this ourselves). It would be interesting to add more options when filling in the <filename>.auto.tfvar files for the setup to be more reproducible.

3. Additional information on optimization of the AS within and without the AWS infrastructure

The sizing guidance provides useful guidelines for choosing EC2 instance types depending on batch sizes. However in our tests we observed that splitting the aggregation load into thousands of small batches (which is necessary to batch the data per client) leads to long end-to-end execution times, at least if done in a naive way, even if the processing times for individual batches are short. In order to facilitate the tuning of this process for AdTechs it would be useful to have:

  • A description of how the processing is parallelized within aggregation service (across a single or different EC2 instances)
  • Any recommendation on sending parallel batch processing requests (e.g. how many batches of a certain size can be processed simultaneously by a EC2 instance of a given type).
  • Sizing recommendations for AWS components other than EC2, notably DynamoDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants