Crawl well-known Resources introduced by The Privacy Sandbox:
-
/.well-known/privacy-sandbox-attestations.json- Submit a form, JSON file sent by Google
- No public list of who participates
A Dockerfile is provided under .devcontainer/; for direct integration with
VS Code or to manually build the image and deploy the Docker container, follow
the instructions in this guide.
Required:
CRUX_URL: The URL to the cached version of CrUX to use (https://github.com/zakird/crux-top-lists/raw/main/data/global/current.csv.gz)CRUX_TOP: Specify how many first top origins to crawl (1000000)RWS_URL: The URL to the RWS canonical set (https://raw.githubusercontent.com/GoogleChrome/related-website-sets/main/related_website_sets.JSON)
Optional:
S3_DATA_BUCKET: The s3 bucket where the crawl raw results are saved, if undefined, we are assuming local run.
./crawl_crux.shDefine the following CI variables to have Gitlab CI building and pushing the Docker image automatically so that ECS task is up to date:
AWS_ACCOUNT_ID: the AWS account IDAWS_REGION: the AWS region to useAWS_ACCESS_KEY_ID: of an IAM user with theAmazonEC2ContainerRegistryPowerUserpolicyAWS_SECRET_ACCESS_KEY: of an IAM user with theAmazonEC2ContainerRegistryPowerUserpolicy