Public log asn matcher

A Postgres-based k8s-infra data pipeline that produces BigQuery tables for reviewing Kubernetes Public artifact traffic in DataStudio.

It utilises Postgres for some local querying and the Postgres container image's init folder, where several shell scripts are sourced and SQL queries are run. The process is split between multiple steps, the steps can be found in the ./pg-init.d folder in this repo or the /docker-entrypoint-initdb.d.

Some scripts call from the /app folder, in the folder there are:

some SQL to run against BigQuery
some Python scripts for local data management
a text file containing the names of buckets that container access logs (with _usage emitted)

The output is finally loaded in a DataStudio report and reviewed by members sig-k8s-infra without PII displayed.

Flow

Kubernetes Public GCS buckets for artifacts are configured to write public access logs to a GCS bucket called k8s-infra-artifacts-gcslogs in the k8s-infra-public-pii project
Data from ASN aggregators (PyASN, Potaroo) is collated in Postgres for matching ASN owners / Companies to IP ranges
The ASN data is then loaded into BigQuery
Publicly known company vendor ASNs from the k8s.io repo into BigQuery
PeeringDB company information is fetched via their API and transformed into Postgres with the Potaroo data before being uploaded to BigQuery
The Kubernetes Public GCS logs are then loaded into the usage_all_raw table of the BigQuery dataset
Various tables are formed out of the previously loaded data in BigQuery, such as IPs and companies to join things together better
The IPs are joined and expanded with the IP ranges from the ASN data, then loaded into the dataset as a table
All the data is then linked into a single table
Tables are then promoted to the stable ${GCP_BIGQUERY_DATASET} table for usage in DataStudio
Wait until PID1's cmdline is postgres (meaning finished init) then halting Postgres cleanly with pg_ctl kill quit ${PID_FOR_POSTGRES}, exiting 0

Environment variables

Name	Default	Description
`TZ`	``	Container time zone
`POSTGRES_PASSWORD`	`postgres`	The password to set for Postgres
`GOOGLE_APPLICATION_CREDENTIALS`	``	The path to the GCP service account json key
`GCP_PROJECT`	`k8s-infra-ii-sandbox`	The project to target which hosts `${GCP_BIGQUERY_DATASET}` and also will be billed
`GCP_BIGQUERY_DATASET`	`etl_script_generated_set`	The dataset and basename to write to (appends date)
`NO_PROMOTE`	``	Disable the promotion of `${GCP_BIGQUERY_DATASET}_${DATE}` to ${GCP_BIGQUERY_DATASET}
`ASN_DATA_PIPELINE_RETAIN`	``	Keeps Postgres running after the job has completed
`GCP_BQ_DUMP_BUCKET`	``	A GCP bucket to dump content from BigQuery
`DEBUG_MODE`	``	Toggles bash's debug mode

Prepare

Log into gcloud

gcloud auth login

Set the GCP project

gcloud config set project k8s-infra-ii-sandbox

Log into application-default

gcloud auth application-default login

Running the Pipeline Manually

Run in Docker

TMP_DIR_ETL=$HOME/.tmp/public-log-asn-matcher-$RANDOM
sudo mkdir -p $TMP_DIR_ETL
sudo chmod 0777 ${TMP_DIR_ETL}
sudo chown -R 999 ~/.config/gcloud # allow for postgres user
docker run \
    -d \
    -e DEBUG_MODE=true \
    -e TZ=$TZ \
    -e POSTGRES_PASSWORD="postgres" \
    -e GCP_PROJECT=k8s-infra-ii-sandbox \
    -e GCP_BIGQUERY_DATASET=etl_script_generated_set \
    -e GCP_BQ_DUMP_BUCKET=ii_bq_scratch_dump \
    -v $HOME/.config/gcloud:/var/lib/postgresql/.config/gcloud \
    -v "${TMP_DIR_ETL}:/tmp" \
    --name public-log-asn-matcher \
    gcr.io/k8s-staging-infra-tools/public-log-asn-matcher
docker logs -f public-log-asn-matcher

Clean up

Change permissions of ~/.config/gcloud back

sudo chown -R $(id -u) ~/.config/gcloud

Generating the bucket list

A list of buckets is used for their .*_usage bucket which stores the public access logs

./generate-buckets.sh

Run a test build

TAG="$(date -u '+%Y%m%d')-$(git describe --tags --always --dirty)"
PROJECT="k8s-infra-ii-sandbox"
BUCKET="ii_bq_scratch_dump"

gcloud builds submit \
    --verbosity info \
    --config cloudbuild.yaml \
    --substitutions _TAG="${TAG}",_GIT_TAG="${TAG}" \
    --project ${PROJECT} \
    --gcs-source-staging-dir "gs://${BUCKET}/source" \
    .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Public log asn matcher

Flow

Environment variables

Prepare

Running the Pipeline Manually

Clean up

Generating the bucket list

Run a test build

Files

README.md

Latest commit

History

README.md

File metadata and controls

Public log asn matcher

Flow

Environment variables

Prepare

Running the Pipeline Manually

Clean up

Generating the bucket list

Run a test build