Skip to content

Latest commit

 

History

History
118 lines (90 loc) · 5.31 KB

README.md

File metadata and controls

118 lines (90 loc) · 5.31 KB

Public log asn matcher

A Postgres-based k8s-infra data pipeline that produces BigQuery tables for reviewing Kubernetes Public artifact traffic in DataStudio.

It utilises Postgres for some local querying and the Postgres container image's init folder, where several shell scripts are sourced and SQL queries are run. The process is split between multiple steps, the steps can be found in the ./pg-init.d folder in this repo or the /docker-entrypoint-initdb.d.

Some scripts call from the /app folder, in the folder there are:

  • some SQL to run against BigQuery
  • some Python scripts for local data management
  • a text file containing the names of buckets that container access logs (with _usage emitted)

The output is finally loaded in a DataStudio report and reviewed by members sig-k8s-infra without PII displayed.

Flow

  1. Kubernetes Public GCS buckets for artifacts are configured to write public access logs to a GCS bucket called k8s-infra-artifacts-gcslogs in the k8s-infra-public-pii project
  2. Data from ASN aggregators (PyASN, Potaroo) is collated in Postgres for matching ASN owners / Companies to IP ranges
  3. The ASN data is then loaded into BigQuery
  4. Publicly known company vendor ASNs from the k8s.io repo into BigQuery
  5. PeeringDB company information is fetched via their API and transformed into Postgres with the Potaroo data before being uploaded to BigQuery
  6. The Kubernetes Public GCS logs are then loaded into the usage_all_raw table of the BigQuery dataset
  7. Various tables are formed out of the previously loaded data in BigQuery, such as IPs and companies to join things together better
  8. The IPs are joined and expanded with the IP ranges from the ASN data, then loaded into the dataset as a table
  9. All the data is then linked into a single table
  10. Tables are then promoted to the stable ${GCP_BIGQUERY_DATASET} table for usage in DataStudio
  11. Wait until PID1's cmdline is postgres (meaning finished init) then halting Postgres cleanly with pg_ctl kill quit ${PID_FOR_POSTGRES}, exiting 0

Environment variables

Name Default Description
TZ `` Container time zone
POSTGRES_PASSWORD postgres The password to set for Postgres
GOOGLE_APPLICATION_CREDENTIALS `` The path to the GCP service account json key
GCP_PROJECT k8s-infra-ii-sandbox The project to target which hosts ${GCP_BIGQUERY_DATASET} and also will be billed
GCP_BIGQUERY_DATASET etl_script_generated_set The dataset and basename to write to (appends date)
NO_PROMOTE `` Disable the promotion of ${GCP_BIGQUERY_DATASET}_${DATE} to ${GCP_BIGQUERY_DATASET}
ASN_DATA_PIPELINE_RETAIN `` Keeps Postgres running after the job has completed
GCP_BQ_DUMP_BUCKET `` A GCP bucket to dump content from BigQuery
DEBUG_MODE `` Toggles bash's debug mode

Prepare

Log into gcloud

gcloud auth login

Set the GCP project

gcloud config set project k8s-infra-ii-sandbox

Log into application-default

gcloud auth application-default login

Running the Pipeline Manually

Run in Docker

TMP_DIR_ETL=$HOME/.tmp/public-log-asn-matcher-$RANDOM
sudo mkdir -p $TMP_DIR_ETL
sudo chmod 0777 ${TMP_DIR_ETL}
sudo chown -R 999 ~/.config/gcloud # allow for postgres user
docker run \
    -d \
    -e DEBUG_MODE=true \
    -e TZ=$TZ \
    -e POSTGRES_PASSWORD="postgres" \
    -e GCP_PROJECT=k8s-infra-ii-sandbox \
    -e GCP_BIGQUERY_DATASET=etl_script_generated_set \
    -e GCP_BQ_DUMP_BUCKET=ii_bq_scratch_dump \
    -v $HOME/.config/gcloud:/var/lib/postgresql/.config/gcloud \
    -v "${TMP_DIR_ETL}:/tmp" \
    --name public-log-asn-matcher \
    gcr.io/k8s-staging-infra-tools/public-log-asn-matcher
docker logs -f public-log-asn-matcher

Clean up

Change permissions of ~/.config/gcloud back

sudo chown -R $(id -u) ~/.config/gcloud

Generating the bucket list

A list of buckets is used for their .*_usage bucket which stores the public access logs

./generate-buckets.sh

Run a test build

TAG="$(date -u '+%Y%m%d')-$(git describe --tags --always --dirty)"
PROJECT="k8s-infra-ii-sandbox"
BUCKET="ii_bq_scratch_dump"

gcloud builds submit \
    --verbosity info \
    --config cloudbuild.yaml \
    --substitutions _TAG="${TAG}",_GIT_TAG="${TAG}" \
    --project ${PROJECT} \
    --gcs-source-staging-dir "gs://${BUCKET}/source" \
    .