Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to generate and upload schemas #18

Merged
merged 34 commits into from
May 9, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e7f9d1a
Add script to generate and upload schemas
fbertsch Apr 25, 2019
639b25f
Update metadata-merge with correct output
fbertsch Apr 30, 2019
c49a65c
Add namespace/doctype whitelist
fbertsch Apr 30, 2019
4d9baf8
Update git user and commit handling
fbertsch Apr 30, 2019
0adf84b
Use published transpiler
fbertsch Apr 30, 2019
51317e6
Remove Avro schemas from script
fbertsch Apr 30, 2019
90046c5
Use dev branch of MPS
fbertsch Apr 30, 2019
e0c7062
Update whitelist
fbertsch Apr 30, 2019
6dc5f1d
Use correct credentials for git push
fbertsch Apr 30, 2019
4247724
Correct transpiler BQ output with jq
fbertsch Apr 30, 2019
7d3217b
Push only whitelisted schemas
fbertsch Apr 30, 2019
0ff53a6
Move git config to script
fbertsch Apr 30, 2019
47cd7a2
Remove controversial terminology
fbertsch Apr 30, 2019
1e9dd8a
Use local copy of mozilla-schema-generator
fbertsch May 2, 2019
b97d6c6
Use correct name for events ping
fbertsch May 2, 2019
df6aad6
Work on separate devlopment branch
fbertsch May 2, 2019
7c15cce
Remove non allowed-list bq schemas
fbertsch May 2, 2019
b58ba56
Don't include metadata in Json schemas
fbertsch May 3, 2019
21a67f4
Don't include parquet files
fbertsch May 3, 2019
7e01548
Pin jsonschema-transpiler to v1.0.0
acmiyaguchi May 6, 2019
c4438f6
Assume binary location from PATH
acmiyaguchi May 7, 2019
0054cbe
Rename MPS_SSH_KEY_BASE64 and remove USER_ID
acmiyaguchi May 7, 2019
c0f227d
Add script doc and move transpiler dep to dockerfile
acmiyaguchi May 7, 2019
158ff13
Refactor script into modular parts
acmiyaguchi May 7, 2019
3c2d418
Rename variables and make shellcheck happy
acmiyaguchi May 7, 2019
9ee438e
Add newline
acmiyaguchi May 7, 2019
60591a8
Update bin/allowlist
jklukas May 8, 2019
857a131
Simplify dockerfile
acmiyaguchi May 8, 2019
3a5abec
Fix permissions, use python3 -m venv, and set -exuo pipefail
acmiyaguchi May 8, 2019
5b897e9
Only merge valid schemas; remove numbering
acmiyaguchi May 9, 2019
4f7605a
Simplify filtering logic
acmiyaguchi May 9, 2019
6558800
Move allowlist to project root
acmiyaguchi May 9, 2019
0e0af4e
Fix error on empty commit
acmiyaguchi May 9, 2019
1413ccf
Remove `|| exit` with addition of `set -e`
acmiyaguchi May 9, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 14 additions & 9 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,12 @@ MAINTAINER Frank Bertsch <frank@mozilla.com>
# Guidelines here: https://github.com/mozilla-services/Dockerflow/blob/master/docs/building-container.md
ARG RUST_SPEC=stable
ARG USER_ID="10001"
ARG GROUP="app"
ARG GROUP_ID="app"
ARG HOME="/app"

ENV HOME=${HOME}
RUN mkdir ${HOME} && \
chown ${USER_ID}:${USER_ID} ${HOME} && \
groupadd --gid ${USER_ID} ${GROUP} && \
useradd --no-create-home --uid 10001 --gid 10001 --home-dir /app ${GROUP}
RUN groupadd --gid ${USER_ID} ${GROUP_ID} && \
useradd --create-home --uid ${USER_ID} --gid ${GROUP_ID} --home-dir /app ${GROUP_ID}

RUN apt-get update && \
apt-get install -y --no-install-recommends \
Expand All @@ -20,17 +18,24 @@ RUN apt-get update && \
apt-get clean

# Install Rust and Cargo
RUN curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain=${RUST_SPEC}
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain=${RUST_SPEC}

ENV CARGO_INSTALL_ROOT=${HOME}/.cargo
ENV PATH ${PATH}:${HOME}/.cargo/bin

# Install Google Cloud SDK
RUN curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:$HOME/google-cloud-sdk/bin
# Install a tagged version of jsonschema-transpiler
RUN cargo install jsonschema-transpiler --version 1.0.0

# Upgrade pip
RUN pip install --upgrade pip

WORKDIR ${HOME}

ADD . ${HOME}/mozilla-schema-generator
ENV PATH $PATH:${HOME}/mozilla-schema-generator/bin

# Drop root and change ownership of the application folder to the user
RUN chown -R ${USER_ID}:${GROUP_ID} ${HOME}
USER ${USER_ID}

ENTRYPOINT ["schema_generator.sh"]
6 changes: 6 additions & 0 deletions allowlist
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
telemetry/core
org-mozilla-fenix/.*
org-mozilla-reference-browser/.*
activity-stream/.*
eng-workflow/bmobugs
mobile/activation
57 changes: 57 additions & 0 deletions bin/metadata_merge
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/env python3

import click
import json

@click.command()
@click.argument(
'metadata',
type=click.Path(
dir_okay=False,
file_okay=True,
writable=False,
exists=True,
),
required=True
)
@click.argument(
'schema',
type=click.Path(
dir_okay=False,
file_okay=True,
writable=True,
exists=True,
),
required=True
)
def main(metadata, schema):
acmiyaguchi marked this conversation as resolved.
Show resolved Hide resolved
print("Merging metadata {} and schema {}".format(metadata, schema))

with open(metadata, "r") as f:
metadata_contents = json.load(f)

with open(schema, "r") as f:
schema_contents = json.load(f)

properties = metadata_contents.get("properties", {})
required = metadata_contents.get("required", [])

if "properties" not in schema_contents:
schema_contents["properties"] = {}

if "required" not in schema_contents:
schema_contents["required"] = []

schema_contents["properties"].update(properties)
schema_contents["required"] += required

json_dump_args = {
'indent': 2,
'separators': (',', ': ')
}

with open(schema, "w") as f:
f.write(json.dumps(schema_contents, **json_dump_args))

if __name__ == "__main__":
main()
150 changes: 150 additions & 0 deletions bin/schema_generator.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
#!/bin/bash

# A script for generating schemas that are deployed into the pipeline. This
# script handles preprocessing, filtering, and transpilation of schemas as part
# of a pre-deployment scheme. The resulting schemas are pushed to a branch of
# mozilla-pipeline-schemas.
#
# Environment variables:
# MPS_SSH_KEY_BASE64: A base64-encoded ssh secret key with permissions to push
# to mozilla-pipeline-schemas
#
# Example usage:
# export MPS_SSH_KEY_BASE64=$(cat ~/.ssh/id_rsa | base64)
# make build && make run
#
# TODO: Update schema mapping for validation
# TODO: Handle overwriting glean schemas
# TODO: Include Main Ping from schema generation
# TODO: What the heck to do with pioneer-study, a non-nested namespace

set -exuo pipefail

MPS_REPO_URL="git@github.com:mozilla-services/mozilla-pipeline-schemas.git"
MPS_BRANCH_SOURCE="dev"
MPS_BRANCH_WORKING="local-working-branch"
MPS_BRANCH_PUBLISH="generated-schemas"
MPS_SCHEMAS_DIR="schemas"

BASE_DIR="/app"
ALLOWLIST="$BASE_DIR/mozilla-schema-generator/allowlist"


function setup_git_ssh() {
# Configure the container for pushing to github

if [[ -z "$MPS_SSH_KEY_BASE64" ]]; then
echo "Missing secret key" 1>&2
exit 1
fi

git config --global user.name "Generated Schema Creator"
git config --global user.email "dataops+pipeline-schemas@mozilla.com"

mkdir -p "$HOME/.ssh"

echo "$MPS_SSH_KEY_BASE64" | base64 --decode > /app/.ssh/id_ed25519
# Makes the future git-push non-interactive
ssh-keyscan github.com > /app/.ssh/known_hosts

chown -R "$(id -u):$(id -g)" "$HOME/.ssh"
chmod 700 "$HOME/.ssh"
chmod 700 "$HOME/.ssh/id_ed25519"
}

function setup_dependencies() {
# Installs mozilla-schema-generator in a virtual environment

python3 -m venv msg-venv
# shellcheck disable=SC1091
acmiyaguchi marked this conversation as resolved.
Show resolved Hide resolved
source msg-venv/bin/activate
pip install -e ./mozilla-schema-generator
}

function clone_and_configure_mps() {
# Checkout mozilla-pipeline-schemas and changes directory to prepare for
# schema generation.

[[ -d mozilla-pipeline-schemas ]] && rm -r mozilla-pipeline-schemas

git clone $MPS_REPO_URL
cd mozilla-pipeline-schemas/$MPS_SCHEMAS_DIR
git checkout $MPS_BRANCH_SOURCE
git checkout -b $MPS_BRANCH_WORKING
}

function prepare_metadata() {
local telemetry_metadata="metadata/telemetry-ingestion/telemetry-ingestion.1.schema.json"
local structured_metadata="metadata/structured-ingestion/structured-ingestion.1.schema.json"

find ./telemetry -name "*.schema.json" -type f \
-exec metadata_merge $telemetry_metadata {} ";"
find . -path ./telemetry -prune -o -name "*.schema.json" -type f \
-exec metadata_merge $structured_metadata {} ";"
}

function filter_schemas() {
# Remove metadata schemas
rm -rf metadata

# Pioneer-study is not nested, remove it
rm -rf pioneer-study

# Remove BigQuery schemas that are not in the allowlist
find . -name '*.bq' | grep -v -f $ALLOWLIST | xargs rm -f
}

function commit_schemas() {
# This method will keep a changelog of releases. If we delete and newly
# checkout branches everytime, that will contain a changelog of changes.
# Assumes the current directory is the root of the repository

find . -name "*.bq" -type f -exec git add {} +
git checkout ./*.schema.json
git commit -a -m "Interim Commit"

git checkout $MPS_BRANCH_PUBLISH || git checkout -b $MPS_BRANCH_PUBLISH

# Keep only the schemas dir
find . -mindepth 1 -maxdepth 1 -not -name .git -exec rm -rf {} +
git checkout $MPS_BRANCH_WORKING -- schemas
git commit -a -m "Auto-push from schema generation" || echo "Nothing to commit"
}

function main() {
cd $BASE_DIR

# Setup ssh key and git config
setup_git_ssh

# Install dependencies
setup_dependencies

# Pull in all schemas from MPS and change directory
clone_and_configure_mps

# Generate new schemas
mozilla-schema-generator generate-glean-ping --out-dir . --pretty

# Remove all non-json schemas (e.g. parquet)
find . -not -name "*.schema.json" -type f -exec rm {} +

# Add metadata to all json schemas, drop metadata schemas
prepare_metadata

# Add transpiled BQ schemas
find . -type f -name "*.schema.json" | while read -r fname; do
bq_out=${fname/schema.json/bq}
jsonschema-transpiler --type bigquery "$fname" > "$bq_out"
done

# Keep only allowed schemas
filter_schemas

# Push to branch of MPS
cd ../
commit_schemas
git push
}

main "$@"
2 changes: 2 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@ services:
dockerfile: Dockerfile
restart: "no"
command: "true"
environment:
- MPS_SSH_KEY_BASE64
2 changes: 1 addition & 1 deletion mozilla_schema_generator/glean_ping.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ class GleanPing(GenericPing):
repos_url = "https://probeinfo.telemetry.mozilla.org/glean/repositories"

default_probes_url = probes_url.format("glean")
default_pings = {"baseline", "event", "metrics"}
default_pings = {"baseline", "events", "metrics"}
ignore_pings = {"default", "glean_ping_info", "glean_client_info"}

def __init__(self, repo): # TODO: Make env-url optional
Expand Down
2 changes: 1 addition & 1 deletion tests/test_glean.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def test_env_size(self, glean):
def test_single_schema(self, glean, config):
schemas = glean.generate_schema(config, split=False)

assert schemas.keys() == {"baseline", "event", "metrics"}
assert schemas.keys() == {"baseline", "events", "metrics"}

final_schemas = {k: schemas[k][0].schema for k in schemas}
for name, schema in final_schemas.items():
Expand Down