Add script to generate and upload schemas #18

fbertsch · 2019-04-25T20:58:55Z

No description provided.

fbertsch · 2019-04-25T21:00:07Z

There's still a few TODOs:

Will this script create a tarball and push that to GCS, or will jenkins see the new push to MPS and create its own?
We need to get the proper credentials and bucket for GCS public repos
We need to setup credentials to let this script push to MPS

fbertsch · 2019-04-25T21:00:24Z

@whd, you seem most likely to have opinions on those todos listed above.

bin/schema_generator.sh

bin/metadata_merge

bin/schema_generator.sh

whd · 2019-04-26T01:20:25Z

* Will this script create a tarball and push that to GCS, or will jenkins see the new push to MPS and create its own?

The latter.

* We need to get the proper credentials and bucket for GCS public repos

Absent a compelling reason, this should also happen on jenkins (as triggered by the MPS push).

* We need to setup credentials to let this script push to MPS

I will work on setting this up for airflow.

bin/schema_generator.sh

whd · 2019-04-26T01:27:07Z

bin/schema_generator.sh

+
+# 2. Remove all non-json schemas (e.g. parquet)
+
+find . -not -name "*.schema.json" -type f | xargs rm


Is this necessary? I can image a future in which we have a unified branch (master) that is updated by both humans and machines, where parquet schemas are simply ignored in the GCP pipeline.

Not really necessary, but I would prefer to only have the relevant schemas available. Why don't we add the parquet schemas back in later if they become useful?

Basically my goal is to have an easy way of saying "what is running on prod"? And the parquet schemas are not included in that (for GCP).

This is fine for now, since the generated branch will be gcp-only. I'll want to revisit this in the hypothetical future where we have a single branch. Since the responsibility of this script will not include uploading to GCS the deletion of parquetmr files could still happen farther down the pipeline.

Dockerfile

whd · 2019-04-29T18:48:12Z

bin/schema_generator.sh

+git checkout $MPS_BRANCH || git checkout -b $MPS_BRANCH
+git commit -a -m "Auto-push from schema generation"
+
+git push --repo https://name:password@bitbucket.org/name/repo.git


Per https://bugzilla.mozilla.org/show_bug.cgi?id=1547333#c2 I think the easiest thing to do here is to set the ssh key via an env var and have this script write the contents of that var to ~/.ssh/id_rsa or similar.

I've updated this, but testing locally I still needed to input a password. The only way I could force push to github was using a personal access token.

im not sure about bitbucket, but github requires pushing to the ssh url rather than https url for password less

I did make that change, see the updated commit.

- Use ssh key for pushing to github - Cherry-pick commit for updated schemas - Remove GCS handling

fbertsch · 2019-04-30T14:50:00Z

I've added a whitelist, and fixed the cherry-picking. Testing led to this branch:

https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/generated-schemas

However, I've yet to get it to work pushing using the ssh auth.

fbertsch · 2019-04-30T14:51:55Z

bin/schema_generator.sh

+
+cd ../
+
+rm -rf templates tests validation


We could consider removing the remaining files in the git repository, e.g. CMakeLists.txt, Dockerfile, etc. They aren't necessarily useful in the read-only git branch.

I couldn't find an emoji for "this seems reasonable for a generated read-only git branch".

whd · 2019-04-30T18:53:31Z

I've tested the generated-schemas branch on the jenkins infra and it mostly appears to work. The following issues still arise:

The .bq schemas aren't valid. This is a combination of mozilla/jsonschema-transpiler#51 and (I think) the top-level fields being unnamed. @acmiyaguchi told me he's fixing at least the former of these.

The whitelist appears to work, but as a result a very limited subset of bq schemas are generated (e.g. no activity stream, webpagetest etc). Was this for testing purposes, or is bq table generation for less fancy schemas supposed to be incorporated into dev directly? Perhaps a blacklist would be better.

fbertsch · 2019-04-30T19:24:39Z

@whd I added the whitelist so we could do a slow roll-out of schemas, while keeping the branch to contain only the ones that are deployed. All BQ table generation should be done during this build step.

whd · 2019-04-30T19:51:04Z

@fbertsch ok, can you whitelist activity-stream and eng-workflow (bmobugs only)? These (and core v9/10) are the current tables we're generating in GCP so having them in the first round of automated schemas would be ideal.

bin/schema_generator.sh

jklukas · 2019-04-30T20:45:01Z

can you whitelist activity-stream and eng-workflow

I want to be careful here, especially with activity-stream, that we don't accidentally recreate tables and drop existing data. As long as the logic here doesn't drop tables, this should be okay.

I've already updated the activity-stream, bmobugs, and core tables "by hand" to include the new metadata fields (and to relax the previously required metadata.document_id to nullable, since it is no longer being sent). These may throw errors when we try to update schema since they contain old metadata fields not in the json schemas.

After I validate that the new metadata fields are flowing properly, I plan to update these tables to drop the old fields (tomorrow (Wednesday), if all goes well).

whd · 2019-04-30T20:49:52Z

can you whitelist activity-stream and eng-workflow

I want to be careful here, especially with activity-stream, that we don't accidentally recreate tables and drop existing data. As long as the logic here doesn't drop tables, this should be okay.

The hand-crafted tables have a _dev suffix on them, whereas the managed ones do not, so no tables are dropped. A separate config change needs to be applied at https://github.com/mozilla-services/cloudops-infra/blob/ingestion/projects/beam/tf/modules/resources/dataflow_jobs/bigquery.tf#L30 to point the dataflow jobs at the new managed tables.

jklukas · 2019-04-30T21:05:29Z

Good point that we're transitioning to the non-dev namespaces now. I think this sounds good. Once we change the job to point at the new namespaces, I can work on backfilling the dev activity-stream tables to the new ones.

fbertsch · 2019-05-02T22:20:12Z

@whd addressed the feedback. The submission_timestamp fields are now DATETIME, so we may be able to test deploy.

bin/schema_generator.sh

whd

Aside from some cleanup and minor inconsistencies in usage of the empirical form in comments, and the sprinkling of || exits that don't make sense to me, this looks good.

Dockerfile

bin/schema_generator.sh

whd · 2019-05-07T23:36:39Z

bin/schema_generator.sh

+    # Replace newlines with backticks (hard to do with sed): cat | tr
+    # Remove the last backtick; it's the file-ending newline: rev | cut | rev
+    # Replace backticks with "\|" (can't do that with tr): sed
+    # Find directories that don't match any of the regex expressions: find


The goal of this step is to remove bq schemas not in allowlist, we are keeping the directories and json schemas.

bin/schema_generator.sh

fbertsch · 2019-05-08T05:43:09Z

Thanks for picking this up, Anthony! My only input is that the jsonschema-transpiler dep wont be updated if the install is in the dockerfile. I would recommend at least checking for an update in the script.

…

On Wed, May 8, 2019, 2:51 AM Anthony Miyaguchi ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In bin/schema_generator.sh <#18 (comment)> : > + # Assumes the current directory is the root of the repository + + find . -name "*.bq" -type f -exec git add {} + + git checkout ./*.schema.json + git commit -a -m "Interim Commit" + + git checkout $MPS_BRANCH_PUBLISH || git checkout -b $MPS_BRANCH_PUBLISH + + # Keep only the schemas dir + find . -mindepth 1 -maxdepth 1 -not -name .git -exec rm -rf {} + + git checkout $MPS_BRANCH_WORKING -- schemas + git commit -a -m "Auto-push from schema generation" +} + +function main() { + cd $BASE_DIR || exit There are a couple of places where the folders and branches are being deleted without checking for existence. I can go back and add set -eou pipefail for errors, unset variables, and pipeline errors. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE62YYGQIGHLCQY6NQ774BTPUIPYZANCNFSM4HIQT45A> .

bin/allowlist

Co-Authored-By: whd <whd@users.noreply.github.com>

* Remove excess chown * Rename GROUP to GROUP_ID

acmiyaguchi · 2019-05-09T00:38:47Z

Thanks for picking this up, Anthony! My only input is that the jsonschema-transpiler dep wont be updated if the install is in the dockerfile. I would recommend at least checking for an update in the script.

Unfortunately, there isn't a mechanism for updates. Instead, we would have to just force install over the original. We should just update the tag in the dockerfile when necessary.

@whd r?

whd

Aside from possibly removing the || exits now that the script is set -e, LGTM.

Add script to generate and upload schemas

e7f9d1a

fbertsch commented Apr 25, 2019

View reviewed changes

bin/schema_generator.sh Outdated Show resolved Hide resolved

fbertsch commented Apr 25, 2019

View reviewed changes

bin/metadata_merge Show resolved Hide resolved

fbertsch commented Apr 25, 2019

View reviewed changes

bin/schema_generator.sh Outdated Show resolved Hide resolved

fbertsch mentioned this pull request Apr 25, 2019

Schema-generator docker container mozilla/telemetry-airflow#491

Merged

whd reviewed Apr 26, 2019

View reviewed changes

bin/schema_generator.sh Outdated Show resolved Hide resolved

whd reviewed Apr 26, 2019

View reviewed changes

whd reviewed Apr 29, 2019

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

whd reviewed Apr 29, 2019

View reviewed changes

fbertsch added 3 commits April 30, 2019 09:40

Update metadata-merge with correct output

639b25f

Add namespace/doctype whitelist

c49a65c

Update git user and commit handling

4d9baf8

- Use ssh key for pushing to github - Cherry-pick commit for updated schemas - Remove GCS handling

fbertsch marked this pull request as ready for review April 30, 2019 14:43

Use published transpiler

0adf84b

fbertsch commented Apr 30, 2019

View reviewed changes

fbertsch added 2 commits April 30, 2019 09:53

Remove Avro schemas from script

51317e6

Use dev branch of MPS

90046c5

Update whitelist

e0c7062

jklukas reviewed Apr 30, 2019

View reviewed changes

bin/schema_generator.sh Outdated Show resolved Hide resolved

Use correct credentials for git push

6dc5f1d

Remove non allowed-list bq schemas

7c15cce

fbertsch added 2 commits May 3, 2019 14:08

Don't include metadata in Json schemas

b58ba56

Don't include parquet files

21a67f4

acmiyaguchi reviewed May 6, 2019

View reviewed changes

bin/schema_generator.sh Outdated Show resolved Hide resolved

acmiyaguchi added 7 commits May 6, 2019 16:43

Pin jsonschema-transpiler to v1.0.0

7e01548

Assume binary location from PATH

c4438f6

Rename MPS_SSH_KEY_BASE64 and remove USER_ID

0054cbe

Add script doc and move transpiler dep to dockerfile

c0f227d

Refactor script into modular parts

158ff13

Rename variables and make shellcheck happy

3c2d418

Add newline

9ee438e

acmiyaguchi requested review from acmiyaguchi and whd May 7, 2019 22:47

whd requested changes May 7, 2019

View reviewed changes

jklukas reviewed May 8, 2019

View reviewed changes

bin/allowlist Outdated Show resolved Hide resolved

jklukas and others added 7 commits May 8, 2019 19:58

Update bin/allowlist

60591a8

Co-Authored-By: whd <whd@users.noreply.github.com>

Simplify dockerfile

857a131

* Remove excess chown * Rename GROUP to GROUP_ID

Fix permissions, use python3 -m venv, and set -exuo pipefail

3a5abec

Only merge valid schemas; remove numbering

5b897e9

Simplify filtering logic

4f7605a

Move allowlist to project root

6558800

Fix error on empty commit

0e0af4e

acmiyaguchi requested a review from whd May 9, 2019 00:38

whd approved these changes May 9, 2019

View reviewed changes

Remove || exit with addition of set -e

1413ccf

acmiyaguchi merged commit d956116 into master May 9, 2019

kik-kik deleted the metadata_merge branch November 23, 2021 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to generate and upload schemas #18

Add script to generate and upload schemas #18

fbertsch commented Apr 25, 2019

fbertsch commented Apr 25, 2019

fbertsch commented Apr 25, 2019

whd commented Apr 26, 2019

whd Apr 26, 2019

fbertsch Apr 26, 2019

fbertsch Apr 26, 2019

whd Apr 29, 2019

whd Apr 29, 2019 •

edited

Loading

fbertsch Apr 30, 2019

haroldwoo Apr 30, 2019

fbertsch Apr 30, 2019

fbertsch commented Apr 30, 2019

fbertsch Apr 30, 2019

whd Apr 30, 2019

whd commented Apr 30, 2019

fbertsch commented Apr 30, 2019

whd commented Apr 30, 2019

jklukas commented Apr 30, 2019

whd commented Apr 30, 2019 •

edited

Loading

jklukas commented Apr 30, 2019

fbertsch commented May 2, 2019

whd left a comment

whd May 7, 2019

fbertsch commented May 8, 2019 via email

acmiyaguchi commented May 9, 2019

whd left a comment


		# 2. Remove all non-json schemas (e.g. parquet)

		find . -not -name "*.schema.json" -type f \| xargs rm

Add script to generate and upload schemas #18

Add script to generate and upload schemas #18

Conversation

fbertsch commented Apr 25, 2019

fbertsch commented Apr 25, 2019

fbertsch commented Apr 25, 2019

whd commented Apr 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whd Apr 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbertsch commented Apr 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whd commented Apr 30, 2019

fbertsch commented Apr 30, 2019

whd commented Apr 30, 2019

jklukas commented Apr 30, 2019

whd commented Apr 30, 2019 • edited Loading

jklukas commented Apr 30, 2019

fbertsch commented May 2, 2019

whd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbertsch commented May 8, 2019 via email

acmiyaguchi commented May 9, 2019

whd left a comment

Choose a reason for hiding this comment

whd Apr 29, 2019 •

edited

Loading

whd commented Apr 30, 2019 •

edited

Loading