Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to generate and upload schemas #18

Merged
merged 34 commits into from
May 9, 2019
Merged

Conversation

fbertsch
Copy link
Contributor

No description provided.

@fbertsch
Copy link
Contributor Author

There's still a few TODOs:

  • Will this script create a tarball and push that to GCS, or will jenkins see the new push to MPS and create its own?
  • We need to get the proper credentials and bucket for GCS public repos
  • We need to setup credentials to let this script push to MPS

@fbertsch
Copy link
Contributor Author

@whd, you seem most likely to have opinions on those todos listed above.

bin/schema_generator.sh Outdated Show resolved Hide resolved
bin/schema_generator.sh Outdated Show resolved Hide resolved
@whd
Copy link
Member

whd commented Apr 26, 2019

* Will this script create a tarball and push that to GCS, or will jenkins see the new push to MPS and create its own?

The latter.

* We need to get the proper credentials and bucket for GCS public repos

Absent a compelling reason, this should also happen on jenkins (as triggered by the MPS push).

* We need to setup credentials to let this script push to MPS

I will work on setting this up for airflow.

bin/schema_generator.sh Outdated Show resolved Hide resolved

# 2. Remove all non-json schemas (e.g. parquet)

find . -not -name "*.schema.json" -type f | xargs rm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I can image a future in which we have a unified branch (master) that is updated by both humans and machines, where parquet schemas are simply ignored in the GCP pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really necessary, but I would prefer to only have the relevant schemas available. Why don't we add the parquet schemas back in later if they become useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically my goal is to have an easy way of saying "what is running on prod"? And the parquet schemas are not included in that (for GCP).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, since the generated branch will be gcp-only. I'll want to revisit this in the hypothetical future where we have a single branch. Since the responsibility of this script will not include uploading to GCS the deletion of parquetmr files could still happen farther down the pipeline.

Dockerfile Outdated Show resolved Hide resolved
git checkout $MPS_BRANCH || git checkout -b $MPS_BRANCH
git commit -a -m "Auto-push from schema generation"

git push --repo https://name:password@bitbucket.org/name/repo.git
Copy link
Member

@whd whd Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1547333#c2 I think the easiest thing to do here is to set the ssh key via an env var and have this script write the contents of that var to ~/.ssh/id_rsa or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this, but testing locally I still needed to input a password. The only way I could force push to github was using a personal access token.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure about bitbucket, but github requires pushing to the ssh url rather than https url for password less

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did make that change, see the updated commit.

- Use ssh key for pushing to github
- Cherry-pick commit for updated schemas
- Remove GCS handling
@fbertsch fbertsch marked this pull request as ready for review April 30, 2019 14:43
@fbertsch
Copy link
Contributor Author

I've added a whitelist, and fixed the cherry-picking. Testing led to this branch:

https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/generated-schemas

However, I've yet to get it to work pushing using the ssh auth.


cd ../

rm -rf templates tests validation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider removing the remaining files in the git repository, e.g. CMakeLists.txt, Dockerfile, etc. They aren't necessarily useful in the read-only git branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find an emoji for "this seems reasonable for a generated read-only git branch".

@whd
Copy link
Member

whd commented Apr 30, 2019

I've tested the generated-schemas branch on the jenkins infra and it mostly appears to work. The following issues still arise:

The .bq schemas aren't valid. This is a combination of mozilla/jsonschema-transpiler#51 and (I think) the top-level fields being unnamed. @acmiyaguchi told me he's fixing at least the former of these.

The whitelist appears to work, but as a result a very limited subset of bq schemas are generated (e.g. no activity stream, webpagetest etc). Was this for testing purposes, or is bq table generation for less fancy schemas supposed to be incorporated into dev directly? Perhaps a blacklist would be better.

@fbertsch
Copy link
Contributor Author

@whd I added the whitelist so we could do a slow roll-out of schemas, while keeping the branch to contain only the ones that are deployed. All BQ table generation should be done during this build step.

@whd
Copy link
Member

whd commented Apr 30, 2019

@fbertsch ok, can you whitelist activity-stream and eng-workflow (bmobugs only)? These (and core v9/10) are the current tables we're generating in GCP so having them in the first round of automated schemas would be ideal.

bin/schema_generator.sh Outdated Show resolved Hide resolved
@jklukas
Copy link
Contributor

jklukas commented Apr 30, 2019

can you whitelist activity-stream and eng-workflow

I want to be careful here, especially with activity-stream, that we don't accidentally recreate tables and drop existing data. As long as the logic here doesn't drop tables, this should be okay.

I've already updated the activity-stream, bmobugs, and core tables "by hand" to include the new metadata fields (and to relax the previously required metadata.document_id to nullable, since it is no longer being sent). These may throw errors when we try to update schema since they contain old metadata fields not in the json schemas.

After I validate that the new metadata fields are flowing properly, I plan to update these tables to drop the old fields (tomorrow (Wednesday), if all goes well).

@whd
Copy link
Member

whd commented Apr 30, 2019

can you whitelist activity-stream and eng-workflow

I want to be careful here, especially with activity-stream, that we don't accidentally recreate tables and drop existing data. As long as the logic here doesn't drop tables, this should be okay.

The hand-crafted tables have a _dev suffix on them, whereas the managed ones do not, so no tables are dropped. A separate config change needs to be applied at https://github.com/mozilla-services/cloudops-infra/blob/ingestion/projects/beam/tf/modules/resources/dataflow_jobs/bigquery.tf#L30 to point the dataflow jobs at the new managed tables.

@jklukas
Copy link
Contributor

jklukas commented Apr 30, 2019

Good point that we're transitioning to the non-dev namespaces now. I think this sounds good. Once we change the job to point at the new namespaces, I can work on backfilling the dev activity-stream tables to the new ones.

@fbertsch
Copy link
Contributor Author

fbertsch commented May 2, 2019

@whd addressed the feedback. The submission_timestamp fields are now DATETIME, so we may be able to test deploy.

bin/schema_generator.sh Outdated Show resolved Hide resolved
Copy link
Member

@whd whd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from some cleanup and minor inconsistencies in usage of the empirical form in comments, and the sprinkling of || exits that don't make sense to me, this looks good.

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
bin/schema_generator.sh Outdated Show resolved Hide resolved
bin/schema_generator.sh Show resolved Hide resolved
# Replace newlines with backticks (hard to do with sed): cat | tr
# Remove the last backtick; it's the file-ending newline: rev | cut | rev
# Replace backticks with "\|" (can't do that with tr): sed
# Find directories that don't match any of the regex expressions: find
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this step is to remove bq schemas not in allowlist, we are keeping the directories and json schemas.

bin/schema_generator.sh Outdated Show resolved Hide resolved
bin/schema_generator.sh Outdated Show resolved Hide resolved
bin/schema_generator.sh Outdated Show resolved Hide resolved
bin/schema_generator.sh Outdated Show resolved Hide resolved
@fbertsch
Copy link
Contributor Author

fbertsch commented May 8, 2019 via email

bin/allowlist Outdated Show resolved Hide resolved
@acmiyaguchi
Copy link
Contributor

Thanks for picking this up, Anthony! My only input is that the jsonschema-transpiler dep wont be updated if the install is in the dockerfile. I would recommend at least checking for an update in the script.

Unfortunately, there isn't a mechanism for updates. Instead, we would have to just force install over the original. We should just update the tag in the dockerfile when necessary.


@whd r?

@acmiyaguchi acmiyaguchi requested a review from whd May 9, 2019 00:38
Copy link
Member

@whd whd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from possibly removing the || exits now that the script is set -e, LGTM.

@acmiyaguchi acmiyaguchi merged commit d956116 into master May 9, 2019
@kik-kik kik-kik deleted the metadata_merge branch November 23, 2021 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants