-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to generate and upload schemas #18
Conversation
There's still a few TODOs:
|
@whd, you seem most likely to have opinions on those todos listed above. |
The latter.
Absent a compelling reason, this should also happen on jenkins (as triggered by the MPS push).
I will work on setting this up for airflow. |
bin/schema_generator.sh
Outdated
|
||
# 2. Remove all non-json schemas (e.g. parquet) | ||
|
||
find . -not -name "*.schema.json" -type f | xargs rm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I can image a future in which we have a unified branch (master) that is updated by both humans and machines, where parquet schemas are simply ignored in the GCP pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really necessary, but I would prefer to only have the relevant schemas available. Why don't we add the parquet schemas back in later if they become useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically my goal is to have an easy way of saying "what is running on prod"? And the parquet schemas are not included in that (for GCP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, since the generated branch will be gcp-only. I'll want to revisit this in the hypothetical future where we have a single branch. Since the responsibility of this script will not include uploading to GCS the deletion of parquetmr files could still happen farther down the pipeline.
bin/schema_generator.sh
Outdated
git checkout $MPS_BRANCH || git checkout -b $MPS_BRANCH | ||
git commit -a -m "Auto-push from schema generation" | ||
|
||
git push --repo https://name:password@bitbucket.org/name/repo.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1547333#c2 I think the easiest thing to do here is to set the ssh key via an env var and have this script write the contents of that var to ~/.ssh/id_rsa
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this, but testing locally I still needed to input a password. The only way I could force push to github was using a personal access token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure about bitbucket, but github requires pushing to the ssh url rather than https url for password less
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did make that change, see the updated commit.
- Use ssh key for pushing to github - Cherry-pick commit for updated schemas - Remove GCS handling
I've added a whitelist, and fixed the cherry-picking. Testing led to this branch: https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/generated-schemas However, I've yet to get it to work pushing using the ssh auth. |
bin/schema_generator.sh
Outdated
|
||
cd ../ | ||
|
||
rm -rf templates tests validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider removing the remaining files in the git repository, e.g. CMakeLists.txt
, Dockerfile
, etc. They aren't necessarily useful in the read-only git branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find an emoji for "this seems reasonable for a generated read-only git branch".
I've tested the generated-schemas branch on the jenkins infra and it mostly appears to work. The following issues still arise: The The whitelist appears to work, but as a result a very limited subset of bq schemas are generated (e.g. no activity stream, webpagetest etc). Was this for testing purposes, or is bq table generation for less fancy schemas supposed to be incorporated into dev directly? Perhaps a blacklist would be better. |
@whd I added the whitelist so we could do a slow roll-out of schemas, while keeping the branch to contain only the ones that are deployed. All BQ table generation should be done during this build step. |
@fbertsch ok, can you whitelist activity-stream and eng-workflow (bmobugs only)? These (and core v9/10) are the current tables we're generating in GCP so having them in the first round of automated schemas would be ideal. |
I want to be careful here, especially with activity-stream, that we don't accidentally recreate tables and drop existing data. As long as the logic here doesn't drop tables, this should be okay. I've already updated the activity-stream, bmobugs, and core tables "by hand" to include the new metadata fields (and to relax the previously required metadata.document_id to nullable, since it is no longer being sent). These may throw errors when we try to update schema since they contain old metadata fields not in the json schemas. After I validate that the new metadata fields are flowing properly, I plan to update these tables to drop the old fields (tomorrow (Wednesday), if all goes well). |
The hand-crafted tables have a |
Good point that we're transitioning to the non-dev namespaces now. I think this sounds good. Once we change the job to point at the new namespaces, I can work on backfilling the |
@whd addressed the feedback. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from some cleanup and minor inconsistencies in usage of the empirical form in comments, and the sprinkling of || exit
s that don't make sense to me, this looks good.
bin/schema_generator.sh
Outdated
# Replace newlines with backticks (hard to do with sed): cat | tr | ||
# Remove the last backtick; it's the file-ending newline: rev | cut | rev | ||
# Replace backticks with "\|" (can't do that with tr): sed | ||
# Find directories that don't match any of the regex expressions: find |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this step is to remove bq schemas not in allowlist
, we are keeping the directories and json schemas.
Thanks for picking this up, Anthony! My only input is that the
jsonschema-transpiler dep wont be updated if the install is in the
dockerfile. I would recommend at least checking for an update in the script.
…On Wed, May 8, 2019, 2:51 AM Anthony Miyaguchi ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In bin/schema_generator.sh
<#18 (comment)>
:
> + # Assumes the current directory is the root of the repository
+
+ find . -name "*.bq" -type f -exec git add {} +
+ git checkout ./*.schema.json
+ git commit -a -m "Interim Commit"
+
+ git checkout $MPS_BRANCH_PUBLISH || git checkout -b $MPS_BRANCH_PUBLISH
+
+ # Keep only the schemas dir
+ find . -mindepth 1 -maxdepth 1 -not -name .git -exec rm -rf {} +
+ git checkout $MPS_BRANCH_WORKING -- schemas
+ git commit -a -m "Auto-push from schema generation"
+}
+
+function main() {
+ cd $BASE_DIR || exit
There are a couple of places where the folders and branches are being
deleted without checking for existence. I can go back and add set -eou
pipefail for errors, unset variables, and pipeline errors.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE62YYGQIGHLCQY6NQ774BTPUIPYZANCNFSM4HIQT45A>
.
|
Co-Authored-By: whd <whd@users.noreply.github.com>
* Remove excess chown * Rename GROUP to GROUP_ID
Unfortunately, there isn't a mechanism for updates. Instead, we would have to just force install over the original. We should just update the tag in the dockerfile when necessary. @whd r? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from possibly removing the || exits
now that the script is set -e
, LGTM.
No description provided.