-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --normalize-case
option for snake_casing column names
#79
Conversation
I've updated the mozilla-pipeline-schema scripts to easily check the diff between different transpiler options. I've created a diff of the $ scripts/mps-generate-schemas.sh bq1 --type bigquery --resolve drop
...
80/132 succeded
$ scripts/mps-generate-schemas.sh bq2 --type bigquery --resolve drop --normalize-case
...
80/132 succeded
$ diff -q bq1/ bq2/
Files bq1/coverage.coverage.1.schema.json and bq2/coverage.coverage.1.schema.json differ
Files bq1/eng-workflow.hgpush.1.schema.json and bq2/eng-workflow.hgpush.1.schema.json differ
Files bq1/firefox-launcher-process.launcher-process-failure.1.schema.json and bq2/firefox-launcher-process.launcher-process-failure.1.schema.json differ
Files bq1/mozdata.event.1.schema.json and bq2/mozdata.event.1.schema.json differ
...
$ diff -q bq1/ bq2/ | wc -l
45
$ diff bq1/ bq2/ > normalize_case.diff |
There are a couple of interesting cases from the diff that I want to highlight:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimal nits, but this looks good.
@badboy This PR has changed a bit from the last review, so I'm retagging you for review. We're looking to have a consistent implementation of snake-casing across the transpiler and ingestion, so I reimplemented the logic using regular expressions and string manipulation instead. This still maintains the same output as heck, but is portable to java and python3. I've added 3 separate test cases to ensure that the behavior stays the same:
I also did the following:
|
mkdir $outdir | ||
|
||
total=0 | ||
failed=0 | ||
for schema in $schemas; do | ||
namespace=$(basename $(dirname $(dirname $schema))) | ||
schema_filename=$(basename $schema | sed 's/schema.json/avro.json/g') | ||
schema_filename=$(basename $schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the point where we need to normalize casing of namespace and doctype names. Or perhaps we just have a special case to modify untrustedModules
to untrusted_modules
, and we create an issue in mps to fail a test when adding a new schema with capital letters in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the renaming needs to happen here as well since these scripts are AFAIK unused by MSG.
Having MPS fail a test when adding other pings with this property sounds swell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this script is just used for testing within this repo, then we don't need to worry about snake-casing here. I'd agree that it's the script in MSG where the change needs to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this script is the prototype of the script in gcp-ingestion/ingestion-beam
and is mostly for end-to-end testing of encoding json into avro, then importing into bigquery and reading the schema back. The transpiler reads from a file (or stdin), so it doesn't have a bias on how the schema repo is organized.
I'm going to leave this as is, for conciseness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to prep a PR for MSG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation in the comment, that helped!
Code looks good so far, couple of minor nits in the comments.
src/casing.rs
Outdated
/// detected by a lowercase followed by an uppercase. Numbers can take on either | ||
/// case depending on the preceeding symbol. | ||
/// | ||
/// See: https://github.com/withoutboats/heck/blob/master/src/lib.rs#L7-L17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to use a fixed-point-in-time link: https://github.com/withoutboats/heck/blob/093d56fbf001e1506e56dbfa38631d99b1066df1/src/lib.rs#L7-L17
src/casing.rs
Outdated
\b # standard word boundary | ||
|(?<=[a-z][A-Z])(?=\d*[A-Z]) # break on runs of uppercase e.g. A7Aa -> A7|Aa | ||
|(?<=[a-z][A-Z])(?=\d*[a-z]) # break in runs of lowercase e.g a7Aa -> a7|Aa | ||
|(?<=[A-Z])(?=\d*[a-z]) # ends with an uppercase e.g. a7A -> a7|A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a regex fan, but ... oh boy :D
Did you come up with the regex here or is that copied from somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with the regex in a separate project for testing all of the different casing functions/libraries that we've been using. Setting up the tests was the key result, I have an auto-generated reporting for different implementations of snake_casing, including this regex algorithm that's implemented in python.
Here are some other resources that I ended up using:
- [RexEgg] - Regex Boundaries and Delimiters—Standard and Advanced - This introduced the concept of word boundaries like
\b
and how to implement it from scratch using lookaheads and lookbehinds. - [StackOverflow] - RegEx to split camelCase or TitleCase (advanced) - This validated my approach as I was developing the regex in an online REPL. It does not include digits, which can be either uppercase or lowercase depending on the previous letter.
- [StackOverflow] - What's the technical reason for “lookbehind assertion MUST be fixed length” in regex? - So how can you tell whether a digit is lowercase or uppercase (e.g.
f("a7aA") == f("A7AA")
)? My first attempt was something similar to(?<=[a-z]\d*)(?=[A-Z][a-z])
for a "lowercase" digit, but this is not supported by most engines because dynamic look-behinds can perform very badly. I found a neat idea for "inverted" positive lookaheads, which can be used to figure out if a digit is lowercase or uppercase in the 2nd and 3rd lines of the regex.
These three resources, along with the definition of a word boundary and the reference test cases in heck, were useful with developing the regex.
We're going to need a version bump for this new option so that we can reference it in MSG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR fixes #77 by adding a new option to
snake_case
all column names in a schema. This should be used by adding a--normalize-case
flag to the command. By default, this option is turned off.I've chosen heck as the casing library, since it seems to have the largest number of active users. It uses the
unicode_segmentation
crate to find word boundaries and performssnake_casing
consistently across mixed casing.I've refactored the code to remove extra clones and to make the order of the functions flow better when reading top-down. I also added a few comments here and there.