Add `--normalize-case` option for snake_casing column names #79

acmiyaguchi · 2019-07-09T00:32:25Z

This PR fixes #77 by adding a new option to snake_case all column names in a schema. This should be used by adding a --normalize-case flag to the command. By default, this option is turned off.

I've chosen heck as the casing library, since it seems to have the largest number of active users. It uses the unicode_segmentation crate to find word boundaries and performs snake_casing consistently across mixed casing.

I've refactored the code to remove extra clones and to make the order of the functions flow better when reading top-down. I also added a few comments here and there.

acmiyaguchi · 2019-07-09T22:53:48Z

I've updated the mozilla-pipeline-schema scripts to easily check the diff between different transpiler options.

I've created a diff of the --normalize-case option: https://gist.github.com/acmiyaguchi/3f526c440b67ebe469bcb6ab2da5123f

$ scripts/mps-generate-schemas.sh bq1 --type bigquery --resolve drop
...
80/132 succeded

$ scripts/mps-generate-schemas.sh bq2 --type bigquery --resolve drop --normalize-case
...
80/132 succeded

$ diff -q bq1/ bq2/
Files bq1/coverage.coverage.1.schema.json and bq2/coverage.coverage.1.schema.json differ
Files bq1/eng-workflow.hgpush.1.schema.json and bq2/eng-workflow.hgpush.1.schema.json differ
Files bq1/firefox-launcher-process.launcher-process-failure.1.schema.json and bq2/firefox-launcher-process.launcher-process-failure.1.schema.json differ
Files bq1/mozdata.event.1.schema.json and bq2/mozdata.event.1.schema.json differ
...

$ diff -q bq1/ bq2/ | wc -l
      45

$ diff bq1/ bq2/ > normalize_case.diff

src/ast.rs

acmiyaguchi · 2019-07-09T23:16:32Z

There are a couple of interesting cases from the diff that I want to highlight:

l2cacheKB -> l2cache_kb
speedMHz -> speed_m_hz
D2DEnabled -> d2d_enabled
DWriteEnabled -> d_write_enabled
activeGMPlugins -> active_gm_plugins

badboy

Minimal nits, but this looks good.

Cargo.toml

src/ast.rs

tests/normalize_case.rs

acmiyaguchi · 2019-07-15T22:56:17Z

@badboy This PR has changed a bit from the last review, so I'm retagging you for review. We're looking to have a consistent implementation of snake-casing across the transpiler and ingestion, so I reimplemented the logic using regular expressions and string manipulation instead. This still maintains the same output as heck, but is portable to java and python3.

I've added 3 separate test cases to ensure that the behavior stays the same:

alphanum_3 - all strings of length 3 drawn from the alphabet "aA7"
word_4 - all strings of length 4 drawn from the alphabet "aA7_"
mps-diff-integration - the strings that were generated by diffing mozilla-pipeline-schemas with this PR using heck

I also did the following:

dropped the regex create in favor of onig, a wrapper around oniguruma for lookaround support
made to_snake_case a function accessible via a public interface for testing.

jklukas · 2019-07-16T14:37:11Z

scripts/mps-generate-schemas.sh

 mkdir $outdir

 total=0
 failed=0
 for schema in $schemas; do
    namespace=$(basename $(dirname $(dirname $schema)))
-    schema_filename=$(basename $schema | sed 's/schema.json/avro.json/g')
+    schema_filename=$(basename $schema)


I think this is the point where we need to normalize casing of namespace and doctype names. Or perhaps we just have a special case to modify untrustedModules to untrusted_modules, and we create an issue in mps to fail a test when adding a new schema with capital letters in the name.

I think the renaming needs to happen here as well since these scripts are AFAIK unused by MSG.

Having MPS fail a test when adding other pings with this property sounds swell.

If this script is just used for testing within this repo, then we don't need to worry about snake-casing here. I'd agree that it's the script in MSG where the change needs to happen.

Filed mozilla-services/mozilla-pipeline-schemas#355

Yep, this script is the prototype of the script in gcp-ingestion/ingestion-beam and is mostly for end-to-end testing of encoding json into avro, then importing into bigquery and reading the schema back. The transpiler reads from a file (or stdin), so it doesn't have a bias on how the schema repo is organized.

I'm going to leave this as is, for conciseness.

I am going to prep a PR for MSG

See mozilla/mozilla-schema-generator#46

badboy

Thanks for the explanation in the comment, that helped!
Code looks good so far, couple of minor nits in the comments.

badboy · 2019-07-16T15:28:19Z

src/casing.rs

+/// detected by a lowercase followed by an uppercase. Numbers can take on either
+/// case depending on the preceeding symbol.
+///
+/// See: https://github.com/withoutboats/heck/blob/master/src/lib.rs#L7-L17


Maybe better to use a fixed-point-in-time link: https://github.com/withoutboats/heck/blob/093d56fbf001e1506e56dbfa38631d99b1066df1/src/lib.rs#L7-L17

badboy · 2019-07-16T15:29:32Z

src/casing.rs

+            \b                              # standard word boundary
+            |(?<=[a-z][A-Z])(?=\d*[A-Z])    # break on runs of uppercase e.g. A7Aa -> A7|Aa
+            |(?<=[a-z][A-Z])(?=\d*[a-z])    # break in runs of lowercase e.g a7Aa -> a7|Aa
+            |(?<=[A-Z])(?=\d*[a-z])         # ends with an uppercase e.g. a7A -> a7|A


I'm a regex fan, but ... oh boy :D
Did you come up with the regex here or is that copied from somewhere?

I came up with the regex in a separate project for testing all of the different casing functions/libraries that we've been using. Setting up the tests was the key result, I have an auto-generated reporting for different implementations of snake_casing, including this regex algorithm that's implemented in python.

Here are some other resources that I ended up using:

[RexEgg] - Regex Boundaries and Delimiters—Standard and Advanced - This introduced the concept of word boundaries like \b and how to implement it from scratch using lookaheads and lookbehinds.

[StackOverflow] - RegEx to split camelCase or TitleCase (advanced) - This validated my approach as I was developing the regex in an online REPL. It does not include digits, which can be either uppercase or lowercase depending on the previous letter.

[StackOverflow] - What's the technical reason for “lookbehind assertion MUST be fixed length” in regex? - So how can you tell whether a digit is lowercase or uppercase (e.g. f("a7aA") == f("A7AA"))? My first attempt was something similar to (?<=[a-z]\d*)(?=[A-Z][a-z]) for a "lowercase" digit, but this is not supported by most engines because dynamic look-behinds can perform very badly. I found a neat idea for "inverted" positive lookaheads, which can be used to figure out if a digit is lowercase or uppercase in the 2nd and 3rd lines of the regex.

These three resources, along with the definition of a word boundary and the reference test cases in heck, were useful with developing the regex.

src/ast.rs

whd · 2019-07-16T17:40:39Z

We're going to need a version bump for this new option so that we can reference it in MSG.

Relies on mozilla/jsonschema-transpiler#79

jklukas

LGTM

Relies on mozilla/jsonschema-transpiler#79

acmiyaguchi added 9 commits July 8, 2019 15:16

Reorder functions and remove excess clones

3487666

Rename normalize_properties and refactor recurse_infer_name

447d561

Add --normalize-case and implement Default for Context

c577171

Add failing test for normalizing casing

80a2505

Add normalize_case to function definitions

2333ccc

Fix broken tests and sort property names properly

a8a5e5a

Use heck to snake_case column names

95024dc

Add a new test-case asserting names that start with numbers

cae2280

Rename prefix_numeric to normalize_numeric_prefix

ac64a47

acmiyaguchi marked this pull request as ready for review July 9, 2019 22:17

acmiyaguchi requested a review from badboy July 9, 2019 22:17

acmiyaguchi added 2 commits July 9, 2019 15:38

Update context in main

5132a17

Update scripts for generating a diff

5567a02

acmiyaguchi requested a review from jklukas July 9, 2019 22:54

acmiyaguchi commented Jul 9, 2019

View reviewed changes

src/ast.rs Show resolved Hide resolved

badboy approved these changes Jul 11, 2019

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

src/ast.rs Show resolved Hide resolved

tests/normalize_case.rs Show resolved Hide resolved

acmiyaguchi mentioned this pull request Jul 15, 2019

Coerce camelCase field names to snake_case in BQ sink mozilla/gcp-ingestion#689

Merged

acmiyaguchi added 5 commits July 15, 2019 14:14

Add test cases for casing and move cases for translating schemas

708689f

Expose snake casing as a public module for integration testing

1ef4ced

Replace regex with oniguruma; implement to_snake_case with regexes

a50eda3

Check-in latest implementation of casing; use static_lazy

4a376fd

Add comment for test case when normalizing property names

c6f7e6f

acmiyaguchi requested a review from badboy July 15, 2019 22:56

Add docstring to to_snake_case

b8c9bb6

jklukas reviewed Jul 16, 2019

View reviewed changes

badboy approved these changes Jul 16, 2019

View reviewed changes

jklukas added a commit to mozilla/mozilla-schema-generator that referenced this pull request Jul 16, 2019

Snake case BQ table and field names

8dceb6e

Relies on mozilla/jsonschema-transpiler#79

jklukas mentioned this pull request Jul 16, 2019

Snake case BQ table and field names mozilla/mozilla-schema-generator#46

Merged

jklukas approved these changes Jul 16, 2019

View reviewed changes

jklukas added a commit to mozilla/mozilla-schema-generator that referenced this pull request Jul 16, 2019

Snake case BQ table and field names

478c159

Relies on mozilla/jsonschema-transpiler#79

acmiyaguchi added 2 commits July 16, 2019 12:57

Update documentation to be more specific

d307cc3

Bump version to 1.2.0

bc9c0ec

acmiyaguchi merged commit 536c6fd into mozilla:master Jul 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--normalize-case` option for snake_casing column names #79

Add `--normalize-case` option for snake_casing column names #79

acmiyaguchi commented Jul 9, 2019 •

edited

Loading

acmiyaguchi commented Jul 9, 2019

acmiyaguchi commented Jul 9, 2019

badboy left a comment

acmiyaguchi commented Jul 15, 2019

jklukas Jul 16, 2019

whd Jul 16, 2019

jklukas Jul 16, 2019

jklukas Jul 16, 2019

acmiyaguchi Jul 16, 2019

jklukas Jul 16, 2019 •

edited

Loading

jklukas Jul 16, 2019

badboy left a comment

badboy Jul 16, 2019

badboy Jul 16, 2019

acmiyaguchi Jul 16, 2019

whd commented Jul 16, 2019

jklukas left a comment

Add --normalize-case option for snake_casing column names #79

Add --normalize-case option for snake_casing column names #79

Conversation

acmiyaguchi commented Jul 9, 2019 • edited Loading

acmiyaguchi commented Jul 9, 2019

acmiyaguchi commented Jul 9, 2019

badboy left a comment

Choose a reason for hiding this comment

acmiyaguchi commented Jul 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jklukas Jul 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

badboy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whd commented Jul 16, 2019

jklukas left a comment

Choose a reason for hiding this comment

Add `--normalize-case` option for snake_casing column names #79

Add `--normalize-case` option for snake_casing column names #79

acmiyaguchi commented Jul 9, 2019 •

edited

Loading

jklukas Jul 16, 2019 •

edited

Loading