Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --normalize-case option for snake_casing column names #79

Merged
merged 19 commits into from
Jul 16, 2019

Conversation

acmiyaguchi
Copy link
Contributor

@acmiyaguchi acmiyaguchi commented Jul 9, 2019

This PR fixes #77 by adding a new option to snake_case all column names in a schema. This should be used by adding a --normalize-case flag to the command. By default, this option is turned off.

I've chosen heck as the casing library, since it seems to have the largest number of active users. It uses the unicode_segmentation crate to find word boundaries and performs snake_casing consistently across mixed casing.

I've refactored the code to remove extra clones and to make the order of the functions flow better when reading top-down. I also added a few comments here and there.

@acmiyaguchi acmiyaguchi marked this pull request as ready for review July 9, 2019 22:17
@acmiyaguchi acmiyaguchi requested a review from badboy July 9, 2019 22:17
@acmiyaguchi
Copy link
Contributor Author

I've updated the mozilla-pipeline-schema scripts to easily check the diff between different transpiler options.

I've created a diff of the --normalize-case option: https://gist.github.com/acmiyaguchi/3f526c440b67ebe469bcb6ab2da5123f

$ scripts/mps-generate-schemas.sh bq1 --type bigquery --resolve drop
...
80/132 succeded

$ scripts/mps-generate-schemas.sh bq2 --type bigquery --resolve drop --normalize-case
...
80/132 succeded

$ diff -q bq1/ bq2/
Files bq1/coverage.coverage.1.schema.json and bq2/coverage.coverage.1.schema.json differ
Files bq1/eng-workflow.hgpush.1.schema.json and bq2/eng-workflow.hgpush.1.schema.json differ
Files bq1/firefox-launcher-process.launcher-process-failure.1.schema.json and bq2/firefox-launcher-process.launcher-process-failure.1.schema.json differ
Files bq1/mozdata.event.1.schema.json and bq2/mozdata.event.1.schema.json differ
...

$ diff -q bq1/ bq2/ | wc -l
      45

$ diff bq1/ bq2/ > normalize_case.diff

@acmiyaguchi acmiyaguchi requested a review from jklukas July 9, 2019 22:54
src/ast.rs Show resolved Hide resolved
@acmiyaguchi
Copy link
Contributor Author

There are a couple of interesting cases from the diff that I want to highlight:

l2cacheKB -> l2cache_kb
speedMHz -> speed_m_hz
D2DEnabled -> d2d_enabled
DWriteEnabled -> d_write_enabled
activeGMPlugins -> active_gm_plugins

Copy link
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal nits, but this looks good.

Cargo.toml Outdated Show resolved Hide resolved
src/ast.rs Show resolved Hide resolved
tests/normalize_case.rs Show resolved Hide resolved
@acmiyaguchi
Copy link
Contributor Author

@badboy This PR has changed a bit from the last review, so I'm retagging you for review. We're looking to have a consistent implementation of snake-casing across the transpiler and ingestion, so I reimplemented the logic using regular expressions and string manipulation instead. This still maintains the same output as heck, but is portable to java and python3.

I've added 3 separate test cases to ensure that the behavior stays the same:

  • alphanum_3 - all strings of length 3 drawn from the alphabet "aA7"
  • word_4 - all strings of length 4 drawn from the alphabet "aA7_"
  • mps-diff-integration - the strings that were generated by diffing mozilla-pipeline-schemas with this PR using heck

I also did the following:

  • dropped the regex create in favor of onig, a wrapper around oniguruma for lookaround support
  • made to_snake_case a function accessible via a public interface for testing.

@acmiyaguchi acmiyaguchi requested a review from badboy July 15, 2019 22:56
mkdir $outdir

total=0
failed=0
for schema in $schemas; do
namespace=$(basename $(dirname $(dirname $schema)))
schema_filename=$(basename $schema | sed 's/schema.json/avro.json/g')
schema_filename=$(basename $schema)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the point where we need to normalize casing of namespace and doctype names. Or perhaps we just have a special case to modify untrustedModules to untrusted_modules, and we create an issue in mps to fail a test when adding a new schema with capital letters in the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the renaming needs to happen here as well since these scripts are AFAIK unused by MSG.

Having MPS fail a test when adding other pings with this property sounds swell.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this script is just used for testing within this repo, then we don't need to worry about snake-casing here. I'd agree that it's the script in MSG where the change needs to happen.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this script is the prototype of the script in gcp-ingestion/ingestion-beam and is mostly for end-to-end testing of encoding json into avro, then importing into bigquery and reading the schema back. The transpiler reads from a file (or stdin), so it doesn't have a bias on how the schema repo is organized.

I'm going to leave this as is, for conciseness.

Copy link

@jklukas jklukas Jul 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to prep a PR for MSG

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation in the comment, that helped!
Code looks good so far, couple of minor nits in the comments.

src/casing.rs Outdated
/// detected by a lowercase followed by an uppercase. Numbers can take on either
/// case depending on the preceeding symbol.
///
/// See: https://github.com/withoutboats/heck/blob/master/src/lib.rs#L7-L17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/casing.rs Outdated
\b # standard word boundary
|(?<=[a-z][A-Z])(?=\d*[A-Z]) # break on runs of uppercase e.g. A7Aa -> A7|Aa
|(?<=[a-z][A-Z])(?=\d*[a-z]) # break in runs of lowercase e.g a7Aa -> a7|Aa
|(?<=[A-Z])(?=\d*[a-z]) # ends with an uppercase e.g. a7A -> a7|A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a regex fan, but ... oh boy :D
Did you come up with the regex here or is that copied from somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with the regex in a separate project for testing all of the different casing functions/libraries that we've been using. Setting up the tests was the key result, I have an auto-generated reporting for different implementations of snake_casing, including this regex algorithm that's implemented in python.

Here are some other resources that I ended up using:

These three resources, along with the definition of a word boundary and the reference test cases in heck, were useful with developing the regex.

src/ast.rs Outdated Show resolved Hide resolved
src/ast.rs Outdated Show resolved Hide resolved
src/ast.rs Outdated Show resolved Hide resolved
@whd
Copy link
Member

whd commented Jul 16, 2019

We're going to need a version bump for this new option so that we can reference it in MSG.

jklukas added a commit to mozilla/mozilla-schema-generator that referenced this pull request Jul 16, 2019
Copy link

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jklukas added a commit to mozilla/mozilla-schema-generator that referenced this pull request Jul 16, 2019
@acmiyaguchi acmiyaguchi merged commit 536c6fd into mozilla:master Jul 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert camel case bigquery column names to snake case
4 participants