-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coerce camelCase field names to snake_case in BQ sink #689
Conversation
Codecov Report
@@ Coverage Diff @@
## master #689 +/- ##
============================================
- Coverage 90.77% 89.62% -1.16%
- Complexity 427 437 +10
============================================
Files 60 54 -6
Lines 2353 2111 -242
Branches 207 208 +1
============================================
- Hits 2136 1892 -244
- Misses 154 157 +3
+ Partials 63 62 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is surprisingly simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the casing decisions made by Guava are inconsistent with Rust library for snake_casing that I've used in mozilla/jsonschema-transpiler#79. I generated a diff against the latest branch of mozilla-pipeline-schemas and scraped the affected names into a file.
In this repository, I've created a simple scala and rust application for testing the column names that I scraped from the diff.
original | java | rust |
---|---|---|
BuildID | build_i_d | build_id |
D2DEnabled | d2_d_enabled | d2d_enabled |
GPUActive | g_p_u_active | gpu_active |
ProductID | product_i_d | product_id |
RAM | r_a_m | ram |
activeGMPlugins | active_g_m_plugins | active_gm_plugins |
changesetID | changeset_i_d | changeset_id |
closedTS | closed_t_s | closed_ts |
debugID | debug_i_d | debug_id |
deviceID | device_i_d | device_id |
engagedTS | engaged_t_s | engaged_ts |
expiredTS | expired_t_s | expired_ts |
l2cacheKB | l2cache_k_b | l2cache_kb |
l3cacheKB | l3cache_k_b | l3cache_kb |
learnMoreTS | learn_more_t_s | learn_more_ts |
loadDurationMS | load_duration_m_s | load_duration_ms |
memoryMB | memory_m_b | memory_mb |
offeredTS | offered_t_s | offered_ts |
processUptimeMS | process_uptime_m_s | process_uptime_ms |
submissionURL | submission_u_r_l | submission_url |
subsysID | subsys_i_d | subsys_id |
threadID | thread_i_d | thread_id |
totalPagesAM | total_pages_a_m | total_pages_am |
vendorID | vendor_i_d | vendor_id |
virtualMaxMB | virtual_max_m_b | virtual_max_mb |
votedTS | voted_t_s | voted_ts |
windowClosedTS | window_closed_t_s | window_closed_ts |
windowsUBR | windows_u_b_r | windows_ubr |
xulLoadDurationMS | xul_load_duration_m_s | xul_load_duration_ms |
We should choose a library that uses Unicode Standard Annex #29, which is the spec that the underlying unicode_segmentation implements for finding word boundaries. It seems to provide good word boundaries.
It looks like java.util.regex.Pattern
supports word boundaries defined by annex 29, so this is one potential alternative.
In any case, the libraries chosen by the schema transpiler and ingestion sink should match on all of the column names in mozilla-pipeline-schemas.
Unicode segmentation was a red herring, it turns out that CaseFormat is an ASCII utility and finding breaks between unicode characters is mostly unrelated (just the usual separators). Instead, the We've already violated the assumption of reversibility by replacing periods and hyphens into underscores. Therefore, I suggest we implement heck's case formatting, because it produces names that seem more intuitive. For example, |
For the previous implementation see https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/parquet/parquet.cpp#L141-L164, which was aiming at hive compatibility and is more similar to the rust implementation. Generally I agree with making the rust implementation the standard as it produces the more sane result in most cases. It is unfortunate that there is no well-accepted standard or specification for performing this conversion that multiple libraries can be written against. Absent such a standard, a test suite including all of these special cases across the different language implementations would be ideal. |
The following python snippet is an acceptance test that we could use. from itertools import product
for c in product("Aa7_", repeat=4):
word = "".join(c)
print(word) This generates 256 (4**4) strings that covers most (if not all) of the renaming cases. This is what it currently looks like. |
I've been trying to find a regular expression to match https://github.com/acmiyaguchi/test-casing/blob/master/RESULTS.md#rust-vs-python-cases |
Perhaps we need to break this into two steps. First, we'd have a function to normalize runs of sequential capital letters such that only the first is capitalized (RAM -> Ram, ProductID -> ProductId), then pass to Guava's CaseFormat. We could also memoize these conversions if they're getting expensive. |
I found a good solution that matches the rust library by reversing the string to search for all the word boundaries. This passes all of the test cases:
We should be able to re-implement this in pretty much any language with a built in regex library. Updated test results can be found here. import re
# Search for all camelCase situations in reverse with arbitrary lookaheads.
REV_WORD_BOUND_PAT = re.compile(
r"""
\b # standard word boundary
|(?<=[a-z][A-Z])(?=\d*[A-Z]) # A7Aa -> A7|Aa boundary
|(?<=[a-z][A-Z])(?=\d*[a-z]) # a7Aa -> a7|Aa boundary
|(?<=[A-Z])(?=\d*[a-z]) # a7A -> a7|A boundary
""",
re.VERBOSE,
)
def snake_case(line: str) -> str:
# replace non-alphanumeric characters with spaces in the reversed line
subbed = re.sub(r"[^\w]|_", " ", line[::-1])
# apply the regex on the reversed string
words = REV_WORD_BOUND_PAT.split(subbed)
# filter spaces between words and snake_case and reverse again
return "_".join([w.lower() for w in words if w.strip()])[::-1] |
Here are the test cases in csv in https://gist.github.com/acmiyaguchi/e377c4b6a53204d7d67e53a20dabc4a1 |
961306f
to
50ad7a8
Compare
I've implemented the Requesting another review from @acmiyaguchi. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, it's nice to have a consistent snake_casing methodology 🎉
assertEquals("untrusted_modules", PubsubMessageToTableRow.convertNameForBq("untrustedModules")); | ||
assertEquals("xul_load_duration_ms", | ||
PubsubMessageToTableRow.convertNameForBq("xulLoadDurationMS")); | ||
assertEquals("a11y_consumers", PubsubMessageToTableRow.convertNameForBq("A11Y_CONSUMERS")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
speedMHz
is a good one to add, it has an interesting look.
// two words | ||
assertEquals(SnakeCase.format("aA"), "a_a"); | ||
// underscores are word boundaries | ||
assertEquals(SnakeCase.format("_a__a_"), "a_a"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Food for thought:
// underscores are word boundaries
assertEquals(SnakeCase.format("a__a"), "a_a");
// sentences can italicized
assertEquals(SnakeCase.format("_a__a_"), "_a_a_");
// sentences can bolded
assertEquals(SnakeCase.format("__a__a__"), "__a_a__");
Fixes #671
We cannot merge this until corresponding changes are made in the schema transpiler (see mozilla/jsonschema-transpiler#77) and the tables with the new structure are deployed.
Deploying this change will likely happen at the same time as moving destination tables to
<namespace>_live
datasets in the shared-prod project. See https://bugzilla.mozilla.org/show_bug.cgi?id=1563740