Coerce camelCase field names to snake_case in BQ sink #689

jklukas · 2019-07-05T18:22:17Z

Fixes #671

We cannot merge this until corresponding changes are made in the schema transpiler (see mozilla/jsonschema-transpiler#77) and the tables with the new structure are deployed.

Deploying this change will likely happen at the same time as moving destination tables to <namespace>_live datasets in the shared-prod project. See https://bugzilla.mozilla.org/show_bug.cgi?id=1563740

codecov-io · 2019-07-05T18:36:42Z

Codecov Report

Merging #689 into master will decrease coverage by 1.15%.
The diff coverage is 86.2%.

@@             Coverage Diff              @@
##             master     #689      +/-   ##
============================================
- Coverage     90.77%   89.62%   -1.16%     
- Complexity      427      437      +10     
============================================
  Files            60       54       -6     
  Lines          2353     2111     -242     
  Branches        207      208       +1     
============================================
- Hits           2136     1892     -244     
- Misses          154      157       +3     
+ Partials         63       62       -1

Flag	Coverage Δ	Complexity Δ
#ingestion_beam	`89.62% <86.2%> (-0.02%)`	`437 <12> (+10)`
#ingestion_edge	`?`	`?`

Impacted Files	Coverage Δ	Complexity Δ
.../telemetry/transforms/PubsubMessageToTableRow.java	`77.16% <85%> (+0.53%)`	`52 <8> (+6)`	⬆️
...ain/java/com/mozilla/telemetry/util/SnakeCase.java	`88.88% <88.88%> (ø)`	`4 <4> (?)`
ingestion-edge/ingestion_edge/dockerflow.py
ingestion-edge/ingestion_edge/create_app.py
ingestion-edge/ingestion_edge/config.py
ingestion-edge/ingestion_edge/wsgi.py
ingestion-edge/ingestion_edge/util.py
ingestion-edge/ingestion_edge/flush.py
ingestion-edge/ingestion_edge/publish.py
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8727587...6e1e012. Read the comment docs.

relud

this is surprisingly simple

acmiyaguchi

Some of the casing decisions made by Guava are inconsistent with Rust library for snake_casing that I've used in mozilla/jsonschema-transpiler#79. I generated a diff against the latest branch of mozilla-pipeline-schemas and scraped the affected names into a file.

In this repository, I've created a simple scala and rust application for testing the column names that I scraped from the diff.

original	java	rust
BuildID	build_i_d	build_id
D2DEnabled	d2_d_enabled	d2d_enabled
GPUActive	g_p_u_active	gpu_active
ProductID	product_i_d	product_id
RAM	r_a_m	ram
activeGMPlugins	active_g_m_plugins	active_gm_plugins
changesetID	changeset_i_d	changeset_id
closedTS	closed_t_s	closed_ts
debugID	debug_i_d	debug_id
deviceID	device_i_d	device_id
engagedTS	engaged_t_s	engaged_ts
expiredTS	expired_t_s	expired_ts
l2cacheKB	l2cache_k_b	l2cache_kb
l3cacheKB	l3cache_k_b	l3cache_kb
learnMoreTS	learn_more_t_s	learn_more_ts
loadDurationMS	load_duration_m_s	load_duration_ms
memoryMB	memory_m_b	memory_mb
offeredTS	offered_t_s	offered_ts
processUptimeMS	process_uptime_m_s	process_uptime_ms
submissionURL	submission_u_r_l	submission_url
subsysID	subsys_i_d	subsys_id
threadID	thread_i_d	thread_id
totalPagesAM	total_pages_a_m	total_pages_am
vendorID	vendor_i_d	vendor_id
virtualMaxMB	virtual_max_m_b	virtual_max_mb
votedTS	voted_t_s	voted_ts
windowClosedTS	window_closed_t_s	window_closed_ts
windowsUBR	windows_u_b_r	windows_ubr
xulLoadDurationMS	xul_load_duration_m_s	xul_load_duration_ms

We should choose a library that uses Unicode Standard Annex #29, which is the spec that the underlying unicode_segmentation implements for finding word boundaries. It seems to provide good word boundaries.

It looks like java.util.regex.Pattern supports word boundaries defined by annex 29, so this is one potential alternative.

In any case, the libraries chosen by the schema transpiler and ingestion sink should match on all of the column names in mozilla-pipeline-schemas.

acmiyaguchi · 2019-07-11T16:55:34Z

Unicode segmentation was a red herring, it turns out that CaseFormat is an ASCII utility and finding breaks between unicode characters is mostly unrelated (just the usual separators). Instead, the heck create uses its own definition of a word boundary that considers consecutive uppercase characters. This is purposely not implemented in CaseFormat because it's not reversible.

We've already violated the assumption of reversibility by replacing periods and hyphens into underscores. Therefore, I suggest we implement heck's case formatting, because it produces names that seem more intuitive. For example, clientID and clientId both map to client_id instead of client_i_d and client_id. These also match the casing we see in main_summary.

whd · 2019-07-11T19:33:01Z

For the previous implementation see https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/parquet/parquet.cpp#L141-L164, which was aiming at hive compatibility and is more similar to the rust implementation. Generally I agree with making the rust implementation the standard as it produces the more sane result in most cases.

It is unfortunate that there is no well-accepted standard or specification for performing this conversion that multiple libraries can be written against. Absent such a standard, a test suite including all of these special cases across the different language implementations would be ideal.

acmiyaguchi · 2019-07-12T00:26:03Z

The following python snippet is an acceptance test that we could use.

from itertools import product

for c in product("Aa7_", repeat=4):
    word = "".join(c)
    print(word)

This generates 256 (4**4) strings that covers most (if not all) of the renaming cases. This is what it currently looks like.

acmiyaguchi · 2019-07-12T23:44:27Z

I've been trying to find a regular expression to match heck's word boundaries, but I'm not sure it's possible. I think numbers are treated to have the same case as the letter just prior to them (e.g. a77Aa -> a77_aa and A77AAa -> a77a_aa), which is troublesome for lookbehinds. I've put together 4 different test suites and compared heck against a regexp-based script, the direct2parquet logic, and the CaseFormat utility.

https://github.com/acmiyaguchi/test-casing/blob/master/RESULTS.md#rust-vs-python-cases

jklukas · 2019-07-15T15:49:54Z

Perhaps we need to break this into two steps. First, we'd have a function to normalize runs of sequential capital letters such that only the first is capitalized (RAM -> Ram, ProductID -> ProductId), then pass to Guava's CaseFormat.

We could also memoize these conversions if they're getting expensive.

acmiyaguchi · 2019-07-15T19:02:27Z

I found a good solution that matches the rust library by reversing the string to search for all the word boundaries.

This passes all of the test cases:

All strings of length 3 drawn with replacement from "aA7"
All string of length 4 drawn with replacement from "aA7_"
All column names from the diff against mozilla-pipeline-schemas using --type bigquery --normalize-case in this PR
Unit tests copied from the heck::snake module

We should be able to re-implement this in pretty much any language with a built in regex library. Updated test results can be found here.

import re

# Search for all camelCase situations in reverse with arbitrary lookaheads.
REV_WORD_BOUND_PAT = re.compile(
    r"""
    \b                                  # standard word boundary
    |(?<=[a-z][A-Z])(?=\d*[A-Z])        # A7Aa -> A7|Aa boundary
    |(?<=[a-z][A-Z])(?=\d*[a-z])        # a7Aa -> a7|Aa boundary
    |(?<=[A-Z])(?=\d*[a-z])             # a7A -> a7|A boundary
    """,
    re.VERBOSE,
)


def snake_case(line: str) -> str:
    # replace non-alphanumeric characters with spaces in the reversed line
    subbed = re.sub(r"[^\w]|_", " ", line[::-1])

    # apply the regex on the reversed string
    words = REV_WORD_BOUND_PAT.split(subbed)

    # filter spaces between words and snake_case and reverse again
    return "_".join([w.lower() for w in words if w.strip()])[::-1]

acmiyaguchi · 2019-07-15T20:18:32Z

Here are the test cases in csv in reference,expected format. A function implementing the above regex (reverse, sub, split, join, reverse) should pass all of the cases.

https://gist.github.com/acmiyaguchi/e377c4b6a53204d7d67e53a20dabc4a1

Fixes #671

jklukas · 2019-07-16T13:48:49Z

I've implemented the SnakeCase class with a regular expression, and it's passing the test cases. This is also now hooked into the BigQuery sink such that docTypes and namespaces will use the conversion when determining destination table names, as well as being applied to field names.

Requesting another review from @acmiyaguchi.

acmiyaguchi

LGTM, it's nice to have a consistent snake_casing methodology 🎉

acmiyaguchi · 2019-07-16T17:40:42Z

ingestion-beam/src/test/java/com/mozilla/telemetry/transforms/PubsubMessageToTableRowTest.java

+    assertEquals("untrusted_modules", PubsubMessageToTableRow.convertNameForBq("untrustedModules"));
+    assertEquals("xul_load_duration_ms",
+        PubsubMessageToTableRow.convertNameForBq("xulLoadDurationMS"));
+    assertEquals("a11y_consumers", PubsubMessageToTableRow.convertNameForBq("A11Y_CONSUMERS"));


speedMHz is a good one to add, it has an interesting look.

acmiyaguchi · 2019-07-16T17:47:25Z

ingestion-beam/src/test/java/com/mozilla/telemetry/util/SnakeCaseTest.java

+    // two words
+    assertEquals(SnakeCase.format("aA"), "a_a");
+    // underscores are word boundaries
+    assertEquals(SnakeCase.format("_a__a_"), "a_a");


Food for thought:

// underscores are word boundaries assertEquals(SnakeCase.format("a__a"), "a_a"); // sentences can italicized assertEquals(SnakeCase.format("_a__a_"), "_a_a_"); // sentences can bolded assertEquals(SnakeCase.format("__a__a__"), "__a_a__");

jklukas requested review from relud and acmiyaguchi July 5, 2019 18:22

jklukas force-pushed the snake-case branch from aaf56ff to ca7b22c Compare July 5, 2019 18:34

relud approved these changes Jul 9, 2019

View reviewed changes

acmiyaguchi suggested changes Jul 11, 2019

View reviewed changes

jklukas and others added 4 commits July 16, 2019 09:05

Coerce camelCase field names to snake_case in BQ sink

d7f87a1

Fixes #671

Refactor casing into a separate utility module

6924fe7

Add tests for consistent snakecasing

23bd653

Add cache and implement compliant regex-based snake casing

06ad2f7

jklukas force-pushed the snake-case branch 2 times, most recently from 961306f to 50ad7a8 Compare July 16, 2019 13:43

Make caching more targeted

6e1e012

jklukas force-pushed the snake-case branch from 50ad7a8 to 6e1e012 Compare July 16, 2019 13:43

jklukas requested a review from acmiyaguchi July 16, 2019 13:48

acmiyaguchi approved these changes Jul 16, 2019

View reviewed changes

Merge branch 'master' into snake-case

56d1734

jklukas changed the title ~~DO NOT MERGE: Coerce camelCase field names to snake_case in BQ sink~~ Coerce camelCase field names to snake_case in BQ sink Jul 17, 2019

jklukas merged commit 51e99f7 into master Jul 17, 2019

jklukas deleted the snake-case branch July 17, 2019 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coerce camelCase field names to snake_case in BQ sink #689

Coerce camelCase field names to snake_case in BQ sink #689

jklukas commented Jul 5, 2019

codecov-io commented Jul 5, 2019 •

edited

Loading

relud left a comment

acmiyaguchi left a comment

acmiyaguchi commented Jul 11, 2019

whd commented Jul 11, 2019

acmiyaguchi commented Jul 12, 2019

acmiyaguchi commented Jul 12, 2019

jklukas commented Jul 15, 2019

acmiyaguchi commented Jul 15, 2019 •

edited

Loading

acmiyaguchi commented Jul 15, 2019

jklukas commented Jul 16, 2019

acmiyaguchi left a comment

acmiyaguchi Jul 16, 2019

acmiyaguchi Jul 16, 2019

Coerce camelCase field names to snake_case in BQ sink #689

Coerce camelCase field names to snake_case in BQ sink #689

Conversation

jklukas commented Jul 5, 2019

codecov-io commented Jul 5, 2019 • edited Loading

Codecov Report

relud left a comment

Choose a reason for hiding this comment

acmiyaguchi left a comment

Choose a reason for hiding this comment

acmiyaguchi commented Jul 11, 2019

whd commented Jul 11, 2019

acmiyaguchi commented Jul 12, 2019

acmiyaguchi commented Jul 12, 2019

jklukas commented Jul 15, 2019

acmiyaguchi commented Jul 15, 2019 • edited Loading

acmiyaguchi commented Jul 15, 2019

jklukas commented Jul 16, 2019

acmiyaguchi left a comment

Choose a reason for hiding this comment

acmiyaguchi Jul 16, 2019

Choose a reason for hiding this comment

acmiyaguchi Jul 16, 2019

Choose a reason for hiding this comment

codecov-io commented Jul 5, 2019 •

edited

Loading

acmiyaguchi commented Jul 15, 2019 •

edited

Loading