Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coerce camelCase field names to snake_case in BQ sink #689

Merged
merged 6 commits into from
Jul 17, 2019
Merged

Conversation

jklukas
Copy link
Contributor

@jklukas jklukas commented Jul 5, 2019

Fixes #671

We cannot merge this until corresponding changes are made in the schema transpiler (see mozilla/jsonschema-transpiler#77) and the tables with the new structure are deployed.

Deploying this change will likely happen at the same time as moving destination tables to <namespace>_live datasets in the shared-prod project. See https://bugzilla.mozilla.org/show_bug.cgi?id=1563740

@codecov-io
Copy link

codecov-io commented Jul 5, 2019

Codecov Report

Merging #689 into master will decrease coverage by 1.15%.
The diff coverage is 86.2%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #689      +/-   ##
============================================
- Coverage     90.77%   89.62%   -1.16%     
- Complexity      427      437      +10     
============================================
  Files            60       54       -6     
  Lines          2353     2111     -242     
  Branches        207      208       +1     
============================================
- Hits           2136     1892     -244     
- Misses          154      157       +3     
+ Partials         63       62       -1
Flag Coverage Δ Complexity Δ
#ingestion_beam 89.62% <86.2%> (-0.02%) 437 <12> (+10)
#ingestion_edge ? ?
Impacted Files Coverage Δ Complexity Δ
.../telemetry/transforms/PubsubMessageToTableRow.java 77.16% <85%> (+0.53%) 52 <8> (+6) ⬆️
...ain/java/com/mozilla/telemetry/util/SnakeCase.java 88.88% <88.88%> (ø) 4 <4> (?)
ingestion-edge/ingestion_edge/dockerflow.py
ingestion-edge/ingestion_edge/create_app.py
ingestion-edge/ingestion_edge/config.py
ingestion-edge/ingestion_edge/wsgi.py
ingestion-edge/ingestion_edge/util.py
ingestion-edge/ingestion_edge/flush.py
ingestion-edge/ingestion_edge/publish.py
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8727587...6e1e012. Read the comment docs.

Copy link
Contributor

@relud relud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is surprisingly simple

Copy link
Contributor

@acmiyaguchi acmiyaguchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the casing decisions made by Guava are inconsistent with Rust library for snake_casing that I've used in mozilla/jsonschema-transpiler#79. I generated a diff against the latest branch of mozilla-pipeline-schemas and scraped the affected names into a file.

In this repository, I've created a simple scala and rust application for testing the column names that I scraped from the diff.

original java rust
BuildID build_i_d build_id
D2DEnabled d2_d_enabled d2d_enabled
GPUActive g_p_u_active gpu_active
ProductID product_i_d product_id
RAM r_a_m ram
activeGMPlugins active_g_m_plugins active_gm_plugins
changesetID changeset_i_d changeset_id
closedTS closed_t_s closed_ts
debugID debug_i_d debug_id
deviceID device_i_d device_id
engagedTS engaged_t_s engaged_ts
expiredTS expired_t_s expired_ts
l2cacheKB l2cache_k_b l2cache_kb
l3cacheKB l3cache_k_b l3cache_kb
learnMoreTS learn_more_t_s learn_more_ts
loadDurationMS load_duration_m_s load_duration_ms
memoryMB memory_m_b memory_mb
offeredTS offered_t_s offered_ts
processUptimeMS process_uptime_m_s process_uptime_ms
submissionURL submission_u_r_l submission_url
subsysID subsys_i_d subsys_id
threadID thread_i_d thread_id
totalPagesAM total_pages_a_m total_pages_am
vendorID vendor_i_d vendor_id
virtualMaxMB virtual_max_m_b virtual_max_mb
votedTS voted_t_s voted_ts
windowClosedTS window_closed_t_s window_closed_ts
windowsUBR windows_u_b_r windows_ubr
xulLoadDurationMS xul_load_duration_m_s xul_load_duration_ms

We should choose a library that uses Unicode Standard Annex #29, which is the spec that the underlying unicode_segmentation implements for finding word boundaries. It seems to provide good word boundaries.

It looks like java.util.regex.Pattern supports word boundaries defined by annex 29, so this is one potential alternative.

In any case, the libraries chosen by the schema transpiler and ingestion sink should match on all of the column names in mozilla-pipeline-schemas.

@acmiyaguchi
Copy link
Contributor

Unicode segmentation was a red herring, it turns out that CaseFormat is an ASCII utility and finding breaks between unicode characters is mostly unrelated (just the usual separators). Instead, the heck create uses its own definition of a word boundary that considers consecutive uppercase characters. This is purposely not implemented in CaseFormat because it's not reversible.

We've already violated the assumption of reversibility by replacing periods and hyphens into underscores. Therefore, I suggest we implement heck's case formatting, because it produces names that seem more intuitive. For example, clientID and clientId both map to client_id instead of client_i_d and client_id. These also match the casing we see in main_summary.

@whd
Copy link
Member

whd commented Jul 11, 2019

For the previous implementation see https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/parquet/parquet.cpp#L141-L164, which was aiming at hive compatibility and is more similar to the rust implementation. Generally I agree with making the rust implementation the standard as it produces the more sane result in most cases.

It is unfortunate that there is no well-accepted standard or specification for performing this conversion that multiple libraries can be written against. Absent such a standard, a test suite including all of these special cases across the different language implementations would be ideal.

@acmiyaguchi
Copy link
Contributor

The following python snippet is an acceptance test that we could use.

from itertools import product

for c in product("Aa7_", repeat=4):
    word = "".join(c)
    print(word)

This generates 256 (4**4) strings that covers most (if not all) of the renaming cases. This is what it currently looks like.

@acmiyaguchi
Copy link
Contributor

I've been trying to find a regular expression to match heck's word boundaries, but I'm not sure it's possible. I think numbers are treated to have the same case as the letter just prior to them (e.g. a77Aa -> a77_aa and A77AAa -> a77a_aa), which is troublesome for lookbehinds. I've put together 4 different test suites and compared heck against a regexp-based script, the direct2parquet logic, and the CaseFormat utility.

https://github.com/acmiyaguchi/test-casing/blob/master/RESULTS.md#rust-vs-python-cases

@jklukas
Copy link
Contributor Author

jklukas commented Jul 15, 2019

Perhaps we need to break this into two steps. First, we'd have a function to normalize runs of sequential capital letters such that only the first is capitalized (RAM -> Ram, ProductID -> ProductId), then pass to Guava's CaseFormat.

We could also memoize these conversions if they're getting expensive.

@acmiyaguchi
Copy link
Contributor

acmiyaguchi commented Jul 15, 2019

I found a good solution that matches the rust library by reversing the string to search for all the word boundaries.

This passes all of the test cases:

  • All strings of length 3 drawn with replacement from "aA7"
  • All string of length 4 drawn with replacement from "aA7_"
  • All column names from the diff against mozilla-pipeline-schemas using --type bigquery --normalize-case in this PR
  • Unit tests copied from the heck::snake module

We should be able to re-implement this in pretty much any language with a built in regex library. Updated test results can be found here.

import re

# Search for all camelCase situations in reverse with arbitrary lookaheads.
REV_WORD_BOUND_PAT = re.compile(
    r"""
    \b                                  # standard word boundary
    |(?<=[a-z][A-Z])(?=\d*[A-Z])        # A7Aa -> A7|Aa boundary
    |(?<=[a-z][A-Z])(?=\d*[a-z])        # a7Aa -> a7|Aa boundary
    |(?<=[A-Z])(?=\d*[a-z])             # a7A -> a7|A boundary
    """,
    re.VERBOSE,
)


def snake_case(line: str) -> str:
    # replace non-alphanumeric characters with spaces in the reversed line
    subbed = re.sub(r"[^\w]|_", " ", line[::-1])

    # apply the regex on the reversed string
    words = REV_WORD_BOUND_PAT.split(subbed)

    # filter spaces between words and snake_case and reverse again
    return "_".join([w.lower() for w in words if w.strip()])[::-1]

@acmiyaguchi
Copy link
Contributor

Here are the test cases in csv in reference,expected format. A function implementing the above regex (reverse, sub, split, join, reverse) should pass all of the cases.

https://gist.github.com/acmiyaguchi/e377c4b6a53204d7d67e53a20dabc4a1

@jklukas jklukas force-pushed the snake-case branch 2 times, most recently from 961306f to 50ad7a8 Compare July 16, 2019 13:43
@jklukas
Copy link
Contributor Author

jklukas commented Jul 16, 2019

I've implemented the SnakeCase class with a regular expression, and it's passing the test cases. This is also now hooked into the BigQuery sink such that docTypes and namespaces will use the conversion when determining destination table names, as well as being applied to field names.

Requesting another review from @acmiyaguchi.

Copy link
Contributor

@acmiyaguchi acmiyaguchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it's nice to have a consistent snake_casing methodology 🎉

assertEquals("untrusted_modules", PubsubMessageToTableRow.convertNameForBq("untrustedModules"));
assertEquals("xul_load_duration_ms",
PubsubMessageToTableRow.convertNameForBq("xulLoadDurationMS"));
assertEquals("a11y_consumers", PubsubMessageToTableRow.convertNameForBq("A11Y_CONSUMERS"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speedMHz is a good one to add, it has an interesting look.

// two words
assertEquals(SnakeCase.format("aA"), "a_a");
// underscores are word boundaries
assertEquals(SnakeCase.format("_a__a_"), "a_a");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought:

// underscores are word boundaries
assertEquals(SnakeCase.format("a__a"), "a_a");
// sentences can italicized
assertEquals(SnakeCase.format("_a__a_"), "_a_a_");
// sentences can bolded
assertEquals(SnakeCase.format("__a__a__"), "__a_a__");

@jklukas jklukas changed the title DO NOT MERGE: Coerce camelCase field names to snake_case in BQ sink Coerce camelCase field names to snake_case in BQ sink Jul 17, 2019
@jklukas jklukas merged commit 51e99f7 into master Jul 17, 2019
@jklukas jklukas deleted the snake-case branch July 17, 2019 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert camel case bigquery column names to snake case
5 participants