Bug 1568042 - Allow ingesting heka-framed blobs #781

wlach · 2019-08-28T14:52:05Z

Work in progress. This adds heka blob parsing, some tests for that, and the very beginnings of a heka->pubsub sink. Still needed:

Code to traverse heka blobs in a filesystem store (equivalent of Dataset from scala moztelemetry)
Make the json -> pubsub integration bits actually work. It seems like the existing text->pubsub and json->pubsub classes assume a 1:1 mapping, which might not be what we want here. There's a bunch I still need to figure out about how to make this work...

codecov-io · 2019-08-28T14:53:28Z

Codecov Report

Merging #781 into master will decrease coverage by 39.42%.
The diff coverage is 62.58%.

@@              Coverage Diff              @@
##             master     #781       +/-   ##
=============================================
- Coverage     89.04%   49.61%   -39.43%     
+ Complexity      562      555        -7     
=============================================
  Files            81       77        -4     
  Lines          2966     4815     +1849     
  Branches        260      701      +441     
=============================================
- Hits           2641     2389      -252     
- Misses          239     2315     +2076     
- Partials         86      111       +25

Flag	Coverage Δ	Complexity Δ
#ingestion_beam	`48.52% <62.58%> (-39.47%)`	`502 <30> (+1)`
#ingestion_edge	`?`	`?`
#ingestion_sink	`61.32% <ø> (-19.34%)`	`63 <ø> (-8)`

Impacted Files	Coverage Δ	Complexity Δ
...src/main/java/com/mozilla/telemetry/heka/Heka.java	`13.05% <ø> (ø)`	`2 <0> (?)`
...m/src/main/java/com/mozilla/telemetry/io/Read.java	`9.67% <0%> (-79.42%)`	`1 <0> (ø)`
...c/main/java/com/mozilla/telemetry/heka/HekaIO.java	`0% <0%> (ø)`	`0 <0> (?)`
...m/mozilla/telemetry/transforms/FailureMessage.java	`71.42% <0%> (-8.58%)`	`8 <0> (ø)`
.../java/com/mozilla/telemetry/options/InputType.java	`53.84% <50%> (-27.98%)`	`1 <0> (ø)`
...src/main/java/com/mozilla/telemetry/util/Json.java	`83.67% <50%> (-3%)`	`21 <1> (+1)`
...in/java/com/mozilla/telemetry/heka/HekaReader.java	`83.17% <83.17%> (ø)`	`29 <29> (?)`
...mozilla/telemetry/schemas/BigQuerySchemaStore.java	`0% <0%> (-100%)`	`0% <0%> (-4%)`
...om/mozilla/telemetry/ingestion/sink/io/Pubsub.java	`2.56% <0%> (-82.06%)`	`0% <0%> (ø)`
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41b45ff...f836980. Read the comment docs.

checkstyle/suppressions.xml

wlach · 2019-08-28T16:11:03Z

Gotcha, could you point me where we can suppress spotless checks?

…

On Wed, Aug 28, 2019, 11:44 Daniel Thorn ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In checkstyle/suppressions.xml <#781 (comment)>: > @@ -15,6 +15,9 @@  <suppress checks=".*" files="WriteTables.java" /> +  + <suppress checks=".*" files="Heka.java" /> this isn't checkstyle complaining, it's spotless — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#781?email_source=notifications&email_token=AAAFAWPKLKYGHTZ45SFFYYDQG2MNDA5CNFSM4IRJHEO2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCC7JTLY#discussion_r318655664>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAFAWPV5QSS4GR6JNVJK4LQG2MNDANCNFSM4IRJHEOQ> .

relud · 2019-09-09T20:11:49Z

Gotcha, could you point me where we can suppress spotless checks?

pom.xml#L189

ingestion-beam/src/main/java/com/mozilla/telemetry/io/Read.java

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaTransform.java

ingestion-beam/src/main/java/com/mozilla/telemetry/io/Read.java

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaTransform.java

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java

jklukas · 2019-09-12T19:31:46Z

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java

+
+      insertPath(payload, Arrays.asList("meta", "Hostname"), message.getHostname());
+      insertPath(payload, Arrays.asList("meta", "Timestamp"), new Long(message.getTimestamp()));
+      insertPath(payload, Arrays.asList("meta", "Type"), message.getDtype());


Am I correct in understanding that these lines are injecting fields into the JSON payload? If we're going to be passing this JSON object through the decoder, it would be better to have these available as attributes on the PubsubMessage rather than in the payload to match the layout of attributes coming from the edge server (see https://mozilla.github.io/gcp-ingestion/architecture/edge_service_specification/#general-data-flow). We should probably have a chat to think through implications there.

Hmm, I think these fields might have been added by hindsight to the json payload in our AWS implementation. @mreid-moz can you confirm?

We do similar in the GCP pipeline where we put various metadata into a metadata blob as one of the last steps of the decoder, see AddMetadata.java.

I believe the plan is that we'll pass these heka payloads through the Decoder rather than just the Sink so that they get validated before we try to send them to BigQuery. We'll need to make sure the message is in a format that the Decoder can understand. The Decoder will check the JSON payload to see if it contains a metadata struct and will extract various attributes from it if they exist, so it is still an option to provide them that way rather than directly making them PubsubMessage attributes. But in either case we need to make sure the names line up. Perhaps we should save that to a separate PR where @relud or I can apply our knowledge of the metadata schema that this repo expects.

wlach · 2019-09-13T20:03:44Z

@jklukas I think this is ready for another look. The ingestion-beam tests are now passing, the ingestion-sink-integration tests aren't passing, but I don't think that has to do with this PR.

AFAICT there are 3 issues remaining, but I'm not sure if there's anything blocking merging. Might be easier to get this in and then followup, thoughts?

jklukas · 2019-09-13T20:24:32Z

This needs a rebase on master before being ready to merge, which will require refactoring to use jackson ObjectNode rather than org.json JSONObject.

mreid-moz

Added some Heka arcana :)

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java

mreid-moz · 2019-09-16T23:54:13Z

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java

+            : Arrays.asList("meta", key);
+        if (f.getValueType() == Heka.Field.ValueType.STRING) {
+          String value = f.getValueString(0);
+          if (value.charAt(0) == '{') {


I think we can reliably also look at the Heka Field's "representation", which should be set to "json" for anything that should be parsed as json. This needs to be verified, as my recollection is fairly hazy on that.

If this is the case, then everything with a "json" representation should be inserted into the main document body, and all other Heka message fields should be inserted into the "meta" struct.

Hmm, yes, this might need a bit more refinement. It looks like the scala version attempts to parse the value as json, falling back to just setting the value to a string if that fails:

https://github.com/mozilla/moztelemetry/blob/63e2e57d098f923e5e005e478d447b8f481b64bc/src/main/scala/com/mozilla/telemetry/heka/package.scala#L116

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java

jklukas · 2019-09-17T17:37:54Z

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java

+          .ifPresent(s -> attributes.put(Attribute.GEO_SUBDIVISION1, s));
+      Optional.ofNullable(meta.path("geoSubdivision2").textValue())
+          .ifPresent(s -> attributes.put(Attribute.GEO_SUBDIVISION2, s));
+      // TODO: Do heka messages contain parsed user agent info? DNT? Method? Protocol?


@mreid-moz Do heka messages contain parsed user agent info? Other headers?

I don't think we parse/store user agent info in the telemetry data on AWS.

cc @trink can you confirm?

We do not store/parse user agent info for telemetry data in AWS (we do for structured). We do store these headers: https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/moz_telemetry/io_modules/decoders/moz_ingest/telemetry.lua#L321-L324.

jklukas · 2019-09-23T19:46:25Z

ingestion-beam/src/test/resources/testdata/heka/crashping.heka.json

@@ -10,7 +10,7 @@
        "version": "69.0",
        "xpcomAbi": "x86-msvc"
    },
-    "clientId": "a136763f-3b9a-492d-8964-bfe783b68dc6",
+    "clientId": "0e8e41d2-d46f-4818-b071-c6605257e6aa",


This test was failing before the commit. I'm assuming the json is just not up to date with the heka data for the test, but correct me if I'm wrong on that, @wlach

…pings

wlach · 2019-09-25T20:06:59Z

Let's merge this! It seems to work reasonable well for the data we've thrown at it so far. Can always address any issues in followups.

wlach force-pushed the heka-ingestion branch from c6cb0d1 to a8c6978 Compare August 28, 2019 14:53

wlach commented Aug 28, 2019

View reviewed changes

checkstyle/suppressions.xml Show resolved Hide resolved

wlach force-pushed the heka-ingestion branch from 4ef7919 to 7459ab0 Compare September 9, 2019 19:27

jklukas reviewed Sep 11, 2019

View reviewed changes

ingestion-beam/src/main/java/com/mozilla/telemetry/io/Read.java Outdated Show resolved Hide resolved

jklukas reviewed Sep 11, 2019

View reviewed changes

wlach force-pushed the heka-ingestion branch from 0b8b3cb to 204b24b Compare September 11, 2019 20:59

jklukas reviewed Sep 12, 2019

View reviewed changes

wlach force-pushed the heka-ingestion branch from 09b8466 to eb0ea53 Compare September 13, 2019 19:54

wlach changed the title ~~Heka ingestion~~ Bug 1568042 - Allow ingesting heka-framed blobs Sep 16, 2019

mreid-moz reviewed Sep 17, 2019

View reviewed changes

jklukas force-pushed the heka-ingestion branch from eb0ea53 to 17b82a8 Compare September 17, 2019 15:35

jklukas reviewed Sep 17, 2019

View reviewed changes

ingestion-beam/src/main/java/com/mozilla/telemetry/heka/HekaReader.java Show resolved Hide resolved

jklukas reviewed Sep 17, 2019

View reviewed changes

jklukas force-pushed the heka-ingestion branch from 24cc864 to 323e684 Compare September 17, 2019 17:42

wlach force-pushed the heka-ingestion branch 3 times, most recently from 78b9e27 to e8146cb Compare September 18, 2019 18:39

jklukas reviewed Sep 23, 2019

View reviewed changes

jklukas force-pushed the heka-ingestion branch 2 times, most recently from bbd6649 to f836980 Compare September 24, 2019 13:03

wlach and others added 5 commits September 25, 2019 11:49

Initial implementation

2357c3a

Refactor JSONObject use to ObjectNode

9a75f07

Refactor HekaTransform to HekaIO.readFiles()

95e04fe

Refactor HekaReader to emit PubsubMessage

607b077

Remove meta object and parse attributes from it

af3cdd4

jklukas and others added 9 commits September 25, 2019 11:49

Skip integration tests for forked PRs

b43e7a6

Handle heka messages which contain payload in a "submission" field

1fd6cca

Parse DNT and pingsender headers

71817c9

Fixes for submission field handling

5228629

Add support and tests for byte-encoded heka fields, as seen in crash …

4c31bc6

…pings

Handle zero-length heka fields

14f3ac7

Handle null properties

da7ff07

Don't read entire heka file into memory

c8ef3e6

spotless

92a914d

jklukas force-pushed the heka-ingestion branch from f836980 to 92a914d Compare September 25, 2019 15:54

jklukas approved these changes Sep 25, 2019

View reviewed changes

wlach merged commit 08c3944 into mozilla:master Sep 25, 2019

wlach deleted the heka-ingestion branch September 25, 2019 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1568042 - Allow ingesting heka-framed blobs #781

Bug 1568042 - Allow ingesting heka-framed blobs #781

wlach commented Aug 28, 2019

codecov-io commented Aug 28, 2019 •

edited

wlach commented Aug 28, 2019 via email

relud commented Sep 9, 2019

jklukas Sep 12, 2019

wlach Sep 12, 2019

jklukas Sep 12, 2019

wlach commented Sep 13, 2019

jklukas commented Sep 13, 2019

mreid-moz left a comment

mreid-moz Sep 16, 2019

wlach Sep 17, 2019

jklukas Sep 17, 2019

mreid-moz Sep 17, 2019

whd Sep 18, 2019

jklukas Sep 23, 2019

wlach commented Sep 25, 2019

Bug 1568042 - Allow ingesting heka-framed blobs #781

Bug 1568042 - Allow ingesting heka-framed blobs #781

Conversation

wlach commented Aug 28, 2019

codecov-io commented Aug 28, 2019 • edited

Codecov Report

wlach commented Aug 28, 2019 via email

relud commented Sep 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach commented Sep 13, 2019

jklukas commented Sep 13, 2019

mreid-moz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach commented Sep 25, 2019

codecov-io commented Aug 28, 2019 •

edited