Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1568042 - Allow ingesting heka-framed blobs #781

Merged
merged 14 commits into from Sep 25, 2019

Conversation

wlach
Copy link
Contributor

@wlach wlach commented Aug 28, 2019

Work in progress. This adds heka blob parsing, some tests for that, and the very beginnings of a heka->pubsub sink. Still needed:

  • Code to traverse heka blobs in a filesystem store (equivalent of Dataset from scala moztelemetry)
  • Make the json -> pubsub integration bits actually work. It seems like the existing text->pubsub and json->pubsub classes assume a 1:1 mapping, which might not be what we want here. There's a bunch I still need to figure out about how to make this work...

@codecov-io
Copy link

codecov-io commented Aug 28, 2019

Codecov Report

Merging #781 into master will decrease coverage by 39.42%.
The diff coverage is 62.58%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master     #781       +/-   ##
=============================================
- Coverage     89.04%   49.61%   -39.43%     
+ Complexity      562      555        -7     
=============================================
  Files            81       77        -4     
  Lines          2966     4815     +1849     
  Branches        260      701      +441     
=============================================
- Hits           2641     2389      -252     
- Misses          239     2315     +2076     
- Partials         86      111       +25
Flag Coverage Δ Complexity Δ
#ingestion_beam 48.52% <62.58%> (-39.47%) 502 <30> (+1)
#ingestion_edge ? ?
#ingestion_sink 61.32% <ø> (-19.34%) 63 <ø> (-8)
Impacted Files Coverage Δ Complexity Δ
...src/main/java/com/mozilla/telemetry/heka/Heka.java 13.05% <ø> (ø) 2 <0> (?)
...m/src/main/java/com/mozilla/telemetry/io/Read.java 9.67% <0%> (-79.42%) 1 <0> (ø)
...c/main/java/com/mozilla/telemetry/heka/HekaIO.java 0% <0%> (ø) 0 <0> (?)
...m/mozilla/telemetry/transforms/FailureMessage.java 71.42% <0%> (-8.58%) 8 <0> (ø)
.../java/com/mozilla/telemetry/options/InputType.java 53.84% <50%> (-27.98%) 1 <0> (ø)
...src/main/java/com/mozilla/telemetry/util/Json.java 83.67% <50%> (-3%) 21 <1> (+1)
...in/java/com/mozilla/telemetry/heka/HekaReader.java 83.17% <83.17%> (ø) 29 <29> (?)
...mozilla/telemetry/schemas/BigQuerySchemaStore.java 0% <0%> (-100%) 0% <0%> (-4%)
...om/mozilla/telemetry/ingestion/sink/io/Pubsub.java 2.56% <0%> (-82.06%) 0% <0%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41b45ff...f836980. Read the comment docs.

@wlach
Copy link
Contributor Author

wlach commented Aug 28, 2019 via email

@relud
Copy link
Contributor

relud commented Sep 9, 2019

Gotcha, could you point me where we can suppress spotless checks?

pom.xml#L189


insertPath(payload, Arrays.asList("meta", "Hostname"), message.getHostname());
insertPath(payload, Arrays.asList("meta", "Timestamp"), new Long(message.getTimestamp()));
insertPath(payload, Arrays.asList("meta", "Type"), message.getDtype());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that these lines are injecting fields into the JSON payload? If we're going to be passing this JSON object through the decoder, it would be better to have these available as attributes on the PubsubMessage rather than in the payload to match the layout of attributes coming from the edge server (see https://mozilla.github.io/gcp-ingestion/architecture/edge_service_specification/#general-data-flow). We should probably have a chat to think through implications there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think these fields might have been added by hindsight to the json payload in our AWS implementation. @mreid-moz can you confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do similar in the GCP pipeline where we put various metadata into a metadata blob as one of the last steps of the decoder, see AddMetadata.java.

I believe the plan is that we'll pass these heka payloads through the Decoder rather than just the Sink so that they get validated before we try to send them to BigQuery. We'll need to make sure the message is in a format that the Decoder can understand. The Decoder will check the JSON payload to see if it contains a metadata struct and will extract various attributes from it if they exist, so it is still an option to provide them that way rather than directly making them PubsubMessage attributes. But in either case we need to make sure the names line up. Perhaps we should save that to a separate PR where @relud or I can apply our knowledge of the metadata schema that this repo expects.

@wlach
Copy link
Contributor Author

wlach commented Sep 13, 2019

@jklukas I think this is ready for another look. The ingestion-beam tests are now passing, the ingestion-sink-integration tests aren't passing, but I don't think that has to do with this PR.

AFAICT there are 3 issues remaining, but I'm not sure if there's anything blocking merging. Might be easier to get this in and then followup, thoughts?

@jklukas
Copy link
Contributor

jklukas commented Sep 13, 2019

This needs a rebase on master before being ready to merge, which will require refactoring to use jackson ObjectNode rather than org.json JSONObject.

@wlach wlach changed the title Heka ingestion Bug 1568042 - Allow ingesting heka-framed blobs Sep 16, 2019
Copy link
Contributor

@mreid-moz mreid-moz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some Heka arcana :)

: Arrays.asList("meta", key);
if (f.getValueType() == Heka.Field.ValueType.STRING) {
String value = f.getValueString(0);
if (value.charAt(0) == '{') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reliably also look at the Heka Field's "representation", which should be set to "json" for anything that should be parsed as json. This needs to be verified, as my recollection is fairly hazy on that.

If this is the case, then everything with a "json" representation should be inserted into the main document body, and all other Heka message fields should be inserted into the "meta" struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, this might need a bit more refinement. It looks like the scala version attempts to parse the value as json, falling back to just setting the value to a string if that fails:

https://github.com/mozilla/moztelemetry/blob/63e2e57d098f923e5e005e478d447b8f481b64bc/src/main/scala/com/mozilla/telemetry/heka/package.scala#L116

.ifPresent(s -> attributes.put(Attribute.GEO_SUBDIVISION1, s));
Optional.ofNullable(meta.path("geoSubdivision2").textValue())
.ifPresent(s -> attributes.put(Attribute.GEO_SUBDIVISION2, s));
// TODO: Do heka messages contain parsed user agent info? DNT? Method? Protocol?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mreid-moz Do heka messages contain parsed user agent info? Other headers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we parse/store user agent info in the telemetry data on AWS.

cc @trink can you confirm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not store/parse user agent info for telemetry data in AWS (we do for structured). We do store these headers: https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/moz_telemetry/io_modules/decoders/moz_ingest/telemetry.lua#L321-L324.

@@ -10,7 +10,7 @@
"version": "69.0",
"xpcomAbi": "x86-msvc"
},
"clientId": "a136763f-3b9a-492d-8964-bfe783b68dc6",
"clientId": "0e8e41d2-d46f-4818-b071-c6605257e6aa",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing before the commit. I'm assuming the json is just not up to date with the heka data for the test, but correct me if I'm wrong on that, @wlach

@jklukas jklukas force-pushed the heka-ingestion branch 2 times, most recently from bbd6649 to f836980 Compare September 24, 2019 13:03
@wlach
Copy link
Contributor Author

wlach commented Sep 25, 2019

Let's merge this! It seems to work reasonable well for the data we've thrown at it so far. Can always address any issues in followups.

@wlach wlach merged commit 08c3944 into mozilla:master Sep 25, 2019
@wlach wlach deleted the heka-ingestion branch September 25, 2019 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants