Streams to views #62

Uberi · 2016-04-29T19:29:56Z

Based on #56:

Convert the Churn and Executive streams to views rather than the old DerivedStream kind. When all of them are completely converted, we can remove the Avro stuff.
The executive view avoids a ton of shuffling by using getRecords with the channel name directly. This seems to make it a lot faster.
telemetry.utils.Telemetry contains some useful functions, getRecords and listOptions.

…types

…l works in production

codecov-io · 2016-04-29T19:38:18Z

Current coverage is 50.95%

Merging #62 into master will decrease coverage by -3.31%

2 files in src/main/scala were modified. more
- Misses -11
2 files (not in diff) in ...streams/main_summary were modified. more
2 files (not in diff) in ...c/main/scala/streams were modified. more
1 files (not in diff) in src/main/scala were modified. more
- Misses -2

@@             master        #62   diff @@
==========================================
  Files            17         19     +2   
  Lines          1539       1639   +100   
  Methods        1481       1584   +103   
  Messages          0          0          
  Branches         55         49     -6   
==========================================
  Hits            835        835          
- Misses          704        804   +100   
  Partials          0          0

Powered by Codecov. Last updated by d36ff31...07dafc0

mreid-moz · 2016-05-02T13:08:55Z

src/main/scala/utils/Telemetry.scala

+    val JString(telemetryPrefix) = metaSources \\ "telemetry" \\ "prefix"
+
+    // get a stream of object summaries that match the desired criteria
+    val bucket = Bucket("net-mozaws-prod-us-west-2-pipeline-data")


You should get the bucket from the metadata as well: val JString(bucketName) = metaSources \\ "telemetry" \\ "bucket"

Uberi · 2016-05-02T15:28:17Z

Updated to use data bucket from metadata.

vitillo · 2016-05-03T12:48:33Z

Thanks for picking this up @Uberi! Could you please rebase it?

vitillo · 2016-05-03T12:50:08Z

src/main/scala/utils/Telemetry.scala

+    }
+  }
+
+  def appendToFile(p: String, s: String): Unit = {


Where is this function used?

vitillo · 2016-05-03T13:11:17Z

@mreid-moz are you happy with the churn rewrite?

mreid-moz · 2016-05-03T13:19:31Z

Taking a look now

mreid-moz · 2016-05-03T13:23:08Z

docs/CrashAggregateView.md

+sbt "run-main telemetry.views.CrashAggregateView --from 20160410 --to 20160411"
+```
+
+**Note:** Currently, due to [avro-parquet issues](https://issues.apache.org/jira/browse/HIVE-12828), Parquet writing only works under the `spark-submit` commands - the above example will fail. This will be fixed when avro-parquet updates or is removed.


s/Currently/As of 2016-05-03/

This will hopefully help later when figuring out timelines debugging the linked issue.

mreid-moz · 2016-05-03T13:40:06Z

src/main/scala/views/ChurnView.scala

+      val schema = buildSchema()
+      val messages = Telemetry.getRecords(sc, currentDate, List("telemetry", "4", "main", "Firefox"))
+      val rowRDD = messages.flatMap(messageToRow).repartition(100) // TODO: partition by sampleId
+      val records = sqlContext.createDataFrame(rowRDD.coalesce(1), schema)


Do we need the coalesce(1) here? Shouldn't we leave rowRDD as-is since it already has the desired number of partitions (in this case 100)?

… one now

…atch ordering in buildSchema()

Uberi · 2016-05-07T18:13:04Z

Changes:

Implement all of the above feedback.
Note: Since I'm not in the AWS cloud services group, I haven't tested these on real data.

mreid-moz · 2016-05-09T13:30:53Z

Changes look good to me. Can you test it on real data @vitillo?

mreid-moz · 2016-05-09T17:47:09Z

src/main/scala/utils/Telemetry.scala

+      bucket,
+      List("").toStream,
+      List(telemetryPrefix, submissionDate.toString("yyyyMMdd")) ++ pingPath
+    ).flatMap(prefix => s3.objectSummaries(bucket, prefix)).map(summary => summary.getKey())


Should we include something like the logic in DerivedStream.groupBySize to try to balance out the size of each partition / task?

vitillo · 2016-05-10T10:01:45Z

@mreid-moz I can test the executive view, can you take the churn view?

mreid-moz · 2016-05-12T14:07:41Z

@vitillo Yeah, I'll test out the churn view today.

mreid-moz · 2016-05-12T19:11:40Z

Ok, I ran a test of the ChurnView code and found the following for the data on 20160401 (running on a cluster of 20 nodes):

Test 1: DerivedStream-based code:

time spark-submit \
    --master yarn-client \
    --class telemetry.DerivedStream \
    telemetry-batch-view-1.1-orig.jar \
    --from-date 20160401 \
    --to-date 20160401 \
    Churn

Test 2: ChurnView code:

time spark-submit \
    --master yarn-client \
    --class telemetry.views.ChurnView \
    telemetry-batch-view-1.1-views.jar \
    --from 20160401 \
    --to 20160401

The old DerivedStream-based code ran in 26m31.701s and generated a correct-looking number of records.
The View-based code ran in 137m48.952s (about 5 times slower), and generated a dataset with the correct structure, but no actual records (a count of the dataframe was zero).
The tasks in the ChurnView code were very unbalanced time-wise, with the longest task taking 2.2hr. See this screenshot from the Spark UI:
I left the generated datasets in telemetry-test-bucket/churn/v1 (DerivedStream based) and telemetry-test-bucket/churn/v2 (ChurnView) for further inspection, and I saved the "spark.log" for the test if that's of interest.
I also have a simple notebook I used to count the resulting datasets if needed.

mreid-moz · 2016-05-12T19:18:20Z

src/main/scala/views/ChurnView.scala

+  def buildSchema(): StructType = {
+    StructType(
+      StructField("clientId",            StringType,  false) ::
+      StructField("sampleId",            IntegerType, false) ::


Either s/IntegerType/LongType/ here or output an Integer value in messageToRow. I noticed that Spark types are less forgiving of Int vs. Long conversion than the Avro stuff.

Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach. This incorporates the "utils.Telemetry" code in mozilla#62. Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time. Consumer-facing changes in the v3 dataset include: - Renaming the "submission_date_s3" field to "submission_date" - Changing the type of the "sample_id" field to a string due to using it as an s3 partitioning field.

* Refactor "Main Summary" to drop avro, use "Views". Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach. This incorporates the "utils.Telemetry" code in #62. Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time. Consumer-facing changes in the v3 dataset include: - Renaming the "submission_date_s3" field to "submission_date" - Changing the type of the "sample_id" field to a string due to using it as an s3 partitioning field. * Fix the S3 partitioning logic. * Re-enable the tests for "Longitudinal". The MainSummaryView serialization doesn't run when using parquet-avro 1.8.1, so disable that for now. Also adjust the .sbtopts file to avoid an OOM when running tests. * Remove a spurious print stmt

* Refactor "Main Summary" to drop avro, use "Views". Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach. This incorporates the "utils.Telemetry" code in mozilla#62. Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time. Consumer-facing changes in the v3 dataset include: - Renaming the "submission_date_s3" field to "submission_date" - Changing the type of the "sample_id" field to a string due to using it as an s3 partitioning field. * Fix the S3 partitioning logic. * Re-enable the tests for "Longitudinal". The MainSummaryView serialization doesn't run when using parquet-avro 1.8.1, so disable that for now. Also adjust the .sbtopts file to avoid an OOM when running tests. * Remove a spurious print stmt

Uberi added 18 commits April 29, 2016 15:09

Initial crash stream

452437d

Working crash rate aggregates derived stream

04e82d8

Add tests and do all the fixes necessary to make the tests pass

64b2d66

Fix client count view test not cleaning up

e87e34c

Make the derived stream runnable

9525c56

Validate docType since we're running this over all the different doc …

f70e824

…types

Remove usage of DerivedStream

bacef91

Try to fix JSON4S error

1cfa6f8

avro-parquet breaks parquet writing in isolation, but apparently stil…

2c4c408

…l works in production

Implement feedback from PR - add some comments, clean up code a bit

ee03fbb

Fix off by one in day iteration

f4aa7bf

Update README

0a1cc04

Update ping counts to exclude crash pings

da5e8c9

Fix tests, which broke due to the way we do fake pings

bf97878

Update docs for crash aggregates and view-based derived datasets

2ccd056

Tweak docs and parquet code

35e3ea1

Move away from DerivedStream - convert Churn dataset

c74dc6f

Add executive view, move getRecords to a utility function

4efc1e0

mreid-moz reviewed May 2, 2016
View reviewed changes

Use data bucket from the metadata

07dafc0

vitillo reviewed May 3, 2016
View reviewed changes

src/main/scala/utils/Telemetry.scala Outdated

}

}

def appendToFile(p: String, s: String): Unit = {

Copy link

Contributor

vitillo May 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this function used?

mreid-moz reviewed May 3, 2016
View reviewed changes

Uberi added 5 commits May 7, 2016 13:41

Merge remote-tracking branch 'upstream/master' into streams-to-views

d389b15

Remove debugging function

03b59fe

Remove old stream-based executive summary, since we have a view-based…

9d9e4f8

… one now

Incorrect assumption - ordering in the original map function didn't m…

6373fbc

…atch ordering in buildSchema()

Implement the rest of the feedback on the churn view stuff

6f6d9c7

mreid-moz reviewed May 9, 2016
View reviewed changes

mreid-moz mentioned this pull request May 12, 2016

Spark schema main summary to view #67

Merged

mreid-moz reviewed May 12, 2016
View reviewed changes

Fix churn view types - Int vs Long

abfd60c

vitillo force-pushed the master branch 5 times, most recently from 1b94720 to ccfd798 Compare June 1, 2016 22:41

vitillo closed this Jun 20, 2016

Streams to views #62

Streams to views #62

Uh oh!

Conversation

Uberi commented Apr 29, 2016

Uh oh!

codecov-io commented Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 50.95%

Uh oh!

mreid-moz May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uberi commented May 2, 2016

Uh oh!

vitillo commented May 3, 2016

Uh oh!

vitillo May 3, 2016

Choose a reason for hiding this comment

Uh oh!

vitillo commented May 3, 2016

Uh oh!

mreid-moz commented May 3, 2016

Uh oh!

mreid-moz May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mreid-moz May 3, 2016

Choose a reason for hiding this comment

Uh oh!

Uberi commented May 7, 2016

Uh oh!

mreid-moz commented May 9, 2016

Uh oh!

mreid-moz May 9, 2016

Choose a reason for hiding this comment

Uh oh!

vitillo commented May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mreid-moz commented May 12, 2016

Uh oh!

mreid-moz commented May 12, 2016

Uh oh!

mreid-moz May 12, 2016

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Apr 29, 2016 •

edited

Loading

mreid-moz May 2, 2016 •

edited

Loading

mreid-moz May 3, 2016 •

edited

Loading

vitillo commented May 10, 2016 •

edited

Loading