-
Notifications
You must be signed in to change notification settings - Fork 46
Streams to views #62
Streams to views #62
Conversation
…l works in production
Current coverage is 50.95%
@@ master #62 diff @@
==========================================
Files 17 19 +2
Lines 1539 1639 +100
Methods 1481 1584 +103
Messages 0 0
Branches 55 49 -6
==========================================
Hits 835 835
- Misses 704 804 +100
Partials 0 0
|
src/main/scala/utils/Telemetry.scala
Outdated
val JString(telemetryPrefix) = metaSources \\ "telemetry" \\ "prefix" | ||
|
||
// get a stream of object summaries that match the desired criteria | ||
val bucket = Bucket("net-mozaws-prod-us-west-2-pipeline-data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should get the bucket from the metadata as well: val JString(bucketName) = metaSources \\ "telemetry" \\ "bucket"
Updated to use data bucket from metadata. |
Thanks for picking this up @Uberi! Could you please rebase it? |
src/main/scala/utils/Telemetry.scala
Outdated
} | ||
} | ||
|
||
def appendToFile(p: String, s: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this function used?
@mreid-moz are you happy with the churn rewrite? |
Taking a look now |
docs/CrashAggregateView.md
Outdated
sbt "run-main telemetry.views.CrashAggregateView --from 20160410 --to 20160411" | ||
``` | ||
|
||
**Note:** Currently, due to [avro-parquet issues](https://issues.apache.org/jira/browse/HIVE-12828), Parquet writing only works under the `spark-submit` commands - the above example will fail. This will be fixed when avro-parquet updates or is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Currently/As of 2016-05-03/
This will hopefully help later when figuring out timelines debugging the linked issue.
src/main/scala/views/ChurnView.scala
Outdated
val schema = buildSchema() | ||
val messages = Telemetry.getRecords(sc, currentDate, List("telemetry", "4", "main", "Firefox")) | ||
val rowRDD = messages.flatMap(messageToRow).repartition(100) // TODO: partition by sampleId | ||
val records = sqlContext.createDataFrame(rowRDD.coalesce(1), schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the coalesce(1)
here? Shouldn't we leave rowRDD
as-is since it already has the desired number of partitions (in this case 100)?
…atch ordering in buildSchema()
Changes:
|
Changes look good to me. Can you test it on real data @vitillo? |
bucket, | ||
List("").toStream, | ||
List(telemetryPrefix, submissionDate.toString("yyyyMMdd")) ++ pingPath | ||
).flatMap(prefix => s3.objectSummaries(bucket, prefix)).map(summary => summary.getKey()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include something like the logic in DerivedStream.groupBySize
to try to balance out the size of each partition / task?
@mreid-moz I can test the executive view, can you take the churn view? |
@vitillo Yeah, I'll test out the churn view today. |
def buildSchema(): StructType = { | ||
StructType( | ||
StructField("clientId", StringType, false) :: | ||
StructField("sampleId", IntegerType, false) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either s/IntegerType/LongType/ here or output an Integer value in messageToRow
. I noticed that Spark types are less forgiving of Int vs. Long conversion than the Avro stuff.
Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach. This incorporates the "utils.Telemetry" code in mozilla#62. Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time. Consumer-facing changes in the v3 dataset include: - Renaming the "submission_date_s3" field to "submission_date" - Changing the type of the "sample_id" field to a string due to using it as an s3 partitioning field.
* Refactor "Main Summary" to drop avro, use "Views". Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach. This incorporates the "utils.Telemetry" code in #62. Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time. Consumer-facing changes in the v3 dataset include: - Renaming the "submission_date_s3" field to "submission_date" - Changing the type of the "sample_id" field to a string due to using it as an s3 partitioning field. * Fix the S3 partitioning logic. * Re-enable the tests for "Longitudinal". The MainSummaryView serialization doesn't run when using parquet-avro 1.8.1, so disable that for now. Also adjust the .sbtopts file to avoid an OOM when running tests. * Remove a spurious print stmt
1b94720
to
ccfd798
Compare
* Refactor "Main Summary" to drop avro, use "Views". Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach. This incorporates the "utils.Telemetry" code in mozilla#62. Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time. Consumer-facing changes in the v3 dataset include: - Renaming the "submission_date_s3" field to "submission_date" - Changing the type of the "sample_id" field to a string due to using it as an s3 partitioning field. * Fix the S3 partitioning logic. * Re-enable the tests for "Longitudinal". The MainSummaryView serialization doesn't run when using parquet-avro 1.8.1, so disable that for now. Also adjust the .sbtopts file to avoid an OOM when running tests. * Remove a spurious print stmt
Based on #56:
DerivedStream
kind. When all of them are completely converted, we can remove the Avro stuff.getRecords
with the channel name directly. This seems to make it a lot faster.telemetry.utils.Telemetry
contains some useful functions,getRecords
andlistOptions
.