-
Notifications
You must be signed in to change notification settings - Fork 46
Crash aggregates #56
Crash aggregates #56
Conversation
Current coverage is 54.26%
@@ master #56 diff @@
==========================================
Files 32 17 -15
Lines 1402 1539 +137
Methods 1338 1481 +143
Messages 0 0
Branches 48 55 +7
==========================================
+ Hits 759 835 +76
- Misses 643 704 +61
Partials 0 0
|
src/main/scala/streams/Crash.scala
Outdated
).toMap | ||
val statsMap = (statsNames, stats).zipped.toMap | ||
|
||
val schema = buildSchema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to move away from Avro schema in new datasets and use only the one provided by SparkSQL. Basically you would have to port over the same code you used in your Python version.
After a lot of debugging, I found out that Parquet-Avro 1.8.1 (the most recent version) entirely breaks
Basically, if we even install parquet-avro, writing Parquet files from dataframes no longer works:
There's a resolved issue in Parquet for this, but it's only for 1.8.2-SNAPSHOT, so it isn't really considered stable yet. We should (1) move this into a separate project (so we don't need to depend on parquet-avro at all), or (2) move away from the Avro stuff all at once. |
} | ||
|
||
def main(args: Array[String]) { | ||
// load configuration for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is incomplete.
Changed:
|
This looks good, did you compare the output generated by this job with the one generated by the Python one? |
Could you please also add some documentation about the metrics collected in this job similarly to what you have written down for the Python one? |
Changes:
|
…l works in production
Looks good, thanks! |
Intended to have behaviour equivalent to https://github.com/mozilla/moz-crash-rate-aggregates, but faster.
It's also a little bit simpler, since the Spark Scala API is somewhat richer and more robust than PySpark.