Skip to content
This repository was archived by the owner on Feb 14, 2025. It is now read-only.

Conversation

Uberi
Copy link
Contributor

@Uberi Uberi commented Apr 12, 2016

Intended to have behaviour equivalent to https://github.com/mozilla/moz-crash-rate-aggregates, but faster.

It's also a little bit simpler, since the Spark Scala API is somewhat richer and more robust than PySpark.

@codecov-io
Copy link

codecov-io commented Apr 12, 2016

Current coverage is 54.26%

Merging #56 into master will increase coverage by +0.08%

  1. 2 files (not in diff) in ...streams/main_summary were deleted. more
  2. 4 files (not in diff) in ...c/main/scala/streams were deleted. more
  3. 2 files (not in diff) in .../src/main/scala/heka were deleted. more
  4. 8 files (not in diff) in ...-view/src/main/scala were deleted. more
  5. File ...lientCountView.scala (not in diff) was modified. more
    • Misses +2
    • Partials 0
    • Hits 0
@@             master        #56   diff @@
==========================================
  Files            32         17    -15   
  Lines          1402       1539   +137   
  Methods        1338       1481   +143   
  Messages          0          0          
  Branches         48         55     +7   
==========================================
+ Hits            759        835    +76   
- Misses          643        704    +61   
  Partials          0          0          

Powered by Codecov. Last updated by d260b7b...35e3ea1

).toMap
val statsMap = (statsNames, stats).zipped.toMap

val schema = buildSchema()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to move away from Avro schema in new datasets and use only the one provided by SparkSQL. Basically you would have to port over the same code you used in your Python version.

@Uberi
Copy link
Contributor Author

Uberi commented Apr 15, 2016

After a lot of debugging, I found out that Parquet-Avro 1.8.1 (the most recent version) entirely breaks DataFrame.write.parquet:

libraryDependencies += "org.apache.parquet" % "parquet-avro" % "1.8.1",

Basically, if we even install parquet-avro, writing Parquet files from dataframes no longer works:

java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$MessageTypeBuilder.addFields([Lorg/apache/parquet/schema/Type;)Lorg/apache/parquet/schema/Types$GroupBuilder;
(...16 more lines...)

There's a resolved issue in Parquet for this, but it's only for 1.8.2-SNAPSHOT, so it isn't really considered stable yet.

We should (1) move this into a separate project (so we don't need to depend on parquet-avro at all), or (2) move away from the Avro stuff all at once.

}

def main(args: Array[String]) {
// load configuration for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is incomplete.

@Uberi
Copy link
Contributor Author

Uberi commented Apr 25, 2016

Changed:

  • Added comment about ping aggregates.
  • Clean up code a bit.
  • Use list with single element to identify top-level fields.

@vitillo
Copy link
Contributor

vitillo commented Apr 26, 2016

This looks good, did you compare the output generated by this job with the one generated by the Python one?

@vitillo
Copy link
Contributor

vitillo commented Apr 27, 2016

Could you please also add some documentation about the metrics collected in this job similarly to what you have written down for the Python one?

@Uberi
Copy link
Contributor Author

Uberi commented Apr 28, 2016

Changes:

@Uberi Uberi force-pushed the crash-aggregates branch from 1460fbc to 35e3ea1 Compare April 29, 2016 19:10
@Uberi Uberi mentioned this pull request Apr 29, 2016
@vitillo
Copy link
Contributor

vitillo commented May 3, 2016

Looks good, thanks!

@vitillo vitillo merged commit 8f6afba into mozilla:master May 3, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants