Spark schema main summary to view #67

mreid-moz · 2016-05-12T13:57:39Z

Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach.

This builds on the code in #62, but still has some performance problems to work out.

Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time.

Note that I also added a step to Telemetry.getMessages to balance the partition sizes, which reduced the run time from more than 3 hours down to 90 minutes or so.

@vitillo @Uberi Feedback and suggestions appreciated, particularly about what I can do to identify and fix the performance difference.

vitillo · 2016-05-12T14:50:30Z

Please rebase.

mreid-moz · 2016-05-12T16:22:59Z

Merged master. I'll squash all the WIP commits once this is ready to land :)

mreid-moz · 2016-05-12T16:24:05Z

build.sbt

    ivyScala := ivyScala.value map { _.copy(overrideScalaVersion = true) },
    libraryDependencies += "org.apache.avro" % "avro" % "1.7.7",
-    libraryDependencies += "org.apache.parquet" % "parquet-avro" % "1.8.1",
+    libraryDependencies += "org.apache.parquet" % "parquet-avro" % "1.7.0",


parquet-avro 1.8.1 conflicted with writing a DataFrame to Parquet format.

@Uberi that might fix the bug you had seen when writing the crash dataset.

It does seem like it. Are there any issues with downgrading though? 7c7494c seems to have upgraded for reasons related to functionality.

I think 1.8.1 added support for writing some of the data structures used by the Longutindal dataset (Map of structs or something, I'm not sure). You can see the errors I got by checking out out this PR and re-enabling the serialization test in src/test/scala/Longitudinal.scala, then running sbt test

codecov-io · 2016-05-12T16:29:36Z

Current coverage is 54.03%

Merging #67 into master will decrease coverage by 0.28%

3 files (not in diff) in ...c/main/scala/streams were modified. more
1 files (not in diff) in src/main/scala were modified. more
- Misses +2
2 files in src/main/scala were modified. more
- Misses +1
- Hits -1

@@             master        #67   diff @@
==========================================
  Files            17         18     +1   
  Lines          1541       1638    +97   
  Methods        1483       1580    +97   
  Messages          0          0          
  Branches         55         50     -5   
==========================================
+ Hits            837        885    +48   
- Misses          704        753    +49   
  Partials          0          0

Powered by Codecov. Last updated by f02806d...f6e3756

mreid-moz · 2016-05-12T16:31:16Z

src/test/scala/Longitudinal.scala


  "Records" can "be serialized" in {
-    ParquetFile.serialize(List(fixture.record).toIterator, fixture.schema)
+    println("TODO: fix Avro+Parquet Serialization test")


Ideally we can just get rid of this by changing Longitudinal to use the SparkSQL / DataFrame serialization too.

vitillo · 2016-05-13T09:26:22Z

Just fyi, I am planning to write a proper telemetry API in the next weeks which is going to provide fast access to telemetry records. @Uberi getRecords was just a temporary measure until that API landed.

vitillo · 2016-05-13T09:27:05Z

Where can I find the output of this job?

mreid-moz · 2016-05-13T13:03:00Z

You can find the sample data at telmetry-test-bucket/main_summary/v3

vitillo · 2016-05-13T13:47:12Z

Do you know why s3://telemetry-test-bucket/main_summary/v3/submission_date_s3=20160401/ has 103 files (instead of 100) and why it's 1/3 smaller than s3://telemetry-parquet/main_summary/v2/submission_date_s3=20160401/?

vitillo · 2016-05-13T13:49:31Z

Also note that v2 has practically no skew and all files have the same size while the same isn't true for v3.

mreid-moz · 2016-05-13T14:02:40Z

The extra 3 files in s3://telemetry-test-bucket/main_summary/v3/submission_date_s3=20160401/ are the "_SUCCESS", "_metadata" and "_common_metadata" files that are autogenerated by the dataframe serialization. I think it's smaller due to using gzip compression instead of none (or snappy, I'm not sure) used by the Avro/Parquet stuff.

mreid-moz · 2016-05-13T14:03:41Z

Re: skew, I believe that is due to actually repartitioning the records (by sampleId) in v2, whereas the current iteration of v3 just does the "coalesce" which combines existing partitions.

mreid-moz · 2016-05-16T20:31:25Z

src/main/scala/views/MainSummaryView.scala

+    hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
+
+    // We want to end up with reasonably large parquet files on S3.
+    hadoopConf.setInt("parquet.block.size", 128 * 1024 * 1024)


@vitillo Is 128MB a reasonable size for parquet files? Should we go larger? I've read conflicting advice.

Depends, how many files are generated per day (with 128MB as size) and how many days are usually read from an analysis job?

There are about 20-30GB per day of data (in the latest gzipped-parquet form per this PR). If we partition by sample_id, we should have 100 partitions of ~235MB each. So likely two files per day+sample_id partition. Analysis jobs frequently look at MAU, so using 30 days of data is common (40 days, including data latency for activity date vs. submission date).

We should have a single file per day+sample_id then.

Ok, I'll go with 256MB for now.

Let's make it 512 to avoid very small files to be generated.

mreid-moz · 2016-05-18T13:49:27Z

Ok, I tested that the data can be read from Spark, and that the record counts match exactly for one day's data between the existing v2 data, the v3 data using manual partitioning, and the v4 data using partitionBy("submission_date", "sample_id").

vitillo · 2016-05-18T13:55:17Z

r+ once you confirm that the dataset can be read by Presto as well.

mreid-moz · 2016-05-18T18:43:09Z

I checked, and the v3 and v4 datasets can't be auto-imported into Hive with the current parquet2hive utility. I filed this pr with a fix and tested that I can access the resulting table in Presto via sql.tmo afterwards.

One other issue I ran into is that parquet2hive assumes all S3 partitioning fields are strings, so using the sample_id field in queries is somewhat awkward, since you can't easily use ranges of sample_ids.

I manually edited the resulting Hive commands to make sample_id a BIGINT type field, and that worked, but I'm not sure if there's a good general approach to this.

vitillo · 2016-05-19T12:02:12Z

Casting sample_id to an integer in the query might be good enough.

mreid-moz · 2016-05-19T12:12:08Z

Ok, there's a workaround where you can just cast the partition field to a number in your query if you want to treat it as one. Null values work as expected too. These queries all do the expected thing:

SELECT count(*) FROM test_main_summary_v4 WHERE sample_id = '5'
SELECT count(*) FROM test_main_summary_v4 WHERE cast(sample_id AS bigint) < 10
SELECT count(*) FROM test_main_summary_v4 WHERE sample_id IS NULL
SELECT count(*) FROM test_main_summary_v4 WHERE cast(sample_id AS bigint) IS NULL

Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach. This incorporates the "utils.Telemetry" code in mozilla#62. Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time. Consumer-facing changes in the v3 dataset include: - Renaming the "submission_date_s3" field to "submission_date" - Changing the type of the "sample_id" field to a string due to using it as an s3 partitioning field.

mreid-moz · 2016-05-19T13:01:25Z

@vitillo I've cleaned up the history in this branch, it should be ready for final review / merge.

mreid-moz · 2016-05-19T14:30:09Z

Argh, I noticed a problem where the "overwrite" deletes the entire "version" prefix before adding data.

vitillo · 2016-05-19T14:41:55Z

As discussed on IRC we should make sure the longitudinal dataset is not affected by the changes.

The MainSummaryView serialization doesn't run when using parquet-avro 1.8.1, so disable that for now. Also adjust the .sbtopts file to avoid an OOM when running tests.

mreid-moz · 2016-05-19T17:47:36Z

Ok, I've re-enabled the serialization test for Longitudinal, instead disabling the serialization test for MainSummaryView.

I confirmed that the MainSummaryView generation itself still works fine.

mreid-moz · 2016-05-19T18:15:27Z

This also restores the "submission_date_s3" field, so that's one less change for consumers of the data between v2 and v3.

vitillo · 2016-05-20T16:41:32Z

src/test/scala/Longitudinal.scala

  }

  "Records" can "be serialized" in {
+    println("TODO: fix Avro+Parquet Serialization test")


Please remove this line.

vitillo · 2016-05-20T16:43:50Z

@mreid-moz this can be merged; we should probably do that when you are back so we can deploy it shortly afterwards.

mreid-moz · 2016-05-23T14:16:58Z

Summary of where we ended up with this PR:

Performance is known to be sub-optimal for now. The original main_summary code (before this change) took about 30 minutes to process a day's data. The code that used the DerivedStreams approach, but changed to SparkSQL types instead of Avro took about 45 minutes. The current code using the Views approach takes about 90 minutes. There are plans to rewrite the S3-iterating code, so we will tackle the performance problem at that time.
The MainSummaryView test coverage does not include testing of the data serialization due to an incompatibility between versions of the parquet-avro library.
The submission_date_s3 field is still present in v3. We also introduce an S3 partition for sample_id, which is a string field, but can efficiently be cast to a number when sampling ranges are desired.
The src/main/scala/utils/Telemetry.scala file was copied (with minor modifications) from Streams to views #62, so one or the other of these PRs will need to be rebased.

If this sounds acceptable, then we're ready to merge.

vitillo · 2016-05-23T14:24:44Z

r+

mreid-moz reviewed May 12, 2016
View reviewed changes

mreid-moz reviewed May 16, 2016
View reviewed changes

mreid-moz changed the title ~~DO NOT MERGE: Spark schema main summary to view~~ Spark schema main summary to view May 19, 2016

mreid-moz force-pushed the spark_schema_main_summary_to_view branch from 75adce2 to 55ab210 Compare May 19, 2016 12:59

mreid-moz changed the title ~~Spark schema main summary to view~~ DO NOT MERGE: Spark schema main summary to view May 19, 2016

Fix the S3 partitioning logic.

0a1ed38

Re-enable the tests for "Longitudinal".

3e446ac

The MainSummaryView serialization doesn't run when using parquet-avro 1.8.1, so disable that for now. Also adjust the .sbtopts file to avoid an OOM when running tests.

mreid-moz changed the title ~~DO NOT MERGE: Spark schema main summary to view~~ Spark schema main summary to view May 19, 2016

vitillo reviewed May 20, 2016
View reviewed changes

Remove a spurious print stmt

94633c1

vitillo merged commit ca1d281 into mozilla:master May 23, 2016

Spark schema main summary to view #67

Spark schema main summary to view #67

Uh oh!

Conversation

mreid-moz commented May 12, 2016

Uh oh!

vitillo commented May 12, 2016

Uh oh!

mreid-moz commented May 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vitillo May 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented May 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 54.03%

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vitillo commented May 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitillo commented May 13, 2016

Uh oh!

mreid-moz commented May 13, 2016

Uh oh!

vitillo commented May 13, 2016

Uh oh!

vitillo commented May 13, 2016

Uh oh!

mreid-moz commented May 13, 2016

Uh oh!

mreid-moz commented May 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mreid-moz commented May 18, 2016

Uh oh!

vitillo commented May 18, 2016

Uh oh!

mreid-moz commented May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitillo commented May 19, 2016

Uh oh!

mreid-moz commented May 19, 2016

Uh oh!

mreid-moz commented May 19, 2016

Uh oh!

mreid-moz commented May 19, 2016

Uh oh!

vitillo commented May 19, 2016

Uh oh!

mreid-moz commented May 19, 2016

Uh oh!

mreid-moz commented May 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vitillo commented May 20, 2016

Uh oh!

mreid-moz commented May 23, 2016

Uh oh!

vitillo May 13, 2016 •

edited

Loading

codecov-io commented May 12, 2016 •

edited

Loading

vitillo commented May 13, 2016 •

edited

Loading

mreid-moz commented May 18, 2016 •

edited

Loading