Skip to content
This repository was archived by the owner on Feb 14, 2025. It is now read-only.

Conversation

mreid-moz
Copy link
Contributor

Per bug 1272334, refactor the Main Summary code to use Spark SQL types directly (rather than using Avro Schema), and switch from the "Derived Stream" approach to the "Views" approach.

This builds on the code in #62, but still has some performance problems to work out.

Processing a single day's data on a 20-node cluster (April 1, 2016 was my reference date) takes a bit more than 90 minutes with this current code. Using the DerivedStreams approach from a few commits back takes about half that amount of time.

Note that I also added a step to Telemetry.getMessages to balance the partition sizes, which reduced the run time from more than 3 hours down to 90 minutes or so.

@vitillo @Uberi Feedback and suggestions appreciated, particularly about what I can do to identify and fix the performance difference.

@vitillo
Copy link
Contributor

vitillo commented May 12, 2016

Please rebase.

@mreid-moz
Copy link
Contributor Author

Merged master. I'll squash all the WIP commits once this is ready to land :)

build.sbt Outdated
ivyScala := ivyScala.value map { _.copy(overrideScalaVersion = true) },
libraryDependencies += "org.apache.avro" % "avro" % "1.7.7",
libraryDependencies += "org.apache.parquet" % "parquet-avro" % "1.8.1",
libraryDependencies += "org.apache.parquet" % "parquet-avro" % "1.7.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-avro 1.8.1 conflicted with writing a DataFrame to Parquet format.

Copy link
Contributor

@vitillo vitillo May 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Uberi that might fix the bug you had seen when writing the crash dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like it. Are there any issues with downgrading though? 7c7494c seems to have upgraded for reasons related to functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1.8.1 added support for writing some of the data structures used by the Longutindal dataset (Map of structs or something, I'm not sure). You can see the errors I got by checking out out this PR and re-enabling the serialization test in src/test/scala/Longitudinal.scala, then running sbt test

@codecov-io
Copy link

codecov-io commented May 12, 2016

Current coverage is 54.03%

Merging #67 into master will decrease coverage by 0.28%

  1. 3 files (not in diff) in ...c/main/scala/streams were modified. more
  2. 1 files (not in diff) in src/main/scala were modified. more
    • Misses +2
  3. 2 files in src/main/scala were modified. more
    • Misses +1
    • Hits -1
@@             master        #67   diff @@
==========================================
  Files            17         18     +1   
  Lines          1541       1638    +97   
  Methods        1483       1580    +97   
  Messages          0          0          
  Branches         55         50     -5   
==========================================
+ Hits            837        885    +48   
- Misses          704        753    +49   
  Partials          0          0          

Powered by Codecov. Last updated by f02806d...f6e3756


"Records" can "be serialized" in {
ParquetFile.serialize(List(fixture.record).toIterator, fixture.schema)
println("TODO: fix Avro+Parquet Serialization test")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we can just get rid of this by changing Longitudinal to use the SparkSQL / DataFrame serialization too.

@vitillo
Copy link
Contributor

vitillo commented May 13, 2016

Just fyi, I am planning to write a proper telemetry API in the next weeks which is going to provide fast access to telemetry records. @Uberi getRecords was just a temporary measure until that API landed.

@vitillo
Copy link
Contributor

vitillo commented May 13, 2016

Where can I find the output of this job?

@mreid-moz
Copy link
Contributor Author

You can find the sample data at telmetry-test-bucket/main_summary/v3

@vitillo
Copy link
Contributor

vitillo commented May 13, 2016

Do you know why s3://telemetry-test-bucket/main_summary/v3/submission_date_s3=20160401/ has 103 files (instead of 100) and why it's 1/3 smaller than s3://telemetry-parquet/main_summary/v2/submission_date_s3=20160401/?

@vitillo
Copy link
Contributor

vitillo commented May 13, 2016

Also note that v2 has practically no skew and all files have the same size while the same isn't true for v3.

@mreid-moz
Copy link
Contributor Author

The extra 3 files in s3://telemetry-test-bucket/main_summary/v3/submission_date_s3=20160401/ are the "_SUCCESS", "_metadata" and "_common_metadata" files that are autogenerated by the dataframe serialization. I think it's smaller due to using gzip compression instead of none (or snappy, I'm not sure) used by the Avro/Parquet stuff.

@mreid-moz
Copy link
Contributor Author

Re: skew, I believe that is due to actually repartitioning the records (by sampleId) in v2, whereas the current iteration of v3 just does the "coalesce" which combines existing partitions.

hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

// We want to end up with reasonably large parquet files on S3.
hadoopConf.setInt("parquet.block.size", 128 * 1024 * 1024)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitillo Is 128MB a reasonable size for parquet files? Should we go larger? I've read conflicting advice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends, how many files are generated per day (with 128MB as size) and how many days are usually read from an analysis job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are about 20-30GB per day of data (in the latest gzipped-parquet form per this PR). If we partition by sample_id, we should have 100 partitions of ~235MB each. So likely two files per day+sample_id partition. Analysis jobs frequently look at MAU, so using 30 days of data is common (40 days, including data latency for activity date vs. submission date).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a single file per day+sample_id then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll go with 256MB for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it 512 to avoid very small files to be generated.

@mreid-moz
Copy link
Contributor Author

Ok, I tested that the data can be read from Spark, and that the record counts match exactly for one day's data between the existing v2 data, the v3 data using manual partitioning, and the v4 data using partitionBy("submission_date", "sample_id").

@vitillo
Copy link
Contributor

vitillo commented May 18, 2016

r+ once you confirm that the dataset can be read by Presto as well.

@mreid-moz
Copy link
Contributor Author

mreid-moz commented May 18, 2016

I checked, and the v3 and v4 datasets can't be auto-imported into Hive with the current parquet2hive utility. I filed this pr with a fix and tested that I can access the resulting table in Presto via sql.tmo afterwards.

One other issue I ran into is that parquet2hive assumes all S3 partitioning fields are strings, so using the sample_id field in queries is somewhat awkward, since you can't easily use ranges of sample_ids.

I manually edited the resulting Hive commands to make sample_id a BIGINT type field, and that worked, but I'm not sure if there's a good general approach to this.

@vitillo
Copy link
Contributor

vitillo commented May 19, 2016

Casting sample_id to an integer in the query might be good enough.

@mreid-moz mreid-moz changed the title DO NOT MERGE: Spark schema main summary to view Spark schema main summary to view May 19, 2016
@mreid-moz
Copy link
Contributor Author

Ok, there's a workaround where you can just cast the partition field to a number in your query if you want to treat it as one. Null values work as expected too. These queries all do the expected thing:

SELECT count(*) FROM test_main_summary_v4 WHERE sample_id = '5'
SELECT count(*) FROM test_main_summary_v4 WHERE cast(sample_id AS bigint) < 10
SELECT count(*) FROM test_main_summary_v4 WHERE sample_id IS NULL
SELECT count(*) FROM test_main_summary_v4 WHERE cast(sample_id AS bigint) IS NULL

Per bug 1272334, refactor the Main Summary code to use Spark SQL types
directly (rather than using Avro Schema), and switch from the "Derived
Stream" approach to the "Views" approach.

This incorporates the "utils.Telemetry" code in mozilla#62.

Processing a single day's data on a 20-node cluster (April 1, 2016 was
my reference date) takes a bit more than 90 minutes with this current
code. Using the DerivedStreams approach from a few commits back takes
about half that amount of time.

Consumer-facing changes in the v3 dataset include:
- Renaming the "submission_date_s3" field to "submission_date"
- Changing the type of the "sample_id" field to a string due to
  using it as an s3 partitioning field.
@mreid-moz mreid-moz force-pushed the spark_schema_main_summary_to_view branch from 75adce2 to 55ab210 Compare May 19, 2016 12:59
@mreid-moz
Copy link
Contributor Author

@vitillo I've cleaned up the history in this branch, it should be ready for final review / merge.

@mreid-moz mreid-moz changed the title Spark schema main summary to view DO NOT MERGE: Spark schema main summary to view May 19, 2016
@mreid-moz
Copy link
Contributor Author

Argh, I noticed a problem where the "overwrite" deletes the entire "version" prefix before adding data.

@vitillo
Copy link
Contributor

vitillo commented May 19, 2016

As discussed on IRC we should make sure the longitudinal dataset is not affected by the changes.

The MainSummaryView serialization doesn't run when using
parquet-avro 1.8.1, so disable that for now. Also adjust
the .sbtopts file to avoid an OOM when running tests.
@mreid-moz
Copy link
Contributor Author

Ok, I've re-enabled the serialization test for Longitudinal, instead disabling the serialization test for MainSummaryView.

I confirmed that the MainSummaryView generation itself still works fine.

@mreid-moz mreid-moz changed the title DO NOT MERGE: Spark schema main summary to view Spark schema main summary to view May 19, 2016
@mreid-moz
Copy link
Contributor Author

This also restores the "submission_date_s3" field, so that's one less change for consumers of the data between v2 and v3.

}

"Records" can "be serialized" in {
println("TODO: fix Avro+Parquet Serialization test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line.

@vitillo
Copy link
Contributor

vitillo commented May 20, 2016

@mreid-moz this can be merged; we should probably do that when you are back so we can deploy it shortly afterwards.

@mreid-moz
Copy link
Contributor Author

Summary of where we ended up with this PR:

  • Performance is known to be sub-optimal for now. The original main_summary code (before this change) took about 30 minutes to process a day's data. The code that used the DerivedStreams approach, but changed to SparkSQL types instead of Avro took about 45 minutes. The current code using the Views approach takes about 90 minutes. There are plans to rewrite the S3-iterating code, so we will tackle the performance problem at that time.
  • The MainSummaryView test coverage does not include testing of the data serialization due to an incompatibility between versions of the parquet-avro library.
  • The submission_date_s3 field is still present in v3. We also introduce an S3 partition for sample_id, which is a string field, but can efficiently be cast to a number when sampling ranges are desired.
  • The src/main/scala/utils/Telemetry.scala file was copied (with minor modifications) from Streams to views #62, so one or the other of these PRs will need to be rebased.

If this sounds acceptable, then we're ready to merge.

@vitillo vitillo merged commit ca1d281 into mozilla:master May 23, 2016
@vitillo
Copy link
Contributor

vitillo commented May 23, 2016

r+

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants