Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Spark 3-based builds #9524

Merged
merged 20 commits into from Oct 1, 2020
Merged

Conversation

karenfeng
Copy link
Contributor

Although Dataproc does not have a public Spark 3-based GA release schedule yet, it'd probably be helpful to start supporting a Spark 3 build; tagging @tpoterba for context.

I'm not familiar with the release process internally, so let me know what other changes need to be made to accommodate this. In particular, this PR likely needs to change the PySpark requirements specified in https://github.com/hail-is/hail/blob/main/hail/python/requirements.txt.

This PR builds on changes from #9199.

The code changes are due to Scala 2.12 and Spark 3 changes:

  • y in x << y must be an int
  • mutable.Stack is deprecated
  • JavaConversions is deprecated
  • addTaskCompletionListener is overloaded
  • Row.merge() is deprecated

The build changes are as follows:

The following testing commands pass (at least to the degree that main does):

  • make -j8 test SCALA_VERSION=2.11.12 SPARK_VERSION=2.4.5
  • make -j8 test SCALA_VERSION=2.12.8 SPARK_VERSION=3.0.0

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@tpoterba tpoterba self-assigned this Sep 29, 2020
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@karenfeng
Copy link
Contributor Author

karenfeng commented Sep 30, 2020

Actually, it looks like I'm having some trouble with block matrices in Python.

@tpoterba
Copy link
Contributor

What errors?

I'll run CI on this, but I don't expect the spark 2 stuff to have any problems.

@karenfeng
Copy link
Contributor Author

I don't believe there were any glaring problems on the Spark 2 side, but some the Python tests on Spark 3 are failing, notably in test/hail/linalg/test_linalg.py. I can't quite diagnose the issue off the top of my head.

@tpoterba
Copy link
Contributor

Passing CI tests for Spark 2.4. Do you have a stack trace for a failure?

Sriram saw issues related to Breeze in #9199, but I think it was the bug you noted above.

@@ -176,13 +176,14 @@ class Method private[lir] (
def findBlocks(): Blocks = {
val blocksb = new ArrayBuilder[Block]()

val s = new mutable.Stack[Block]()
var s = List[Block]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just make this a new ArrayStack[Block]()? I saw this a few weeks ago and meant to fix it, so glad this came up again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayStack is a one of ours, is.hail.utils.ArrayStack

Signed-off-by: Karen Feng <karen.feng@databricks.com>
@karenfeng
Copy link
Contributor Author

Yep, I bumped into the Breeze bug a while back while trying to upgrade Hail to Spark 3 internally, and realized it'd be a blocker downstream. I've only seen issues on the Python side; for example:

______________________________________________________________________________ Tests.test_block_matrix_entries ______________________________________________________________________________

self = <test.hail.linalg.test_linalg.Tests testMethod=test_block_matrix_entries>

    @fails_local_backend()
    def test_block_matrix_entries(self):
        n_rows, n_cols = 5, 3
        rows = [{'i': i, 'j': j, 'entry': float(i + j)} for i in range(n_rows) for j in range(n_cols)]
        schema = hl.tstruct(i=hl.tint32, j=hl.tint32, entry=hl.tfloat64)
        table = hl.Table.parallelize([hl.struct(i=row['i'], j=row['j'], entry=row['entry']) for row in rows], schema)
        table = table.annotate(i=hl.int64(table.i),
                               j=hl.int64(table.j)).key_by('i', 'j')
    
        ndarray = np.reshape(list(map(lambda row: row['entry'], rows)), (n_rows, n_cols))
    
        for block_size in [1, 2, 1024]:
            block_matrix = BlockMatrix.from_numpy(ndarray, block_size)
            entries_table = block_matrix.entries()
            self.assertEqual(entries_table.count(), n_cols * n_rows)
            self.assertEqual(len(entries_table.row), 3)
>           self.assertTrue(table._same(entries_table))
E           AssertionError: False is not true

test/hail/linalg/test_linalg.py:868: AssertionError
----------------------------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------------------------
Table._same: rows differ:
  Row mismatch:
    L: [Struct(entry=1.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=2.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=1.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=2.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=3.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=2.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=3.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=4.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=3.0)]
    R: [Struct(entry=0.0)]
  Row mismatch:
    L: [Struct(entry=4.0)]
    R: [Struct(entry=0.0)]

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@karenfeng
Copy link
Contributor Author

Turns out I needed to pull in the transitive breeze dependencies; without these, the Python tests failed while the Scala tests passed. I've re-run the tests on Spark 3 now; let me know if there are any other changes you'd like to see. Unfortunately, there's a small bug I saw on the Spark 3 MLLib side; I'll make sure this gets addressed ASAP: https://issues.apache.org/jira/browse/SPARK-33043

Signed-off-by: Karen Feng <karen.feng@databricks.com>
@johnc1231
Copy link
Contributor

Ran CI tests, looks like everything is passing now. Will leave approval to @tpoterba though

@johnc1231
Copy link
Contributor

Thanks for doing this! Should make things easier when dataproc actually releases a Spark 3 version

@tpoterba
Copy link
Contributor

tpoterba commented Oct 1, 2020

OK, looks good. When this goes in, I'll add a CI target to build against spark 3 so we don't accidentally break it again.

@danking danking merged commit 00b78e4 into hail-is:main Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants