Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] PC Relate improvements #7962

Merged
merged 17 commits into from Jan 24, 2020
Merged

[hail] PC Relate improvements #7962

merged 17 commits into from Jan 24, 2020

Conversation

danking
Copy link
Collaborator

@danking danking commented Jan 24, 2020

I changed the Spark persists to writeRead which writes and then reads. I tried to maintain this invariant: a block matrix partition always reads a linear number of partitions in the number of referenced block matrices. In particular, the result of any matmul must writeRead.

I removed the boxing of Doubles to check for NaN.

I avoided a bunch of allocation when performing matmul by using a fused multiply and add operation (dgemm).

I sped up conversation to BlockMatrix somewhat by introducing an iterator that caches the firstelementoffset.

I substantially improved BlockMatrix.checkpoint by using the fast lz4 codec.

new: I also added some tasteful cache'ing to PCRelate which substantially reduced the time spent reading data from disk.

cc: @johnc1231 @konradjk

@@ -1660,18 +1657,35 @@ private class BlockMatrixMultiplyRDD(l: BlockMatrix, r: BlockMatrix)
}
})

def dgemm(c: BDM[Double], _a: BDM[Double], _b: BDM[Double]) {
Copy link
Contributor

@johnc1231 johnc1231 Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I'd rather we called this something more like fusedMultiplyAndAdd. It really isn't dgemm unless it has a million more arguments, and we actually have LAPACK.dgemm in our codebase now for native arrays.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a benchmark

@@ -2466,6 +2466,21 @@ def checkpoint(self, output: str, overwrite: bool = False, stage_locally: bool =
>>> dataset = dataset.checkpoint('output/dataset_checkpoint.mt')

"""
if _codec_spec is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requires documentation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hidden argument, I don't think it should be documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint now has different semantics from write -- the files are bigger. That's the piece we should document.

@danking
Copy link
Collaborator Author

danking commented Jan 24, 2020

Definitely need a benchmark. Will write one. How long is a benchmark allowed to take?

@tpoterba
Copy link
Contributor

There's a 30m timeout. Also things are a bit slower on batch than on your laptop, I'd shoot for a couple minutes

@danking
Copy link
Collaborator Author

danking commented Jan 24, 2020

OK, I added a 1min benchmark

@tpoterba
Copy link
Contributor

tpoterba commented Jan 24, 2020

how long did it take before?

@danking
Copy link
Collaborator Author

danking commented Jan 24, 2020

You always ask the good questions Tim. So, on that example, it got slower: it was 30s. I think this is because I switched to writing and reading from disk instead of Spark persist.

@danking
Copy link
Collaborator Author

danking commented Jan 24, 2020

I should think more about how to benchmark this properly.

@danking
Copy link
Collaborator Author

danking commented Jan 24, 2020

I reran the benchmarks with nothing else running on my laptop. I think this should work right?

git checkout pc-relate && \
git status && \
python3 -m benchmark_hail run -t pc_relate,pc_relate_big && \
git checkout master && \
(cd ../../hail && make install) && \
git status && \
git checkout pc-relate && \
python3 -m benchmark_hail run -t pc_relate,pc_relate_big

Benchmark should import the installed hail.

pc-relate branch:

2020-01-24 18:38:08,147: INFO: burn in: 30.09s
2020-01-24 18:38:35,904: INFO: run 1: 27.75s
2020-01-24 18:39:03,001: INFO: run 2: 27.09s
2020-01-24 18:39:29,144: INFO: run 3: 26.14s

master:

2020-01-24 18:41:08,254: INFO: burn in: 32.71s
2020-01-24 18:41:37,239: INFO: run 1: 28.98s
2020-01-24 18:42:05,393: INFO: run 2: 28.15s
2020-01-24 18:42:33,411: INFO: run 3: 28.01s

@danking danking merged commit a03d620 into hail-is:master Jan 24, 2020
@tpoterba
Copy link
Contributor

yep, seems good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants