Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] staged import_matrix_table #6987

Merged
merged 24 commits into from Sep 7, 2019

Conversation

@danking
Copy link
Collaborator

commented Sep 3, 2019

I staged import_matrix_table and achieved substantial performance improvements. A few changes were necessary:

  • FunctionBuilder now accepts Code[Unit] to be added to the init method of the function object
  • SRVB now has an init method that should be called in the init method of a function object if many methods will share the SRVB
  • CodeChar now exists

The main change is in ImportMatrix.scala which is both staged and based on scanning the string rather than using String.split. The approach is essentially a simplified, staged version of import_vcf.

I benchmarked the change with:

In [2]: %%time 
   ...: import hail as hl 
   ...: m = hl.import_matrix_table('/tmp/foo.tsv.gz',  
   ...:                            row_fields={'f0': hl.tstr}, 
   ...:                            no_header=True,  
   ...:                            sep=' ', 
   ...:                            min_partitions=16) 
   ...: m = m.key_rows_by(locus=hl.parse_locus(m.f0)) 
   ...: m._force_count_rows()                                                                                                                                                                                                                                                   

/tmp/foo.tsv.gz is a gzipped (not blocked) 1GB of 1000 rows each containing one row column and 500k sample columns. The entries are the integers from 0 to 499,999.

The first column is the first run (when the JIT is warmed) and the second column is the mean of two subsequent runs. All times in seconds. Everything is necessarily executed on one core.

version cold warm
this PR 48 39.35
this PR with one monolithic method 235 73
master (5fe6737) 91s 83.5

I was disappointed with the performance of the monolithic method, so I dug in with -XX:+PrintCompilation and found that the JIT was having trouble doing on-stack replacement of the entry parsing loop. There was a cryptic message about the stack not being empty during an OSR compilation. I take this result as evidence that, in the JVM, small, fine-grained methods are critical for reliable performance.

The new code, after JIT warming, is reading at 250 MB/s (1GB / 40 seconds) which is a half to a third of the performance of cat. It's more than twice as fast as the old code.

Aside: using the staged stuff is hard, especially when using multiple methods. A couple thoughts:

  • Because SRVB generates a fresh SRVB for arrays and structs, you must thread the srvb through your code gen rather than using a single field, this is annoying and error prone
  • init is not a first class thing in FunctionBuilder and I've arguably made the whole situation more ugly by exposing addInitInstructions. Without the ability to place code in the constructor, it is hard to coordinate work between multiple methods.
  • When using lots of methods, there's a lot of bookkeeping. I would like a way to define a "staged class" that wraps up some of the boilerplate. Not totally clear what I want here, just less boilerplate.

Aside2: This is still pretty slow!? Splitting into multiple methods allowed me to interrogate where time was spent. The answer is "method4" which is parseEntries. This code includes the loop, the srvb state management, the Region manipulation, and checking for the missing value. I'd be surprised its the checking for missing value because I delegate to String.regionMatches for the heavy lifting and that does not show up in the profiler. I'm left to conclude that either srvb state management or writing/reading to regions is expensive.

Copy link
Collaborator

left a comment

fix tests please

@@ -157,15 +157,7 @@ object IRParser {
def error(t: Token, msg: String): Nothing = ParserUtils.error(t.pos, msg)

def deserialize[T](str: String)(implicit formats: Formats, mf: Manifest[T]): T = {
try {

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 5, 2019

Collaborator

this produces worse user errors...

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 5, 2019

Collaborator

the "error summary" will be a mapping exception, I think, not a "your VCF has this problem"

This comment has been minimized.

Copy link
@danking

danking Sep 5, 2019

Author Collaborator

Before I removed this I got a mapping exception that said it found a "JNothing" when it was looking for "object". When I removed this, I got an exception that wrapped the mapping exception and mentioned a field name that was missing in the JSON (I had misspelled the field name in python).

Wrapping lower level exceptions with exceptions containing more context is very common in Java libraries. I know we do this kind of transformation frequently, but I'm strongly against it. This isn't the first time I've had to remove one our try-catches so as to find a meaningful error message.

This comment has been minimized.

Copy link
@danking

danking Sep 5, 2019

Author Collaborator

This won't change the error summary either way, we always take the deepest message (which in this case is next to useless). This just adds ~50 lines of stack trace.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 5, 2019

Collaborator

ah, OK, sure. Must be confusing this wrapping with something else!

fatal("no paths specified for import_matrix_table.")
import LoadMatrix._
assert(rowFields.values.forall { t =>
t.isOfType(TString()) ||

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 5, 2019

Collaborator

add same check to entry?

line := Code._null,
srvb.init()))

@transient private[this] val parseStringMb = fb.newMethod[Region, String]

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 5, 2019

Collaborator

you can add string names to all these methods to make profiling a lot easier! I'm going to do that with as much of the codebase as possible soon.

This comment has been minimized.

Copy link
@danking

danking Sep 5, 2019

Author Collaborator

I went down this path but it changed several files and looked like EmitFunctionBuilder needed a refactor to reduce some duplicate code so I backed it all out after I finished profiling. Happy to add back in if another PR adds the functionality.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 5, 2019

Collaborator

can't you just do fb.newMethod[Region, String]("parseStringMB")?

@danking danking force-pushed the danking:staged-import-mt branch from b85d05a to 50e0ee4 Sep 5, 2019
danking added 10 commits Sep 6, 2019
fix
fix
fix
@danking

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 6, 2019

@chrisvittal I took the plunge and also created a MatrixHybridReader. Changes (including above):

  • FunctionBuilder now accepts Code[Unit] to be added to the init method of the function object
  • SRVB now has an init method that should be called in the init method of a function object if many methods will share the SRVB
  • CodeChar now exists
  • TextMatrixReader exists which mimics TextTableReader; these should get unified at some point
  • minor documentation fixes to import_matrix_table
  • better error messages wrt using the name row_id, which is reserved for use by import_matrix_table when there are no keys specified
  • several new tests, including:
    • extensive testing of the product space of header, delimiter, header, and entry_type (including such weird things as using 9 as the missing value)
    • a pathological file: 9-separated values with 8 representing the missing value
    • several tests that trigger the pruner
  • rename LoadMatrix.scala to TextMatrixReader.scala
Copy link
Collaborator

left a comment

Retest. Otherwise LGTM.

@danking danking merged commit ab84587 into hail-is:master Sep 7, 2019
1 check passed
1 check passed
ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.