-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joins on PS1 MDF databases exceedingly slow #3
Comments
Profiling information: 1137798 function calls (1114185 primitive calls) in 1133.420 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 74 539.657 7.293 539.657 7.293 {method 'isInsideV' of 'cPolygon.Polygon' objects} 75 225.521 3.007 297.129 3.962 bhpix.py:9(proj_healpix) 518 140.781 0.272 140.781 0.272 {method '_read_records' of 'tables.tableExtension.Table' objects} 75 101.655 1.355 402.557 5.367 bhpix.py:38(proj_bhealpix) 227 71.608 0.315 71.608 0.315 {numpy.core.multiarray.where} 74 11.384 0.154 11.384 0.154 {lsd.native.table_join} 965 9.384 0.010 9.412 0.010 colgroup.py:149(__getitem__) 1408 7.022 0.005 7.022 0.005 {method 'any' of 'numpy.ndarray' objects} 444 5.300 0.012 5.300 0.012 {numpy.core.multiarray.concatenate} 79 3.772 0.048 3.772 0.048 {method 'astype' of 'numpy.ndarray' objects} 74 3.366 0.045 962.203 13.003 join_ops.py:270(filter_space) 153 3.017 0.020 3.017 0.020 {numpy.core.multiarray.arange} 148 1.597 0.011 1131.155 7.643 join_ops.py:966(__iter__) 296 1.330 0.004 1.384 0.005 {method '_g_new' of 'tables.hdf5Extension.File' objects} 148/74 0.675 0.005 1111.929 15.026 join_ops.py:172(evaluate_join) 74 0.325 0.004 0.325 0.004 {method 'fill' of 'numpy.ndarray' objects} 148 0.322 0.002 1129.557 7.632 join_ops.py:511(__iter__) 18574 0.267 0.000 0.267 0.000 {method '_g_getAttr' of 'tables.hdf5Extension.AttributeSet' objects} 296 0.252 0.001 0.252 0.001 {method '_closeFile' of 'tables.hdf5Extension.File' objects} |
The problem was in computation of whether stars fall within the given boundaries or not. Since the boundaries are arbitrary polygons, this becomes slow once a cell with lots of objects gets hit. Compounding the problem, when joining tables this calculation gets repeated for each cell that is joined, so about ~70x as many times than when there is no join (for PanSTARRS MDF fields). I fixed it by introducing a "result cache", that remembers (on disk) the results of "heavy" functions and reuses them when those functions get called again. It's a general mechanism that will also allow caching of database files on local disk on the node, thus avoiding the slow network I/O to /n/panlfs (that feature will come later). |
Modify the lookup algorithm to do the following: given "... FROM A, B ..." #1 first look for A:B join files in A's directory #2 if that fails, look for A:B join files in B's directory #3 when a potential match is found, prefer resolving the other table to the same dbdir where the .join file resides #4 if you can't, look for it in other dbdirs, in order they show up in LSD_DB tables with the same name in two different LSD_DB directories.
This profiles and reproduces the problem:
NWORKERS=1 /n/sw/python-2.7/lib/python2.7/cProfile.py -s time lsd-query --format=fits --bounds='beam(333.3978, 0.4723, 0.1)' 'SELECT obj_id, cal_psf_mag, cal_psf_mag_sig FROM md_obj, md_det'
The runtime on a random Odyssey node is ~1100 sec, while the runtime for a query with no join is ~15 sec.
(originally reported by Dae-Won Kim)
The text was updated successfully, but these errors were encountered: