Joins on PS1 MDF databases exceedingly slow #3

mjuric · 2011-02-25T20:15:08Z

This profiles and reproduces the problem:

NWORKERS=1 /n/sw/python-2.7/lib/python2.7/cProfile.py -s time lsd-query --format=fits --bounds='beam(333.3978, 0.4723, 0.1)' 'SELECT obj_id, cal_psf_mag, cal_psf_mag_sig FROM md_obj, md_det'

The runtime on a random Odyssey node is ~1100 sec, while the runtime for a query with no join is ~15 sec.

(originally reported by Dae-Won Kim)

mjuric · 2011-02-25T20:33:56Z

Profiling information:

         1137798 function calls (1114185 primitive calls) in 1133.420 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       74  539.657    7.293  539.657    7.293 {method 'isInsideV' of 'cPolygon.Polygon' objects}
       75  225.521    3.007  297.129    3.962 bhpix.py:9(proj_healpix)
      518  140.781    0.272  140.781    0.272 {method '_read_records' of 'tables.tableExtension.Table' objects}
       75  101.655    1.355  402.557    5.367 bhpix.py:38(proj_bhealpix)
      227   71.608    0.315   71.608    0.315 {numpy.core.multiarray.where}
       74   11.384    0.154   11.384    0.154 {lsd.native.table_join}
      965    9.384    0.010    9.412    0.010 colgroup.py:149(__getitem__)
     1408    7.022    0.005    7.022    0.005 {method 'any' of 'numpy.ndarray' objects}
      444    5.300    0.012    5.300    0.012 {numpy.core.multiarray.concatenate}
       79    3.772    0.048    3.772    0.048 {method 'astype' of 'numpy.ndarray' objects}
       74    3.366    0.045  962.203   13.003 join_ops.py:270(filter_space)
      153    3.017    0.020    3.017    0.020 {numpy.core.multiarray.arange}
      148    1.597    0.011 1131.155    7.643 join_ops.py:966(__iter__)
      296    1.330    0.004    1.384    0.005 {method '_g_new' of 'tables.hdf5Extension.File' objects}
   148/74    0.675    0.005 1111.929   15.026 join_ops.py:172(evaluate_join)
       74    0.325    0.004    0.325    0.004 {method 'fill' of 'numpy.ndarray' objects}
      148    0.322    0.002 1129.557    7.632 join_ops.py:511(__iter__)
    18574    0.267    0.000    0.267    0.000 {method '_g_getAttr' of 'tables.hdf5Extension.AttributeSet' objects}
      296    0.252    0.001    0.252    0.001 {method '_closeFile' of 'tables.hdf5Extension.File' objects}

mjuric · 2011-03-03T19:37:13Z

The problem was in computation of whether stars fall within the given boundaries or not. Since the boundaries are arbitrary polygons, this becomes slow once a cell with lots of objects gets hit. Compounding the problem, when joining tables this calculation gets repeated for each cell that is joined, so about ~70x as many times than when there is no join (for PanSTARRS MDF fields).

I fixed it by introducing a "result cache", that remembers (on disk) the results of "heavy" functions and reuses them when those functions get called again. It's a general mechanism that will also allow caching of database files on local disk on the node, thus avoiding the slow network I/O to /n/panlfs (that feature will come later).

mjuric · 2011-03-04T22:48:11Z

Fixed in commits 741f275 through a981340

Modify the lookup algorithm to do the following: given "... FROM A, B ..." #1 first look for A:B join files in A's directory #2 if that fails, look for A:B join files in B's directory #3 when a potential match is found, prefer resolving the other table to the same dbdir where the .join file resides #4 if you can't, look for it in other dbdirs, in order they show up in LSD_DB tables with the same name in two different LSD_DB directories.

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joins on PS1 MDF databases exceedingly slow #3

Joins on PS1 MDF databases exceedingly slow #3

mjuric commented Feb 25, 2011

mjuric commented Feb 25, 2011

mjuric commented Mar 3, 2011

mjuric commented Mar 4, 2011

Joins on PS1 MDF databases exceedingly slow #3

Joins on PS1 MDF databases exceedingly slow #3

Comments

mjuric commented Feb 25, 2011

mjuric commented Feb 25, 2011

mjuric commented Mar 3, 2011

mjuric commented Mar 4, 2011