Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins on PS1 MDF databases exceedingly slow #3

Closed
mjuric opened this issue Feb 25, 2011 · 3 comments
Closed

Joins on PS1 MDF databases exceedingly slow #3

mjuric opened this issue Feb 25, 2011 · 3 comments
Labels

Comments

@mjuric
Copy link
Owner

mjuric commented Feb 25, 2011

This profiles and reproduces the problem:

NWORKERS=1 /n/sw/python-2.7/lib/python2.7/cProfile.py -s time lsd-query --format=fits --bounds='beam(333.3978, 0.4723, 0.1)' 'SELECT obj_id, cal_psf_mag, cal_psf_mag_sig FROM md_obj, md_det'

The runtime on a random Odyssey node is ~1100 sec, while the runtime for a query with no join is ~15 sec.

(originally reported by Dae-Won Kim)

@mjuric
Copy link
Owner Author

mjuric commented Feb 25, 2011

Profiling information:

         1137798 function calls (1114185 primitive calls) in 1133.420 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       74  539.657    7.293  539.657    7.293 {method 'isInsideV' of 'cPolygon.Polygon' objects}
       75  225.521    3.007  297.129    3.962 bhpix.py:9(proj_healpix)
      518  140.781    0.272  140.781    0.272 {method '_read_records' of 'tables.tableExtension.Table' objects}
       75  101.655    1.355  402.557    5.367 bhpix.py:38(proj_bhealpix)
      227   71.608    0.315   71.608    0.315 {numpy.core.multiarray.where}
       74   11.384    0.154   11.384    0.154 {lsd.native.table_join}
      965    9.384    0.010    9.412    0.010 colgroup.py:149(__getitem__)
     1408    7.022    0.005    7.022    0.005 {method 'any' of 'numpy.ndarray' objects}
      444    5.300    0.012    5.300    0.012 {numpy.core.multiarray.concatenate}
       79    3.772    0.048    3.772    0.048 {method 'astype' of 'numpy.ndarray' objects}
       74    3.366    0.045  962.203   13.003 join_ops.py:270(filter_space)
      153    3.017    0.020    3.017    0.020 {numpy.core.multiarray.arange}
      148    1.597    0.011 1131.155    7.643 join_ops.py:966(__iter__)
      296    1.330    0.004    1.384    0.005 {method '_g_new' of 'tables.hdf5Extension.File' objects}
   148/74    0.675    0.005 1111.929   15.026 join_ops.py:172(evaluate_join)
       74    0.325    0.004    0.325    0.004 {method 'fill' of 'numpy.ndarray' objects}
      148    0.322    0.002 1129.557    7.632 join_ops.py:511(__iter__)
    18574    0.267    0.000    0.267    0.000 {method '_g_getAttr' of 'tables.hdf5Extension.AttributeSet' objects}
      296    0.252    0.001    0.252    0.001 {method '_closeFile' of 'tables.hdf5Extension.File' objects}

@mjuric
Copy link
Owner Author

mjuric commented Mar 3, 2011

The problem was in computation of whether stars fall within the given boundaries or not. Since the boundaries are arbitrary polygons, this becomes slow once a cell with lots of objects gets hit. Compounding the problem, when joining tables this calculation gets repeated for each cell that is joined, so about ~70x as many times than when there is no join (for PanSTARRS MDF fields).

I fixed it by introducing a "result cache", that remembers (on disk) the results of "heavy" functions and reuses them when those functions get called again. It's a general mechanism that will also allow caching of database files on local disk on the node, thus avoiding the slow network I/O to /n/panlfs (that feature will come later).

@mjuric
Copy link
Owner Author

mjuric commented Mar 4, 2011

Fixed in commits 741f275 through a981340

mjuric pushed a commit that referenced this issue Jun 16, 2011
Modify the lookup algorithm to do the following:

	given "... FROM A, B ..."

	#1 first look for A:B join files in A's directory
	#2 if that fails, look for A:B join files in B's directory
	#3 when a potential match is found, prefer resolving the other
	   table to the same dbdir where the .join file resides
	#4 if you can't, look for it in other dbdirs, in order they
	   show up in LSD_DB

tables with the same name in two different LSD_DB directories.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant