DM-1579: HTM indexed files. #42

SimonKrughoff · 2016-03-24T05:27:27Z

No description provided.

timj · 2016-03-24T05:36:05Z

tests/testHtmIndex.py

+        default_config.colnames = ['id', 'ra', 'dec', 'a', 'a_err', 'b', 'b_err', 'is_phot', 'is_res',
+                                   'is_var', 'val1', 'val2', 'val3']
+        default_config.delimiter = '|'
+        #this also tests changing the delimiter


Need a space between the # and the text.

PaulPrice · 2016-03-24T12:03:51Z

python/lsst/pipe/tasks/htmIndexEngine.py

+            names = True
+            if self.config.colnames:
+                names = self.config.colnames
+            arr = numpy.genfromtxt(filename, dtype=None, skip_header=self.config.header_lines,


Why text input? Our catalogs are typically large, and in FITS.

Text input is the less trivial of the two. I figured a FITS reader could be added once this is in. I could do it here if it's a no-go without it.

The input mechanism isn't abstracted, so changing the input mechanism isn't straight-forward, and would involve copying a lot of code.

PaulPrice · 2016-03-24T12:53:48Z

I think several of these commits could be squashed together without losing anything.

PaulPrice · 2016-03-24T13:00:36Z

Despite the filename choices, I think there's a lot in here that's not HTM-specific. I think it's worth making it more generic, so I can choose which indexing scheme I want to use. For example, you could make an abstract base class for the Indexer and allow the user to select (via Config) which Indexer subclass to use.

PaulPrice · 2016-03-24T13:02:57Z

tests/testHtmIndex.py

+
+        arr = numpy.array(zip(ident, ra, dec, a_mag, a_mag_err, b_mag, b_mag_err, is_photometric, is_resolved,
+                           is_variable, extra_col1, extra_col2, extra_col3), dtype=dtype)
+        numpy.savetxt(out_path+"/ref.txt", arr, delimiter=",",


I think this demonstrates my concern about having the ingestion read CSV rather than FITS. Converting FITS catalogs covering 3pi steradians to 21 mag to CSV is going to be very slow and consume a lot of disk space.

I agree. I think it would be trivial to add, but I'd rather do that as another ticket so I can close this. Please push back if that is not acceptable to you.

I don't think the code is useful if it's not fast, but the reference catalog I usually use is very large (151 GB for PS1 astrometry_net_data).

I get your point, but I disagree it's not useful if it's not fast. It will be very useful for me when I run simulations and I have to make a new (small) reference catalog for every different run I do.

PaulPrice · 2016-03-24T14:28:22Z

As I've thought about this some more, I've concluded that the ingestion code is going to have to be parallelised in order to ingest large (all sky) catalogs. That doesn't mean that the parallelisation needs to be done as part of this ticket, but I would suggest that the ingestion should be broken up now into suitable pieces for implementing the parallelisation later.

I see two operations that can be parallelised:

Assigning each source an index: can be parallelised over input files, writing temporary files that join sources to indices.
Writing each source to the output: can be parallelised over indices, writing the final product.

djreiss · 2016-03-25T22:05:43Z

python/lsst/pipe/tasks/htmIndexEngine.py

@@ -36,4 +47,9 @@ def get_pixel_ids(self, ctrCoord, radius):
        return pixel_id_list, is_on_boundary

    def index_points(self, ra_list, dec_list):
+        """!Generate trixel ids for each row in an input file
+
+        @param[in] coord_list  List of coord objects to index


Should list "ra_list", "dec_list" as input parameters in docstring.

This implements an indexer that provides the methods necessary to shard a catalog and then look up the correct shard later. The package I use currently has a problem if you call the lookup_id method without first calling another function. This hack will go away if the package is fixed or if we get an LSST maintained version.

This will be used to read files from disk for indexing

The catalog ingester will load files using a helper task and put the contents in a set of SourceCatalogs, one per HTM trixel. Most of the config is to allow mapping of columns in the file to the schema needed for referenc catalogs.

This implements the task to take files and parse them into SourceCatalogs. Some of the tasks implemented here are: read the file into a numpy array (the whole file needs to fit in memory), construct the SourceCatalog schema from the array dtype, index the catalog and fill the SourceCatalog. Command line tasks need a command runner. Since this task doesn't need to parse an `--id` argument, we an forego that step. Essentially, this just calls the task run method and gives a place to persist the config.

This involves reconstituting the task that ingested the files. The only config is the name of the config to read to create the ingester task. The loader then borrows the indexer from the ingester task to grab the files from the repo it needs to load the reference catalog. The shards are only clipped if they land on the boundary of the circular aperture. This is a minimal override of the reference catalog loader. It may be possible to optimize by overriding other default implementations.

SimonKrughoff force-pushed the tickets/DM-1579 branch from 9b4181a to b4b50ce Compare March 24, 2016 05:30

timj reviewed Mar 24, 2016
View reviewed changes

SimonKrughoff force-pushed the tickets/DM-1579 branch 2 times, most recently from d222b93 to 53c5c93 Compare March 24, 2016 05:47

PaulPrice reviewed Mar 24, 2016
View reviewed changes

djreiss reviewed Mar 25, 2016
View reviewed changes

SimonKrughoff force-pushed the tickets/DM-1579 branch from 90227a8 to bbd76ce Compare March 31, 2016 23:36

SimonKrughoff added 4 commits March 31, 2016 16:41

Add a task for reading text files

9d1d1d8

This will be used to read files from disk for indexing

Add config for the catalog ingester

0d2cb01

The catalog ingester will load files using a helper task and put the contents in a set of SourceCatalogs, one per HTM trixel. Most of the config is to allow mapping of columns in the file to the schema needed for referenc catalogs.

Add bin script for the ingester

2d5acdf

SimonKrughoff force-pushed the tickets/DM-1579 branch from bbd76ce to 38b65ee Compare March 31, 2016 23:58

SimonKrughoff added 3 commits March 31, 2016 18:00

Add unittest for catalog ingester and loader

c21ed6d

Added esutil dependency

c56e30d

SimonKrughoff force-pushed the tickets/DM-1579 branch from 9e393e3 to c56e30d Compare April 1, 2016 01:00

SimonKrughoff merged commit c56e30d into master Apr 1, 2016

ktlim deleted the tickets/DM-1579 branch August 25, 2018 06:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-1579: HTM indexed files. #42

DM-1579: HTM indexed files. #42

SimonKrughoff commented Mar 24, 2016

timj Mar 24, 2016

SimonKrughoff Mar 24, 2016

PaulPrice Mar 24, 2016

SimonKrughoff Mar 29, 2016

PaulPrice Mar 30, 2016

PaulPrice commented Mar 24, 2016

PaulPrice commented Mar 24, 2016

PaulPrice Mar 24, 2016

SimonKrughoff Mar 30, 2016

PaulPrice Mar 30, 2016

SimonKrughoff Mar 30, 2016

PaulPrice commented Mar 24, 2016

djreiss Mar 25, 2016

DM-1579: HTM indexed files. #42

DM-1579: HTM indexed files. #42

Conversation

SimonKrughoff commented Mar 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaulPrice commented Mar 24, 2016

PaulPrice commented Mar 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaulPrice commented Mar 24, 2016

Choose a reason for hiding this comment