New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-1579: HTM indexed files. #42
Conversation
9b4181a
to
b4b50ce
Compare
default_config.colnames = ['id', 'ra', 'dec', 'a', 'a_err', 'b', 'b_err', 'is_phot', 'is_res', | ||
'is_var', 'val1', 'val2', 'val3'] | ||
default_config.delimiter = '|' | ||
#this also tests changing the delimiter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a space between the #
and the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
d222b93
to
53c5c93
Compare
names = True | ||
if self.config.colnames: | ||
names = self.config.colnames | ||
arr = numpy.genfromtxt(filename, dtype=None, skip_header=self.config.header_lines, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why text input? Our catalogs are typically large, and in FITS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text input is the less trivial of the two. I figured a FITS reader could be added once this is in. I could do it here if it's a no-go without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input mechanism isn't abstracted, so changing the input mechanism isn't straight-forward, and would involve copying a lot of code.
I think several of these commits could be squashed together without losing anything. |
Despite the filename choices, I think there's a lot in here that's not HTM-specific. I think it's worth making it more generic, so I can choose which indexing scheme I want to use. For example, you could make an abstract base class for the |
|
||
arr = numpy.array(zip(ident, ra, dec, a_mag, a_mag_err, b_mag, b_mag_err, is_photometric, is_resolved, | ||
is_variable, extra_col1, extra_col2, extra_col3), dtype=dtype) | ||
numpy.savetxt(out_path+"/ref.txt", arr, delimiter=",", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this demonstrates my concern about having the ingestion read CSV rather than FITS. Converting FITS catalogs covering 3pi steradians to 21 mag to CSV is going to be very slow and consume a lot of disk space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it would be trivial to add, but I'd rather do that as another ticket so I can close this. Please push back if that is not acceptable to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the code is useful if it's not fast, but the reference catalog I usually use is very large (151 GB for PS1 astrometry_net_data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point, but I disagree it's not useful if it's not fast. It will be very useful for me when I run simulations and I have to make a new (small) reference catalog for every different run I do.
As I've thought about this some more, I've concluded that the ingestion code is going to have to be parallelised in order to ingest large (all sky) catalogs. That doesn't mean that the parallelisation needs to be done as part of this ticket, but I would suggest that the ingestion should be broken up now into suitable pieces for implementing the parallelisation later. I see two operations that can be parallelised:
|
@@ -36,4 +47,9 @@ def get_pixel_ids(self, ctrCoord, radius): | |||
return pixel_id_list, is_on_boundary | |||
|
|||
def index_points(self, ra_list, dec_list): | |||
"""!Generate trixel ids for each row in an input file | |||
|
|||
@param[in] coord_list List of coord objects to index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should list "ra_list", "dec_list" as input parameters in docstring.
90227a8
to
bbd76ce
Compare
This implements an indexer that provides the methods necessary to shard a catalog and then look up the correct shard later. The package I use currently has a problem if you call the lookup_id method without first calling another function. This hack will go away if the package is fixed or if we get an LSST maintained version.
This will be used to read files from disk for indexing
The catalog ingester will load files using a helper task and put the contents in a set of SourceCatalogs, one per HTM trixel. Most of the config is to allow mapping of columns in the file to the schema needed for referenc catalogs.
This implements the task to take files and parse them into SourceCatalogs. Some of the tasks implemented here are: read the file into a numpy array (the whole file needs to fit in memory), construct the SourceCatalog schema from the array dtype, index the catalog and fill the SourceCatalog. Command line tasks need a command runner. Since this task doesn't need to parse an `--id` argument, we an forego that step. Essentially, this just calls the task run method and gives a place to persist the config.
bbd76ce
to
38b65ee
Compare
This involves reconstituting the task that ingested the files. The only config is the name of the config to read to create the ingester task. The loader then borrows the indexer from the ingester task to grab the files from the repo it needs to load the reference catalog. The shards are only clipped if they land on the boundary of the circular aperture. This is a minimal override of the reference catalog loader. It may be possible to optimize by overriding other default implementations.
9e393e3
to
c56e30d
Compare
No description provided.