Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-1579: HTM indexed files. #42

Merged
merged 8 commits into from Apr 1, 2016
Merged

DM-1579: HTM indexed files. #42

merged 8 commits into from Apr 1, 2016

Conversation

SimonKrughoff
Copy link
Contributor

No description provided.

default_config.colnames = ['id', 'ra', 'dec', 'a', 'a_err', 'b', 'b_err', 'is_phot', 'is_res',
'is_var', 'val1', 'val2', 'val3']
default_config.delimiter = '|'
#this also tests changing the delimiter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a space between the # and the text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@SimonKrughoff SimonKrughoff force-pushed the tickets/DM-1579 branch 2 times, most recently from d222b93 to 53c5c93 Compare March 24, 2016 05:47
names = True
if self.config.colnames:
names = self.config.colnames
arr = numpy.genfromtxt(filename, dtype=None, skip_header=self.config.header_lines,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why text input? Our catalogs are typically large, and in FITS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text input is the less trivial of the two. I figured a FITS reader could be added once this is in. I could do it here if it's a no-go without it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input mechanism isn't abstracted, so changing the input mechanism isn't straight-forward, and would involve copying a lot of code.

@PaulPrice
Copy link
Contributor

I think several of these commits could be squashed together without losing anything.

@PaulPrice
Copy link
Contributor

Despite the filename choices, I think there's a lot in here that's not HTM-specific. I think it's worth making it more generic, so I can choose which indexing scheme I want to use. For example, you could make an abstract base class for the Indexer and allow the user to select (via Config) which Indexer subclass to use.


arr = numpy.array(zip(ident, ra, dec, a_mag, a_mag_err, b_mag, b_mag_err, is_photometric, is_resolved,
is_variable, extra_col1, extra_col2, extra_col3), dtype=dtype)
numpy.savetxt(out_path+"/ref.txt", arr, delimiter=",",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this demonstrates my concern about having the ingestion read CSV rather than FITS. Converting FITS catalogs covering 3pi steradians to 21 mag to CSV is going to be very slow and consume a lot of disk space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think it would be trivial to add, but I'd rather do that as another ticket so I can close this. Please push back if that is not acceptable to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the code is useful if it's not fast, but the reference catalog I usually use is very large (151 GB for PS1 astrometry_net_data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, but I disagree it's not useful if it's not fast. It will be very useful for me when I run simulations and I have to make a new (small) reference catalog for every different run I do.

@PaulPrice
Copy link
Contributor

As I've thought about this some more, I've concluded that the ingestion code is going to have to be parallelised in order to ingest large (all sky) catalogs. That doesn't mean that the parallelisation needs to be done as part of this ticket, but I would suggest that the ingestion should be broken up now into suitable pieces for implementing the parallelisation later.

I see two operations that can be parallelised:

  1. Assigning each source an index: can be parallelised over input files, writing temporary files that join sources to indices.
  2. Writing each source to the output: can be parallelised over indices, writing the final product.

@@ -36,4 +47,9 @@ def get_pixel_ids(self, ctrCoord, radius):
return pixel_id_list, is_on_boundary

def index_points(self, ra_list, dec_list):
"""!Generate trixel ids for each row in an input file

@param[in] coord_list List of coord objects to index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should list "ra_list", "dec_list" as input parameters in docstring.

This implements an indexer that provides the methods necessary to
shard a catalog and then look up the correct shard later.  The package
I use currently has a problem if you call the lookup_id method without
first calling another function.  This hack will go away if the package
is fixed or if we get an LSST maintained version.
This will be used to read files from disk for indexing
The catalog ingester will load files using a helper task
and put the contents in a set of SourceCatalogs, one
per HTM trixel.  Most of the config is to allow mapping of
columns in the file to the schema needed for referenc catalogs.
This implements the task to take files and parse them into
SourceCatalogs.  Some of the tasks implemented here are:
read the file into a numpy array (the whole file needs to fit
in memory), construct the SourceCatalog schema from the array
dtype, index the catalog and fill the SourceCatalog.

Command line tasks need a command runner.  Since this task
doesn't need to parse an `--id` argument, we an forego that
step.  Essentially, this just calls the task run method and
gives a place to persist the config.
This involves reconstituting the task that ingested the files.

The only config is the name of the config to read to create the
ingester task.  The loader then borrows the indexer from the
ingester task to grab the files from the repo it needs to load the
reference catalog.  The shards are only clipped if they land on the
boundary of the circular aperture.

This is a minimal override of the reference catalog loader.  It may
be possible to optimize by overriding other default implementations.
@SimonKrughoff SimonKrughoff merged commit c56e30d into master Apr 1, 2016
@ktlim ktlim deleted the tickets/DM-1579 branch August 25, 2018 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants