New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-11021: Generalize prototype pipeline script #4
Conversation
Defect ingestion does not work presently because the defects live in a tar.gz file. Once this is fixed, the pipeline should work. Additional outstanding issues are noted with TODO in the code.
Gaia for astrometry, pan-starrs for photometry These changes effectively complete DM-10975
decam_process.py
Outdated
datafiles.extend(glob(os.path.join(args.dataset, 'raw', files))) | ||
allcalibdatafiles.extend(glob(os.path.join(args.dataset, 'calib', files))) | ||
|
||
# Ignore wtmaps and illumcors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider breaking up main()
into several functions, e.g., one for argument parsing, one for finding the files, one for the "run everything" bit... Stuff could be divided past that; for example, the code used to create datafiles
and allcalibdatafiles
could be split into a single function that takes a directory and an extension (or sequence thereof).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wholeheartedly agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! Mucking with main()
is a bit lower on my priority list since ap_verify
doesn't care about it, but you're both correct that this function is rather lengthy with multiple functionalities. I'll see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks like a very positive improvement.
decam_process.py
Outdated
@@ -15,324 +15,378 @@ | |||
import os | |||
import argparse | |||
import textwrap | |||
import tarfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Std lib imports should come before local lib imports: https://developer.lsst.io/v/DM-5063/coding/python_style_guide.html#imports-should-be-grouped-in-order-standard-lib-3rd-party-lib-local-lib
decam_process.py
Outdated
''' | ||
Process raw decam images with MasterCals from ingestion --> difference imaging | ||
|
||
TODO: Rename script and GitHub repo to ap_pipe (not decam_hits or decam_process). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module-level docstring should come above the imports.
decam_process.py
Outdated
$ python hits_ingest.py ingestCalibs -f path/to/calibrations/somefiles*.fits.fz | ||
$ python hits_ingest.py processCcd | ||
$ python hits_ingest.py diffIm | ||
$ python decam_process.py -d dataset_root -o output_location -i "visit=12345, ccd=5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to make sure that this line stays sync-ed with the one on L49 - or else get rid of it here...?
decam_process.py
Outdated
help="Location on disk of dataset_root, which contains subdirectories of raw data, calibs, etc.") | ||
parser.add_argument('-o', '--output', | ||
help="Location on disk where output repos will live.") | ||
# TODO: implement this argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a ticket for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of DM-11422 because using coadds as templates will require --id
-like metadata to be handled differently than when a visit is a template. I'll note this explicitly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decam_process.py
Outdated
ingest - Ingest raw DECam images into a repository with a | ||
corresponding registry. | ||
USAGE: | ||
$ python hits_ingest.py ingest -f path/to/rawimages/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did all of this extra functionality go away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially yes; a call to run the pipeline script via main()
will run all of the steps rather than just one at a time. This doesn't matter for ap_verify
(or any other external code using the pipeline) because it can call whichever function it wants whenever it wants. Is that OK?
decam_process.py
Outdated
tarfile.open(os.path.join('../../', defecttarball), 'r').extractall('defects') | ||
defectfiles = glob(os.path.join('defects', os.path.basename(defectfiles[0]), '*.fits')) | ||
defectargs.extend(defectfiles) | ||
defectargumentParser = IngestCalibsArgumentParser(name='ingestCalibs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defectArgumentParser
, etc...
decam_process.py
Outdated
print('If this is your first time ingesting these calibration files,') | ||
print(' make sure you only use image or wtmap MasterCals, not both.') | ||
print('~~~ !!! ~~~') | ||
flatBiasIngest(repo, calibrepo, calibdatafiles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not for this ticket but I think all of this code would benefit from going into a class, instead of having passing around all the different variables from one function to the next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific about the benefits? You made a similar argument for ap_verify
and I'm still not sure it was the right thing to do; in particular, explicit function arguments make it much easier to see (and control) each task's data dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to discuss with both of you further, but "not for this ticket" indeed!
refObjLoader.filterMap = {"g": "g", | ||
"r": "r", | ||
"VR": "g"} | ||
** to run from bash, 'processccd_config.py' must exist and contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the equivalent of using the processccd_config.py
file for this script? Is it hardcoded somewhere, or not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see it below. Might be good to enumerate the default values that are set in the config (e.g. using gaia for astrometric reference, etc.)
decam_process.py
Outdated
'calibrate.doPhotoCal=True'] | ||
ProcessCcdTask.parseAndRun(args=args, config=config) | ||
if os.path.isdir(os.path.join(processedrepo, '0'+visit)): | ||
print('ProcessCcd has already been run for visit {0}, skipping...'.format(visit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, maybe add a return
statement here and move everything else out of the else
block.
decam_process.py
Outdated
|
||
Parameters | ||
---------- | ||
processedrepo: `str` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first read this as proccess drepo
. I like camel-case or underscore (I don't care which but I think the style guide has us moving toward using underscore, e.g. processed_repo
.
Thanks for the thorough review! I think the only outstanding items are (1) adding tests and (2) possibly changing how defect file/tarball info is passed into doIngestCalibs. I would like to close this ticket so I can rename/reorganize the repo to finally be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks great. Just 2 minor comments.
decam_process.py
Outdated
@@ -207,158 +246,307 @@ def doIngest(repo, datafiles): | |||
config = IngestConfig() | |||
config.parse.retarget(DecamParseTask) | |||
# create an instance of the decam ingest task | |||
ingestTask = ingest.DecamIngestTask(config=config) | |||
IngestTask = ingest.DecamIngestTask(config=config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you changed this to capitalized? Capitalized names are for classes, whereas this is a class instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because the story of my life is being confused about what is a class and what is an instance... I'll change it back :)
decam_process.py
Outdated
argumentParser = IngestCalibsArgumentParser(name='ingestCalibs') | ||
config = IngestCalibsConfig() | ||
config.parse.retarget(ingestCalibs.DecamCalibsParseTask) | ||
ingestTask = IngestCalibsTask(config=config, name='ingestCalibs') | ||
CalibIngestTask = IngestCalibsTask(config=config, name='ingestCalibs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
os.chdir(calib_repo) | ||
try: | ||
os.mkdir('defects') | ||
except OSError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the change over to try/except/else!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, it's always OSError
. 😲
You'll probably brand this as "not this ticket", but I don't think it's safe to assume a pre-existing directory is the only reason os.mkdir
raises -- it could also be lack of permissions, problems with the filesystem itself, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point @kfindeisen, I will make the statement that prints with that error mention the fact that the exception is likely (but not necessarily) happening because the defect directory already exists.
The defects are now copied and extracted from the input dataset into the calibrepo, much in the same way ref_cats are extracted into repo. This is done during ingestion instead of during processing.
b0514c1
to
bc8f42c
Compare
No description provided.