block-by-block IO - part 2 #488

yunjunz · 2020-12-13T08:38:49Z

Description of proposed changes

This PR, together with #478, addresses the memory issue for large datasets in the routine workflow smallbaselineApp.py, e.g., #199, #216, #473. Testing on an ifgramStack.h5 file of 128 GB with unwrapPhase in the shape of (384, 8412, 5276) on my laptop (with 16GB memory) shows the max memory usage of 4 GB.

block-by-block IO for the following scripts:
- tropo_pyaps3.py
- reference_date.py
- timeseries2velocity.py
- geocode.py (and objects/resample.py)
- save_hdfeos5.py
memory-efficient view.py via readfile.read(x/ystep) with improved handling of large 3D matrix
timeseries2velocity: integrate complex time func with bootstrap, so that one could use 1) bootstrap or 2) normal least squares with error propagation, for the estimation of complex time function and their uncertainty.
dem_error:
- parallel support via dask and mintpy.compute.* options, in the same way as ifgram_inversion.py
- do not write step_model for simplicity as it's now better supported in ts2vel.py
add mintpy.load.x/ystep option to support multilooking during load_data step, to downsize dataset

Reminders

Pass Codacy code review (green)
Pass Circle CI test (green)
Make sure that your code follows our style. Use the other functions/files as a basis.
If modifying functionality, describe changes to function behavior and arguments in a comment below the function declaration.
If adding new functionality, add a detailed description to the documentation and/or an example.

+ block-by-block IO for tropo_pyaps3 using writefile.layout_hdf5() and writefile.write_hdf5_block() + add run_or_skip() within calculate_delay_timeseries() for auto-skip

+ add --ram/--memory option for custom memory usage with default value of 2GB and template reading support + import cluster.split_box2sub_boxes() for patch spliting + block-by-block processing using writefile.layout_hdf5/write_hdf5_block()

+ fully integrate the bootstrap method with complex time func support

+ add --ram/--memory option for max memory usage setup + split run_timeseries2time_func() into: - read_inps2model() to get model dict and print key model info - layout_hdf5() to create HDF5 file with time func structure - write_hdf5_block() to write the block of time func + objects.cluster.split_box2sub_boxes: refactor

+ dem_error: do not write step_model into file/disk because: 1. the step function estimation is now supported via timeseries2velocity.h5 and the latter has more powerful functionality 2. the step model HFD5 file from dem_error.py is different from the one from ts2vel.h5 and the latter is preferred for its simplicity. + drop the common operations support in the following scripts of timeseriesStepModel.h5 file, which sometimes has one date in the time dimension: - geocode.py - mask.py - multilook.py - subset.py + dem_error: replace split2boxes with cluster.split_box2sub_boxes() test_sbApp: plot velocity alone for snap and aria also

…or 3D dset + utils.readfile: - read_hdf5_file/binary(): fix the size discrepency when x/ystep > 1 to be consistent with multilooking output size - read_hdf5_file(): use for loop when ystep * xstep > 1 for 3D dataset to save memory + multilook.multilook_dataset(): add method arg to support/swtich btw. average and nearest + view: expand multilook_num to all multiple subplots scenarios to save memory because the readfile.read(x/ystep) now won't distort data (due to nearest sampling instead of previous averaging)

For executable scripts, use ``` if __name == '__main__': main(sys.argv[1:]) ``` instead of ``` if __name == '__main__': main() ``` because the latter return error in interactive python when cmd_line_parsee() is not called in the main body of main(), such as the case in tsview.py, therefore the former is more generic and useful. + defaults/auto_path: use watermask.msk for ARIA

+ use pyresample as the default softwre for geocoding dataset produced by isce (lut in radar-coord) and gamma (lut in geo-coord) + comparison of geocodeing results between pyresample / scipy on Wells EQ dataset gives identical results on 99.1% of all valid pixels, thus change the default geocoding software from scipy to pyresample for: 1. consistency with config for other processors 2. flexibility, i.e. customized SNWE and lat/lon step. 3. efficiency, pyresample supports 3D matrix, thus, is more efficient. + consistent internal definition + clean up the following concepts/variables in resample objects: - always to coordinates at pixel center for interpolation - SNWE indicates bounding box at pixel outer boundary, consistent with Y/X_FIRST definition, unless noted in the adjacent comments.

+ rename mintpy.compute.memorySize to mintpy.compute.maxMemory for a more intuitive name + change default max memory from 2 GB to 4 GB + dem_error: add dask parallel option in prepare for dask support + sbApp(_auto).cfg: merge mintpy.geocode.latStep and mintpy.geocode.lonStep into one as mintpy.geocode.laloStep for consistency with mintpy.objects.resample object

* geocode.py + add --ram option from utils.arg_group.py + merge -y/x into --lalo-step for consistency with resample obj + more explicit checking / error message for --lalo-step option, since it's only customizable if radar2geo AND lut in radar-coord + block-by-block IO for both HDF5 and binary file, the latter is bbb in read only * objects/resample.py + move all configurations into __init__() to simplify the run_resample() + consistent member variables across all scenarios (radar2geo/geo2radar, geo/radar-coord lookup table, scipy/pyresmaple), including: - lalo_step - SNWE - length/width - src/dest_box_list - src/dest_def_list (for pyresample) - src/dest_pts and interp_mask (for scipy) + add get_num_box() + prepare_geometry_definition_radar(): - add block-by-block geometry preparation for radar2geo - add custom SNWE support for geo2radar + prepare_geometry_definition_geo() - add custom SNWE support for radar2geo

+ use max_memory to calc block size in temp_avg/pha_closure/ifg_inv + round block step to the nearest 10 + add used time info for load_data and plot_sbApp + Update mkdocs.yml

+ change default mintpy.compute.cluster value from no to none, to be consistent with utils.arg_group.add_parallel_argument() + dem_error: add parallel computing support via dask

+ docs/hdfeos5.md: add metadata session in require / recommend / auto-grab sub-section, to facilitate manual specification + save_hdfeos5: change the followingg metadataa - remove "frame" from UNAVCO definition completely and use first/last_frame only, for simplicity - uncomment processing_software - hardwire processing_type = LOS_TIMESERIES. This can be changed in the future if velocity/interferogram capabiligy is added. - hardwire post_processing_method = MintPy. This can be changed in the future if the script supports products from other softwares. + save_hdfeos5: add date-by-date IO to save memory / handle big data

+ move the following sub-functions into a new sub-module utils.attribute: - utils.utils0.subset_attribute() --> update_attribute4subset() - multilook.multilook_attribute() --> update_attribute4multilook() - geocode.metadata_radar2geo() --> update_attribute4radar2geo() - geocode.metadata_geo2radar() --> update_attribute4geo2radar() + update docs/api/module_hierarchy.md for utils.attribute/arg_group

+ add `mintpy.load.x/ystep` with default value of 1 for smallbaselineApp.py + multilook.multilook_data(): - add default lks_y/x value of 1 - return directly if no multilook number is specified + load_data.py - use iDict to replace inpsDict for simplicity - read mintpy.load.x/ystep and pass them to ifgramStackDict/geometryDict object + objects/stackDict.py: support x/ystep in all write2hdf5() + prep_aria: support multilook via mintpy.load.x/ystep

yunjunz added 15 commits December 12, 2020 18:15

block-by-block IO for tropo_pyaps3.py

453b3b7

+ block-by-block IO for tropo_pyaps3 using writefile.layout_hdf5() and writefile.write_hdf5_block() + add run_or_skip() within calculate_delay_timeseries() for auto-skip

block-by-block IO for reference_date.py

35b4651

+ add --ram/--memory option for custom memory usage with default value of 2GB and template reading support + import cluster.split_box2sub_boxes() for patch spliting + block-by-block processing using writefile.layout_hdf5/write_hdf5_block()

ts2vel: integrate cpx time func with bootstrap

4741304

+ fully integrate the bootstrap method with complex time func support

adjust block size calc

a1b62d0

+ use max_memory to calc block size in temp_avg/pha_closure/ifg_inv + round block step to the nearest 10 + add used time info for load_data and plot_sbApp + Update mkdocs.yml

dem_error: add dask support for parallel computing

27ed75d

+ change default mintpy.compute.cluster value from no to none, to be consistent with utils.arg_group.add_parallel_argument() + dem_error: add parallel computing support via dask

yunjunz force-pushed the big_data branch from 2be7fb2 to e2cd61f Compare December 13, 2020 19:21

yunjunz merged commit 8cda913 into insarlab:main Dec 13, 2020

yunjunz deleted the big_data branch December 13, 2020 20:10

yunjunz added this to the Big Data milestone Dec 13, 2020

yunjunz mentioned this pull request Dec 13, 2020

Geocode fails with large timeseries file #199

Closed

yunjunz mentioned this pull request Feb 17, 2021

gacos: support new .tif format #519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

block-by-block IO - part 2 #488

block-by-block IO - part 2 #488

yunjunz commented Dec 13, 2020 •

edited

Loading

block-by-block IO - part 2 #488

block-by-block IO - part 2 #488

Conversation

yunjunz commented Dec 13, 2020 • edited Loading

yunjunz commented Dec 13, 2020 •

edited

Loading