Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block-by-block IO - part 2 #488

Merged
merged 15 commits into from
Dec 13, 2020
Merged

block-by-block IO - part 2 #488

merged 15 commits into from
Dec 13, 2020

Conversation

yunjunz
Copy link
Member

@yunjunz yunjunz commented Dec 13, 2020

Description of proposed changes

This PR, together with #478, addresses the memory issue for large datasets in the routine workflow smallbaselineApp.py, e.g., #199, #216, #473. Testing on an ifgramStack.h5 file of 128 GB with unwrapPhase in the shape of (384, 8412, 5276) on my laptop (with 16GB memory) shows the max memory usage of 4 GB.

  • block-by-block IO for the following scripts:

    • tropo_pyaps3.py
    • reference_date.py
    • timeseries2velocity.py
    • geocode.py (and objects/resample.py)
    • save_hdfeos5.py
  • memory-efficient view.py via readfile.read(x/ystep) with improved handling of large 3D matrix

  • timeseries2velocity: integrate complex time func with bootstrap, so that one could use 1) bootstrap or 2) normal least squares with error propagation, for the estimation of complex time function and their uncertainty.

  • dem_error:

    • parallel support via dask and mintpy.compute.* options, in the same way as ifgram_inversion.py
    • do not write step_model for simplicity as it's now better supported in ts2vel.py
  • add mintpy.load.x/ystep option to support multilooking during load_data step, to downsize dataset

Reminders

  • Pass Codacy code review (green)
  • Pass Circle CI test (green)
  • Make sure that your code follows our style. Use the other functions/files as a basis.
  • If modifying functionality, describe changes to function behavior and arguments in a comment below the function declaration.
  • If adding new functionality, add a detailed description to the documentation and/or an example.

+ block-by-block IO for tropo_pyaps3 using writefile.layout_hdf5() and writefile.write_hdf5_block()

+ add run_or_skip() within calculate_delay_timeseries() for auto-skip
+ add --ram/--memory option for custom memory usage with default value of 2GB and template reading support

+ import cluster.split_box2sub_boxes() for patch spliting

+ block-by-block processing using writefile.layout_hdf5/write_hdf5_block()
+ fully integrate the bootstrap method with complex time func support
+ add --ram/--memory option for max memory usage setup

+ split run_timeseries2time_func() into:
   - read_inps2model() to get model dict and print key model info
   - layout_hdf5() to create HDF5 file with time func structure
   - write_hdf5_block() to write the block of time func

+ objects.cluster.split_box2sub_boxes: refactor
+ dem_error: do not write step_model into file/disk because:
1. the step function estimation is now supported via timeseries2velocity.h5 and the latter has more powerful functionality
2. the step model HFD5 file from dem_error.py is different from the one from ts2vel.h5 and the latter is preferred for its simplicity.

+ drop the common operations support in the following scripts of timeseriesStepModel.h5 file, which sometimes has one date in the time dimension:
   - geocode.py
   - mask.py
   - multilook.py
   - subset.py

+ dem_error: replace split2boxes with cluster.split_box2sub_boxes()

test_sbApp: plot velocity alone for snap and aria also
…or 3D dset

+ utils.readfile:
   - read_hdf5_file/binary(): fix the size discrepency when x/ystep > 1 to be consistent with multilooking output size
   - read_hdf5_file(): use for loop when ystep * xstep > 1 for 3D dataset to save memory

+ multilook.multilook_dataset(): add method arg to support/swtich btw. average and nearest

+ view: expand multilook_num to all multiple subplots scenarios to save memory
  because the readfile.read(x/ystep) now won't distort data (due to nearest sampling instead of previous averaging)
For executable scripts, use

```
if __name == '__main__':
    main(sys.argv[1:])
```

instead of

```
if __name == '__main__':
    main()
```

because the latter return error in interactive python when cmd_line_parsee() is not called in the main body of main(), such as the case in tsview.py, therefore the former is more generic and useful.

+ defaults/auto_path: use watermask.msk for ARIA
+ use pyresample as the default softwre for geocoding dataset produced by isce (lut in radar-coord) and gamma (lut in geo-coord)

+ comparison of geocodeing results between pyresample / scipy on Wells EQ dataset gives identical results on 99.1% of all valid pixels, thus change the default geocoding software from scipy to pyresample for:
1. consistency with config for other processors
2. flexibility, i.e. customized SNWE and lat/lon step.
3. efficiency, pyresample supports 3D matrix, thus, is more efficient.

+ consistent internal definition

+ clean up the following concepts/variables in resample objects:
  - always to coordinates at pixel center for interpolation
  - SNWE indicates bounding box at pixel outer boundary, consistent with Y/X_FIRST definition, unless noted in the adjacent comments.
+ rename mintpy.compute.memorySize to mintpy.compute.maxMemory for a more intuitive name

+ change default max memory from 2 GB to 4 GB

+ dem_error: add dask parallel option in prepare for dask support

+ sbApp(_auto).cfg: merge mintpy.geocode.latStep and mintpy.geocode.lonStep into one as mintpy.geocode.laloStep for consistency with mintpy.objects.resample object
* geocode.py
+ add --ram option from utils.arg_group.py
+ merge -y/x into --lalo-step for consistency with resample obj
+ more explicit checking / error message for --lalo-step option, since it's only customizable if radar2geo AND lut in radar-coord
+ block-by-block IO for both HDF5 and binary file, the latter is bbb in read only

* objects/resample.py
+ move all configurations into __init__() to simplify the run_resample()
+ consistent member variables across all scenarios (radar2geo/geo2radar, geo/radar-coord lookup table, scipy/pyresmaple), including:
   - lalo_step
   - SNWE
   - length/width
   - src/dest_box_list
   - src/dest_def_list (for pyresample)
   - src/dest_pts and interp_mask (for scipy)

+ add get_num_box()

+ prepare_geometry_definition_radar():
   - add block-by-block geometry preparation for radar2geo
   - add custom SNWE support for geo2radar
+ prepare_geometry_definition_geo()
   - add custom SNWE support for radar2geo
+ use max_memory to calc block size in temp_avg/pha_closure/ifg_inv
+ round block step to the nearest 10

+ add used time info for load_data and plot_sbApp

+ Update mkdocs.yml
+ change default mintpy.compute.cluster value from no to none, to be consistent with utils.arg_group.add_parallel_argument()

+ dem_error: add parallel computing support via dask
+ docs/hdfeos5.md: add metadata session in require / recommend / auto-grab sub-section, to facilitate manual specification

+ save_hdfeos5: change the followingg metadataa
   - remove "frame" from UNAVCO definition completely and use first/last_frame only, for simplicity
   - uncomment processing_software
   - hardwire processing_type = LOS_TIMESERIES. This can be changed in the future if velocity/interferogram capabiligy is added.
   - hardwire post_processing_method = MintPy. This can be changed in the future if the script supports products from other softwares.

+ save_hdfeos5: add date-by-date IO to save memory / handle big data
+ move the following sub-functions into a new sub-module utils.attribute:
    - utils.utils0.subset_attribute() --> update_attribute4subset()
   - multilook.multilook_attribute() --> update_attribute4multilook()
   - geocode.metadata_radar2geo() --> update_attribute4radar2geo()
   - geocode.metadata_geo2radar() --> update_attribute4geo2radar()

+ update docs/api/module_hierarchy.md for utils.attribute/arg_group
+ add `mintpy.load.x/ystep` with default value of 1 for smallbaselineApp.py

+ multilook.multilook_data():
   - add default lks_y/x value of 1
   - return directly if no multilook number is specified

+ load_data.py
   - use iDict to replace inpsDict for simplicity
   - read mintpy.load.x/ystep and pass them to ifgramStackDict/geometryDict object

+ objects/stackDict.py: support x/ystep in all write2hdf5()

+ prep_aria: support multilook via mintpy.load.x/ystep
@yunjunz yunjunz merged commit 8cda913 into insarlab:main Dec 13, 2020
@yunjunz yunjunz deleted the big_data branch December 13, 2020 20:10
@yunjunz yunjunz added this to the Big Data milestone Dec 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant