750 save csv v2 #941

bhagemeier · 2022-03-25T13:14:39Z

Description

This feature provides saving of data in CSV format. In order to utilize MPI-IO and write in parallel, the CSV format written is normalized to fixed-width. This is not the same as np.savetxt(), but may serve as a starting point to move in that direction. Supports all possible splits for 2D tensors, so None, 0, and 1. save_csv will only ever store at most 2D data, comparable to np.savetxt().

Issue/s resolved: #750

Changes proposed:

add ht.save_csv()

Type of change

New feature (non-breaking change which adds functionality)

Memory requirements

Requires a buffer (per process) to contain the row that is currently written. Except for very wide rows, it should be negligible. A 3595x3595 tensor taken from a real example worked well and showed only a small increase in the memory footprint.

Performance

split=None: no gain
split=0 reduces writing time substantially
split=1

Due Diligence

All supported (None, 0, 1) split configurations tested
Multiple dtypes tested in relevant functions
Documentation updated (if needed)
Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

yes: the generic save function can now save CSVs

skip ci

it has problems when running with mpirun, so will try a different approach on a different branch.

This is an implementation of save_csv that covers split None and 0. resolves #750

Using the collective MPI.File.Write_at_all led to problems with not perfectly balanced chunks. The ordinary Write_at is much better for this purpose. On the way, I also removed print statements and comments.

floating is the supertype of float32 and float64, not float, which is just an alias. Added a corresponding test.

The way we use MPI-IO does not reset existing contents of files and therefore may leave garbage at the end if the data to be written has a shorter representation than the existing file. Therefore, we reset by default, but allow to omit this step.

the difference from split 0 is not so big after all

sys was only there for debugging purposes

remove unreachable else branch and start from common offset for all splits

Not synchronizing at the end of writing the file may lead to strange effects for imbalanced tensors.

ghost · 2022-03-28T06:17:10Z

CodeSee Review Map:

Review in an interactive map

View more CodeSee Maps

Legend

bhagemeier · 2022-03-28T09:04:54Z

Just found a bug with split=1, which does not work for nprocs>shape[1]. Need to fix before merging.

Having more processes than chunks in split 1 did not work. Rather than checking whether we are the last (overall) rank, we check whether we have the last chunk of data and don't write anything if we have no data. Last chunk is relevant to distinguish newline or separator addition at the end of our buffer.

codecov · 2022-03-28T13:59:42Z

Codecov Report

Merging #941 (20b2023) into main (3bcab68) will decrease coverage by 4.41%.
The diff coverage is 92.10%.

@@            Coverage Diff             @@
##             main     #941      +/-   ##
==========================================
- Coverage   95.50%   91.08%   -4.42%     
==========================================
  Files          64       64              
  Lines        9801     9875      +74     
==========================================
- Hits         9360     8995     -365     
- Misses        441      880     +439

Flag	Coverage Δ
gpu	`?`
unit	`91.08% <92.10%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
heat/core/io.py	`89.00% <92.10%> (+0.54%)`	⬆️
heat/optim/dp_optimizer.py	`24.19% <0.00%> (-71.89%)`	⬇️
heat/core/stride_tricks.py	`81.53% <0.00%> (-12.31%)`	⬇️
heat/core/devices.py	`86.66% <0.00%> (-11.12%)`	⬇️
heat/nn/data_parallel.py	`84.13% <0.00%> (-10.35%)`	⬇️
heat/cluster/spectral.py	`85.71% <0.00%> (-8.58%)`	⬇️
heat/core/communication.py	`89.90% <0.00%> (-6.86%)`	⬇️
heat/core/tests/test_suites/basic_test.py	`91.26% <0.00%> (-4.86%)`	⬇️
heat/core/linalg/qr.py	`97.30% <0.00%> (-2.70%)`	⬇️
heat/utils/data/partial_dataset.py	`92.30% <0.00%> (-2.06%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bcab68...20b2023. Read the comment docs.

Sometimes, load_csv complained about files not being available anymore. We need to sync through a barrier before unlinking files.

The maximum of a tensor can be less than 0, so need abs around the max, too, before passing it to log10. Also, a 0 value must be excluded.

bhagemeier · 2022-03-30T08:00:47Z

After thorough testing, finding and resolving 2 more bugs, I am sufficiently confident to merge this PR.

bhagemeier added 12 commits March 23, 2022 11:48

An initial implementation for save_csv

7fc2e0e

it has problems when running with mpirun, so will try a different approach on a different branch.

save_csv docstring

507d525

save_csv function

a5d82c2

This is an implementation of save_csv that covers split None and 0. resolves #750

Test whether test.csv exists before truncation

e4e8fa4

Use more appropriate MPI.File.Write_at

0fcd797

Using the collective MPI.File.Write_at_all led to problems with not perfectly balanced chunks. The ordinary Write_at is much better for this purpose. On the way, I also removed print statements and comments.

save_csv: fix float64 problem

38ac575

floating is the supertype of float32 and float64, not float, which is just an alias. Added a corresponding test.

Adjust format to match field width

c6ffcee

Truncate file before writing

d6d1d49

The way we use MPI-IO does not reset existing contents of files and therefore may leave garbage at the end if the data to be written has a shorter representation than the existing file. Therefore, we reset by default, but allow to omit this step.

Remove print statements

eda9825

Add changelog entry for save_csv

ed54d31

Allow for 1D vectors

c5727ab

Add CSV to generic save function

d42f99c

bhagemeier requested review from Markus-Goetz and ClaudiaComito March 25, 2022 14:21

bhagemeier added 4 commits March 25, 2022 22:12

save_csv for split 1

b0407a3

the difference from split 0 is not so big after all

Remove unused import

3bf1319

sys was only there for debugging purposes

Code simplification

d1dbc38

remove unreachable else branch and start from common offset for all splits

Make save_csv synchronous

6fe203c

Not synchronizing at the end of writing the file may lead to strange effects for imbalanced tensors.

bhagemeier marked this pull request as ready for review March 28, 2022 08:15

Merge branch 'main' into 750-save-csv-v2

4e7a23c

coquelin77 previously approved these changes Mar 28, 2022

View reviewed changes

bhagemeier dismissed coquelin77’s stale review via a8740cb March 28, 2022 13:55

bhagemeier added 4 commits March 29, 2022 23:05

Write split 0 only on rank 0

7f76ef5

Enumerate test_csv cases and use individual temporary files

6fb1981

Barrier before unlink

7c2024f

Sometimes, load_csv complained about files not being available anymore. We need to sync through a barrier before unlinking files.

Use abs also for max value

7a673cd

The maximum of a tensor can be less than 0, so need abs around the max, too, before passing it to log10. Also, a 0 value must be excluded.

Merge branch 'main' into 750-save-csv-v2

20b2023

bhagemeier merged commit 2b10dd2 into main Mar 30, 2022

bhagemeier deleted the 750-save-csv-v2 branch March 30, 2022 08:01

bhagemeier mentioned this pull request Mar 30, 2022

Improve save_csv write performance #947

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

750 save csv v2 #941

750 save csv v2 #941

bhagemeier commented Mar 25, 2022 •

edited

ghost commented Mar 28, 2022 •

edited by ghost

bhagemeier commented Mar 28, 2022

codecov bot commented Mar 28, 2022 •

edited

bhagemeier commented Mar 30, 2022

750 save csv v2 #941

750 save csv v2 #941

Conversation

bhagemeier commented Mar 25, 2022 • edited

Description

Changes proposed:

Type of change

Memory requirements

Performance

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

ghost commented Mar 28, 2022 • edited by ghost

CodeSee Review Map:

Legend

bhagemeier commented Mar 28, 2022

codecov bot commented Mar 28, 2022 • edited

Codecov Report

bhagemeier commented Mar 30, 2022

bhagemeier commented Mar 25, 2022 •

edited

ghost commented Mar 28, 2022 •

edited by ghost

codecov bot commented Mar 28, 2022 •

edited