Improve save_csv string formatting #948

bhagemeier · 2022-03-31T19:59:09Z

Using .format can take up to 2x as long as using %.

Also add a test covering an additional line of code.

Description

The title says it all. Improve write performance.

Issue/s resolved: #947

Changes proposed:

Use string interpolation operator rather than str.format()

Type of change

performance improvement

Memory requirements

no changes

Performance

Up to 2x as fast in overall runtime than using str.format(), e.g. cutting down from 139s to 66s for the same data. See performance comparison on #947 .

Due Diligence

All split configurations tested
Multiple dtypes tested in relevant functions
Documentation updated (if needed)
Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

Using .format can take up to 2x as long as using %. Also add a test covering an additional line of code.

ghost · 2022-03-31T20:00:11Z

CodeSee Review Map:

Review in an interactive map

View more CodeSee Maps

Legend

codecov · 2022-03-31T20:16:29Z

Codecov Report

Merging #948 (e88ed36) into main (5f77902) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #948      +/-   ##
==========================================
+ Coverage   95.34%   95.36%   +0.02%     
==========================================
  Files          64       64              
  Lines        9898     9898              
==========================================
+ Hits         9437     9439       +2     
+ Misses        461      459       -2

Flag	Coverage Δ
gpu	`94.59% <100.00%> (+0.03%)`	⬆️
unit	`90.98% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
heat/core/io.py	`89.43% <100.00%> (+0.43%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f77902...e88ed36. Read the comment docs.

bhagemeier · 2022-04-07T07:31:29Z

heat/core/io.py

@@ -1004,16 +1004,16 @@ def save_csv(
        decimals = 0
        dec_sep = 0
        if sign == 1:
-            fmt = "{: %dd}" % (pre_point_digits + 1)


Format strings are different depending on whether you use str.format() or string interpolation. String interpolation is significantly faster than str.format(). Therefore, this PR switches to string interpolation.

Have you tried f-strings? How do they compete?

Yes, tried them. Seems a bit convoluted to create them. f-strings are intended to be in code as literals, whereas I dynamically generate the format string depending on desired precision etc. You can give it a try if you like. For me personally, it wasn't quite worth the effort for an output format that is far from ideal in terms of performance anyway. It may turn out that they don't even work in the intended way.

bhagemeier · 2022-04-07T07:32:14Z

heat/core/io.py

@@ -1033,11 +1033,11 @@ def save_csv(
    for i in range(data.lshape[0]):
        # if lshape is of the form (x,), then there will only be a single element per row
        if len(data.lshape) == 1:
-            row = fmt.format(data.larray[i])
+            row = fmt % (data.larray[i])


The format strings have to be used differently now.

bhagemeier · 2022-04-07T07:34:42Z

heat/core/tests/test_io.py

@@ -195,6 +195,21 @@ def test_save_csv(self):
                            if data.comm.rank == 0:
                                os.unlink(filename)

+        data = ht.random.randint(0, 100, size=(150,))


This captures a special case that had not been covered before.

mtar · 2022-04-13T11:38:00Z

heat/core/io.py

@@ -1004,16 +1004,16 @@ def save_csv(
        decimals = 0
        dec_sep = 0
        if sign == 1:
-            fmt = "{: %dd}" % (pre_point_digits + 1)
+            fmt = "%%%-dd" % (pre_point_digits + 1)


Suggested change

fmt = "%%%-dd" % (pre_point_digits + 1)

fmt = f"{pre_point_digits + 1}"

mtar · 2022-04-13T11:38:47Z

heat/core/io.py

        else:
-            fmt = "{:%dd}" % (pre_point_digits)
+            fmt = "%%%dd" % (pre_point_digits)


Suggested change

fmt = "%%%dd" % (pre_point_digits)

fmt = f"{pre_point_digits}"

mtar · 2022-04-13T11:39:19Z

heat/core/io.py

    elif types.issubdtype(data.dtype, types.floating):
        if decimals == -1:
            decimals = 7 if data.dtype is types.float32 else 15
        if sign == 1:
-            fmt = "{: %d.%df}" % (pre_point_digits + decimals + 2, decimals)
+            fmt = "%%%-d.%df" % (pre_point_digits + decimals + 2, decimals)


Suggested change

fmt = "%%%-d.%df" % (pre_point_digits + decimals + 2, decimals)

fmt = f"{pre_point_digits + decimals + 2}.{decimals}f"

mtar · 2022-04-13T11:39:40Z

heat/core/io.py

        else:
-            fmt = "{:%d.%df}" % (pre_point_digits + decimals + 1, decimals)
+            fmt = "%%%d.%df" % (pre_point_digits + decimals + 1, decimals)


f"{pre_point_digits + decimals + 1}.{decimals}f"

mtar · 2022-04-13T11:40:34Z

heat/core/io.py

        else:
-            fmt = "{:%d.%df}" % (pre_point_digits + decimals + 1, decimals)
+            fmt = "%%%d.%df" % (pre_point_digits + decimals + 1, decimals)


Suggested change

fmt = "%%%d.%df" % (pre_point_digits + decimals + 1, decimals)

fmt = f"{pre_point_digits + decimals + 1}.{decimals}f"

mtar · 2022-04-13T11:41:11Z

heat/core/io.py

@@ -1033,11 +1033,11 @@ def save_csv(
    for i in range(data.lshape[0]):
        # if lshape is of the form (x,), then there will only be a single element per row
        if len(data.lshape) == 1:
-            row = fmt.format(data.larray[i])
+            row = fmt % (data.larray[i])


Suggested change

row = fmt % (data.larray[i])

row = f"{data.larray[i]:{fmt}}"

mtar · 2022-04-13T11:41:48Z

heat/core/io.py

        else:
            if data.lshape[1] == 0:
                break
-            row = sep.join(fmt.format(item) for item in data.larray[i])
+            row = sep.join(fmt % (item) for item in data.larray[i])


Suggested change

row = sep.join(fmt % (item) for item in data.larray[i])

row = sep.join(f"{item:{fmt}}" for item in data.larray[i])

ClaudiaComito · 2022-04-20T08:54:15Z

run tests

ClaudiaComito

Thanks a lot @bhagemeier !

Improve save_csv string formatting

e370e45

Using .format can take up to 2x as long as using %. Also add a test covering an additional line of code.

Changelog message for issue #947

7b59a42

Merge branch 'main' into 947-improve-csv-string-formatting

6813e8f

bhagemeier commented Apr 7, 2022

View reviewed changes

ClaudiaComito added the PR talk label Apr 20, 2022

bhagemeier marked this pull request as ready for review April 20, 2022 08:17

mtar reviewed Apr 20, 2022

View reviewed changes

bhagemeier added 2 commits April 20, 2022 10:18

Merge branch 'main' into 947-improve-csv-string-formatting

4fc449e

Merge branch 'main' into 947-improve-csv-string-formatting

f838fd8

ClaudiaComito added the merge label Apr 20, 2022

bhagemeier added 2 commits April 20, 2022 11:40

Improve test coverage

bffcdaa

Merge branch 'main' into 947-improve-csv-string-formatting

e88ed36

ClaudiaComito approved these changes Apr 20, 2022

View reviewed changes

ClaudiaComito merged commit 0510437 into main Apr 20, 2022

ClaudiaComito deleted the 947-improve-csv-string-formatting branch April 20, 2022 12:25

mtar removed the PR talk label Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve save_csv string formatting #948

Improve save_csv string formatting #948

bhagemeier commented Mar 31, 2022 •

edited

Loading

ghost commented Mar 31, 2022 •

edited by ghost

Loading

codecov bot commented Mar 31, 2022 •

edited

Loading

bhagemeier Apr 7, 2022

mtar Apr 12, 2022

bhagemeier Apr 12, 2022 •

edited

Loading

bhagemeier Apr 7, 2022

bhagemeier Apr 7, 2022

mtar Apr 13, 2022

mtar Apr 13, 2022

mtar Apr 13, 2022

mtar Apr 13, 2022

mtar Apr 13, 2022

mtar Apr 13, 2022

mtar Apr 13, 2022

ClaudiaComito commented Apr 20, 2022

ClaudiaComito left a comment

	fmt = "%%%-dd" % (pre_point_digits + 1)
	fmt = f"{pre_point_digits + 1}"

	fmt = "%%%dd" % (pre_point_digits)
	fmt = f"{pre_point_digits}"

	fmt = "%%%-d.%df" % (pre_point_digits + decimals + 2, decimals)
	fmt = f"{pre_point_digits + decimals + 2}.{decimals}f"

	fmt = "%%%d.%df" % (pre_point_digits + decimals + 1, decimals)
	fmt = f"{pre_point_digits + decimals + 1}.{decimals}f"

	row = sep.join(fmt % (item) for item in data.larray[i])
	row = sep.join(f"{item:{fmt}}" for item in data.larray[i])

Improve save_csv string formatting #948

Improve save_csv string formatting #948

Conversation

bhagemeier commented Mar 31, 2022 • edited Loading

Description

Changes proposed:

Type of change

Memory requirements

Performance

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

ghost commented Mar 31, 2022 • edited by ghost Loading

CodeSee Review Map:

Legend

codecov bot commented Mar 31, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhagemeier Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ClaudiaComito commented Apr 20, 2022

ClaudiaComito left a comment

Choose a reason for hiding this comment

bhagemeier commented Mar 31, 2022 •

edited

Loading

ghost commented Mar 31, 2022 •

edited by ghost

Loading

codecov bot commented Mar 31, 2022 •

edited

Loading

bhagemeier Apr 12, 2022 •

edited

Loading