Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve save_csv string formatting #948

Merged
merged 7 commits into from
Apr 20, 2022

Conversation

bhagemeier
Copy link
Member

@bhagemeier bhagemeier commented Mar 31, 2022

Using .format can take up to 2x as long as using %.

Also add a test covering an additional line of code.

Description

The title says it all. Improve write performance.

Issue/s resolved: #947

Changes proposed:

  • Use string interpolation operator rather than str.format()

Type of change

  • performance improvement

Memory requirements

  • no changes

Performance

Up to 2x as fast in overall runtime than using str.format(), e.g. cutting down from 139s to 66s for the same data. See performance comparison on #947 .

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

Using .format can take up to 2x as long as using %.

Also add a test covering an additional line of code.
@ghost
Copy link

ghost commented Mar 31, 2022

CodeSee Review Map:

Review these changes using an interactive CodeSee Map

Review in an interactive map

View more CodeSee Maps

Legend

CodeSee Map Legend

@codecov
Copy link

codecov bot commented Mar 31, 2022

Codecov Report

Merging #948 (e88ed36) into main (5f77902) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #948      +/-   ##
==========================================
+ Coverage   95.34%   95.36%   +0.02%     
==========================================
  Files          64       64              
  Lines        9898     9898              
==========================================
+ Hits         9437     9439       +2     
+ Misses        461      459       -2     
Flag Coverage Δ
gpu 94.59% <100.00%> (+0.03%) ⬆️
unit 90.98% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
heat/core/io.py 89.43% <100.00%> (+0.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f77902...e88ed36. Read the comment docs.

@@ -1004,16 +1004,16 @@ def save_csv(
decimals = 0
dec_sep = 0
if sign == 1:
fmt = "{: %dd}" % (pre_point_digits + 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format strings are different depending on whether you use str.format() or string interpolation. String interpolation is significantly faster than str.format(). Therefore, this PR switches to string interpolation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried f-strings? How do they compete?

Copy link
Member Author

@bhagemeier bhagemeier Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tried them. Seems a bit convoluted to create them. f-strings are intended to be in code as literals, whereas I dynamically generate the format string depending on desired precision etc. You can give it a try if you like. For me personally, it wasn't quite worth the effort for an output format that is far from ideal in terms of performance anyway. It may turn out that they don't even work in the intended way.

@@ -1033,11 +1033,11 @@ def save_csv(
for i in range(data.lshape[0]):
# if lshape is of the form (x,), then there will only be a single element per row
if len(data.lshape) == 1:
row = fmt.format(data.larray[i])
row = fmt % (data.larray[i])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format strings have to be used differently now.

@@ -195,6 +195,21 @@ def test_save_csv(self):
if data.comm.rank == 0:
os.unlink(filename)

data = ht.random.randint(0, 100, size=(150,))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This captures a special case that had not been covered before.

@bhagemeier bhagemeier marked this pull request as ready for review April 20, 2022 08:17
@@ -1004,16 +1004,16 @@ def save_csv(
decimals = 0
dec_sep = 0
if sign == 1:
fmt = "{: %dd}" % (pre_point_digits + 1)
fmt = "%%%-dd" % (pre_point_digits + 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt = "%%%-dd" % (pre_point_digits + 1)
fmt = f"{pre_point_digits + 1}"

else:
fmt = "{:%dd}" % (pre_point_digits)
fmt = "%%%dd" % (pre_point_digits)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt = "%%%dd" % (pre_point_digits)
fmt = f"{pre_point_digits}"

elif types.issubdtype(data.dtype, types.floating):
if decimals == -1:
decimals = 7 if data.dtype is types.float32 else 15
if sign == 1:
fmt = "{: %d.%df}" % (pre_point_digits + decimals + 2, decimals)
fmt = "%%%-d.%df" % (pre_point_digits + decimals + 2, decimals)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt = "%%%-d.%df" % (pre_point_digits + decimals + 2, decimals)
fmt = f"{pre_point_digits + decimals + 2}.{decimals}f"

else:
fmt = "{:%d.%df}" % (pre_point_digits + decimals + 1, decimals)
fmt = "%%%d.%df" % (pre_point_digits + decimals + 1, decimals)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"{pre_point_digits + decimals + 1}.{decimals}f"

else:
fmt = "{:%d.%df}" % (pre_point_digits + decimals + 1, decimals)
fmt = "%%%d.%df" % (pre_point_digits + decimals + 1, decimals)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt = "%%%d.%df" % (pre_point_digits + decimals + 1, decimals)
fmt = f"{pre_point_digits + decimals + 1}.{decimals}f"

@@ -1033,11 +1033,11 @@ def save_csv(
for i in range(data.lshape[0]):
# if lshape is of the form (x,), then there will only be a single element per row
if len(data.lshape) == 1:
row = fmt.format(data.larray[i])
row = fmt % (data.larray[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
row = fmt % (data.larray[i])
row = f"{data.larray[i]:{fmt}}"

else:
if data.lshape[1] == 0:
break
row = sep.join(fmt.format(item) for item in data.larray[i])
row = sep.join(fmt % (item) for item in data.larray[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
row = sep.join(fmt % (item) for item in data.larray[i])
row = sep.join(f"{item:{fmt}}" for item in data.larray[i])

@ClaudiaComito
Copy link
Contributor

run tests

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @bhagemeier !

@ClaudiaComito ClaudiaComito merged commit 0510437 into main Apr 20, 2022
@ClaudiaComito ClaudiaComito deleted the 947-improve-csv-string-formatting branch April 20, 2022 12:25
@mtar mtar removed the PR talk label Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve save_csv write performance
3 participants