Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix timer hostname bug #3043

Merged
merged 2 commits into from Sep 3, 2023
Merged

fix timer hostname bug #3043

merged 2 commits into from Sep 3, 2023

Conversation

zachglick
Copy link
Contributor

@zachglick zachglick commented Sep 2, 2023

Description

This PR fixes a bug that @swarthout and I ran into while running psi4 on AWS.

A non-negligible fraction of our psi4 calculations that run through qcschema (e.g. many body, cbs) result in the following non-deterministic error:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/psi4/driver/schema_wrapper.py", line 459, in run_qcschema
ret_data = run_json_qcschema(input_model.dict(), clean, False, keep_wfn=keep_wfn)
File "/opt/conda/lib/python3.10/site-packages/psi4/driver/schema_wrapper.py", line 721, in run_json_qcschema
json_data["native_files"] = {fl: flpath.read_text() for fl, flpath in files.items() if flpath.exists()}
File "/opt/conda/lib/python3.10/site-packages/psi4/driver/schema_wrapper.py", line 721, in <dictcomp>
json_data["native_files"] = {fl: flpath.read_text() for fl, flpath in files.items() if flpath.exists()}
File "/opt/conda/lib/python3.10/pathlib.py", line 1135, in read_text
return f.read()
File "/opt/conda/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 47: invalid continuation byte

This error tells us that psi4 is unable to read its own timer.dat file.

Upon further examination of these problematic timer.dat files, we noticed that the "host" field appears to be corrupted. Here is one such corrupted header of a timer.dat file, represented w/ a latin-1 encoding (since it can't be read w/ the standard utf-8 encoding):

Host: ip-172-31-XX-XXX.us-east-2.compute.interá^X

Timers On : Sat Sep  2 14:14:18 2023
Timers Off: Sat Sep  2 14:14:18 2023

Wall Time:        0.52 seconds

                                                       Time (seconds)
Module                                       User      System        Wall        Calls

--------------------------------------------------------------------------------------

In all of these problematic timer.dat files, the host name is truncated and ends with a random assortment of bytes. In the above example, the full host name should be ip-172-31-XX-XXX.us-east-2.compute.internal.

We then examined how psi4 determines and processes the host name. It turns out, psi4 uses the gethostname function from the C API to get up to the first 40 bytes of the host name, and then it writes those bytes to timer.dat. The host name of this particular compute cluster is over 40 chars/bytes. This is unsafe because if a host name has more than 40 characters, the null byte (\0) won't be written to timer.dat to signify the end of the string, and psi4 will continue to write whatever is in memory past the 40 chars/bytes until it hits a null byte. This also explains the original error, b/c writing random bytes to a file can lead to non-utf-8-compliant files.

It turns out that linux defines a maximum host name length of 64, so the easy fix here is to just increase the size of the host name buffer from 40 to 65 (== 64 + 1 for the null byte at the end). I have no idea why this length was previously limited to 40.

User API & Changelog headlines

  • Fix bug resulting in UnicodeDecodeError and corrupted timer.dat files

Dev notes & details

  • increase the buffer size used to retrieve the host name and enforce that the host name ends in a null byte before writing to timer.dat

Checklist

Status

  • Ready for review
  • Ready for merge

@zachglick zachglick added the bug label Sep 2, 2023
Copy link
Member

@loriab loriab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what a pain to debug! thanks for the fix. we'll backport this for 1.8.2

@loriab loriab added the backport label Sep 2, 2023
@loriab loriab added this to the Psi4 1.9 milestone Sep 2, 2023
@loriab loriab added this pull request to the merge queue Sep 3, 2023
Merged via the queue into master with commit 81f0da6 Sep 3, 2023
4 of 5 checks passed
@zachglick zachglick deleted the gethostname_fix branch September 3, 2023 15:13
@zachglick
Copy link
Contributor Author

zachglick commented Sep 3, 2023

what a pain to debug! thanks for the fix. we'll backport this for 1.8.2

Thank you! Hopefully it helps someone else

susilehtola added a commit to susilehtola/psi4 that referenced this pull request Sep 3, 2023
@susilehtola
Copy link
Member

Thanks for the patch.

The Linux man page states that

SUSv2 guarantees that "Host names are limited to 255 bytes".

On Windows the maximum length is 256 bytes.

Also one can remove some of the hard-coding. I will file a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants