Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python client: encoding error when writing tmp file to disk before upload of Py object. #7201

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 3 comments

Comments

@exalate-issue-sync
Copy link

An example of error that occurs on Windows when trying to upload a pandas DataFrame to H2O:

{noformat} File ".\src\anomaly_detection_95.py", line 36, in get_aggregated_rows
h2o_df = h2o.H2OFrame(all_rows)

File "c:\users\mllaugel\desktop\humano_fraud_isof\venv\lib\site-packages\h2o\frame.py", line 109, in init
self._upload_python_object(python_obj, destination_frame, header, separator,

File "c:\users\mllaugel\desktop\humano_fraud_isof\venv\lib\site-packages\h2o\frame.py", line 149, in _upload_python_object
csv_writer.writerows(data_to_write)

File "C:\Users\mllaugel\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 205: character maps to {noformat}

calling code:

{code:python} h2o.init()
all_rows = pd.read_csv(os.path.join(self.location, "training_data.csv"))
h2o_df = h2o.H2OFrame(all_rows){code}

apart from the fact that the snippet above is not a best practice, it should not throw an error anyway.
On Windows, the {{.csv}} file is loaded by pandas using {{utf-8}} (see below), and before upload, the H2O Py client writes the frame into a tmp {{.csv}} file (see [https://github.com/h2oai/h2o-3/blob/03cc6c86a179021418ae0f21df372ab7df0fdd86/h2o-py/h2o/frame.py#L142|https://github.com/h2oai/h2o-3/blob/03cc6c86a179021418ae0f21df372ab7df0fdd86/h2o-py/h2o/frame.py#L142|smart-link] ) using a different encoding.

The error occurs because pandas apparently loads {{.csv}} files by default using {{utf-8}} encoding, the information is not available in doc but could find this by inspecting pandas code:

{code:python}# Windows does not default to utf-8. Set to utf-8 for a consistent behavior
encoding_passed, encoding = encoding, encoding or "utf-8" {code}

then when writing, as we don’t specify the {{encoding='utf-8'}}param when opening the tmp file being written, it tries to write it in default {{cp1252}}, raising the error due to some incompatible chars.

On top of fixing our {{_upload_python_object}} function, I’d recommend to review all usages of {{open(...)}} in our Py code base, for both read and write, and ensure that we always enforce {{utf-8}} encoding for a consistent behavior.

@exalate-issue-sync
Copy link
Author

Zuzana Olajcová commented: Resolved in [https://github.com//pull/5999|https://github.com//pull/5999|smart-link]

@h2o-ops-ro
Copy link
Collaborator

JIRA Issue Details

Jira Issue: PUBDEV-8460
Assignee: Zuzana Olajcová
Reporter: Sebastien Poirier
State: Resolved
Fix Version: 3.36.0.2
Attachments: N/A
Development PRs: Available

@h2o-ops-ro
Copy link
Collaborator

Linked PRs from JIRA

#5999

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant