as_factor() 'corrupts' dataframe if it fails #5011

exalate-issue-sync · 2023-05-22T18:09:10Z

as_factor() can corrupt a dataframe (example below) if it fails. User has to cast to character (or otherwise) first before as_factor will work.

Suggestion here is:

Support as_factor() better regardless of existing type
If as_factor() fails, to not corrupt the dataframe

Hi,

I have a df with several features that I want to apply asFactor(). I'm using this code:

def pimpIt(df):
for i in all_features[:]:
print df[i].head(3)
print df.types[i]
df[i] = df[i].asfactor()
print df[i].head(3)
print df.types[i]
return df

train_H2O_2 = pimpIt(train_H2O)

When the type of a column is 'real', it fails. E.g. I have a factor hotel_class with possible values 1.0, 2.0,3.0,4.0 and 5.0. (Somehow it became a double; I know how to cast it to int before throwing into this function, but I'd like to illustrate the issue with the dataframe becoming unusable after running into this issue)

H2OResponseError Traceback (most recent call last)
in ()
8 return df
9
---> 10 train_H2O_2 = pimpIt(train_H2O)

in pimpIt(df)
4 print df.types[i]
5 df[i] = df[i].asfactor()
----> 6 print df[i].head(3)
7 print df.types[i]
8 return df

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in repr(self)
410 stk = traceback.extract_stack()
411 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]):
--> 412 self.show()
413 return ""
414

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in show(self, use_pandas)
422 print("This H2OFrame has been removed.")
423 return
--> 424 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill()
425 if H2ODisplay._in_ipy():
426 import IPython.display

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in _frame(self, fill_cache)
471
472 def _frame(self, fill_cache=False):
--> 473 self._ex._eager_frame()
474 if fill_cache:
475 self._ex._cache.fill()

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eager_frame(self)
84 if not self._cache.is_empty(): return
85 if self._cache._id is not None: return # Data already computed under ID, but not cached locally
---> 86 self._eval_driver(True)
87
88 def _eager_scalar(self): # returns a scalar (or a list of scalars)

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eval_driver(self, top)
98 def _eval_driver(self, top):
99 exec_str = self._get_ast_str(top)
--> 100 res = ExprNode.rapids(exec_str)
101 if 'scalar' in res:
102 if isinstance(res['scalar'], list):

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in rapids(expr)
201 :returns: The JSON response (as a python dictionary) of the Rapids execution
202 """
--> 203 return h2o.api("POST /99/Rapids", data={"ast": expr, "session_id": h2o.connection().session_id})
204
205

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
82 # type checks are performed in H2OConnection class
83 _check_connection()
---> 84 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
85
86

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
261 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
262 self._log_end_transaction(start_time, resp)
--> 263 return self._process_response(resp, save_to)
264
265 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in _process_response(response, save_to)
581 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
582 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
--> 583 raise H2OResponseError(data)
584
585 # Server errors (notably 500 = "Server Error")

H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
Error: Categorical conversion can only currently be applied to integer columns.
Request: POST /99/Rapids
data: {u'session_id': '_sid_bb3c', u'ast': "(tmp= py_136_sid_bb3c (rows (cols_py (tmp= py_135_sid_bb3c (:= py_132_sid_bb3c (as.factor (cols_py py_132_sid_bb3c 'hotel_class')) 337 [])) 'hotel_class') [0:3]))"}
So I'll work around this by CAST it to int in Spark, or do as.numeric first and then as.factor second for the h2o frame.

The issue though is that now my h2o dataframe is rendered useless. E.g. if I try to inspect the dataframe, or the head of this column, I get:
H2OResponseError: Server error water.exceptions.H2OKeyNotFoundArgumentException:
Error: Object 'py_135_sid_bb3c' not found for argument: key
Request: GET /3/Frames/py_135_sid_bb3c
params: {u'row_count': '10'}
Other dataframe functions still work though. E.g. train_H2O.types gives the dictionary containing the column type information.

In Flow, I see the data is still there. So I guess the python object has become a corrupt reference or something, and I imagine you can create better error handling here so the data scientist doesn't lose time restarting the jupyter sparkling water kernel, reshipping of data, retrying the function, failing again, new kernel etc, after 3rd time finding out what is causing the issue and applying a more root cause fix (sorry for exaggeration :).

exalate-issue-sync · 2023-05-22T18:09:12Z

Avkash Chauhan commented: [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223] Will this be part of SW 2.1 release?

exalate-issue-sync · 2023-05-22T18:09:14Z

Vlad Patryshev commented: Fixed in h2oai/h2o-3#886

The solution is this: check the column type before sending stuff over to categorize.
And there was a bug, the returned frame had only one type specified.

A test added too.

exalate-issue-sync · 2023-05-22T18:09:15Z

Vlad Patryshev commented: See h2oai/h2o-3#886

exalate-issue-sync · 2023-05-22T18:09:18Z

Jakub Hava commented: It was marked as resolved but the PR is still not in the master - [~accountid:557058:6deb9650-da01-46cf-9a0f-c9f353cb4311] was asked by [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223] for some additional changes.

exalate-issue-sync · 2023-05-22T18:09:19Z

Vlad Patryshev commented: See also SW-354, as a further development.

exalate-issue-sync · 2023-05-22T18:09:21Z

Michal Malohlava commented: Requires version of H2O; 3.10.4.2+

DinukaH2O · 2023-05-23T12:13:22Z

JIRA Issue Migration Info

Jira Issue: SW-334
Assignee: Vlad Patryshev
Reporter: Nick Karpov
State: Resolved
Fix Version: 2.1.3
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

h2oai/h2o-3#886

hasithjp · 2023-05-29T15:09:26Z

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2017-02-20T09:49:45.631-0800

DinukaH2O closed this as completed May 23, 2023

DinukaH2O added the fixVersion/2.1.3 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as_factor() 'corrupts' dataframe if it fails #5011

as_factor() 'corrupts' dataframe if it fails #5011

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023

as_factor() 'corrupts' dataframe if it fails #5011

as_factor() 'corrupts' dataframe if it fails #5011

Comments

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023