Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as_factor() 'corrupts' dataframe if it fails #5011

Closed
exalate-issue-sync bot opened this issue May 22, 2023 · 8 comments
Closed

as_factor() 'corrupts' dataframe if it fails #5011

exalate-issue-sync bot opened this issue May 22, 2023 · 8 comments

Comments

@exalate-issue-sync
Copy link

as_factor() can corrupt a dataframe (example below) if it fails. User has to cast to character (or otherwise) first before as_factor will work.

Suggestion here is:

  1. Support as_factor() better regardless of existing type
  2. If as_factor() fails, to not corrupt the dataframe

Hi,

I have a df with several features that I want to apply asFactor(). I'm using this code:

def pimpIt(df):
for i in all_features[:]:
print df[i].head(3)
print df.types[i]
df[i] = df[i].asfactor()
print df[i].head(3)
print df.types[i]
return df

train_H2O_2 = pimpIt(train_H2O)

When the type of a column is 'real', it fails. E.g. I have a factor hotel_class with possible values 1.0, 2.0,3.0,4.0 and 5.0. (Somehow it became a double; I know how to cast it to int before throwing into this function, but I'd like to illustrate the issue with the dataframe becoming unusable after running into this issue)

H2OResponseError Traceback (most recent call last)
in ()
8 return df
9
---> 10 train_H2O_2 = pimpIt(train_H2O)

in pimpIt(df)
4 print df.types[i]
5 df[i] = df[i].asfactor()
----> 6 print df[i].head(3)
7 print df.types[i]
8 return df

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in repr(self)
410 stk = traceback.extract_stack()
411 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]):
--> 412 self.show()
413 return ""
414

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in show(self, use_pandas)
422 print("This H2OFrame has been removed.")
423 return
--> 424 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill()
425 if H2ODisplay._in_ipy():
426 import IPython.display

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in _frame(self, fill_cache)
471
472 def _frame(self, fill_cache=False):
--> 473 self._ex._eager_frame()
474 if fill_cache:
475 self._ex._cache.fill()

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eager_frame(self)
84 if not self._cache.is_empty(): return
85 if self._cache._id is not None: return # Data already computed under ID, but not cached locally
---> 86 self._eval_driver(True)
87
88 def _eager_scalar(self): # returns a scalar (or a list of scalars)

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eval_driver(self, top)
98 def _eval_driver(self, top):
99 exec_str = self._get_ast_str(top)
--> 100 res = ExprNode.rapids(exec_str)
101 if 'scalar' in res:
102 if isinstance(res['scalar'], list):

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in rapids(expr)
201 :returns: The JSON response (as a python dictionary) of the Rapids execution
202 """
--> 203 return h2o.api("POST /99/Rapids", data={"ast": expr, "session_id": h2o.connection().session_id})
204
205

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
82 # type checks are performed in H2OConnection class
83 _check_connection()
---> 84 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
85
86

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
261 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
262 self._log_end_transaction(start_time, resp)
--> 263 return self._process_response(resp, save_to)
264
265 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in _process_response(response, save_to)
581 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
582 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
--> 583 raise H2OResponseError(data)
584
585 # Server errors (notably 500 = "Server Error")

H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
Error: Categorical conversion can only currently be applied to integer columns.
Request: POST /99/Rapids
data: {u'session_id': '_sid_bb3c', u'ast': "(tmp= py_136_sid_bb3c (rows (cols_py (tmp= py_135_sid_bb3c (:= py_132_sid_bb3c (as.factor (cols_py py_132_sid_bb3c 'hotel_class')) 337 [])) 'hotel_class') [0:3]))"}
So I'll work around this by CAST it to int in Spark, or do as.numeric first and then as.factor second for the h2o frame.

The issue though is that now my h2o dataframe is rendered useless. E.g. if I try to inspect the dataframe, or the head of this column, I get:
H2OResponseError: Server error water.exceptions.H2OKeyNotFoundArgumentException:
Error: Object 'py_135_sid_bb3c' not found for argument: key
Request: GET /3/Frames/py_135_sid_bb3c
params: {u'row_count': '10'}
Other dataframe functions still work though. E.g. train_H2O.types gives the dictionary containing the column type information.

In Flow, I see the data is still there. So I guess the python object has become a corrupt reference or something, and I imagine you can create better error handling here so the data scientist doesn't lose time restarting the jupyter sparkling water kernel, reshipping of data, retrying the function, failing again, new kernel etc, after 3rd time finding out what is causing the issue and applying a more root cause fix (sorry for exaggeration :).

@exalate-issue-sync
Copy link
Author

Avkash Chauhan commented: [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223] Will this be part of SW 2.1 release?

@exalate-issue-sync
Copy link
Author

Vlad Patryshev commented: Fixed in h2oai/h2o-3#886

The solution is this: check the column type before sending stuff over to categorize.
And there was a bug, the returned frame had only one type specified.

A test added too.

@exalate-issue-sync
Copy link
Author

Vlad Patryshev commented: See h2oai/h2o-3#886

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: It was marked as resolved but the PR is still not in the master - [~accountid:557058:6deb9650-da01-46cf-9a0f-c9f353cb4311] was asked by [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223] for some additional changes.

@exalate-issue-sync
Copy link
Author

Vlad Patryshev commented: See also SW-354, as a further development.

@exalate-issue-sync
Copy link
Author

Michal Malohlava commented: Requires version of H2O; 3.10.4.2+

@DinukaH2O
Copy link

JIRA Issue Migration Info

Jira Issue: SW-334
Assignee: Vlad Patryshev
Reporter: Nick Karpov
State: Resolved
Fix Version: 2.1.3
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

h2oai/h2o-3#886

@hasithjp
Copy link
Member

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2017-02-20T09:49:45.631-0800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants