-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] pyfunc cannot predict dataframe with None value properly #4827
Comments
@yxiong Thanks for raising this issue! Can you provide a more complete stacktrace for the failure? |
Hi @dbczumar , What I posted is actually the entire stacktrace (only two function calls and then exception). I also created a small piece of code that can reproduce the issue. Hope that helps. import pandas as pd
import mlflow
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})
one_hot_pipeline = Pipeline(steps=[
("imputer", SimpleImputer(missing_values=None, strategy="constant", fill_value="")),
("onehot", OneHotEncoder(handle_unknown="ignore"))])
transformers = [("onehot", one_hot_pipeline, ["feature"])]
preprocessor = ColumnTransformer(transformers, remainder="passthrough", sparse_threshold=0)
model = Pipeline([
("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier()),
])
mlflow.autolog()
with mlflow.start_run(run_name='decision_tree') as run:
model.fit(X_train, y_train)
model.predict(X_train) # ==> ok
model_pyfunc = mlflow.pyfunc.load_model(
'runs:/{run_id}/model'.format(run_id=run.info.run_id))
model_pyfunc.predict(X_train)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<command-1260285005520992> in <module>
2 'runs:/{run_id}/model'.format(run_id=run.info.run_id))
3
----> 4 model_pyfunc.predict(X_train)
/databricks/python/lib/python3.8/site-packages/mlflow/pyfunc/__init__.py in predict(self, data)
594 if input_schema is not None:
595 data = _enforce_schema(data, input_schema)
--> 596 return self._model_impl.predict(data)
597
598 @property
/databricks/python/lib/python3.8/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
118
119 # lambda, but not partial, allows help() to work with update_wrapper
--> 120 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
121 # update the docstring of the returned function
122 update_wrapper(out, self.fn) |
@yxiong Got it. This seems to be an issue with the dataset being erroneously transformed during model schema enforcement that occurs within the pyfunc inference procedure. I'll loop in an area expert who can take a look. |
Thanks for triaging, @dbczumar ! I looked into this a little more, and confirmed that this is an issue with import mlflow
import pandas as pd
import random
import string
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
X_train = pd.DataFrame({"feature": ['a', 'a', 'b', 'b', None, 'a', 'a', 'b', 'b']})
y_train = pd.DataFrame({"label": [0, 0, 1, 1, 1, 0, 0, 1, 1]})
model = Pipeline([
("imputer", SimpleImputer(missing_values=None, strategy="constant", fill_value="")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
("classifier", DecisionTreeClassifier()),
])
model.fit(X_train, y_train)
model.predict(X_train) # ==> ok
# Save model and the load it to pyfunc works.
path = "/tmp/sklearn-model/" + ''.join(
random.choice(string.ascii_lowercase) for _ in range(6))
print("Model path =", path)
signature = mlflow.models.infer_signature(model_input=X_train)
mlflow.sklearn.save_model(model, path, signature=signature)
pyfunc_model = mlflow.pyfunc.load_model(path)
pyfunc_model.predict(X_train) # AttributeError: 'bool' object has no attribute 'any' If I remove |
I believe I have identified the root cause:
@tomasatdatabricks Do you have some suggestions on how this should be fixed? |
[Update] After scikit-learn/scikit-learn#21114, the validation can pass if we set
|
Thanks for making the change to scikit-learn! It does seem like a good route to me. MLflow's schema enforcement seems to convert None values to pandas.NA, which doesn't seem totally unreasonable -- it just doesn't work with scikit-learn's SimpleImputer. I'm not sure if there's another "more correct" behavior for MLflow. We could keep "None", but it's not obvious that that's better in all cases, so fixing this on the scikit side seems like a reasonable long-term solution. |
Thanks for the feedback, @aarondav . I was able to connect with @tomasatdatabricks , and we think it's probably the best not to cast pandas objects to string for now as it will break a few downstream places. #5134 fixed this issue. |
Oh, nice, that's great. Thanks for the follow-up. |
Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.
Please fill in this bug report template to ensure a timely and thorough response.
Willingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
System information
mlflow --version
):Describe the problem
My training data contains
None
values, and I built a sklearn pipeline with imputer to handle it. Then I train the pipeline model with MLflow tracking enabled:The trained model itself is able to do
predict
with no problemBut the the
pyfunc
object I got from MLflow doesn't work:This happened only to data with
None
values. If I run inference to a subset ofX_train
withoutNone
, themodel_pyfunc.predict
function also works.Code to reproduce issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
What component(s), interfaces, languages, and integrations does this bug affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: