Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pmdarima flavor #5373

Merged
merged 22 commits into from
Feb 25, 2022
Merged

Pmdarima flavor #5373

merged 22 commits into from
Feb 25, 2022

Conversation

BenWilson2
Copy link
Member

What changes are proposed in this pull request?

Add pmdarima flavor

How is this patch tested?

unit tests

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Adds the native pmdarima flavor to MLflow.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@github-actions github-actions bot added area/docs Documentation issues area/examples Example code area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Feb 11, 2022
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenWilson2 Looks great! Just left a few comments, should be good to go after!

Comment on lines 798 to 800
When predicting a ``pmdarima`` flavor, the ``predict`` method argument ``return_conf_int`` will control the
output format. When set to ``False`` or ``None`` (which is the default), the return type will be a
``pandas.series``. When set to ``True``, the return type a ``tuple[pandas.Series, tuple[pandas.Series]]``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When predicting a ``pmdarima`` flavor, the ``predict`` method argument ``return_conf_int`` will control the
output format. When set to ``False`` or ``None`` (which is the default), the return type will be a
``pandas.series``. When set to ``True``, the return type a ``tuple[pandas.Series, tuple[pandas.Series]]``.
When predicting a ``pmdarima`` flavor, the ``predict`` method argument ``return_conf_int`` controls the
output format. When set to ``False`` or ``None`` (which is the default), the return type is a
``pandas.series``. When set to ``True``, the return type is a ``tuple[pandas.Series, tuple[pandas.Series]]``.

Can we also explain what each data type represents?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this prediction signature compatible with future pmdarima API wrappers . extensions that we may want to introduce?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other wrappers that we'll be building the return type for either a) without confidence intervals (pd.Series) and b) with confidence intervals (tuple[pd.Series, tuple[pd.Series]]) would be unified as a pd.DataFrame output.
Do you think that we should make make the "b" scenario a pd.DataFrame return type and handle the transformation for users here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the pyfunc model inference format doesn't support tuple outputs (https://github.com/mlflow/mlflow/pull/5370/files#r808409658), I think it makes sense to use a dataframe for (b). In fact, can we use a dataframe for both (a) and (b) and add additional column(s) to the dataframe for (b)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%. Will implement this

@@ -17,6 +17,7 @@ pytest tests/h2o --large
pytest tests/shap --large
pytest tests/paddle --large
pytest tests/prophet --large
pytest tests/pmdarima --large
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these to cross version tests instead, since pmdarima, unlike Prophet, is maintained & somewhat frequently updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely! added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenWilson2 Thanks! Can we remove this line from dev/run-python-flavor-tests.sh now that we've added pmdarima to cross version tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch. Since the prophet flavor is part of x-version testing in master now I'll remove that here as well.

mlflow/pmdarima.py Show resolved Hide resolved
Comment on lines 797 to 799
.. note::
When predicting a ``pmdarima`` flavor, the ``predict`` method argument ``return_conf_int`` will control the
output format. When set to ``False`` or ``None`` (which is the default), the return type will be a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide an example input that populates all of the supported arguments? I know we don't do a good job of this for other flavors, but this is a great opportunity to start!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example showing the primary config for the input df signature to the pyfunc .predict() method. I'm leaving the X parameter out since none of the common open source datasets support exogeneous regressors, pmdarima is deprecating that functionality in a near-future release, and it's just really confusing to most users of these libraries to use exog regressor elements.
I'm providing an example output conversion for the confidence interval output to convert it to a DataFrame to illustrate what to do with the model's output as well.

Comment on lines 80 to 87
:param pmdarima_model:
:param path:
:param conda_env:
:param mlflow_model:
:param signature:
:param input_example:
:param pip_requirements:
:param extra_pip_requirements:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we complete these parameter docstrings and add a description of the overall method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow. yikes. totally missed those. On it!

Comment on lines 156 to 167
:param pmdarima_model:
:param artifact_path:
:param conda_env:
:param registered_model_name:
:param signature:
:param input_example:
:param await_registration_for:
:param pip_requirements:
:param extra_pip_requirements:
:param kwargs:

:return:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we complete these parameter docstrings and add a description of the overall method?

Comment on lines 186 to 189
"""
:param model_uri:
:param dst_path:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we complete these parameter docstrings and add a description of the overall method?

- pip
- pip:
- pmdarima
- mlflow>=1.23.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLflow 1.23.0 doesn't have this feature. I think we need mflow>1.23.1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@@ -0,0 +1,57 @@
import mlflow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes quite awhile to run (several minutes). Can we make the workload smaller and add some print statement progress indicators prior to the training portion (e.g. "constructing model...")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduced the runtime to ~ 5 seconds and added additional status messages.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2
Copy link
Member Author

Docs rendering verified for table visualization

models:
minimum: "1.8.4"
maximum: "1.8.4"
requirements: ["prophet"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we need prophet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just so that we can use the data generators in that test module without having to copy the generator into this module.

@harupy harupy added the enable-dev-tests Enables cross-version tests for dev versions label Feb 21, 2022
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
…ts for dev branch

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Comment on lines 361 to 373
# NB: dev version of pmdarima uses a __version__ tag of "0.0.0" which will fail the env build.
# To be as safe as possible in cross version tests, use the latest released version when
# testing dev pmdarima.

def get_latest_version():
ver = pmdarima.__version__
if ver == "0.0.0":
url = "https://pypi.org/pypi/pmdarima/json"
resp = requests.get(url).json()
releases = resp["releases"].keys()
return sorted(releases, key=parse_version, reverse=True)[0]
else:
return ver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use --no-conda flag instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

EXTRA_PYFUNC_SERVING_TEST_ARGS = [] if _is_available_on_pypi("xgboost") else ["--no-conda"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenWilson2 Can we address this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most definitely. Thank you for the tip @harupy !

model flavor in native pmdarima formats.

.. note::
When predicting a ``pmdarima`` flavor, the ``predict`` method argument ``return_conf_int`` controls the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: return_conf_int isn't a method argument, right? It's an optional column in the input pandas dataframe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point. Changed this note to make it far more clear what is going on with this configuration (and that by omitting the column in the configuration argument DataFrame, the default will be False for calculating CI values.

format via the :py:func:`mlflow.pmdarima.save_model()` and :py:func:`mlflow.pmdarima.log_model()` methods.
These methods also add the ``python_function`` flavor to the MLflow Models that they produce, allowing the
model to be interpreted as generic Python functions for inference via :py:func:`mlflow.pyfunc.load_model()`.
This loaded PyFunc model can only be scored with a DataFrame input.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide an overview of the accepted DataFrame input columns and length (i.e. single-row DataFrames)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a table, a listing of what the values mean, and warnings for ensuring that it's a single row DF and that n_periods is not optional.

pip install $CACHE_DIR/pmdarima-*.whl

models:
minimum: "1.8.4"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be compatible with pmdarima < 1.8.4 based on your warning logic for the case where exogenous regressors are specified with pmdarima < 1.8.0. Can we drop the minimum version down to 1.8.0, providing support for all versions of pmdarima released within the last year (+ December 2020)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed and upped the cross version test to 1.8.5 (released 2 days ago)



@format_docstring(LOG_MODEL_PARAM_DOCS.format(package_name=FLAVOR_NAME))
def save_model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the @experimental decorator to the public methods in this module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -784,6 +784,60 @@ method to load MLflow Models with the ``prophet`` model flavor in native prophet

For more information, see :py:mod:`mlflow.prophet`.

Pmdarima (``pmdarima``)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we indicate that this is experimental?

Suggested change
Pmdarima (``pmdarima``)
Pmdarima (``pmdarima``) (Experimental)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenWilson2 Looks fantastic! Should be ready to go after all remaining comments are addressed!

Comment on lines 426 to 427
assert len(forecast_with_ci == 10)
assert len(forecast_with_ci.columns.values == 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert len(forecast_with_ci == 10)
assert len(forecast_with_ci.columns.values == 3)
assert len(forecast_with_ci) == 10
assert len(forecast_with_ci.columns.values) == 3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes that's quite a typo. Fixed.

Comment on lines 419 to 420
assert len(forecast_no_ci == 10)
assert len(forecast_no_ci.columns.values == 1)
Copy link
Member

@harupy harupy Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert len(forecast_no_ci == 10)
assert len(forecast_no_ci.columns.values == 1)
assert len(forecast_no_ci) == 10
assert len(forecast_no_ci.columns.values) == 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed typo

- pip
- pip:
- pmdarima
- mlflow>=1.23.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we insert a newline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

maximum: "1.8.4"
requirements: ["prophet"]
run: |
pytest tests/pmdarima/test_pmdarima_model_export.py --large
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we insert a newline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines 308 to 311
raise MlflowException(
f"The provided prediction pd.DataFrame {dataframe.to_string()} "
f"contains {len(dataframe)} rows. Only 1 row should be supplied."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line prints out an error message that looks like this:

The provided prediction pd.DataFrame    a  b
0  1  3
1  2  4 contains 2 rows. Only 1 row should be supplied.

Can we remove dataframe.to_string() and say the provided prediction pd.DataFrame contains ...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed and cleaned up the wording

# `X` entries as a 2D array structure to the predict method.
exogoneous_regressor = attrs.get("X", None)

if exogoneous_regressor and self._pmdarima_version < "1.8.0":
Copy link
Member

@harupy harupy Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if exogoneous_regressor and self._pmdarima_version < "1.8.0":
if exogoneous_regressor and Version(self._pmdarima_version) < Version("1.8.0"):

Can we use packaging.version.Version here to compare versions correctly?

>>> from packaging.version import Version
>>> "1.8.0.rc1" > "1.8.0"
True
>>> Version("1.8.0.rc1") > Version("1.8.0")
False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point. Updated.

Comment on lines 313 to 317
if not set(df_schema).issubset(schema):
raise MlflowException(
f"The provided schema {df_schema} contains invalid columns. "
f"Columns must be part of: {schema}"
)
Copy link
Member

@harupy harupy Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation throws an exception if the given dataframe looks like this:

pd.DataFrame(
    {
        # valid columns
        "n_periods": [0],
        "X": [0],
        "return_conf_int": [0],
        "alpha": [0],
        # invalid columns
        "foo": [0],
    }
)

Does the foo column cause any problems in the subsequent process? It appears the current implementation ignores it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this implementation and simplified the validation check (only n_periods is required so we only throw if we can't get a valid value there). Any extra elements in the passed in DF will be ignored.

Comment on lines 96 to 102
.. code-block:: py

from mlflow.models.signature import infer_signature

model = pmdarima.auto_arima(data)
predictions = model.predict(n_periods=30, return_conf_int=False)
signature = infer_signature(data, predictions)
Copy link
Member

@harupy harupy Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. code-block:: py
from mlflow.models.signature import infer_signature
model = pmdarima.auto_arima(data)
predictions = model.predict(n_periods=30, return_conf_int=False)
signature = infer_signature(data, predictions)
.. code-block:: py
from mlflow.models.signature import infer_signature
model = pmdarima.auto_arima(data)
predictions = model.predict(n_periods=30, return_conf_int=False)
signature = infer_signature(data, predictions)

Can we indent this block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, otherwise LGTM!

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
…ma-flavor

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once note order is swapped!

…flow

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 merged commit 3973283 into master Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs Documentation issues area/examples Example code area/models MLmodel format, model serialization/deserialization, flavors enable-dev-tests Enables cross-version tests for dev versions rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants