Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for the SparkML model export. #72

Merged
merged 33 commits into from
Jul 8, 2018

Conversation

tomasatdatabricks
Copy link
Contributor

Initial stab at exporting SparkML models. Also as part of this moved spark_udf outside of pyfunc and created a separate flavor for it. The flavor is required because not all pyfuncs have to support spark udf mode, e.g. the pyfunc generated by SparkML flavor does not.

Few open questions / TODOs:

  1. how to specify jar dependencies? For now unimplemented.
  2. What should be the model output? For now only produce 1-d prediction to be consistent with other flavors. We should probably revisit that for all the models in one go.
  3. Maybe we should add load-model-as-spark-transformer interface in addition to existing spark_udf.

@mateiz
Copy link
Contributor

mateiz commented Jun 21, 2018

Hey Tomas, instead of adding a new spark_udf flavor here, let's just add a runnable_in_spark: false flag in the python_function flavor to capture this case. We don't want to make everyone write 2 flavors if their library actually works anywhere in Python. (In fact this one might also work as a UDF if they create a second SparkContext, but it would just be weird).

@tomasatdatabricks
Copy link
Contributor Author

Oh ok. My motivation for turning it into a separate flavor was to allow you to set other properties, like use different flavor - e.g. with mleap you might want to use that instead of a PyFunc.

Initially I though the sparkML model would be that (and that's why I added it) but than I found out it actually does not work as udf at all :/

@mateiz
Copy link
Contributor

mateiz commented Jun 21, 2018

Yeah, the Spark ML model is different. The way I envision it, we would also add a java_function that we can turn into a Spark UDF in the future for libraries like MLeap. I'd prefer not to require every library developer to know about every model consumer, so that's why it would be better not to have Spark UDF flavor if we can have more general ones.

from mlflow.version import VERSION as MLFLOW_VERSION


class CondaEnvironment(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be a public class/module or just an internal one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for this to be public. Environments come up all the time and I thought it would be nice to have a convenience wrapper around it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make it a class for now, mostly because there are other things in Conda that we don't capture here yet, and it creates a maintenance burden for us. Can you switch the Spark functions to take the Conda environment in a file like we do in pyfunc for now? It's also inconsistent right now because Spark takes a CondaEnvironment but our other functions take a file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And make this class private if we still want to use it, or just hold onto it to have another PR to add it later.)

@@ -141,14 +141,14 @@ def _get_code_dirs(src_code_path, dst_code_path=None):
and not x == "__pycache__"]


def spark_udf(spark, path, run_id=None, result_type="double"):
def load_spark_udf(spark, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this not take run_id anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I meant to remove this one I think. I will double check but I thinkt it's a leftover from where I had spark_udf as meta_flavor and flavors supporting spark_udf mode would have load_spark_udf method.


def load_pyfunc(path):
"""
Load the model as PuFunc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@@ -0,0 +1,116 @@
"""
Sample MLflow integration for SparkML models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't say "sample" anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@codecov-io
Copy link

codecov-io commented Jun 26, 2018

Codecov Report

Merging #72 into master will decrease coverage by 0.17%.
The diff coverage is 78.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
- Coverage   50.21%   50.03%   -0.18%     
==========================================
  Files          87       89       +2     
  Lines        4192     4271      +79     
==========================================
+ Hits         2105     2137      +32     
- Misses       2087     2134      +47
Impacted Files Coverage Δ
mlflow/sklearn.py 78.68% <ø> (ø) ⬆️
mlflow/sagemaker/__init__.py 18.54% <ø> (-11.3%) ⬇️
mlflow/sagemaker/container/__init__.py 0% <0%> (ø) ⬆️
mlflow/utils/environment.py 100% <100%> (ø)
mlflow/utils/__init__.py 62.5% <100%> (+5.35%) ⬆️
mlflow/models/__init__.py 79.31% <33.33%> (-20.69%) ⬇️
mlflow/spark.py 85.41% <85.41%> (ø)
mlflow/utils/file_utils.py 75.86% <0%> (-14.66%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26a7603...c2088b4. Read the comment docs.

from mlflow.version import VERSION as MLFLOW_VERSION


class CondaEnvironment(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make it a class for now, mostly because there are other things in Conda that we don't capture here yet, and it creates a maintenance burden for us. Can you switch the Spark functions to take the Conda environment in a file like we do in pyfunc for now? It's also inconsistent right now because Spark takes a CondaEnvironment but our other functions take a file.

from mlflow.version import VERSION as MLFLOW_VERSION


class CondaEnvironment(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And make this class private if we still want to use it, or just hold onto it to have another PR to add it later.)

Log model using supplied flavor.

:param artifact_path: RUN-relative path identifying this model.
:param flavor: Flavor that can save the model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the type of flavor? It's not obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair point, I'll make the CondaEnvironment private (I still need something to print out environment with pyspark and python version).

"""
Log model using supplied flavor.

:param artifact_path: RUN-relative path identifying this model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN -> Run?

from mlflow import tracking


def load_udf(spark, path, run_id=None, result_type='double'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we move this to a new module now? Seems like it should stay back in pyfunc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought you asked me to move it some time ago, I probably misunderstood. I'll move it back to pyfunc.


:param spark_model: Model to be saved.
:param path: Local path where the model is to be save.
:param mlflow_model: MLflow model config.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlflow_model is to add this to an existing model, right? You should document that and maybe also make the parameter name consistent in sklearn.save_model since it's called model there. Also I'd move this to be the last argument since it seems unlikely that users would directly pass this (since they'll use log_model instead).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I add documentation. I actually renamed the sklearn one too. I though mlflow_model was more clear than just model.

FLAVOR_NAME = "sparkml"


def log_model(spark_model, artifact_path, env, jars=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is env required here? Also document its type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goo catch, it should not be

return SparkMLModel(spark, PipelineModel.load(path))


class SparkMLModel(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be a public class? Seems like an internal utility so make it private for now. Also I'd call it _PyFuncModelWrapper or something to make it clear that this is for the PyFunc interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense.

Wrapper around SparkML PipelineModel providing interface for scoring pandas DataFrame.
"""

def __init__(self, spark, transformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it transformer or just spark_model here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I though transformer was more descriptive, but I am fine calling it spark_model if you prefer that.


PYTHON_VERSION = "{major}.{minor}.{micro}".format(major=version_info.major,
minor=version_info.minor,
micro=version_info.micro)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be lower in the file, right? After the imports.

.travis.yml Outdated
@@ -31,6 +31,7 @@ install:
- pip install --upgrade pip
- pip install .
- pip install -r dev-requirements.txt
- mlflow sagemaker build-and-push-container --no-push --mlflow-home ./
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a requirement for tests or is this meant to be a test? Maybe it should go into the tox files so people get it when testing locally too? Or if it's meant to be a test, then we can add it as a test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It is a requirement for tests. Previously I was building the container in sagemaker test but now I want to test other models in sagemaker (e.g. testing spark model revealed missing java) container and I don't want to be rebuilding the container all the time as it takes time.

Adding it to tox files makes sense, but I also need to see local mlflow-project. I'll look into it.

if FLAVOR_NAME not in m.flavors:
raise Exception("Model does not have {} flavor".format(FLAVOR_NAME))
conf = m.flavors[FLAVOR_NAME]
return PipelineModel.load(conf['model_data'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is model_data a relative path within the artifact directory? If so, it seems that we should join path with conf[model_data] here, right? Can we have a test for that too?

:param path: Local path
:return: The model as PyFunc.
"""
spark = pyspark.sql.SparkSession.builder.config(key="spark.python.worker.reuse", value=True) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we need to specify keyword args key and value here? It looks a bit weird.

# def _create_conda_env_file(path, name, channels = None, conda_deps=None, pip_deps=None):
# d = dependencies={'python': PYTHON_VERSION, "pip": {"mlflow": MLFLOW_VERSION}})
# with open(path, 'w') as out:
# yaml.safe_dump(d, stream=out, default_flow_style=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this commented out code if we don't need it.

dependencies:"""


def _mlflow_conda_env(path, additional_conda_deps=None, additional_pip_deps=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a short doc comment to say what this does.

print("model_path", model_path)
assembler = VectorAssembler(
inputCols=iris.feature_names,
outputCol="features")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would fit on 1 line

print("")
print(dir(tmpdir))
print(pandas_df)
print(spark_df.show())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should get rid of the prints by default in tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, will do

@@ -0,0 +1,122 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just calling this module spark.py? Nobody else really uses the name sparkml and there's only one ML library in Spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense. I will rename it.

mlflow/spark.py Outdated
versions.
:param jars: List of jars needed by the model.
"""
return Model.log(artifact_path=artifact_path, flavor=mlflow.sparkml, spark_model=spark_model,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be flavor=mlflow.spark now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch. And it should be tested too!

@mateiz mateiz merged commit 541b2b3 into mlflow:master Jul 8, 2018
@mateiz
Copy link
Contributor

mateiz commented Jul 8, 2018

Looks good, thanks!

dbczumar referenced this pull request in dbczumar/mlflow May 24, 2022
[ML-21881] Implement `evaluate` step
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants