Initial commit for the SparkML model export. #72

tomasatdatabricks · 2018-06-21T07:04:24Z

Initial stab at exporting SparkML models. Also as part of this moved spark_udf outside of pyfunc and created a separate flavor for it. The flavor is required because not all pyfuncs have to support spark udf mode, e.g. the pyfunc generated by SparkML flavor does not.

Few open questions / TODOs:

how to specify jar dependencies? For now unimplemented.
What should be the model output? For now only produce 1-d prediction to be consistent with other flavors. We should probably revisit that for all the models in one go.
Maybe we should add load-model-as-spark-transformer interface in addition to existing spark_udf.

mateiz · 2018-06-21T20:02:03Z

Hey Tomas, instead of adding a new spark_udf flavor here, let's just add a runnable_in_spark: false flag in the python_function flavor to capture this case. We don't want to make everyone write 2 flavors if their library actually works anywhere in Python. (In fact this one might also work as a UDF if they create a second SparkContext, but it would just be weird).

tomasatdatabricks · 2018-06-21T20:08:28Z

Oh ok. My motivation for turning it into a separate flavor was to allow you to set other properties, like use different flavor - e.g. with mleap you might want to use that instead of a PyFunc.

Initially I though the sparkML model would be that (and that's why I added it) but than I found out it actually does not work as udf at all :/

mateiz · 2018-06-21T20:09:54Z

Yeah, the Spark ML model is different. The way I envision it, we would also add a java_function that we can turn into a Spark UDF in the future for libraries like MLeap. I'd prefer not to require every library developer to know about every model consumer, so that's why it would be better not to have Spark UDF flavor if we can have more general ones.

mateiz · 2018-06-24T17:44:16Z

mlflow/environment.py

+from mlflow.version import VERSION as MLFLOW_VERSION
+
+
+class CondaEnvironment(object):


Is this meant to be a public class/module or just an internal one?

I meant for this to be public. Environments come up all the time and I thought it would be nice to have a convenience wrapper around it.

I don't think we should make it a class for now, mostly because there are other things in Conda that we don't capture here yet, and it creates a maintenance burden for us. Can you switch the Spark functions to take the Conda environment in a file like we do in pyfunc for now? It's also inconsistent right now because Spark takes a CondaEnvironment but our other functions take a file.

(And make this class private if we still want to use it, or just hold onto it to have another PR to add it later.)

mateiz · 2018-06-24T17:45:33Z

mlflow/pyfunc/__init__.py

@@ -141,14 +141,14 @@ def _get_code_dirs(src_code_path, dst_code_path=None):
            and not x == "__pycache__"]


-def spark_udf(spark, path, run_id=None, result_type="double"):
+def load_spark_udf(spark, path):


Why does this not take run_id anymore?

Oh, I meant to remove this one I think. I will double check but I thinkt it's a leftover from where I had spark_udf as meta_flavor and flavors supporting spark_udf mode would have load_spark_udf method.

mateiz · 2018-06-24T17:46:25Z

mlflow/sparkml.py

+
+def load_pyfunc(path):
+    """
+    Load the model as PuFunc.


Good catch.

mateiz · 2018-06-24T17:46:35Z

mlflow/sparkml.py

@@ -0,0 +1,116 @@
+"""
+Sample MLflow integration for SparkML models.


Shouldn't say "sample" anymore

Good catch.

…kml_export

codecov-io · 2018-06-26T18:56:45Z

Codecov Report

Merging #72 into master will decrease coverage by 0.17%.
The diff coverage is 78.75%.

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
- Coverage   50.21%   50.03%   -0.18%     
==========================================
  Files          87       89       +2     
  Lines        4192     4271      +79     
==========================================
+ Hits         2105     2137      +32     
- Misses       2087     2134      +47

Impacted Files	Coverage Δ
mlflow/sklearn.py	`78.68% <ø> (ø)`	⬆️
mlflow/sagemaker/__init__.py	`18.54% <ø> (-11.3%)`	⬇️
mlflow/sagemaker/container/__init__.py	`0% <0%> (ø)`	⬆️
mlflow/utils/environment.py	`100% <100%> (ø)`
mlflow/utils/__init__.py	`62.5% <100%> (+5.35%)`	⬆️
mlflow/models/__init__.py	`79.31% <33.33%> (-20.69%)`	⬇️
mlflow/spark.py	`85.41% <85.41%> (ø)`
mlflow/utils/file_utils.py	`75.86% <0%> (-14.66%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26a7603...c2088b4. Read the comment docs.

mateiz · 2018-06-27T00:18:44Z

mlflow/environment.py

+from mlflow.version import VERSION as MLFLOW_VERSION
+
+
+class CondaEnvironment(object):


I don't think we should make it a class for now, mostly because there are other things in Conda that we don't capture here yet, and it creates a maintenance burden for us. Can you switch the Spark functions to take the Conda environment in a file like we do in pyfunc for now? It's also inconsistent right now because Spark takes a CondaEnvironment but our other functions take a file.

mateiz · 2018-06-27T00:19:13Z

mlflow/environment.py

+from mlflow.version import VERSION as MLFLOW_VERSION
+
+
+class CondaEnvironment(object):


(And make this class private if we still want to use it, or just hold onto it to have another PR to add it later.)

mateiz · 2018-06-27T00:20:27Z

mlflow/models/__init__.py

+        Log model using supplied flavor.
+
+        :param artifact_path: RUN-relative path identifying this model.
+        :param flavor: Flavor that can save the model.


Can you document the type of flavor? It's not obvious.

Ok, fair point, I'll make the CondaEnvironment private (I still need something to print out environment with pyspark and python version).

mateiz · 2018-06-27T00:20:38Z

mlflow/models/__init__.py

+        """
+        Log model using supplied flavor.
+
+        :param artifact_path: RUN-relative path identifying this model.


RUN -> Run?

mateiz · 2018-06-27T00:22:37Z

mlflow/spark_model/__init__.py

+from mlflow import tracking
+
+
+def load_udf(spark, path, run_id=None, result_type='double'):


Why did we move this to a new module now? Seems like it should stay back in pyfunc.

Oh, I thought you asked me to move it some time ago, I probably misunderstood. I'll move it back to pyfunc.

mateiz · 2018-06-27T00:26:59Z

mlflow/sparkml.py

+
+    :param spark_model: Model to be saved.
+    :param path: Local path where the model is to be save.
+    :param mlflow_model: MLflow model config.


mlflow_model is to add this to an existing model, right? You should document that and maybe also make the parameter name consistent in sklearn.save_model since it's called model there. Also I'd move this to be the last argument since it seems unlikely that users would directly pass this (since they'll use log_model instead).

Yes, I add documentation. I actually renamed the sklearn one too. I though mlflow_model was more clear than just model.

mateiz · 2018-06-27T00:27:33Z

mlflow/sparkml.py

+FLAVOR_NAME = "sparkml"
+
+
+def log_model(spark_model, artifact_path, env, jars=None):


Why is env required here? Also document its type.

goo catch, it should not be

mateiz · 2018-06-27T00:28:47Z

mlflow/sparkml.py

+    return SparkMLModel(spark, PipelineModel.load(path))
+
+
+class SparkMLModel(object):


Do we want this to be a public class? Seems like an internal utility so make it private for now. Also I'd call it _PyFuncModelWrapper or something to make it clear that this is for the PyFunc interface.

Yeah that makes sense.

mateiz · 2018-06-27T00:28:58Z

mlflow/sparkml.py

+    Wrapper around SparkML PipelineModel providing interface for scoring pandas DataFrame.
+    """
+
+    def __init__(self, spark, transformer):


Should we call it transformer or just spark_model here?

Hm I though transformer was more descriptive, but I am fine calling it spark_model if you prefer that.

mateiz · 2018-06-27T00:29:45Z

mlflow/utils/__init__.py

+
+PYTHON_VERSION = "{major}.{minor}.{micro}".format(major=version_info.major,
+                                                  minor=version_info.minor,
+                                                  micro=version_info.micro)


This can be lower in the file, right? After the imports.

…er container (Turns out java was missing in the container.)

…vis install script.

mateiz · 2018-06-27T18:04:02Z

.travis.yml

@@ -31,6 +31,7 @@ install:
  - pip install --upgrade pip
  - pip install .
  - pip install -r dev-requirements.txt
+  - mlflow sagemaker build-and-push-container --no-push --mlflow-home ./


Is this a requirement for tests or is this meant to be a test? Maybe it should go into the tox files so people get it when testing locally too? Or if it's meant to be a test, then we can add it as a test case.

Good point. It is a requirement for tests. Previously I was building the container in sagemaker test but now I want to test other models in sagemaker (e.g. testing spark model revealed missing java) container and I don't want to be rebuilding the container all the time as it takes time.

Adding it to tox files makes sense, but I also need to see local mlflow-project. I'll look into it.

…ent from spark ml model save method. For now user is responsible for providing conda envrionment file with pyspark. Moved building of sagemaker docker container to tox.ini (lets see if it works).

…le was stored so it can be passed as an argument to save_model function.

…kml_export

…eparate PR).

…arkML -> Spark MLlib)

…kml_export

… specify python version in tutorial example - should investigate why)

mateiz · 2018-07-06T05:01:14Z

mlflow/sparkml.py

+    if FLAVOR_NAME not in m.flavors:
+        raise Exception("Model does not have {} flavor".format(FLAVOR_NAME))
+    conf = m.flavors[FLAVOR_NAME]
+    return PipelineModel.load(conf['model_data'])


Is model_data a relative path within the artifact directory? If so, it seems that we should join path with conf[model_data] here, right? Can we have a test for that too?

mateiz · 2018-07-06T05:02:18Z

mlflow/sparkml.py

+    :param path: Local path
+    :return: The model as PyFunc.
+    """
+    spark = pyspark.sql.SparkSession.builder.config(key="spark.python.worker.reuse", value=True) \


Nit: do we need to specify keyword args key and value here? It looks a bit weird.

mateiz · 2018-07-06T05:02:39Z

mlflow/utils/__init__.py

+# def _create_conda_env_file(path, name, channels = None, conda_deps=None, pip_deps=None):
+#     d = dependencies={'python': PYTHON_VERSION, "pip": {"mlflow": MLFLOW_VERSION}})
+#     with open(path, 'w') as out:
+#         yaml.safe_dump(d, stream=out, default_flow_style=False)


Delete this commented out code if we don't need it.

mateiz · 2018-07-06T05:03:08Z

mlflow/utils/environment.py

+dependencies:"""
+
+
+def _mlflow_conda_env(path, additional_conda_deps=None, additional_pip_deps=None):


Add a short doc comment to say what this does.

mateiz · 2018-07-06T05:04:13Z

tests/mlflow/sparkml/test_sparkml_model_export.py

+    print("model_path", model_path)
+    assembler = VectorAssembler(
+        inputCols=iris.feature_names,
+        outputCol="features")


These would fit on 1 line

mateiz · 2018-07-06T05:04:25Z

tests/mlflow/sparkml/test_sparkml_model_export.py

+    print("")
+    print(dir(tmpdir))
+    print(pandas_df)
+    print(spark_df.show())


Probably should get rid of the prints by default in tests

true, will do

mateiz · 2018-07-06T05:06:48Z

mlflow/sparkml.py

@@ -0,0 +1,122 @@
+"""


What do you think about just calling this module spark.py? Nobody else really uses the name sparkml and there's only one ML library in Spark.

Yeah makes sense. I will rename it.

…d need to be joined with the model path

…kml_export

mateiz · 2018-07-06T23:25:22Z

mlflow/spark.py

+           versions.
+    :param jars: List of jars needed by the model.
+    """
+    return Model.log(artifact_path=artifact_path, flavor=mlflow.sparkml, spark_model=spark_model,


should this be flavor=mlflow.spark now?

Yes good catch. And it should be tested too!

mateiz · 2018-07-08T01:55:50Z

Looks good, thanks!

[ML-21881] Implement `evaluate` step

Initial commit dfor the SparkML model export.

8a57354

mateiz reviewed Jun 24, 2018

View reviewed changes

tomasatdatabricks added 8 commits June 25, 2018 11:13

Addressed reviewres comments / fixed tests issues.

c969d46

Merge branch 'master' of github.com:databricks/mlflow into tomas/spar…

bade432

…kml_export

Fixed tests.

12b47ac

Removed unused import.

ac0fa84

Fixed python2 issue

33f712e

Merge branch 'master' of github.com:databricks/mlflow into tomas/spar…

8c88b63

…kml_export

Fixed py2 test issue.

648edcd

Fixed py2 test issue.

3a8e94e

mateiz requested changes Jun 27, 2018

View reviewed changes

tomasatdatabricks added 2 commits June 27, 2018 09:08

Addressed review comments. Added test for spark ml scoring in sagemak…

4cd619e

…er container (Turns out java was missing in the container.)

Reverted test_spark, added building of mlflow docker container in tra…

7ba6f86

…vis install script.

mateiz reviewed Jun 27, 2018

View reviewed changes

tomasatdatabricks added 11 commits June 27, 2018 13:01

After offline conversation with Aaron, removed default conda environm…

12b9222

…ent from spark ml model save method. For now user is responsible for providing conda envrionment file with pyspark. Moved building of sagemaker docker container to tox.ini (lets see if it works).

Added missing conda_env to sparkml.save() in sparkml test

ce3e9a1

Fixed pyspark dependency in sparkml test.

8b7f235

Removed trailing whitespace in Sagemaker dockerfile template string.

d2cc0c9

Fixed cerate default environment call to return the path where the fi…

00e8283

…le was stored so it can be passed as an argument to save_model function.

Merge branch 'master' of github.com:databricks/mlflow into tomas/spar…

bfa8502

…kml_export

Removed dead code

2b18ceb

Added missing pytest import.

203998c

Reverted unintended change in spark_udf (which should be fixed in a s…

01ddda5

…eparate PR).

Removed uninteded change in example. Updated comments in spark ml (Sp…

83cb6b4

…arkML -> Spark MLlib)

Merge branch 'master' of github.com:databricks/mlflow into tomas/spar…

54a8928

…kml_export

tomasatdatabricks added 5 commits June 29, 2018 14:46

Test debug - run only cli tests.

2f83ac5

Test debug 2 - run only cli::run_local test and set -capture=no.

8587017

Test debugging. Updated conda environment.

878896b

Test debug

0ab1405

Cleared debugging changes. Added 'fix' for the test issue (explicitly…

5f00749

… specify python version in tutorial example - should investigate why)

smurching mentioned this pull request Jul 2, 2018

Attempt to fix hanging projects CLI tests #112

Merged

mateiz reviewed Jul 6, 2018

View reviewed changes

tomasatdatabricks added 5 commits July 6, 2018 12:09

Fixed bug in sparkml.load_model. Paths in the MLmodel are relative an…

950d233

…d need to be joined with the model path

Addressed review comments

be0e0dc

Merge branch 'master' of github.com:databricks/mlflow into tomas/spar…

a41c7e6

…kml_export

Fixed refactoring bug.

8883333

Fixed python2 test issue.

c2088b4

mateiz reviewed Jul 6, 2018

View reviewed changes

Fixed spark.log_model, added test.

4174cd2

mateiz merged commit 541b2b3 into mlflow:master Jul 8, 2018

dbczumar referenced this pull request in dbczumar/mlflow May 24, 2022

Merge pull request #72 from harupy/ML-21881

98353ca

[ML-21881] Implement `evaluate` step

		from mlflow.version import VERSION as MLFLOW_VERSION


		class CondaEnvironment(object):

		@@ -0,0 +1,116 @@
		"""
		Sample MLflow integration for SparkML models.

		from mlflow import tracking


		def load_udf(spark, path, run_id=None, result_type='double'):

		FLAVOR_NAME = "sparkml"


		def log_model(spark_model, artifact_path, env, jars=None):

		return SparkMLModel(spark, PipelineModel.load(path))


		class SparkMLModel(object):

		dependencies:"""


		def _mlflow_conda_env(path, additional_conda_deps=None, additional_pip_deps=None):

Initial commit for the SparkML model export. #72

Initial commit for the SparkML model export. #72

Conversation

tomasatdatabricks commented Jun 21, 2018

mateiz commented Jun 21, 2018 • edited Loading

tomasatdatabricks commented Jun 21, 2018

mateiz commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jun 26, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateiz commented Jul 8, 2018

mateiz commented Jun 21, 2018 •

edited

Loading

codecov-io commented Jun 26, 2018 •

edited

Loading