-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit for the SparkML model export. #72
Conversation
Hey Tomas, instead of adding a new spark_udf flavor here, let's just add a runnable_in_spark: false flag in the python_function flavor to capture this case. We don't want to make everyone write 2 flavors if their library actually works anywhere in Python. (In fact this one might also work as a UDF if they create a second SparkContext, but it would just be weird). |
Oh ok. My motivation for turning it into a separate flavor was to allow you to set other properties, like use different flavor - e.g. with mleap you might want to use that instead of a PyFunc. Initially I though the sparkML model would be that (and that's why I added it) but than I found out it actually does not work as udf at all :/ |
Yeah, the Spark ML model is different. The way I envision it, we would also add a java_function that we can turn into a Spark UDF in the future for libraries like MLeap. I'd prefer not to require every library developer to know about every model consumer, so that's why it would be better not to have Spark UDF flavor if we can have more general ones. |
mlflow/environment.py
Outdated
from mlflow.version import VERSION as MLFLOW_VERSION | ||
|
||
|
||
class CondaEnvironment(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be a public class/module or just an internal one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant for this to be public. Environments come up all the time and I thought it would be nice to have a convenience wrapper around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make it a class for now, mostly because there are other things in Conda that we don't capture here yet, and it creates a maintenance burden for us. Can you switch the Spark functions to take the Conda environment in a file like we do in pyfunc for now? It's also inconsistent right now because Spark takes a CondaEnvironment but our other functions take a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And make this class private if we still want to use it, or just hold onto it to have another PR to add it later.)
mlflow/pyfunc/__init__.py
Outdated
@@ -141,14 +141,14 @@ def _get_code_dirs(src_code_path, dst_code_path=None): | |||
and not x == "__pycache__"] | |||
|
|||
|
|||
def spark_udf(spark, path, run_id=None, result_type="double"): | |||
def load_spark_udf(spark, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this not take run_id anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I meant to remove this one I think. I will double check but I thinkt it's a leftover from where I had spark_udf as meta_flavor and flavors supporting spark_udf mode would have load_spark_udf method.
mlflow/sparkml.py
Outdated
|
||
def load_pyfunc(path): | ||
""" | ||
Load the model as PuFunc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
mlflow/sparkml.py
Outdated
@@ -0,0 +1,116 @@ | |||
""" | |||
Sample MLflow integration for SparkML models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't say "sample" anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
Codecov Report
@@ Coverage Diff @@
## master #72 +/- ##
==========================================
- Coverage 50.21% 50.03% -0.18%
==========================================
Files 87 89 +2
Lines 4192 4271 +79
==========================================
+ Hits 2105 2137 +32
- Misses 2087 2134 +47
Continue to review full report at Codecov.
|
mlflow/environment.py
Outdated
from mlflow.version import VERSION as MLFLOW_VERSION | ||
|
||
|
||
class CondaEnvironment(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make it a class for now, mostly because there are other things in Conda that we don't capture here yet, and it creates a maintenance burden for us. Can you switch the Spark functions to take the Conda environment in a file like we do in pyfunc for now? It's also inconsistent right now because Spark takes a CondaEnvironment but our other functions take a file.
mlflow/environment.py
Outdated
from mlflow.version import VERSION as MLFLOW_VERSION | ||
|
||
|
||
class CondaEnvironment(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And make this class private if we still want to use it, or just hold onto it to have another PR to add it later.)
mlflow/models/__init__.py
Outdated
Log model using supplied flavor. | ||
|
||
:param artifact_path: RUN-relative path identifying this model. | ||
:param flavor: Flavor that can save the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the type of flavor
? It's not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair point, I'll make the CondaEnvironment private (I still need something to print out environment with pyspark and python version).
mlflow/models/__init__.py
Outdated
""" | ||
Log model using supplied flavor. | ||
|
||
:param artifact_path: RUN-relative path identifying this model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RUN -> Run?
mlflow/spark_model/__init__.py
Outdated
from mlflow import tracking | ||
|
||
|
||
def load_udf(spark, path, run_id=None, result_type='double'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we move this to a new module now? Seems like it should stay back in pyfunc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought you asked me to move it some time ago, I probably misunderstood. I'll move it back to pyfunc.
mlflow/sparkml.py
Outdated
|
||
:param spark_model: Model to be saved. | ||
:param path: Local path where the model is to be save. | ||
:param mlflow_model: MLflow model config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlflow_model
is to add this to an existing model, right? You should document that and maybe also make the parameter name consistent in sklearn.save_model
since it's called model
there. Also I'd move this to be the last argument since it seems unlikely that users would directly pass this (since they'll use log_model instead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I add documentation. I actually renamed the sklearn one too. I though mlflow_model was more clear than just model.
mlflow/sparkml.py
Outdated
FLAVOR_NAME = "sparkml" | ||
|
||
|
||
def log_model(spark_model, artifact_path, env, jars=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is env required here? Also document its type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goo catch, it should not be
mlflow/sparkml.py
Outdated
return SparkMLModel(spark, PipelineModel.load(path)) | ||
|
||
|
||
class SparkMLModel(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be a public class? Seems like an internal utility so make it private for now. Also I'd call it _PyFuncModelWrapper or something to make it clear that this is for the PyFunc interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense.
mlflow/sparkml.py
Outdated
Wrapper around SparkML PipelineModel providing interface for scoring pandas DataFrame. | ||
""" | ||
|
||
def __init__(self, spark, transformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call it transformer or just spark_model here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I though transformer was more descriptive, but I am fine calling it spark_model if you prefer that.
mlflow/utils/__init__.py
Outdated
|
||
PYTHON_VERSION = "{major}.{minor}.{micro}".format(major=version_info.major, | ||
minor=version_info.minor, | ||
micro=version_info.micro) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be lower in the file, right? After the imports.
…er container (Turns out java was missing in the container.)
…vis install script.
.travis.yml
Outdated
@@ -31,6 +31,7 @@ install: | |||
- pip install --upgrade pip | |||
- pip install . | |||
- pip install -r dev-requirements.txt | |||
- mlflow sagemaker build-and-push-container --no-push --mlflow-home ./ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a requirement for tests or is this meant to be a test? Maybe it should go into the tox
files so people get it when testing locally too? Or if it's meant to be a test, then we can add it as a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It is a requirement for tests. Previously I was building the container in sagemaker test but now I want to test other models in sagemaker (e.g. testing spark model revealed missing java) container and I don't want to be rebuilding the container all the time as it takes time.
Adding it to tox files makes sense, but I also need to see local mlflow-project. I'll look into it.
…ent from spark ml model save method. For now user is responsible for providing conda envrionment file with pyspark. Moved building of sagemaker docker container to tox.ini (lets see if it works).
…le was stored so it can be passed as an argument to save_model function.
…arkML -> Spark MLlib)
… specify python version in tutorial example - should investigate why)
mlflow/sparkml.py
Outdated
if FLAVOR_NAME not in m.flavors: | ||
raise Exception("Model does not have {} flavor".format(FLAVOR_NAME)) | ||
conf = m.flavors[FLAVOR_NAME] | ||
return PipelineModel.load(conf['model_data']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is model_data
a relative path within the artifact directory? If so, it seems that we should join path
with conf[model_data]
here, right? Can we have a test for that too?
mlflow/sparkml.py
Outdated
:param path: Local path | ||
:return: The model as PyFunc. | ||
""" | ||
spark = pyspark.sql.SparkSession.builder.config(key="spark.python.worker.reuse", value=True) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do we need to specify keyword args key
and value
here? It looks a bit weird.
mlflow/utils/__init__.py
Outdated
# def _create_conda_env_file(path, name, channels = None, conda_deps=None, pip_deps=None): | ||
# d = dependencies={'python': PYTHON_VERSION, "pip": {"mlflow": MLFLOW_VERSION}}) | ||
# with open(path, 'w') as out: | ||
# yaml.safe_dump(d, stream=out, default_flow_style=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this commented out code if we don't need it.
dependencies:""" | ||
|
||
|
||
def _mlflow_conda_env(path, additional_conda_deps=None, additional_pip_deps=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a short doc comment to say what this does.
print("model_path", model_path) | ||
assembler = VectorAssembler( | ||
inputCols=iris.feature_names, | ||
outputCol="features") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These would fit on 1 line
print("") | ||
print(dir(tmpdir)) | ||
print(pandas_df) | ||
print(spark_df.show()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should get rid of the prints by default in tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, will do
mlflow/sparkml.py
Outdated
@@ -0,0 +1,122 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about just calling this module spark.py
? Nobody else really uses the name sparkml
and there's only one ML library in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense. I will rename it.
…d need to be joined with the model path
mlflow/spark.py
Outdated
versions. | ||
:param jars: List of jars needed by the model. | ||
""" | ||
return Model.log(artifact_path=artifact_path, flavor=mlflow.sparkml, spark_model=spark_model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be flavor=mlflow.spark
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good catch. And it should be tested too!
Looks good, thanks! |
[ML-21881] Implement `evaluate` step
Initial stab at exporting SparkML models. Also as part of this moved spark_udf outside of pyfunc and created a separate flavor for it. The flavor is required because not all pyfuncs have to support spark udf mode, e.g. the pyfunc generated by SparkML flavor does not.
Few open questions / TODOs: