Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
mhamilton723 committed Oct 13, 2021
1 parent 5a6933a commit 44402cd
Show file tree
Hide file tree
Showing 314 changed files with 829 additions and 1,098 deletions.
2 changes: 1 addition & 1 deletion .chglog/CHANGELOG.tpl.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
{{ end -}}

## Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.\n
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n

{{ end -}}

2 changes: 1 addition & 1 deletion .chglog/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ style: github
template: CHANGELOG.tpl.md
info:
title: CHANGELOG
repository_url: https://github.com/Azure/mmlspark
repository_url: https://github.com/Microsoft/SynapseML
options:
commit_groups:
title_maps:
Expand Down
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Steps to reproduce the behavior, code snippets encouraged
A clear and concise description of what you expected to happen.

**Info (please complete the following information):**
- MMLSpark Version: [e.g. v0.17]
- SynapseML Version: [e.g. v0.17]
- Spark Version [e.g. 2.4.3]
- Spark Platform [e.g. Databricks]

Expand All @@ -26,7 +26,7 @@ A clear and concise description of what you expected to happen.
Please post the stacktrace here if applicable
```

If the bug pertains to a specific feature please tag the appropriate [CODEOWNER](https://github.com/Azure/mmlspark/blob/master/CODEOWNERS) for better visibility
If the bug pertains to a specific feature please tag the appropriate [CODEOWNER](https://github.com/Microsoft/SynapseML/blob/master/CODEOWNERS) for better visibility

**Additional context**
Add any other context about the problem here.
2 changes: 1 addition & 1 deletion .github/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ newPRWelcomeComment: >
- `style: Remove nulls from CNTKModel`
- `test: Add test coverage for CNTKModel`
Make sure to check out the [developer guide](https://github.com/Azure/mmlspark/blob/master/CONTRIBUTING.md) for guidance on testing your change.
Make sure to check out the [developer guide](https://github.com/Microsoft/SynapseML/blob/master/CONTRIBUTING.md) for guidance on testing your change.
# Configuration for first-pr-merge - https://github.com/behaviorbot/first-pr-merge

Expand Down
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Interested in contributing to MMLSpark? We're excited to work with you.
## Interested in contributing to SynapseML? We're excited to work with you.

### You can contribute in many ways:

Expand Down Expand Up @@ -32,7 +32,7 @@ this process:

#### Implement your contribution

- Fork the MMLSpark repository.
- Fork the SynapseML repository.
- Implement your algorithm in Scala, using our wrapper generation mechanism to
produce PySpark bindings.
- Use SparkML `PipelineStage`s so your algorithm can be used as a part of
Expand All @@ -41,7 +41,7 @@ this process:
- Implement model saving and loading by extending SparkML `MLReadable`.
- Use good Scala style.
- Binary dependencies should be on Maven Central.
- See this [pull request](https://github.com/Azure/mmlspark/pull/22) for an
- See this [pull request](https://github.com/Microsoft/SynapseML/pull/22) for an
example contribution.

#### Implement tests
Expand All @@ -65,7 +65,7 @@ this process:

- In most cases, you should squash your commits into one.
- Open a pull request, and link it to the discussion issue you created earlier.
- An MMLSpark core team member will trigger a build to test your changes.
- An SynapseML core team member will trigger a build to test your changes.
- Fix any build failures. (The pull request will have comments from the build
with useful links.)
- Wait for code reviews from core team members and others.
Expand Down
78 changes: 39 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
![MMLSpark](https://mmlspark.azureedge.net/icons/mmlspark.svg)
![SynapseML](https://mmlspark.azureedge.net/icons/mmlspark.svg)

# Microsoft Machine Learning for Apache Spark

[![Build Status](https://msdata.visualstudio.com/A365/_apis/build/status/microsoft.SynapseML?branchName=master)](https://msdata.visualstudio.com/A365/_build/latest?definitionId=17563&branchName=master) [![codecov](https://codecov.io/gh/Azure/mmlspark/branch/master/graph/badge.svg)](https://codecov.io/gh/Azure/mmlspark) [![Gitter](https://badges.gitter.im/Microsoft/MMLSpark.svg)](https://gitter.im/Microsoft/MMLSpark?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
[![Build Status](https://msdata.visualstudio.com/A365/_apis/build/status/microsoft.SynapseML?branchName=master)](https://msdata.visualstudio.com/A365/_build/latest?definitionId=17563&branchName=master) [![codecov](https://codecov.io/gh/Microsoft/SynapseML/branch/master/graph/badge.svg)](https://codecov.io/gh/Microsoft/SynapseML) [![Gitter](https://badges.gitter.im/Microsoft/MMLSpark.svg)](https://gitter.im/Microsoft/MMLSpark?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)

[![Release Notes](https://img.shields.io/badge/release-notes-blue)](https://github.com/Azure/mmlspark/releases) [![Scala Docs](https://img.shields.io/static/v1?label=api%20docs&message=scala&color=blue&logo=scala)](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/scala/index.html#package) [![PySpark Docs](https://img.shields.io/static/v1?label=api%20docs&message=python&color=blue&logo=python)](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/pyspark/index.html) [![Academic Paper](https://img.shields.io/badge/academic-paper-7fdcf7)](https://arxiv.org/abs/1810.08744)
[![Release Notes](https://img.shields.io/badge/release-notes-blue)](https://github.com/Microsoft/SynapseML/releases) [![Scala Docs](https://img.shields.io/static/v1?label=api%20docs&message=scala&color=blue&logo=scala)](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/scala/index.html#package) [![PySpark Docs](https://img.shields.io/static/v1?label=api%20docs&message=python&color=blue&logo=python)](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/pyspark/index.html) [![Academic Paper](https://img.shields.io/badge/academic-paper-7fdcf7)](https://arxiv.org/abs/1810.08744)

[![Version](https://img.shields.io/badge/version-1.0.0--rc4-blue)](https://github.com/Azure/mmlspark/releases) [![Snapshot Version](https://mmlspark.blob.core.windows.net/icons/badges/master_version3.svg)](#sbt)
[![Version](https://img.shields.io/badge/version-1.0.0--rc4-blue)](https://github.com/Microsoft/SynapseML/releases) [![Snapshot Version](https://mmlspark.blob.core.windows.net/icons/badges/master_version3.svg)](#sbt)


MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework
SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework
[Apache Spark](https://github.com/apache/spark) in several new directions.
MMLSpark adds many deep learning and data science tools to the Spark ecosystem,
SynapseML adds many deep learning and data science tools to the Spark ecosystem,
including seamless integration of Spark Machine Learning pipelines with [Microsoft Cognitive Toolkit
(CNTK)](https://github.com/Microsoft/CNTK), [LightGBM](https://github.com/Microsoft/LightGBM) and
[OpenCV](http://www.opencv.org/). These tools enable powerful and highly-scalable predictive and analytical models
for a variety of datasources.

MMLSpark also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users
can embed **any** web service into their SparkML models. In this vein, MMLSpark provides easy to use
SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users
can embed **any** web service into their SparkML models. In this vein, SynapseML provides easy to use
SparkML transformers for a wide variety of [Microsoft Cognitive Services](https://azure.microsoft.com/en-us/services/cognitive-services/). For production grade deployment, the Spark Serving project enables high throughput,
sub-millisecond latency web services, backed by your Spark cluster.

MMLSpark requires Scala 2.12, Spark 3.0+, and Python 3.6+.
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.
See the API documentation [for
Scala](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/scala/index.html#package) and [for
PySpark](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/pyspark/index.html).
Expand Down Expand Up @@ -60,7 +60,7 @@ PySpark](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/pyspark/index.htm

| <img width="150" src="https://mmlspark.blob.core.windows.net/graphics/emails/isolation forest 3.svg"> |<img width="150" src="https://mmlspark.blob.core.windows.net/graphics/emails/cyberml.svg"> | <img width="150" src="https://mmlspark.blob.core.windows.net/graphics/emails/conditional_knn.svg"> |
|:--:|:--:|:--:|
| **Isolation Forest on Spark** | [**CyberML**](https://github.com/Azure/mmlspark/blob/master/notebooks/CyberML%20-%20Anomalous%20Access%20Detection.ipynb) | **Conditional KNN** |
| **Isolation Forest on Spark** | [**CyberML**](https://github.com/Microsoft/SynapseML/blob/master/notebooks/CyberML%20-%20Anomalous%20Access%20Detection.ipynb) | **Conditional KNN** |
| Distributed Nonlinear Outlier Detection | Machine Learning Tools for Cyber Security | Scalable KNN Models with Conditional Queries |


Expand All @@ -71,7 +71,7 @@ PySpark](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc4/pyspark/index.htm
- Fit a LightGBM classification or regression model on a biochemical dataset
([example 3]), to learn more check out the [LightGBM documentation
page](docs/lightgbm.md).
- Deploy a deep network as a distributed web service with [MMLSpark
- Deploy a deep network as a distributed web service with [SynapseML
Serving](docs/mmlspark-serving.md)
- Use web services in Spark with [HTTP on Apache Spark](docs/http.md)
- Use Bi-directional LSTMs from Keras for medical entity extraction
Expand All @@ -97,7 +97,7 @@ See our [notebooks](notebooks/) for all examples.

[example 4]: notebooks/TextAnalytics%20-%20Amazon%20Book%20Reviews.ipynb "Amazon Book Reviews - TextFeaturizer"

[example 5]: notebooks/HyperParameterTuning%20-%20Fighting%20Breast%20Cancer.ipynb "Hyperparameter Tuning with MMLSpark"
[example 5]: notebooks/HyperParameterTuning%20-%20Fighting%20Breast%20Cancer.ipynb "Hyperparameter Tuning with SynapseML"

[example 6]: notebooks/DeepLearning%20-%20CIFAR10%20Convolutional%20Network.ipynb "CIFAR10 CNTK CNN Evaluation"

Expand Down Expand Up @@ -134,22 +134,22 @@ scoredImages = cntkModel.transform(imagesWithLabels)
...
```

See [other sample notebooks](notebooks/) as well as the MMLSpark
See [other sample notebooks](notebooks/) as well as the SynapseML
documentation for [Scala](http://mmlspark.azureedge.net/docs/scala/) and
[PySpark](http://mmlspark.azureedge.net/docs/pyspark/).

## Setup and installation

### Python

To try out MMLSpark on a Python (or Conda) installation you can get Spark
To try out SynapseML on a Python (or Conda) installation you can get Spark
installed via pip with `pip install pyspark`. You can then use `pyspark` as in
the above example, or from python:

```python
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark:1.0.0-rc4") \
.config("spark.jars.packages", "com.microsoft.synapse.ml:mmlspark:1.0.0-rc4") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
import mmlspark
Expand All @@ -161,47 +161,47 @@ If you are building a Spark application in Scala, add the following lines to
your `build.sbt`:

```scala
resolvers += "MMLSpark" at "https://mmlspark.azureedge.net/maven"
libraryDependencies += "com.microsoft.ml.spark" %% "mmlspark" % "1.0.0-rc4"
resolvers += "SynapseML" at "https://mmlspark.azureedge.net/maven"
libraryDependencies += "com.microsoft.synapse.ml" %% "mmlspark" % "1.0.0-rc4"

```

### Spark package

MMLSpark can be conveniently installed on existing Spark clusters via the
SynapseML can be conveniently installed on existing Spark clusters via the
`--packages` option, examples:

```bash
spark-shell --packages com.microsoft.ml.spark:mmlspark:1.0.0-rc4
pyspark --packages com.microsoft.ml.spark:mmlspark:1.0.0-rc4
spark-submit --packages com.microsoft.ml.spark:mmlspark:1.0.0-rc4 MyApp.jar
spark-shell --packages com.microsoft.synapse.ml:mmlspark:1.0.0-rc4
pyspark --packages com.microsoft.synapse.ml:mmlspark:1.0.0-rc4
spark-submit --packages com.microsoft.synapse.ml:mmlspark:1.0.0-rc4 MyApp.jar
```

This can be used in other Spark contexts too. For example, you can use MMLSpark
This can be used in other Spark contexts too. For example, you can use SynapseML
in [AZTK](https://github.com/Azure/aztk/) by [adding it to the
`.aztk/spark-defaults.conf`
file](https://github.com/Azure/aztk/wiki/PySpark-on-Azure-with-AZTK#optional-set-up-mmlspark).

### Databricks

To install MMLSpark on the [Databricks
To install SynapseML on the [Databricks
cloud](http://community.cloud.databricks.com), create a new [library from Maven
coordinates](https://docs.databricks.com/user-guide/libraries.html#libraries-from-maven-pypi-or-spark-packages)
in your workspace.

For the coordinates use: `com.microsoft.ml.spark:mmlspark:1.0.0-rc4`
For the coordinates use: `com.microsoft.synapse.ml:mmlspark:1.0.0-rc4`
with the resolver: `https://mmlspark.azureedge.net/maven`. Ensure this library is
attached to your target cluster(s).

Finally, ensure that your Spark cluster has at least Spark 2.4 and Scala 2.11.

You can use MMLSpark in both your Scala and PySpark notebooks. To get started with our example notebooks import the following databricks archive:
You can use SynapseML in both your Scala and PySpark notebooks. To get started with our example notebooks import the following databricks archive:

`https://mmlspark.blob.core.windows.net/dbcs/MMLSparkExamplesv1.0.0-rc4.dbc`
`https://mmlspark.blob.core.windows.net/dbcs/SynapseMLExamplesv1.0.0-rc4.dbc`

### Apache Livy and HDInsight

To install MMLSpark from within a Jupyter notebook served by Apache Livy the following configure magic can be used. You will need to start a new session after this configure cell is executed.
To install SynapseML from within a Jupyter notebook served by Apache Livy the following configure magic can be used. You will need to start a new session after this configure cell is executed.

Excluding certain packages from the library may be necessary due to current issues with Livy 0.5

Expand All @@ -210,7 +210,7 @@ Excluding certain packages from the library may be necessary due to current issu
{
"name": "mmlspark",
"conf": {
"spark.jars.packages": "com.microsoft.ml.spark:mmlspark:1.0.0-rc4",
"spark.jars.packages": "com.microsoft.synapse.ml:mmlspark:1.0.0-rc4",
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.11,org.scalactic:scalactic_2.11,org.scalatest:scalatest_2.11"
}
Expand All @@ -219,7 +219,7 @@ Excluding certain packages from the library may be necessary due to current issu

### Docker

The easiest way to evaluate MMLSpark is via our pre-built Docker container. To
The easiest way to evaluate SynapseML is via our pre-built Docker container. To
do so, run the following command:

```bash
Expand All @@ -234,15 +234,15 @@ notebooks. See the [documentation](docs/docker.md) for more on Docker use.
### GPU VM Setup

MMLSpark can be used to train deep learning models on GPU nodes from a Spark
SynapseML can be used to train deep learning models on GPU nodes from a Spark
application. See the instructions for [setting up an Azure GPU
VM](docs/gpu-setup.md).



### Building from source

MMLSpark has recently transitioned to a new build infrastructure.
SynapseML has recently transitioned to a new build infrastructure.
For detailed developer docs please see the [Developer Readme](docs/developer-readme.md)

If you are an existing mmlspark developer, you will need to reconfigure your
Expand All @@ -252,7 +252,7 @@ better integrate with intellij and SBT.

### R (Beta)

To try out MMLSpark using the R autogenerated wrappers [see our
To try out SynapseML using the R autogenerated wrappers [see our
instructions](docs/R-setup.md). Note: This feature is still under development
and some necessary custom wrappers may be missing.

Expand All @@ -262,17 +262,17 @@ and some necessary custom wrappers may be missing.

- [Conditional Image Retrieval](https://arxiv.org/abs/2007.07177)

- [MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales](https://arxiv.org/abs/1810.08744)
- [SynapseML: Unifying Machine Learning Ecosystems at Massive Scales](https://arxiv.org/abs/1810.08744)

- [Flexible and Scalable Deep Learning with MMLSpark](https://arxiv.org/abs/1804.04031)
- [Flexible and Scalable Deep Learning with SynapseML](https://arxiv.org/abs/1804.04031)

## Learn More

- Visit our [website].

- Watch our keynote demos at [the Spark+AI Summit 2019], [the Spark+AI European Summit 2018], and [the Spark+AI Summit 2018].

- See how MMLSpark is used to [help endangered species].
- See how SynapseML is used to [help endangered species].

- Explore generative adversarial artwork in [our collaboration with The MET and MIT].

Expand All @@ -286,17 +286,17 @@ and some necessary custom wrappers may be missing.

[the Spark+AI European Summit 2018]: https://youtu.be/N3ozCZXeOeU?t=472

[our paper]: https://arxiv.org/abs/1804.04031 "Flexible and Scalable Deep Learning with MMLSpark"
[our paper]: https://arxiv.org/abs/1804.04031 "Flexible and Scalable Deep Learning with SynapseML"

[help endangered species]: https://www.microsoft.com/en-us/ai/ai-lab-stories?activetab=pivot1:primaryr3 "Identifying snow leopards with AI"

[our collaboration with The MET and MIT]: https://www.microsoft.com/en-us/ai/ai-lab-stories?activetab=pivot1:primaryr4 "Generative art at the MET"

[our collaboration with Apache Spark]: https://blogs.technet.microsoft.com/machinelearning/2018/03/05/image-data-support-in-apache-spark/ "Image Data Support in Apache Spark"

[MMLSpark in Azure Machine Learning]: https://docs.microsoft.com/en-us/azure/machine-learning/preview/how-to-use-mmlspark "How to Use Microsoft Machine Learning Library for Apache Spark"
[SynapseML in Azure Machine Learning]: https://docs.microsoft.com/en-us/azure/machine-learning/preview/how-to-use-mmlspark "How to Use Microsoft Machine Learning Library for Apache Spark"

[MMLSpark at the Spark Summit]: https://databricks.com/session/mmlspark-lessons-from-building-a-sparkml-compatible-machine-learning-library-for-apache-spark "MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library for Apache Spark"
[SynapseML at the Spark Summit]: https://databricks.com/session/mmlspark-lessons-from-building-a-sparkml-compatible-machine-learning-library-for-apache-spark "MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library for Apache Spark"

## Contributing & feedback

Expand All @@ -323,7 +323,7 @@ Issue](https://help.github.com/articles/creating-an-issue/).

- [Recommenders](https://github.com/Microsoft/Recommenders)

- [JPMML-SparkML plugin for converting MMLSpark LightGBM models to
- [JPMML-SparkML plugin for converting SynapseML LightGBM models to
PMML](https://github.com/alipay/jpmml-sparkml-lightgbm)

- [Microsoft Cognitive Toolkit](https://github.com/Microsoft/CNTK)
Expand Down
Loading

0 comments on commit 44402cd

Please sign in to comment.