Skip to content

Releases: microsoft/SynapseML

v0.16

18 Jul 02:17
1d29394
Compare
Choose a tag to compare

New Features

New Examples

Updates and Improvements

General

  • MMLSpark Image Schema now unified with Spark Core
  • Bugfixes for Text Analytics services
  • PageSplitter now propagates nulls
  • HTTP on Spark now supports socket and read timeouts
  • HyperparamBuilder python wrappers now return idiomatic python objects

LightGBM on Spark

  • Added multiclass classification
  • Added multiple types of boosting (Gradient Boosting Decision Tree, Random Forest, Dropout meet Multiple Additive Regression Trees, Gradient-based One-Side Sampling)
  • Added windows OS support/bugfix
  • LightGBM version bumped to 2.2.200
  • Added native support for categorical columns, either through Spark's StringIndexer, MMLSpark's ValueIndexer or list of indexes/slot names parameter
  • isUnbalance parameter for unbalanced datasets
  • Added boost from average parameter

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

  • Ilya Matiach, Casey Hong, Daniel Ciborowski, Karthik Rajendran, Dalitso Banda, Manon Knoertzer, Sudarshan Raghunathan, Anand Raman,Markus Cozowicz, The Microsoft AI Development Acceleration Program, Cognitive Search Team, Azure Search Team

v0.15

18 Jul 02:17
Compare
Choose a tag to compare

New Features

  • Add the TagImage and DescribeImage services
  • Add Ranking Cross Validator and Evaluator

New Examples

Updates and Improvements

LightGBM

  • Fix issue with raw2probabilityInPlace
  • Add weight column
  • Add getModel API to TrainClassifier and TrainRegressor
  • Improve robustness of getting executor cores

HTTP on Spark and Spark Serving

  • Improve robustness of Gateway creation and management
  • Imrpove Gateway documentation

Version Bumps

  • Updated to Spark 2.4.0
  • LightGBM version update to 2.1.250

Misc

  • Fix Flaky Tests
  • Remove autogeneration of scalastyle
  • Increase training dataset size in snow leopard example

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

  • Ilya Matiach, Casey Hong, Karthik Rajendran, Daniel Ciborowski, Sebastien Thomas, Eli Barzilay, Sudarshan Raghunathan, @flybywind, @wentongxin, @haal

v0.14

18 Jul 02:17
Compare
Choose a tag to compare

New Features

  • The Cognitive Services on Spark: A simple and scalable integration between the Microsoft Cognitive Services and SparkML
    • Bing Image Search
    • Computer Vision: OCR, Recognize Text, Recognize Domain Specific Content,
      Analyze Image, Generate Thumbnails
    • Text Analytics: Language Detector, Entity Detector, Key Phrase Extractor,
      Sentiment Detector, Named Entity Recognition
    • Face: Detect, Find Similar, Identify, Group, Verify
  • Added distributed model interpretability with LIME on Spark
  • 100x lower latencies (<1ms) with Spark Serving
  • Expanded Spark Serving to cover the full HTTP protocol
  • Added the SuperpixelTransformer for segmenting images
  • Added a Fluent API, mlTransform and mlFit, for composing pipelines more elegantly

New Examples

  • Chain together cognitive services to understand the feelings of your favorite celebrities with CognitiveServices - Celebrity Quote Analysis.ipynb
  • Explore how you can use Bing Image Search and Distributed Model Interpretability to get an Object Detection system without labeling any data in ModelInterpretation - Snow Leopard Detection.ipynb
  • See how to deploy any spark computation as a Web service on any Spark platform with the SparkServing - Deploying a Classifier.ipynb notebook

Updates and Improvements

LightGBM

  • More APIs for loading LightGBM Native Models
  • LightGBM training checkpointing and continuation
  • Added tweedie variance power to LightGBM
  • Added early stopping to lightGBM
  • Added feature importances to LightGBM
  • Added a PMML exporter for LightGBM on Spark

HTTP on Spark

  • Added the VectorizableParam for creating column parameterizable inputs
  • Added handler parameter added to HTTP services
  • HTTP on Spark now propagates nulls robustly

Version Bumps

  • Updated to Spark 2.3.1
  • LightGBM version update to 2.1.250

Misc

  • Added Vagrantfile for easy windows developer setup
  • Improved Image Reader fault tolerance
  • Reorganized Examples into Topics
  • Generalized Image Featurizer and other Image based code to handle Binary Files as well as Spark Images
  • Added ModelDownloader R wrapper
  • Added getBestModel and getBestModelInfo to TuneHyperparameters
  • Expanded Binary File Reading APIs
  • Added Explode and Lambda transformers
  • Added SparkBindings trait for automating spark binding creation
  • Added retries and timeouts to ModelDownloader
  • Added ResizeImageTransformer to remove ImageFeaturizer dependence on OpenCV

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark. (In alphabetical order)

  • Abhiram Eswaran, Anand Raman, Ari Green, Arvind Krishnaa Jagannathan, Ben Brodsky, Casey Hong, Courtney Cochrane, Henrik Frystyk Nielsen, Ilya Matiach, Janhavi Suresh Mahajan, Jaya Susan Mathew, Karthik Rajendran, Mario Inchiosa, Minsoo Thigpen, Soundar Srinivasan, Sudarshan Raghunathan, @terrytangyuan

v0.13

18 Jul 02:17
Compare
Choose a tag to compare

New Functionality:

  • Export trained LightGBM models for evaluation outside of Spark

  • LightGBM on Spark supports multiple cores per executor

  • CNTKModel works with multi-input multi-output models of any CNTK
    datatype

  • Added Minibatching and Flattening transformers for adding flexible
    batching logic to pipelines, deep networks, and web clients.

  • Added Benchmark test API for tracking model performance across
    versions

  • Added PartitionConsolidator function for aggregating streaming data
    onto one partition per executor (for use with connection/rate-limited
    HTTP services)

Updates and Improvements:

  • Updated to Spark 2.3.0

  • Added Databricks notebook tests to build system

  • CNTKModel uses significantly less memory

  • Simplified example notebooks

  • Simplified APIs for MMLSpark Serving

  • Simplified APIs for CNTK on Spark

  • LightGBM stability improvements

  • ComputeModelStatistics stability improvements

Acknowledgements:

We would like to acknowledge the external contributors who helped create
this version of MMLSpark (in order of commit history):

mmlspark-v0.11: v0.11

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • TuneHyperparameters: parallel distributed randomized grid search for
    SparkML and TrainClassifier/TrainRegressor parameters. Sample
    notebook and python wrappers will be added in the near future.

  • Added PowerBIWriter for writing and streaming data frames to
    PowerBI.

  • Expanded image reading and writing capabilities, including using
    images with Spark Structured Streaming. Images can be read from and
    written to paths specified in a dataframe.

  • New functionality for convenient plotting in Python.

  • UDF transformer and additional UDFs.

  • Expanded pipeline support for arbitrary user code and libraries such
    as NLTK through UDFTransformer.

  • Refactored fuzzing system and added test coverage.

  • GPU training supports multiple VMs.

Updates:

  • Updated to Conda 4.3.31, which comes with Python 3.6.3.

  • Also updated SBT and JVM.

Improvements:

  • Additional bugfixes, stability, and notebook improvements.

mmlspark-v0.10: v0.10

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • We now provide initial support for training on a GPU VM, and an ARM
    template to deploy an HDI Cluster with an associated GPU machine. See
    docs/gpu-setup.md for instructions on setting this up.

  • New auto-generated R wrappers for estimators and transformers. To
    import them into R, you can use devtools to import from the uploaded
    zip file. Tests and sample notebooks to come.

  • A new RenameColumn transformer for renaming columns within a
    pipeline.

New notebooks:

  • Notebook 104: An experiment to demonstrate regression models to
    predict automobile prices. This notebook demonstrates the use of
    Pipeline stages, CleanMissingData, and
    ComputePerInstanceStatistics.

  • Notebook 105: Demonstrates DataConversion to make some columns Categorical.

  • There us a 401 notebook in notebooks/gpu which demonstrates CNTK
    training when using a GPU VM. (It is not shown with the rest of the
    notebooks yet.)

Updates:

  • Updated to use CNTK 2.2. Note that this version of CNTK depends on
    libpng12 and libjasper1 -- which are included in our docker images.
    (This should get resolved in the upcoming CNTK 2.3 release.)

Improvements:

  • Local builds will always use a "0.0" version instead of a version
    based on the git repository. This should simplify the build process
    for developers and avoid hard-to-resolve update issues.

  • The TextPreprocessor transformer can be used to find and replace all
    key value pairs in an input map.

  • Fixed a regression in the image reader where zip files with images no
    longer displayed the full path to the image inside a zip file.

  • Additional minor bug and stability fixes.

v0.9

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • Refactor ImageReader and BinaryFileReader to support streaming
    images, including a Python API. Also improved performance of the
    readers. Check the 302 notebook for usage example.

  • Add ClassBalancer estimator for improving classification performance
    on highly imbalanced datasets.

  • Create an infrastructure for automated fuzzing, serialization, and
    python wrapper tests.

  • Added a DropColumns pipeline stage.

New notebooks:

  • 305: A Flowers sample notebook demonstrating deep transfer learning
    with ImageFeaturizer.

Updates:

  • Our main build is now based on Spark 2.2.

Improvements:

  • Enable streaming through the EnsembleByKey transformer.

  • ImageReader, HDFS issue, etc.

v0.8

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • We are now uploading MMLSpark as a Azure/mmlspark spark package.
    Use --packages Azure:mmlspark:0.8 with the Spark command-line tools.

  • Add a bi-directional LSTM medical entity extractor to the
    ModelDownloader, and new jupyter notebook for medical entity
    extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM.

  • Add ImageSetAugmenter for easy dataset augmentation within image
    processing pipelines.

Improvements:

  • Optimize the performance of CNTKModel. It now broadcasts a loaded
    model to workers and shares model weights between partitions on the
    same worker. Minibatch padding (an internal workaround of a CNTK bug)
    is now no longer used, eliminating excess computations when there is a
    mismatch between the partition size and minibatch size.

  • Bugfix: CNTKModel can work with models with unnamed outputs.

Docker image improvements:

  • Environment variables are now part of the docker image (in addition to
    being set in bash).

  • New docker images:

    • microsoft/mmlspark:latest: plain image, as always,
    • microsoft/mmlspark:gpu: GPU variant based on an nvidia/cuda image.
    • microsoft/mmlspark:plus and microsoft/mmlspark:plus-gpu: these
      images contain additional packages for internal use; they will
      probably be based on an older Conda version too in future releases.

Updates:

  • The Conda environment now includes NLTK.

  • Updated Java and SBT versions.

v0.7

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • New transforms: EnsembleByKey, Cacher Timer; see the documentation.

Updates:

  • Miniconda version 4.3.21, including Python 3.6.

  • CNTK version 2.1, using Maven Central.

  • Use OpenCV from the OpenPnP project from Maven Central.

Improvements:

  • Spark's binaryFiles function had a regression in version 2.1 from
    version 2.0 which would lead to performance issues; work around that
    for now. Data frame operations after a use of BinaryFileReader (eg,
    reading images) are significantly faster with this.

  • The Spark installation is now patched with hadoop-azure and
    azure-storage.

  • Includes additional bug fixes and improvements.

v0.6

18 Jul 02:03
Compare
Choose a tag to compare

New functionality:

  • Similar to Spark's StringIndexer, we have a ValueIndexer that can
    be used for indexing any type of values instead of only strings. Not
    only can it index these values, we also provide a reverse mapping via
    IndexToValue, similar to Spark's IndexToString transform.

  • A new "clean missing" data estimator, example:

    val cmd = new CleanMissingData()
      .setInputCols(Array("some-column"))
      .setOutputCols(Array("some-column"))
      .setCleaningMode(CleanMissingData.customOpt)
      .setCustomValue(someCustomValue)
    val cmdModel = cmd.fit(dataset)
    val result = cmdModel.transform(dataset)
    
  • New default featurization for date and timestamp spark types and our
    internal image type. For featurization of date columns, convert
    column to double features: year, day of week, month, day of month.
    For featurization of timestamp columns, same as date and in addition:
    hour of day, minute of hour, second of minute. For featurization of
    image columns, use image data converted to double with width and
    height info.

  • Starting the docker image without an ACCEPT_EULA variable setting
    would throw an error. Instead, we now start a tiny web server that
    shows the EULA and replaces itself with the Jupyter interface when you
    click the AGREE button.

Breaking changes:

  • Renamed ImageTransform to ImageTransformer.

Notable bug fixes and other changes:

  • Improved sample notebooks, and a new one: "303 - Transfer Learning by
    DNN Featurization - Airplane or Automobile".

  • Fix serialization bugs in generated python PipelineStages.

Acknowledgments

Thanks to Ali Zaidi for some notebook beautifications.