Skip to content

Releases: microsoft/SynapseML

v0.13

18 Jul 02:17
Compare
Choose a tag to compare

New Functionality:

  • Export trained LightGBM models for evaluation outside of Spark

  • LightGBM on Spark supports multiple cores per executor

  • CNTKModel works with multi-input multi-output models of any CNTK
    datatype

  • Added Minibatching and Flattening transformers for adding flexible
    batching logic to pipelines, deep networks, and web clients.

  • Added Benchmark test API for tracking model performance across
    versions

  • Added PartitionConsolidator function for aggregating streaming data
    onto one partition per executor (for use with connection/rate-limited
    HTTP services)

Updates and Improvements:

  • Updated to Spark 2.3.0

  • Added Databricks notebook tests to build system

  • CNTKModel uses significantly less memory

  • Simplified example notebooks

  • Simplified APIs for MMLSpark Serving

  • Simplified APIs for CNTK on Spark

  • LightGBM stability improvements

  • ComputeModelStatistics stability improvements

Acknowledgements:

We would like to acknowledge the external contributors who helped create
this version of MMLSpark (in order of commit history):

mmlspark-v0.11: v0.11

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • TuneHyperparameters: parallel distributed randomized grid search for
    SparkML and TrainClassifier/TrainRegressor parameters. Sample
    notebook and python wrappers will be added in the near future.

  • Added PowerBIWriter for writing and streaming data frames to
    PowerBI.

  • Expanded image reading and writing capabilities, including using
    images with Spark Structured Streaming. Images can be read from and
    written to paths specified in a dataframe.

  • New functionality for convenient plotting in Python.

  • UDF transformer and additional UDFs.

  • Expanded pipeline support for arbitrary user code and libraries such
    as NLTK through UDFTransformer.

  • Refactored fuzzing system and added test coverage.

  • GPU training supports multiple VMs.

Updates:

  • Updated to Conda 4.3.31, which comes with Python 3.6.3.

  • Also updated SBT and JVM.

Improvements:

  • Additional bugfixes, stability, and notebook improvements.

mmlspark-v0.10: v0.10

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • We now provide initial support for training on a GPU VM, and an ARM
    template to deploy an HDI Cluster with an associated GPU machine. See
    docs/gpu-setup.md for instructions on setting this up.

  • New auto-generated R wrappers for estimators and transformers. To
    import them into R, you can use devtools to import from the uploaded
    zip file. Tests and sample notebooks to come.

  • A new RenameColumn transformer for renaming columns within a
    pipeline.

New notebooks:

  • Notebook 104: An experiment to demonstrate regression models to
    predict automobile prices. This notebook demonstrates the use of
    Pipeline stages, CleanMissingData, and
    ComputePerInstanceStatistics.

  • Notebook 105: Demonstrates DataConversion to make some columns Categorical.

  • There us a 401 notebook in notebooks/gpu which demonstrates CNTK
    training when using a GPU VM. (It is not shown with the rest of the
    notebooks yet.)

Updates:

  • Updated to use CNTK 2.2. Note that this version of CNTK depends on
    libpng12 and libjasper1 -- which are included in our docker images.
    (This should get resolved in the upcoming CNTK 2.3 release.)

Improvements:

  • Local builds will always use a "0.0" version instead of a version
    based on the git repository. This should simplify the build process
    for developers and avoid hard-to-resolve update issues.

  • The TextPreprocessor transformer can be used to find and replace all
    key value pairs in an input map.

  • Fixed a regression in the image reader where zip files with images no
    longer displayed the full path to the image inside a zip file.

  • Additional minor bug and stability fixes.

v0.9

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • Refactor ImageReader and BinaryFileReader to support streaming
    images, including a Python API. Also improved performance of the
    readers. Check the 302 notebook for usage example.

  • Add ClassBalancer estimator for improving classification performance
    on highly imbalanced datasets.

  • Create an infrastructure for automated fuzzing, serialization, and
    python wrapper tests.

  • Added a DropColumns pipeline stage.

New notebooks:

  • 305: A Flowers sample notebook demonstrating deep transfer learning
    with ImageFeaturizer.

Updates:

  • Our main build is now based on Spark 2.2.

Improvements:

  • Enable streaming through the EnsembleByKey transformer.

  • ImageReader, HDFS issue, etc.

v0.8

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • We are now uploading MMLSpark as a Azure/mmlspark spark package.
    Use --packages Azure:mmlspark:0.8 with the Spark command-line tools.

  • Add a bi-directional LSTM medical entity extractor to the
    ModelDownloader, and new jupyter notebook for medical entity
    extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM.

  • Add ImageSetAugmenter for easy dataset augmentation within image
    processing pipelines.

Improvements:

  • Optimize the performance of CNTKModel. It now broadcasts a loaded
    model to workers and shares model weights between partitions on the
    same worker. Minibatch padding (an internal workaround of a CNTK bug)
    is now no longer used, eliminating excess computations when there is a
    mismatch between the partition size and minibatch size.

  • Bugfix: CNTKModel can work with models with unnamed outputs.

Docker image improvements:

  • Environment variables are now part of the docker image (in addition to
    being set in bash).

  • New docker images:

    • microsoft/mmlspark:latest: plain image, as always,
    • microsoft/mmlspark:gpu: GPU variant based on an nvidia/cuda image.
    • microsoft/mmlspark:plus and microsoft/mmlspark:plus-gpu: these
      images contain additional packages for internal use; they will
      probably be based on an older Conda version too in future releases.

Updates:

  • The Conda environment now includes NLTK.

  • Updated Java and SBT versions.

v0.7

18 Jul 02:18
Compare
Choose a tag to compare

New functionality:

  • New transforms: EnsembleByKey, Cacher Timer; see the documentation.

Updates:

  • Miniconda version 4.3.21, including Python 3.6.

  • CNTK version 2.1, using Maven Central.

  • Use OpenCV from the OpenPnP project from Maven Central.

Improvements:

  • Spark's binaryFiles function had a regression in version 2.1 from
    version 2.0 which would lead to performance issues; work around that
    for now. Data frame operations after a use of BinaryFileReader (eg,
    reading images) are significantly faster with this.

  • The Spark installation is now patched with hadoop-azure and
    azure-storage.

  • Includes additional bug fixes and improvements.

v0.6

18 Jul 02:03
Compare
Choose a tag to compare

New functionality:

  • Similar to Spark's StringIndexer, we have a ValueIndexer that can
    be used for indexing any type of values instead of only strings. Not
    only can it index these values, we also provide a reverse mapping via
    IndexToValue, similar to Spark's IndexToString transform.

  • A new "clean missing" data estimator, example:

    val cmd = new CleanMissingData()
      .setInputCols(Array("some-column"))
      .setOutputCols(Array("some-column"))
      .setCleaningMode(CleanMissingData.customOpt)
      .setCustomValue(someCustomValue)
    val cmdModel = cmd.fit(dataset)
    val result = cmdModel.transform(dataset)
    
  • New default featurization for date and timestamp spark types and our
    internal image type. For featurization of date columns, convert
    column to double features: year, day of week, month, day of month.
    For featurization of timestamp columns, same as date and in addition:
    hour of day, minute of hour, second of minute. For featurization of
    image columns, use image data converted to double with width and
    height info.

  • Starting the docker image without an ACCEPT_EULA variable setting
    would throw an error. Instead, we now start a tiny web server that
    shows the EULA and replaces itself with the Jupyter interface when you
    click the AGREE button.

Breaking changes:

  • Renamed ImageTransform to ImageTransformer.

Notable bug fixes and other changes:

  • Improved sample notebooks, and a new one: "303 - Transfer Learning by
    DNN Featurization - Airplane or Automobile".

  • Fix serialization bugs in generated python PipelineStages.

Acknowledgments

Thanks to Ali Zaidi for some notebook beautifications.

v0.5

18 Jul 02:03
Compare
Choose a tag to compare

Initial release.