@mailletf mailletf released this Jul 12, 2016 · 1226 commits to master since this release

Assets 2

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2016.07.12.0. Since the latest release, we've been working on many exciting projects. For example, we've started applying MLDB to LiDAR data and header bidding. We're also gearing up for more projects on image classification and deep learning, which means MLDB's support for Tensorflow will keep on improving over the coming weeks and months.

This release contains 135 new commits, modified 283 files and fixes 47 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

New tutorial

The Selecting Columns Programmatically Using Column Expressions Tutorial explains how the column expressions can be used to programmatically chose which columns are returned by the SELECT statement. This is an example of the powerful additions MLDB made to standard SQL to make it possible to work efficiently with schema-free sparse datasets made up of millions of columns.

Improvements to import procedures

Support for the SELECT and NAMED arguments in the import.json procedure

Following the addition of the WHERE argument to the import.json procedure in MLDB's last release, the procedure now supports the SELECT and NAMED arguments.

The json.import procedure allows a user to import a dataset made up of JSON blobs. The SELECT argument allows a user to select which keys to import, while the NAMED argument allows a user to name each row by potentially using values from the JSON blob.

Given the following file that contains these two lines:

{"a": "b1", "c": {"d": 1}, "e": [0, 1]}
{"a": "b2", "c": {"d": 2}}, "e": [0, 5]}

if we use the new arguments in the following way:

  • SELECT: c.d
  • NAMED: a

the resulting dataset will look like this:

_rowName c.d
b1 1
b2 2

Added rowHash() function to the import.text procedure

In the import.text procedure, certain functions are available in the SELECT, NAMED, WHERE and TIMESTAMP expressions.

This release adds the rowHash() function, that should be mainly useful when used in the WHERE argument to do random sampling. For example, when importing a huge file in MLDB, if we know beforehand that we only want to load a random 10% sample of the data, we can now simply use WHERE: rowHash() % 10 = 0. This will only keep the required sample as we're streaming through the data saving both time and memory.

Improvements to the SFTP protocol handler

In MLDB's last release, we introduced the sftp:// protocol handler. In this release, we made it more robust with the support of non-standard SSH ports as well as improving how it handles an unexpected loss of connection.

Machine learning

New summary.statistics procedure

As a first step in the modelling process, a data scientist usually wants to get a feel for the data. Looking at summary statistics is a great way to do this since they provide a high-level summary of the data.

The new summary.statistics procedure calculates summary statistics for the different columns in a dataset, and works for both numerical and categorical data.

The procedure calculates the number of unique and of null values,
and the most frequent items for both numerical and categorical data. In addition to those, the procedure calculates the mean, minimum and maximum values as well as the 1st quartile, median and 3rd quartile for columns containing only numerical data.

This is an example of the statistics for numerical columns on the dataset used in the Predicting Titanic Survival Demo:

New feature_hasher feature generator function

Feature hashing, also known as the hashing trick, is a way to turn a potentially very large feature vector into a much smaller fixed-length vector. It works by applying a hash function to each feature in the original vector and using the result as an index in the smaller vector.

The feature_hasher function offers this functionality and can operate in two modes: on columns only or on the union of columns and values. This gives it the flexibility to deal with both sparse and dense data respectivelly.

Updated Tensorflow to version 0.9.0

We updated the Tensorflow version shipped with MLDB to version 0.9.0. The new version includes many bug fixes and performance improvements.

This new Tensorflow release also includes contributions from a member of the MLDB team!

New functions

  • hash(expr): this function returns the hash of the value in expr.
  • extract_domain(str, {removeSubdomain: false}): this functions extracts the domain name from a URL. It can be very useful when dealing with web data.