@mailletf mailletf released this Jun 28, 2016 · 1361 commits to master since this release

Assets 2

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2016.06.28.1. We've been hard at work using MLDB for several customer-facing projects and building internal features on it. We've also added a team member, Jonathan, who will be spending the summer building tutorials on MLDB. Welcome!

This release contains 112 new commits, modified 114 files and fixes 41 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

New demo

  • The Investigating the Panama Papers demo shows off MLDB's SQL engine by exploring the raw data from the Offshore Leaks Database (or "Panama Papers" as they were called in the media). MLDB is a great tool to understand the basic structure of the dataset and to start to identify the predictive power of some of the attributes.

New tutorials

  • The [Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial](https://docs.mldb.ai/ipy/notebooks/_tutorials/ _latest/Executing%20JavaScript%20Code%20Directly%20in%20SQL%20Queries%20Using%20the%20jseval%20Function%20Tutorial.html) showcases a very unique feature of MLDB: the ability to embed Javascript directly inside SQL queries in an extremely performant, multithreaded manner.
  • The Virtual Manipulation of Datasets Tutorial shows how to use the datasets of type sampled and merged. These are useful for splitting a dataset into testing and training sets and recombining them.
  • The Loading Data From An HTTP Server Tutorial shows how to load data from a public web server. Since MLDB is batteries included and much of machine learning is done over public datasets from
    the web, it's important to highlight how easy MLDB makes it to get started with one.

Improvements to import procedures

New autoGenerateHeaders option in import.text procedure

MLDB is a great way to deal with datasets that have lots of columns. The prefered way to import data is by using the import.text procedure. The procedure has lots of options to provide as much flexibility as possible when importing raw data.

However, if the imported file's first line did not contain the header (names of all the columns), it had to be provided as a list to the procedure. This added an extra, unnecessary step and slowed down the workflow. The new autoGenerateHeaders option solves this problem by automatically generating column names 0, 1, 2, etc.

Support for the WHERE argument in import.json procedure

When dealing with large real-life datasets, we don't always need all the rows of our raw data. Being able to filter unnecessary rows during the import stage can save both processing time and memory.

The addition of the WHERE argument to the import.json procedure allows a user to filter rows on the values inside the JSON blobs she or he is importing using an SQL expressoin. Here is a quick example:

Given the following file that contains these two lines:

{"a": "b1", "c": {"d": 1}}
{"a": "b2", "c": {"d": 2}}

we can now filter like this: WHERE: "c.d = 1". The resulting dataset will only contain the first line:

a c.d
b1 1

Machine Learning Improvements

The classifier.experiment procedure (as well as the classifier. test procedure) now report the accuracy, when evaluating boolean or categorical classifiers. This complements the metrics already reported: precision, recall, F1-score and Area Under the Curve.

A configuration present for the Naive Bayes classifier has also been added to the classifier.train procedure.

Added JSON payload suport to the /query endpoint and user functions

The /query endpoint and the application of functions via REST now support arguments passed in the JSON payload. Previously, they could only receive arguments in the query-string. Since the query-string has a limit in size, it could be problematic for very large queries. For example, if trying to query MLDB with the data extracted from an image. Supporting JSON payloads solves this shortcoming.

New SFTP protocol handler

It's easy to load data from a variety of sources with MLDB because of the different [protocols it can handle](https://docs.mldb.ai/doc/#builtin/ Url.md.html). This release adds the SSH File Transfer Protocol (SFTP) to the list of supported protocols.

New functions

http.useragent

For users dealing with web data, the http.useragent function provides very easy parsing of [user agent strings](https://en.wikipedia.org/wiki/ User_agent). After having instanciated a ua_parser function, a user can make the following call:

SELECT ua_parser({ua: 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5. 1 Mobile/9B206  Safari/7534.48.3'}) as *

which will return:

rowName browser.family browser.version device.brand device.model isSpider os. family os.version
result Mobile Safari 5.1.0 Apple iPhone 0 iOS 5.1.1

sign

Adding to the varied list of numerical functions available, the sign(x) function has been added. It returns the sign of x (-1, 0, +1).

Performance Improvements

The previous release of MLDB included internal refactorings to allow structured paths for row and column names. These fixed many problems with modelling data with MLDB, but came at a significant runtime cost. This release of MLDB has added a lot of rework and optimizations of those structured path names, allowing for reduced memory usage and much faster manipulation. In most cases, MLDB should be as performant or more as previous releases.