# The mltoolbox package

The mltoolbox package is a collection of packages designed for quickly building, evaluating, and deploying Tensorflow models. Generally, each package provides tools for the main steps in developing a model: preprocessing, training, evaluating, and prediction. In addition to this, each package is structured so that these steps can be performed locally or using cloud services on Google Cloud Platform, without code changes. 

This notebook is a quick introduction two mltoolbox packages: regression and classification for structured data, collectively called the 'structured data' package.

<br>
# The structured data package

The structured data package (`mltoolbox.classification` and `mltoolbox.regression`) is designed for solving classification or regression problems where the data has a columnar structure. That is, the input data contains a set of features where each feature is a numerical value or a categorical value. 

As an example, consider the problem of predicting a car's value given its make, model, year and millage:


| example_id 	| value 	| make  	| model            	| year 	| millage 	|
|------------	|-------	|-------	|------------------	|------	|---------	|
| 1          	| 11000 	| 'Mazda' 	| 'CZ-5'           	| 2013 	| 70000   	|
| 2          	| 45000 	| 'BMW'   	| '6 Series'       	| 2015 	| 28000   	|
| 3          	| 20000 	| 'Ford'  	| 'F150 Super Cab' 	| 2014 	| 50000   	|

This table has five different column types:
1. key column (example_id)
1. target column (value)
1. categorical string column (make, model)
1. categorical number column (year)
1. numerical column (millage)

The structured data package can work with problems with these types of columns. The structured data package does not support image data; meaning, a column of your data cannot be an image file or file path to an image. However, the mltoolbox package does support image classification problems, see `mltoolbox.image`. Also, the structured data package does not support text columns. This means sentiment analysis problems are not supported.

A standard workflow when using the structured data package includes running these four functions:

1. `analyze`: Computes statistics over the training set used by the trainer. Unlike other packages in the mltoolbox package, the structured data package does not have a preprocess step. The input data does not need any transformation into some other format.
1. `train`: Starts a Tensorflow training job
1. `batch_predict`: Runs prediction where the data is stored in files. When these files contain the target column, this step is called evaluation.
1. `predict`: Runs prediction from an in-memory object. The cloud service version of this step makes a prediction call to a deployed model.

Each of these functions can run its task locally or using GCP services, based on the inclusion or absence of a `cloud` parameter. Also, each of these function have an asynchronous version (`analyze_async`, `train_async`, etc) that returns a job object you can use to query the status of the task. This could be useful, for example, when submitting multiple ML Engine training jobs within one notebook at the same time.

<br>
# Structure of the notebook samples

The notebook samples in this folder first demonstrate running analysis, training, batch prediction, and online prediction 'locally' without using GCP services. This first 'local end to end' notebook writes the output of each state to the local file system. The later notebooks demonstrates one of the four steps using cloud services. 

To speed things up, the cloud service notebooks extracts the results of the previous step from the local notebook. For example, the notebooks that perform cloud prediction require a trained model to exist in GCS. Instead of running training using the ML Engine, the cloud prediction notebooks copy the trained model that the first local notebook make from the local file system to a location on GCS. This means you only use the cloud service the individual notebook demonstrates!

<br>
# Key column

In this package, it is required that the dataset contains a key column. This key column is actually not used in any step of this package. In fact the key column does not even have to contain unique values. However, a key column is required because it is extremely useful in batch prediction. Batch prediction in general does not produce predictions in the same order as the input. This means without a key value, it would be impossible to align the predicted value with the input! By building a key into the model, you can join prediction with the input data source or other external sources that use the same key.

<br>
# Where can I get help?

Please post a question to Stack Overflow with the tag 'google-cloud-datalab'.
