# ML-SQL Updates and Model Persistance

## By: Neeraj Asthana (under Professor Robert Brunner)

### Summer 2016 UIUC

___

## Updates

1. Model Persistance Exploration (Pickling, joblib, json formatted file)
1. Created ML-SQL 3 step process
1. Added SQL like commenting with "--" operator
1. Added REPLACE keyword
1. General Bug Fixes

In [1]:
from mlsql import repl, execute

## Model Persistance

We must be able to save different models that are trained so that we can transfer models between different users/machines without having to retrain the model completely. 

Options:
1. Pickling
1. Joblib
1. Saving Coefficients and important configurations to a file (.mlsql extension) in JSON format

### Pickling

Idea: Use the "pickle.dumps()" and "pickle.loads()" functions from python's pickle module to hold data into a dictionary like structure. 

Benefits:
- Can easily save and load already learned modules from memory
- Minimal Coding involved
- all model object structures remain in place

Problems:
- Only generalizes to Python (no immediate comparisons in R, Java, Spark, etc.)
- Cannot transfer trained models to other people or machines (not a file)

### JobLib

Idea: Use the "joblib.dump()" and "joblib.load()" functions from python's joblib module to hold data in a file. This is similar to pickling but it saves it to a file instead. 

Benefits:
- Can easily save and load already learned modules from memory
- Minimal Coding involved
- all model object structures remain in place
- Can transfer trained models to other people or machines (saves to a file)

Problems:
- Only generalizes to Python (no immediate comparisons in R, Java, Spark, etc.)
- creates many small, unneccessary files (SVM below creates 12 different small files)

Demo:

In [2]:
from sklearn import svm
from sklearn import datasets

#Train SVM
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)

#Joblib
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
clf = joblib.load('filename.pkl')

### Saving JSON file

Idea: Use the "json.dumps()" and "json.loads()" functions from python's json module to hold a model's parameters (from get_params function) in a file. We can then use the set_params to recreate the object.

Benefits:
- Can easily save and load already learned modules from memory
- Minimal Coding involved
- all model object structures remain in place
- Can transfer trained models to other people or machines (saves to a file)
- Generalizes better than other options (R, Java, Spark, etc. have modules to read files and json formats)

Problems:
- Must manually code initializers for all machine learning algorithms

Demo:

In [3]:
#Show current model
clf

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [7]:
from mlsql.functions.utils.modelIO import save_model, load_model
save_model("example.txt", clf)

with open("example.txt.mlsql", "r") as f:
    text = f.read()
    print(text)

example.txt
SVC
{"degree": 3, "C": 1.0, "decision_function_shape": null, "coef0": 0.0, "class_weight": null, "random_state": null, "cache_size": 200, "tol": 0.001, "probability": false, "kernel": "rbf", "shrinking": true, "max_iter": -1, "gamma": "auto", "verbose": false}


In [10]:
new_model = load_model("example.txt.mlsql")

new_model

SVC


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## ML-SQL steps

Split the ML-SQL language into 3 distinct steps (Model, Apply, and Metrics).

The visual below describes the general divisions between each step and lists the different keywords associated with each of the different steps.

![ML-SQL Steps](mlsql_steps.jpg)

In [6]:
# Iris Dataset exmaple with language

command = "LOAD /home/ubuntu/notebooks/ML-SQL/dataflows/Classification/iris.data ()"

execute(command)

['read', '/home/ubuntu/notebooks/ML-SQL/dataflows/Classification/iris.data', ',', 'False']
filename: /home/ubuntu/notebooks/ML-SQL/dataflows/Classification/iris.data
header: 
separator: 
train size: 
test size: 
predictors: 
label: 
algorithm: 
replace columns: 
replace value: 
replace identifier: 

     0    1    2    3            4
0  5.1  3.5  1.4  0.2  Iris-setosa
1  4.9  3.0  1.4  0.2  Iris-setosa
2  4.7  3.2  1.3  0.2  Iris-setosa
3  4.6  3.1  1.5  0.2  Iris-setosa
4  5.0  3.6  1.4  0.2  Iris-setosa


___

### Acknowledgements

- http://scikit-learn.org/stable/modules/model_persistence.html

___ 