### Random Forest Example using Titanic data
In this notebook we demo the usage of python-mldb. You can access python-mldb through a single class, Dealer.

In [1]:
import os

from python_mldb import Dealer


path = os.path.abspath('./')
print(path)

/Users/johnnyhsu/my_repo/python-mldb/Example


In [2]:
dealer = Dealer.Dealer(os.path.join(path, 'config_file/config.yaml'))

Connection established.
Query: SHOW DATABASES; done.
Query: USE test done.
Dealer established, service start!


### Dealer Intro
When Dealer object establish, it can load *.csv file into database through dealer.dataset.
Dealer can start a training procedure with the dataset using dealer.procedure.train.

In [3]:
train_path = os.path.join(path, '../data/train.csv')
os.path.exists(train_path)
print (train_path)

raw_data_name = 'TitanicTrain'
dealer.dataset.save_to_database(train_path, 'TitanicTrain')

/Users/johnnyhsu/my_repo/python-mldb/Example/../data/train.csv
Failed : 1050 (42S01): Table 'titanictrain' already exists
Query: LOAD DATA LOCAL INFILE '/Users/johnnyhsu/my_repo/python-mldb/Example/../data/train.csv' INTO TABLE TitanicTrain FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '
' IGNORE 1 LINES done.


In [4]:
# Check the data we just save into database
raw_data = dealer.dataset.load_from_database(raw_data_name)

Query: SHOW COLUMNS FROM TitanicTrain done.
Query: SELECT * FROM TitanicTrain done.


In [5]:
print(raw_data.columns)
n = 5
guests = raw_data.head(5).values
for guest in guests:
    print(guest)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
 '7.25' '' 'S\r']
['2' '1' '1' 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
 'female' '38' '1' '0' 'PC 17599' '71.2833' 'C85' 'C\r']
['3' '1' '3' 'Heikkinen, Miss. Laina' 'female' '26' '0' '0'
 'STON/O2. 3101282' '7.925' '' 'S\r']
['4' '1' '1' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)' 'female' '35'
 '1' '0' '113803' '53.1' 'C123' 'S\r']
['5' '0' '3' 'Allen, Mr. William Henry' 'male' '35' '0' '0' '373450'
 '8.05' '' 'S\r']


### Training Procedure
Dealer.dataset will load the data we save into database, return a pandas.Dataframe object. We now use choose columns we want for our training process.

In [6]:
# Select the feature we want
train_feature_list = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
train_label = ['Survived']

train_x = raw_data[train_feature_list].values
train_y = raw_data[train_label].values


In [7]:
# Train with toy data generated by sklearn
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

In [8]:
columns = ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'label']
X, y = make_classification(n_samples=1000, 
                           n_features=4,
                           n_informative=2, 
                           n_redundant=0,
                           random_state=0, 
                           shuffle=False)

y = np.expand_dims(y, axis=1)
feature_label_pair = np.concatenate((X, y), axis=1)

toy_data = pd.DataFrame(data=feature_label_pair,
                       columns=columns)

toy_data_table_name = 'ToyData'
with open('toy_data.csv', 'w') as f:
    toy_data.to_csv(f, columns=columns, index=False)

In [9]:
# Save toy data into database
dealer.dataset.save_to_database('toy_data.csv', toy_data_table_name)

Failed : 1050 (42S01): Table 'toydata' already exists
Query: LOAD DATA LOCAL INFILE 'toy_data.csv' INTO TABLE ToyData FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '
' IGNORE 1 LINES done.


In [10]:
# Check the data we just save into database
toy_data = dealer.dataset.load_from_database(toy_data_table_name)

Query: SHOW COLUMNS FROM ToyData done.
Query: SELECT * FROM ToyData done.


In [11]:
print(toy_data.columns)

Index(['feature_1', 'feature_2', 'feature_3', 'feature_4', 'label'], dtype='object')


### Add procedure to dealer's procedure list
Choose model we want for training, register it to dealer's procedure list.

In [12]:
from python_mldb import Procedure
model_name = 'rf_classifier'
rf_classifier = Procedure.RFClassifierProcedure(dealer.query_handler, dealer.dataset, model_name)
dealer.procedure_dict[model_name] = rf_classifier

In [13]:
# Call the procedure's train function
procedure = dealer.procedure_dict[model_name]
procedure.train(toy_data_table_name, label_col=[columns[4]], feature_col=columns[0: 4])

Query: SHOW COLUMNS FROM ToyData done.
Query: SELECT * FROM ToyData done.
Start training random forest classifier with dataset ToyData.


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished


Query: SHOW TABLES; done.
Query: CREATE TABLE RF_Model (name VARCHAR(25), savetime DATETIME, dataset VARCHAR(25), model_path VARCHAR(250), CONSTRAINT PK Primary Key (name, savetime, dataset)) done.
Query: INSERT INTO RF_Model VALUES ('rf_classifier','2018-12-16 01:15:32.214572','ToyData','/Users/johnnyhsu/my_repo/python-mldb/saved_model/2018-12-16T01:15:32.214572_ToyData_rf_classifier.pickle') done.
INSERT INTO RF_Model VALUES ('rf_classifier','2018-12-16 01:15:32.214572','ToyData','/Users/johnnyhsu/my_repo/python-mldb/saved_model/2018-12-16T01:15:32.214572_ToyData_rf_classifier.pickle')
Trained model is saved in database test, table RF_Model.


### Add function to dealer's function list
Choose model we want to use for referencing, register it to dealer's function list.

In [14]:
# Register function for dealer
from python_mldb import Function
rf_classifier_func = Function.RFClassifierFunction(dealer.query_handler, dealer.dataset, model_name)

In [15]:
# Show models we have in the database
rf_classifier_func.show_model()

Query: SHOW TABLES; done.
Query: SELECT * FROM RF_Model done.
('rf_classifier', datetime.datetime(2018, 12, 16, 1, 15, 32), 'ToyData', '/Users/johnnyhsu/my_repo/python-mldb/saved_model/2018-12-16T01:15:32.214572_ToyData_rf_classifier.pickle')


In [17]:
# Reference using the model and dataset we choose
rf_classifier_func.reference(model_name, '2018-12-16 01:15:32', 'ToyData', 'ToyData', columns[0: 4])

Query: SHOW TABLES; done.
Query: SELECT model_path FROM RF_Model WHERE name='rf_classifier' AND savetime='2018-12-16 01:15:32' AND dataset='ToyData' done.
Query: SHOW COLUMNS FROM ToyData done.
Query: SELECT * FROM ToyData done.
[0. 0. 0. ... 1. 1. 1.]


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
