# Example UDF Linear Regression

The following steps will test training and inference of linear regression on a set of 1000 rows with 10 x-columns.

1. Generate of input table `udf_example_lr_in`.
1. Create empty output table `udf_example_lr_out`.
1. Create non-distributed training script `udf_lr_train.py`.
1. Execute training script.
1. Create distributed inference script `udf_lr_infer.py`.
1. Execute inference script.
1. Analyze predictions with SQL.

Also See:

* [Linear Regression](https://en.wikipedia.org/wiki/Linear_regression)
* [sklearn Linear Regression Example](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)
* [sklearn LinearRegression reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)


### Import dependencies

In [1]:
# Local libraries should automatically reload
%reload_ext autoreload
%autoreload 1

# to access Kinetica Jupyter I/O functions
import sys
sys.path.append('../KJIO') 

import numpy as np
import pandas as pd

%aimport kodbc_io
%aimport kapi_io

SCHEMA = 'TEST'

  return f(*args, **kwds)
  return f(*args, **kwds)


### Create input table and data

Create input table with randomly generated x-values.

In [2]:
NUM_ROWS = 1000
NUM_X_COLS = 10

_x_val = np.random.random([NUM_ROWS, NUM_X_COLS])
_x_df = pd.DataFrame(_x_val).add_prefix('x')

_y_val = _x_val.sum(axis=1) + np.random.random([NUM_ROWS])*3
_y_df = pd.DataFrame(_y_val).add_prefix('y')

# Create a combined dataframe
_input_df = pd.concat([_x_df, _y_df], axis=1)

# Give the index a name so a primary key is created from it.
_input_df.index.name = 'id'

# create the table.
INPUT_TABLE = 'udf_example_lr_in'
kapi_io.save_df(_input_df, INPUT_TABLE, SCHEMA)
_input_df.head()

Dropping table: <udf_example_lr_in>
Creating table: <udf_example_lr_in>
Column 0: <id> (long) ['primary_key']
Column 1: <x0> (double) []
Column 2: <x1> (double) []
Column 3: <x2> (double) []
Column 4: <x3> (double) []
Column 5: <x4> (double) []
Column 6: <x5> (double) []
Column 7: <x6> (double) []
Column 8: <x7> (double) []
Column 9: <x8> (double) []
Column 10: <x9> (double) []
Column 11: <y0> (double) []
Inserted rows into <TEST.udf_example_lr_in>: 1000


Unnamed: 0_level_0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,y0
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0.578553,0.8308,0.25494,0.420119,0.481351,0.876553,0.057114,0.542622,0.122846,0.792798,6.480924
1,0.74414,0.876429,0.573956,0.838551,0.092669,0.110693,0.996091,0.111226,0.365169,0.462991,6.95607
2,0.408107,0.452961,0.485118,0.258277,0.723331,0.415332,0.118527,0.32422,0.795756,0.11573,4.185134
3,0.17486,0.118865,0.510145,0.98471,0.41507,0.887213,0.941575,0.514211,0.703649,0.327186,8.288303
4,0.974973,0.72161,0.786799,0.02446,0.801757,0.148838,0.071539,0.754976,0.66377,0.311761,5.681353


### Create empty output table

Create an output table based on the input table with an additeional `predict` column.

In [3]:
_output_df = pd.DataFrame(data=None, dtype='float32', columns=_input_df.columns)
_output_df['predict'] = pd.Series(None, dtype='float32')
_output_df['id'] = pd.Series(None, dtype='int32')

# set the index which will become the primary key
_output_df.set_index('id', inplace=True)

# create the table.
OUTPUT_TABLE = 'udf_example_lr_out'
kapi_io.save_df(_output_df, OUTPUT_TABLE, SCHEMA)

Dropping table: <udf_example_lr_out>
Creating table: <udf_example_lr_out>
Column 0: <id> (long) ['primary_key']
Column 1: <x0> (float) []
Column 2: <x1> (float) []
Column 3: <x2> (float) []
Column 4: <x3> (float) []
Column 5: <x4> (float) []
Column 6: <x5> (float) []
Column 7: <x6> (float) []
Column 8: <x7> (float) []
Column 9: <x8> (float) []
Column 10: <x9> (float) []
Column 11: <y0> (float) []
Column 12: <predict> (float) []
Inserted rows into <TEST.udf_example_lr_out>: 0


### Create UDF training script

This UDF will create a model from the input table and return the model ID.

In [4]:
%%writefile udf_lr_train.py
###########################################################
# Proc Name: lr_train
# Input Params: in_table_name
# Output Params: model_id, processed_rows
###########################################################

import gpudb
import numpy as np
import pandas as pd
import pickle
from sklearn.linear_model import LinearRegression
from kmodel_io import KModelIO
from kinetica_proc import ProcData
import kapi_io

PROC_DATA = ProcData()

# Log some proc details.
PROC_NAME = PROC_DATA.request_info['proc_name']
RUN_ID = PROC_DATA.request_info['run_id']
print('UDF Start: {} ({})'.format(PROC_NAME, RUN_ID))

IN_TABLE_NAME = PROC_DATA.params['in_table_name']
print('Got intput table: {}'.format(IN_TABLE_NAME))
MODEL_NAME = 'LinearReg_Model'

# read input table to a dataframe
_in_df = kapi_io.load_df(IN_TABLE_NAME)
_y_df = _in_df['y0']
_x_df = _in_df.drop(['y0', 'id'], axis=1)

_model = LinearRegression()
_model.fit(X=_x_df, y=_y_df)
print('LinearRegression coefficients: {}'.format(_model.coef_))

# save model to database
_model_pickle = pickle.dumps(_model)

kio = KModelIO()
_model_id = kio.Model2Kinetica(pbfile=_model_pickle, 
                   ModelName=MODEL_NAME, 
                   Loss=-99, 
                   COLLECTION='TEST')
print('Saving model: {} ({})'.format(MODEL_NAME, _model_id))

_result_rows = str(_in_df.shape[0])
PROC_DATA.results['processed_rows'] = _result_rows
PROC_DATA.results['model_id'] = _model_id
PROC_DATA.complete()
print('UDF Complete: {} rows ({})'.format(_result_rows, RUN_ID))

Overwriting udf_lr_train.py


### Run Training UDF

We need to pass as a parameter the input table. It will create a model and we save the resulting model ID.

In [5]:
%aimport kudf_io

kudf_io.create_proc(
    _proc_name='lr_train',
    _file_paths=['udf_lr_train.py', '../KJIO/kmodel_io.py', '../KJIO/kapi_io.py'],
    _execution_mode='nondistributed')

_results = kudf_io.submit_proc(_proc_name='lr_train', 
                       _input_table_names=[], 
                       _output_table_names=[],
                       _params={'in_table_name' : INPUT_TABLE})

_model_id = _results['0']['model_id']
print('Generated model: {}'.format(_model_id))

Reading file: udf_lr_train.py
Reading file: kmodel_io.py
Reading file: kapi_io.py
Creating UDF: lr_train [udf_lr_train.py, ../KJIO/kmodel_io.py, ../KJIO/kapi_io.py]
Dropping older version of proc: lr_train 
Starting UDF: lr_train (id=12)
   Input Tables: []
   Output Tables: []
[12] UDF Running... (0/1 complete) (time=0.0)
[12] UDF Running... (1/1 complete) (time=5.0)
[12] UDF finished with status: complete 
TOM 0: [complete] {'model_id': '99e4d4fc-903f-11e8-a356-0242ac130002', 'processed_rows': '1000'}  (time=3.6 sec)
Generated model: 99e4d4fc-903f-11e8-a356-0242ac130002


### View saved models

Model is saved to the TFmodel table. The next UDF will load it and do distributed inferencing.

In [6]:
kodbc_io.get_df("""
SELECT
    model,
    model_id,
    accuracy,
    data_time_created
FROM TFmodel
order by data_time_created desc
""")

Connected to GPUdb ODBC Server (6.2.0.12.20180720232954)


Unnamed: 0,model,model_id,Accuracy,Data_Time_created
0,LinearReg_Model,99e4d4fc-903f-11e8-a356-0242ac130002,-99.0,2018-07-25 19:19:06
1,LinearReg_Model,073a40ba-903f-11e8-b693-0242ac130002,-99.0,2018-07-25 19:15:00


### Create UDF inference script

This UDF will use the model ID we pass and generate predictions which are saved in the output table.

In [7]:
%%writefile udf_lr_infer.py
###########################################################
# Proc Name: lr_infer
# Input Params: model_id
# Output Params: result_rows, mse, variance
###########################################################

import gpudb
import numpy as np
import pickle
from kmodel_io import KModelIO
from kinetica_proc import ProcData
from sklearn.metrics import mean_squared_error, r2_score

_proc_data = ProcData()

# Log some proc details.
PROC_NAME = _proc_data.request_info['proc_name']
DATA_SEGMENT_ID = _proc_data.request_info['data_segment_id']
RUN_ID = _proc_data.request_info['run_id']
print('UDF Start: {} ({}-{})'.format(PROC_NAME, RUN_ID, DATA_SEGMENT_ID))

_in_table = _proc_data.input_data[0]
_out_table = _proc_data.output_data[0]
_out_table.size = _in_table.size

# Load the model
_model_id = _proc_data.params['model_id']
print('Reading model: {}'.format(_model_id))
_kio = KModelIO()
_picklebytes = _kio.SkModel_from_Kinetica(_model_id)
_model = pickle.loads(_picklebytes)

# copy data colums to out table.
for _idx, _col in enumerate(_in_table):
    _out_table[_col.name][:] = _in_table[_col.name][:]

# copy each column by name into a numpy array
_y_values = _in_table['y0']
_x_values = np.zeros((_in_table.size, 10), dtype=float)
_x_col_names = ['x' + str(i) for i in range(10)]
for _idx, _x_col in enumerate(_x_col_names):
    _x_values[:,_idx] = _in_table[_x_col]

_y_predict = _model.predict(_x_values)
_out_table['predict'][:] = _y_predict

# Calculate stats
_mse = mean_squared_error(_y_values, _y_predict)
_proc_data.results['mse'] = str(_mse)

_variance = r2_score(_y_values, _y_predict)
_proc_data.results['variance'] = str(_variance)

_result_rows = str(_out_table.size)
_proc_data.results['result_rows'] = _result_rows

_proc_data.complete()

print('UDF Complete: {} rows ({}-{})'.format(_result_rows, RUN_ID, DATA_SEGMENT_ID))

Overwriting udf_lr_infer.py


### Run UDF inference script

Run the inference UDF and pass the model ID generated by the training UDF.

In [8]:
%aimport kudf_io

kudf_io.create_proc(
    _proc_name='lr_infer',
    _file_paths=['udf_lr_infer.py', '../KJIO/kmodel_io.py'],
    _execution_mode='distributed')

print('Submitting proc with model: {}'.format(_model_id))
_run_id = kudf_io.submit_proc(_proc_name='lr_infer', 
                       _input_table_names=[INPUT_TABLE], 
                       _output_table_names=[OUTPUT_TABLE],
                       _params={'model_id' : _model_id})

Reading file: udf_lr_infer.py
Reading file: kmodel_io.py
Creating UDF: lr_infer [udf_lr_infer.py, ../KJIO/kmodel_io.py]
Dropping older version of proc: lr_infer 
Submitting proc with model: 99e4d4fc-903f-11e8-a356-0242ac130002
Starting UDF: lr_infer (id=13)
   Input Tables: ['udf_example_lr_in']
   Output Tables: ['udf_example_lr_out']
[13] UDF Running... (0/2 complete) (time=0.0)
[13] UDF Running... (2/2 complete) (time=5.0)
[13] UDF finished with status: complete 
TOM 0: [complete] {'mse': '0.8154020040933726', 'result_rows': '516', 'variance': '0.5373550737647687'}  (time=1.8 sec)
TOM 1000: [complete] {'mse': '0.7414310222615849', 'result_rows': '484', 'variance': '0.5220175504006552'}  (time=1.8 sec)


### Analyze results

Use SQL to compare the actuals with predictions.

In [9]:
kodbc_io.get_df("""
SELECT
    y0 AS real_value, 
    predict, 
    ABS(y0 - predict) AS error
FROM {}
LIMIT 10
""".format(OUTPUT_TABLE))

Connected to GPUdb ODBC Server (6.2.0.12.20180720232954)


Unnamed: 0,real_value,predict,error
0,6.480924,6.409591,0.071333
1,6.95607,6.728884,0.227186
2,4.185134,5.564436,1.379302
3,8.288303,7.179564,1.108739
4,5.681353,6.597044,0.91569
5,3.837377,5.169572,1.332196
6,6.397273,6.45026,0.052987
7,4.830026,5.28457,0.454544
8,6.626303,5.911208,0.715095
9,8.302239,7.144397,1.157842
