# Example UDF Sum-of-squares

This turtorial performs the following steps:

1. Create input table `udf_sos_in` with 1000 rows of random data for $x_1$ and $x_2$.
2. Create output table `udf_sos_out` where results will be stored.
2. Create a UDF that will calculate $y={x_1}^2+{x_2}^2 $ from the input table and save results to the output table.
4. Execute the UDF.
5. Compare the actual and expected results.

Also See:
* [Sum of Squares Tutorial](https://www.kinetica.com/docs/udf/python/examples/dist_noncuda_sum_of_squares/dist_noncuda_sum_of_squares.html)
* [Running Python UDFs](https://www.kinetica.com/docs/udf/python/running.html)
* [Python UDF API](https://www.kinetica.com/docs/udf/python/writing.html)
* [UDF Simulator](https://www.kinetica.com/docs/udf/simulating_udfs.html)


### Import dependencies

In [1]:
# Local libraries should automatically reload
%reload_ext autoreload
%autoreload 1

# to access Kinetica Jupyter I/O functions
import sys
sys.path.append('../KJIO') 

import numpy as np
import pandas as pd

%aimport kodbc_io
%aimport kapi_io

INPUT_TABLE = 'udf_example_sos_in'
OUTPUT_TABLE = 'udf_example_sos_out'
SCHEMA = 'TEST'

### Create input data table

Create a table named with `udf_sos_in` with 1000 rows of random numbers into x1 and x2 colomns. 

In [2]:
NUM_ROWS = 1000

# Create a dataframe from a dict of series. 
_input_df = pd.DataFrame({ 
    'id' : np.array(range(NUM_ROWS), dtype='int32'),
    'x1' : pd.Series(np.random.randn(NUM_ROWS)*10, dtype='float32'),
    'x2' : pd.Series(np.random.randn(NUM_ROWS)*10, dtype='float32'),
    }).set_index('id')

kapi_io.save_df(_input_df, INPUT_TABLE, SCHEMA)

Dropping table: <udf_example_sos_in>
Creating  table: <udf_example_sos_in>
Column 0: <id> (long) ['shard_key']
Column 1: <x1> (float) []
Column 2: <x2> (float) []
Inserted rows into <TEST.udf_example_sos_in>: 1000


### View Input table contents

In [3]:
kodbc_io.get_df("""
select top 10 * from {}
""".format(INPUT_TABLE))

Connected to GPUdb ODBC Server (6.2.0.17.20180825221415)
Rows returned: 10


Unnamed: 0,id,x1,x2
0,0,-8.679238,-14.42838
1,1,1.172305,-18.570099
2,2,-12.150077,-2.034876
3,3,16.557261,-15.781698
4,4,10.953131,-4.461103
5,5,-4.104146,5.906353
6,9,-10.045493,17.56945
7,16,-9.13765,-10.306147
8,19,10.237325,-13.428774
9,22,-9.009595,-1.922585


### Create an empty output table

In [4]:
_output_df = pd.DataFrame({ 
    'id' : pd.Series(None, dtype='int32'),
    'y' : pd.Series(None, dtype='float32'),
    }).set_index('id')

kapi_io.save_df(_output_df, OUTPUT_TABLE, SCHEMA)

Dropping table: <udf_example_sos_out>
Creating  table: <udf_example_sos_out>
Column 0: <id> (long) ['shard_key']
Column 1: <y> (float) []
Inserted rows into <TEST.udf_example_sos_out>: 0


### Below is the contents of the UDF

A python file named `udf_sos_proc.py` will be saved in the current folder

In [5]:
%%writefile udf_sos_proc.py
from kinetica_proc import ProcData
proc_data = ProcData()

proc_name = proc_data.request_info['proc_name']
data_segment_id = proc_data.request_info['data_segment_id']
run_id = proc_data.request_info['run_id']
print('UDF Start: {} ({}-{})'.format(proc_name, run_id, data_segment_id))

in_table = proc_data.input_data[0]
col_in_x1 = in_table['x1']
col_in_x2 = in_table['x2']
col_in_id = in_table['id']

out_table = proc_data.output_data[0]
col_out_y = out_table['y']
col_out_id = out_table['id']

# Extend the output table by the number of record entries in the input table
out_table.size = in_table.size

# Use the first column in the output table as the output column
# Loop through all the input table columns
for i in xrange(0, in_table.size):
    col_out_y[i] = col_in_x1[i]**2 + col_in_x2[i]**2
    col_out_id[i] = col_in_id[i]

# we will get the results when the proc finishes
result_rows = str(out_table.size)
proc_data.results['result_rows'] = result_rows
proc_data.complete()

print('UDF Complete: {} rows ({}-{})'.format(result_rows, run_id, data_segment_id))

Overwriting udf_sos_proc.py


### Execute the UDF

Submit the script for execution and monitor the results.

In [6]:
%aimport kudf_io

kudf_io.create_proc(
    _proc_name='sos_proc',
    _file_paths=['./udf_sos_proc.py'])

_result = kudf_io.submit_proc(_proc_name='sos_proc', 
                    _params={},
                    _input_table_names=[INPUT_TABLE], 
                    _output_table_names=[OUTPUT_TABLE])

Reading file: udf_sos_proc.py
Creating UDF: sos_proc [./udf_sos_proc.py]
Dropping older version of proc: sos_proc 
Starting UDF: sos_proc (id=1)
   Input Tables: ['udf_example_sos_in']
   Output Tables: ['udf_example_sos_out']
[1] UDF Running... (0/2 complete) (time=0.0)
[1] UDF Running... (2/2 complete) (time=5.0)
[1] UDF finished with status: complete 
TOM 0: [complete] {'result_rows': '516'}  (time=3.1 sec)
TOM 1000: [complete] {'result_rows': '484'}  (time=3.1 sec)


### Query the results

We should see that the `diff` column shows zero.

In [7]:
kodbc_io.get_df('''
SELECT 
    in_t.x1, 
    in_t.x2, 
    out_t.y AS actual_result, 
    FLOAT(in_t.x1 * in_t.x1 + in_t.x2 * in_t.x2) AS expected_result,
    FLOAT(in_t.x1 * in_t.x1 + in_t.x2 * in_t.x2) - out_t.y AS diff
FROM {} as out_t
INNER JOIN {} AS in_t 
    ON in_t.id = out_t.id
LIMIT 10
'''.format(OUTPUT_TABLE, INPUT_TABLE))

Connected to GPUdb ODBC Server (6.2.0.17.20180825221415)
Rows returned: 10


Unnamed: 0,x1,x2,actual_result,expected_result,diff
0,-8.679238,-14.42838,283.507324,283.507324,0.0
1,1.172305,-18.570099,346.22287,346.22287,0.0
2,-12.150077,-2.034876,151.765091,151.765091,0.0
3,16.557261,-15.781698,523.204895,523.204895,0.0
4,10.953131,-4.461103,139.872513,139.872513,0.0
5,-4.104146,5.906353,51.729023,51.729019,-4e-06
6,-10.045493,17.56945,409.597504,409.597504,0.0
7,-9.13765,-10.306147,189.713318,189.713318,0.0
8,10.237325,-13.428774,285.134796,285.134796,0.0
9,-9.009595,-1.922585,84.869133,84.869133,0.0
