# Examples for Kinetica API

The following steps will test:
1. Saving of a Pandas Dataframe to a Kinetica table 
1. saving of a Pandas Dataframe to a table with shard key and column properties.
1. Loading of a Kinetica table to a Pandas dataframe

Also See:
* [Kinetica Python Guide](https://www.kinetica.com/docs/6.2/tutorials/python_guide.html)
* [Kinetica Python API](https://www.kinetica.com/docs/api/python/index.html)
* [Intro to Pandas Dataframes](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)
* [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)


In [1]:
# Local libraries should automatically reload
%reload_ext autoreload
%autoreload 1

# to access Kinetica Jupyter I/O functions
import sys
sys.path.append('../KJIO') 

import numpy as np
import pandas as pd

### Create a test dataframe

This is the data we will save to a table.

In [2]:
# Create a dataframe from a dict of series. 
_test_df = pd.DataFrame({ 
    'str_col' : ['A', 'B', 'C', 'D'],
    #'cat_col' : pd.Categorical(["test","train","test","train"]),
    'double_col' : 1.,
    #'ts_col' : pd.date_range('1/1/2000', periods=4),
    'float_col' : pd.Series(range(4), dtype='float32'),
    'int_col' : np.array(np.random.randn(4)*10, dtype='int32')
    })

_test_df.head()

Unnamed: 0,str_col,double_col,float_col,int_col
0,A,1.0,0.0,1
1,B,1.0,1.0,-3
2,C,1.0,2.0,-14
3,D,1.0,3.0,-6


### Understanding Dataframe Column Types

The dtypes property lists types that will be used to create the Kinetica table.

In [3]:
_test_df.dtypes

str_col        object
double_col    float64
float_col     float32
int_col         int32
dtype: object

### Understanding Dataframe column and row indexes

Each dataframe has separate indexes for **rows** and **columns** that determine its dimensions. When the dataframe is converted to a table the indexes will be used to generate table attributes.

The **column index** is typically a list of strings. It could also be a range of numbers but this is not good for creating a table so we use column names.

In [4]:
_test_df.columns

Index(['str_col', 'double_col', 'float_col', 'int_col'], dtype='object')

The **row index** will be converted to a table column and used as the shard key if it has a name. In this case we convert `str_col` to an index.

In [5]:
_test_df2 = _test_df.set_index('str_col')
print(_test_df2.index)
_test_df2

Index(['A', 'B', 'C', 'D'], dtype='object', name='str_col')


Unnamed: 0_level_0,double_col,float_col,int_col
str_col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1.0,0.0,1
B,1.0,1.0,-3
C,1.0,2.0,-14
D,1.0,3.0,-6


### Set Kinetica schema and connection parameters

### Import kapi_io functions

In this example we use %aimport so changes to the source are reloaded each time we execute the cell.

In [6]:
%aimport kapi_io

SCHEMA = 'TEST'

### Create and save the dataframe to a table

Typtes are converted and strings get a default `char16` attribute.

In [7]:
kapi_io.save_df(_test_df, 'test_df', SCHEMA)

Dropping table: <test_df>
Creating  table: <test_df>
Column 0: <str_col> (string) ['char16']
Column 1: <double_col> (double) []
Column 2: <float_col> (float) []
Column 3: <int_col> (int) []
Inserted rows into <TEST.test_df>: 4


### Save Dataframe with shard key and column properties

More fine control over the column properties is available.

In [8]:
import gpudb

# Convert str_col to a row index so it will be used as the shard_key
_test_df_sk = _test_df.set_index('str_col')

# you can specify additional column properties
_col_props = { 'str_col' : [gpudb.GPUdbColumnProperty.CHAR32], 
             'int_col' : [gpudb.GPUdbColumnProperty.NULLABLE]}

kapi_io.save_df(_test_df_sk, 'test_df_sk', SCHEMA, _col_props=_col_props)

Dropping table: <test_df_sk>
Creating  table: <test_df_sk>
Column 0: <str_col> (string) ['char32', 'shard_key']
Column 1: <double_col> (double) []
Column 2: <float_col> (float) []
Column 3: <int_col> (int) ['nullable']
Inserted rows into <TEST.test_df_sk>: 4


### Load the test_df table to a dataframe

In [9]:
_loaded_df = kapi_io.load_df('test_df')
_loaded_df

Getting 4 records from <test_df>.
Records Retrieved: (4, 4)


Unnamed: 0,str_col,double_col,float_col,int_col
0,A,1.0,0.0,1
1,B,1.0,1.0,-3
2,C,1.0,2.0,-14
3,D,1.0,3.0,-6


### Set row index (optional)

If one of the columns should be the index you can specify this.

In [10]:
_loaded_df2 = _loaded_df.set_index('str_col')
_loaded_df2

Unnamed: 0_level_0,double_col,float_col,int_col
str_col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1.0,0.0,1
B,1.0,1.0,-3
C,1.0,2.0,-14
D,1.0,3.0,-6
