<h3 align="center" style="margin:0px">
    <img width="200" src="../_assets/images/logo_purple.png" alt="Kinetica Logo"/>
</h3>
<h5 align="center" style="margin:0px">
    <a href="https://www.kinetica.com/">Website</a>
    <span> | </span>
    <a href="https://docs.kinetica.com/7.2/">Docs</a>
    <span> | </span>
    <a href="https://docs.kinetica.com/7.2/api/">API Docs</a>
    <span> | </span>
    <a href="https://join.slack.com/t/kinetica-community/shared_invite/zt-1bt9x3mvr-uMKrXlSDXfy3oU~sKi84qg">Community Slack</a>   
</h5>

# Vector Dataframe I/O Demo

We will learn ingress and egress of records with vector columns. This includes loading data a CSV to a dataframe and using a dataframe to create a Kinetica table.

## Overview

In Kinetica 7.2 we we introduced a vector datatype with [similarity search](https://docs.kinetica.com/7.2/vector_search/) functions. In this demo we will  Learn how to ingest and egress dataframes containing vector columns.

We will explore a number of I/O methods:
1. Loading a dataframe from a CSV
2. Creating a table from a dataframe.
3. Loading a dataframe from a SQL statement.
4. Load a dataframe from a table.
5. Incrementally ingest dataframes into a table.

The data we will loading are CSV files containing stock embedings that represent 10 day time windows of the stock prices. If we have an example window that we are interested in we can use the similarity search to find other similar stock windows to assist with predictions.

The CSV has these columns:

* `stock`: Stock symbol
* `window_name`: Label for 10 day stock window.
* `literal_vec`: An embedding that represents the window.

In the CSV, a single vector column is enclosed in quotes and brackets. For example:

```
"[ 0.51339287 0.51339287 0.5052084 ]"
```

## Setup

### Prerequisites 

To run this demonstration you will need:

* Python runtime >= 3.10
* Kinetica server >= 7.2
* Kinetica python API >= 7.2

The next cell will install necessary dependencies.

In [4]:
# install Kinetica package
%pip install -U -q 'gpudb>=7.2' typeguard 

# install packages needed by this notebook
%pip install -U -q pandas pyarrow tqdm

# install packages needed for Jupyter widgets
%pip install -U -q ipykernel ipywidgets

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### DB Connection

You will need to configure environment variables with your connection information:

* `KINETICA_URL`
* `KINETICA_USER`
* `KINETICA_PASSWD`

The cell below will validate your DB connection:

In [5]:
# load imports and connect to the Kinetica DB.

from gpudb import GPUdb
from pathlib import Path
import os

HOST = os.environ['KINETICA_URL']
USER = os.environ['KINETICA_USER']
PASSWORD = os.environ['KINETICA_PASSWD']

def create_kdbc(host: str, login: str, password: str) -> GPUdb:
    options = GPUdb.Options()
    options.username = login
    options.password = password
    options.skip_ssl_cert_verification = True
    options.disable_failover = True
    options.logging_level = 'INFO'
    kdbc = GPUdb(host=host, options = options)
    from importlib.metadata import version
    print(f"Connected to {kdbc.get_url()}. (api={version('gpudb')} server={str(kdbc.server_version)})")
    return kdbc

KDBC = create_kdbc(host=HOST, login=USER, password=PASSWORD)

# Directory containing stock CSV files.
DATA_DIR = Path("./data/stocks")

# name of test table we will create
TABLE_NAME = "demo.emb_vec"

Connected to http://172.31.33.30:9191. (api=7.2.0.1 server=7.2.0.1)


### Create some helper functions

In [6]:
import pandas as pd
import numpy as np
from numpy._typing import NDArray;

def str_to_array(vec: str) -> NDArray:
    """
    Parameters:
        Input string contains a space separated vector of floats surrounded by brackets (e.g. '[ 1.1 2.2 3.3 ]')

    Returns: 
        Output is a numpy array.
    """

    # https://numpy.org/doc/stable/reference/generated/numpy.fromstring.html
    return np.fromstring(vec[1:-1], sep=" ", dtype=float)

def csv_to_df(csv_file: Path) -> pd.DataFrame:
    """
    Parameters: 
        Input path containing stock data.
    """

    # Read The CSV file into a dataframe
    # https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
    df = pd.read_csv(csv_file)

    # Convert the literal_vel column from string to array
    df['literal_vec'] = df['literal_vec'].apply(str_to_array)
    return df

## Examples

### Load dataframe from CSV

In [7]:
# We use the pandas `read_csv()` to load the CSV.
load_df = pd.read_csv(DATA_DIR / "AAPL.csv")

# the str_to_array() funtion will convert the string representation to an array.
load_df['literal_vec'] = load_df['literal_vec'].apply(str_to_array)
load_df

Unnamed: 0,stock,window_name,literal_vec
0,AAPL,AAPL_1980-12-12_1981-02-13,"[0.51339287, 0.51339287, 0.5052084, 0.50446427..."
1,AAPL,AAPL_2008-04-29_2008-07-01,"[24.444286, 25.007143, 25.17, 24.85, 24.994286..."
2,AAPL,AAPL_1983-09-08_1983-11-10,"[0.6183036, 0.56696427, 0.56696427, 0.546875, ..."
3,AAPL,AAPL_2008-05-09_2008-07-11,"[26.165714, 26.207144, 26.263332, 26.431429, 2..."
4,AAPL,AAPL_2008-05-19_2008-07-21,"[26.837143, 26.22857, 25.974285, 26.557142, 26..."
...,...,...,...
1425,AAPL,AAPL_2008-03-10_2008-05-12,"[17.425714, 17.098572, 17.72857, 18.192858, 18..."
1426,AAPL,AAPL_2008-03-20_2008-05-22,"[18.731428, 19.038572, 18.834642, 19.262144, 1..."
1427,AAPL,AAPL_2008-03-30_2008-06-01,"[20.397142, 20.476667, 20.467142, 20.5, 20.9, ..."
1428,AAPL,AAPL_2008-04-09_2008-06-11,"[21.901428, 21.634285, 21.59, 22.078571, 21.81..."


### Create a table from a dataframe

Using `from_df()` we can optionally specify table column types and creation options. 
The Pandas types are automatically converted to Kinetica column types.
Most types are supported including `string`, `int`, `boolean`, `vector` and `datetime`.

In [8]:
from gpudb import GPUdbTable

# we can specify extended type information with the column_types parameter.
gpudb_table = GPUdbTable.from_df(load_df, db=KDBC, 
                                table_name=TABLE_NAME, 
                                clear_table=True,
                                load_data=True,
                                column_types = { 'stock': 'char8' })

# The `type_as_df()` method will return the table schema as a dataframe.
gpudb_table.type_as_df()

GPUdbException: "Error creating GPUdbTable: ''Schema does not exist: python_dev_guide (TM/SMc:7274)''"

### Load dataframe from a SQL statement and save to CSV. 

Kinetica column types are automatically converted to Pandas types.

In [None]:
# The stock to load
stock_name = 'AAPL'

save_df = KDBC.to_df(f""" select stock, window_name, literal_vec
from  {TABLE_NAME}
where stock = '{stock_name}'
""")

if(save_df is None):
    raise ValueError(f"Stock not found: {stock_name}")

out_file = DATA_DIR / f"{stock_name}.csv"
save_df.to_csv(out_file, index=False)
save_df

Unnamed: 0,stock,window_name,literal_vec
0,AAPL,AAPL_1980-12-12_1981-02-13,"[0.51339287, 0.51339287, 0.5052084, 0.50446427..."
1,AAPL,AAPL_2008-04-29_2008-07-01,"[24.444286, 25.007143, 25.17, 24.85, 24.994286..."
2,AAPL,AAPL_1983-09-08_1983-11-10,"[0.6183036, 0.56696427, 0.56696427, 0.546875, ..."
3,AAPL,AAPL_2008-05-09_2008-07-11,"[26.165714, 26.207144, 26.263332, 26.431429, 2..."
4,AAPL,AAPL_2008-05-19_2008-07-21,"[26.837143, 26.22857, 25.974285, 26.557142, 26..."
...,...,...,...
1425,AAPL,AAPL_2008-03-10_2008-05-12,"[17.425714, 17.098572, 17.72857, 18.192858, 18..."
1426,AAPL,AAPL_2008-03-20_2008-05-22,"[18.731428, 19.038572, 18.834642, 19.262144, 1..."
1427,AAPL,AAPL_2008-03-30_2008-06-01,"[20.397142, 20.476667, 20.467142, 20.5, 20.9, ..."
1428,AAPL,AAPL_2008-04-09_2008-06-11,"[21.901428, 21.634285, 21.59, 22.078571, 21.81..."


### Load a dataframe from a table.

Here we create a `GPUdbTable` and load the dataframe without the need for SQL.

> Note: The `GPUdbTable` can be used for things like filtering, inserts, aggregation, or inspection. See the [Python API documentation](https://docs.kinetica.com/7.1/api/python/frame/source/gpudbtable.html#) for more details.

In [None]:
from gpudb import GPUdbTable
gpudb_table = GPUdbTable(_type=None, name=TABLE_NAME, db=KDBC)
out_df = gpudb_table.to_df()
out_df

Unnamed: 0,stock,window_name,literal_vec
0,AAPL,AAPL_1980-12-12_1981-02-13,"[0.51339287, 0.51339287, 0.5052084, 0.50446427..."
1,AAPL,AAPL_2008-04-29_2008-07-01,"[24.444286, 25.007143, 25.17, 24.85, 24.994286..."
2,AAPL,AAPL_1983-09-08_1983-11-10,"[0.6183036, 0.56696427, 0.56696427, 0.546875, ..."
3,AAPL,AAPL_2008-05-09_2008-07-11,"[26.165714, 26.207144, 26.263332, 26.431429, 2..."
4,AAPL,AAPL_2008-05-19_2008-07-21,"[26.837143, 26.22857, 25.974285, 26.557142, 26..."
...,...,...,...
1425,AAPL,AAPL_2008-03-10_2008-05-12,"[17.425714, 17.098572, 17.72857, 18.192858, 18..."
1426,AAPL,AAPL_2008-03-20_2008-05-22,"[18.731428, 19.038572, 18.834642, 19.262144, 1..."
1427,AAPL,AAPL_2008-03-30_2008-06-01,"[20.397142, 20.476667, 20.467142, 20.5, 20.9, ..."
1428,AAPL,AAPL_2008-04-09_2008-06-11,"[21.901428, 21.634285, 21.59, 22.078571, 21.81..."


### Incrementally ingest dataframes

Sometimes you may have a large amount of data that will not fit into a single dataframe. For this you can use another workflow where you create an empty table and load it incrementally.

> Note: This example also shows how to create a progress bar in your notebook.

In [None]:
from gpudb import GPUdbTable
from tqdm.notebook import tqdm

# we will use anoher table name for this example
EMB_LARGE_TABLE = "demo.emb_vec_multi"

# get the list of CSV files to load
csv_list = list(DATA_DIR.glob("*.csv"))
csv_list.sort()

# create an empty table using the schema from the first dataframe.
load_df = csv_to_df(csv_list[0])
gpudb_table = GPUdbTable.from_df(load_df, db=KDBC, 
                                table_name=EMB_LARGE_TABLE, 
                                clear_table=True,
                                load_data=False,
                                column_types = { 'stock': 'char8' })

# incrementally load dataframes
with tqdm(csv_list) as pbar: 
    for file in pbar:
        pbar.set_description(f"Loading {file}", refresh=True)
        load_df = csv_to_df(file)
        gpudb_table.insert_df(load_df)


  0%|          | 0/6 [00:00<?, ?it/s]