# Data source management and data loading

In [None]:
from kywy.client.kawa_client import KawaClient as K

kawa = K.load_client_from_environment()
cmd = kawa.commands

## 1. Create a datasource 

Creating datasources in KAWA requires a data loader. 
In order to create a data loader, use the new_data_loader method on the KAWA client.

In [None]:
import pandas as pd
import datetime

df = pd.DataFrame([{
       'id': 1,
       'flag':True,
       'comment':'bar',
       'price': 1.124,
       'order_date': datetime.date(2035,1,1),
}])

# A data loader will be created, using the dataframe to understand what indicators
# will constitute the datasource.
# Here: id will be an integer, flag will be a boolean indicator, comment a text and price a decimal,
# order_date will be a date indicator
loader = kawa.new_data_loader(datasource_name='Sample Datasource 2', df=df)


Once the loader is created, use it to create a datasource. 
If a datasource with the same already exists, this command won’t have any effect. 

It won’t update the datasource if indicators are different.

__Primary keys__

You can also specify the primary keys of your datasource when you create it by passing a `primary_keys` argument.
If you omit it, KAWA will add to your dataframe (And your datasource) a new indicator: `record_id` that will be automatically
incremented to serve as a technical primary key.

---
**ℹ️ NOTE**

The primary keys will be in the same order as the columns of the dataframe, not in the order of the `primary_keys` array. Please refer to the paragraph about indexation to learn more about how to order your primary keys.


---




Below are the supported types and their mapped type in KAWA.

__If a column of the dataframe has an unrecognised type, the datasource cannot be built.__

| Pandas type(s) | Kawa Type | Notes |
| --- | --- | --- |
| int, int8, int16, int32, int64, uint8, uint16, uint32, uint64 | integer |  |
| float, float16, float32, float64 | decimal | |
| date | date | |
| datetime, pd.Timestamp | date_time | If the timezone is not specified, will use the local timezone. |
| string | text | |
| boolean | boolean | |
| array of texts | list(text) | |
| array of integers | list(integer) | |
| array of floats | list(decimal) | |


In [None]:
# Creates a datasource based on the schema of the dataframe, with the given name.
# This operation is idempotent, based on the data source name.
# NO DATA WILL BE INSERTED AT THAT POINT
loader.create_datasource()

## 2. Adding columns to an existing datasource
In order to add indicators to an existing datasource, create a new loader with a dataframe containing the new indicators and call the `add_new_indicators_to_datasource` method. 

This method won’t have any effect if no new indicator is present.


In [None]:
# Creates a new data loader on an existing datasource
df_with_new_client_indicator = pd.DataFrame([{
       'id': 1,
       'flag':True,
       'comment':'bar',
       'price': 1.124,
       'order_date': datetime.date(2035,1,1),
       'client':'Wayne Enterprises'
}])


loader = kawa.new_data_loader(
    datasource_name='Sample Datasource',
    df=df_with_new_client_indicator,
)

# The 'client' indicator was not part of the datasource,
# it will be added as a text indicator
loader.add_new_indicators_to_datasource();

## 3. Loading data into KAWA

Loading data into a datasource requires a data loader.



#### Incremental loads

This is driven by the `reset_before_insert` parameter. By default, this is `False`.

If set to `False`, the data of the new dataframe will be added on top of what was there before. __This is an incremental load__. 

If some primary keys are defined, new values for existing keys will replace existing ones (upsert). If no primary keys were defined, KAWA will have introduced an auto increment indicator. 

In that case, the incoming data will be appended to the existing ones without any replacement.
If set to `True`, the data of the new dataframe will replace whatever was there before.

#### Automatic sheet creation

This is linked to the `create_sheet` input. If set to `True`, a sheet will be created after the load. Its URL will be printed in the standard output. By default, this is `False`.


#### Upload speed

In order to speed up the load, increase the `nb_threads` parameter. Do not specify a number above the number of cores of the Clickhouse server. In general, values higher than 10 won’t add anything to the loading speed. It is discouraged to use values above 10 for this parameter.

When this parameter has a value above 1, the dataframe will be split into `nb_threads` parquet files and each one will be sent to the server in a separate thread. This allows to use multiple cores when streaming data into the warehouse.

In [None]:
df = pd.DataFrame([{
       'id': 1,
       'flag':True,
       'comment':'bar',
       'price': 1.124,
       'order_date': datetime.date(2035,1,1),
       'client':'Wayne Enterprises'
}])

loader = kawa.new_data_loader(df=df, datasource_name='Sample Datasource')

# The two following lines are indempotent
loader.create_datasource()
loader.add_new_indicators_to_datasource();

# Loads the content of df into the datasource 'Sample Datasource'
loader.load_data(
    reset_before_insert=True,
    create_sheet=True,
    nb_threads=1
);

## 4. Indexation and Partitioning

When your data is stored on Clickhouse, you can use the Python API to index it and configure how it is partitionned.
Here is a comprehensive documentation that explains the architecture of Clickhouse indexes:

https://clickhouse.com/docs/en/optimize/sparse-primary-indexes


KAWA API exposes one command to let you change the primary keys and the partition key of your table.


At any point, you can check the schema of your Clickhouse table by running:

In [None]:
datasource_id = kawa.entities.datasources().get_entity_id('Sample Datasource')

schema = kawa.entities.get_datasource_schema(datasource_id)
print(schema.get('schema').replace('\\n', '\n'))

#### Primary keys

KAWA will always use the primary key set as the order key set in Clickhouse. This means that the data will be sorted by each of the primary key, in THAT order.

This will result in higher performances for queries with filters on the first primary key column. The further down a column is, the less significant the speedup will be by filtering on that column.


---
**⚠️ WARNING**

It is recommended to put in the first position the dimension on which most of the views are filtered - a date for instance. ['date','portfolio','record_id']

If a data source contains a natural hierarchy, putting the hierarchy in the correct order is also a good idea: ['region','country','city','record_id']

To avoid losing data, 
__THE PRIMARY KEY THAT ENSURES UNICITY MUST ALWAYS BE IN THE PROVIDED LIST__.


---


#### Partition key

This will tell clickhouse how to partition the data on the hard drive. It will improve performance for ingestion and select queries. Please refer to this link for more details about partitions:

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key

Setting a partition key will also make it possible to cleanly remove old / unwanted partitions through the KAWA API.


In [None]:
# Command to set a new set of primary keys and a partition 
datasource_id = kawa.entities.datasources().get_entity_id('Sample Datasource')

cmd.replace_datasource_primary_keys(
   datasource=datasource_id,
    
   # This will define the new set of primary keys.
   # Please make sure that this array reflects the unicity of your records.
   # Failing to do so will result in data loss.
   new_primary_keys=['order_date','id'],
    
   # This will define the partition key, optional
   partition_key='order_date',
    
   # Popular values for this are: YEAR, YEAR_AND_MONTH, YEAR_AND_WEEK
   # This is optional
   partition_sampler='YEAR_AND_MONTH',
    
);


schema = kawa.entities.get_datasource_schema(datasource_id)
print(schema.get('schema').replace('\\n', '\n'))

## 5. Removing data from a Datasource

KAWA supports two ways to remove data from datasources.

- One is based on dropping entire paritions (this is the recommmended way)
- The other one directly leverages the DELETE statement.

Running those commands require the data admin privilege in the workspace.

In [None]:
import datetime

# Command to drop a partition of data
# Only partitions on date columns are supported for such drops (and without sampling)

datasource = kawa.entities.datasources().get_entity('Sample Datasource')

cmd.drop_date_partition(
   datasource=datasource, 
   date_partition=datetime.date(2024, 1, 1)
);

The command below deletes data based on a series of where_clauses.
It deletes the data that MATCHES the clauses.

In [None]:
datasource = kawa.entities.datasources().get_entity('Sample Datasource')

# Please refer to 03_compute_notebook for the syntax here.
# It is identical to the one for the filter operators
where_clauses = [
    K.where('client').in_list(['Wonka']),
    K.where('order_date').date_range(from_inclusive=datetime.date(2024, 1, 1))
]

# Will delete all records with client = Wonka and order date from the 1st of January 2024.
cmd.delete_data(
    datasource=datasource, 
    delete_where=where_clauses,
);