# Superstore Example using KAWA's Python client

## 1. Load the data into KAWA

With this libraary, you can easily load any dataframe or parquet file into KAWA.
It is very useful if you want to connect to your internal systems or APIs, if you want to ingest real time data,
or if you have large volumes of data that you wish to ingest quickly.

Many options are available on the loader API, please refer to our webiste and to our Githhub repository for documentation and tutorials. 

In [None]:
# Download superstore file and import it into a pandas dataframe
import requests
import pandas as pd

url = 'https://gist.githubusercontent.com/nnbphuong/38db511db14542f3ba9ef16e69d3814c/raw/3a77ff9d97c504d3ec3210b12fde7242b8c6ab63/Superstore.csv'
filename = '/tmp/superstore.csv'

response = requests.get(url)
with open(filename, 'wb') as file:
    file.write(response.content)


df = pd.read_csv(filename, 
                 parse_dates=['Order Date', 'Ship Date'], 
                 date_format='%Y-%m-%d')

# KAWA supports both date and datetime objects
df['Ship Date'] = df['Ship Date'].dt.date
df['Order Date'] = df['Order Date'].dt.date


In [None]:
# Connect to the KAWA Api using KYWY
from kywy.client.kawa_client import KawaClient as K
kawa = K.load_client_from_environment()


In [None]:
# Load the dataframe into KAWA

loader = kawa.new_data_loader(
    df=df, 
    datasource_name='Super Store',
)

loader.create_datasource()

loader.load_data(
    create_sheet=True,
    reset_before_insert=True,
);


## 2. Perform computations

Leverage KAWA query language to perform computations on the KAWA data warehouse.

The datasets that have been filtered and aggregated are then accessible from your Python runtime environment as pandas dataframes.

You can then combine them and manipulate them further to reach your objectives.

__The two main advantages of this approach are:__ 

- Very high peformances: aggregates billions of rows in less than a second
- Very low memory footprint on your environment, all the heavy lifintg happens on KAWA's warehouse.


### 1.a Which 5 states are the most profitable?


In [None]:
query = (kawa
         .sheet('Super Store')
         .select(K.col('Profit').sum())
         .group_by('State')
         .order_by('Profit', ascending=False)
         .limit(5))

query.compute(use_group_names=True)

### 1.b Show the monthly profit for the State of California in 2018

In [None]:
from datetime import date

query = (kawa
         .sheet('Super Store')
         .select(
             K.col('Order Date'),
             K.col('Profit').avg().alias('Avg Profit'),
             K.col('Profit').median().alias('Median Profit'),
         )
         .group_by('Order Date')
         .sample('YEAR_AND_MONTH')
         .filter(K.col('State').eq('California'))
         .filter(K.col('Order Date').date_range(
             from_inclusive=date(2018,1,1), 
             to_inclusive=date(2018,12,31))
         )
         .no_limit())

df = query.compute(use_group_names=True)

df.plot(x='Order Date', title="Median and Avg profit in 2018, California");

### 1.c Show all cities whose average profit is greater than 50 with more than 100 unique orders

In [None]:
query = (kawa
         .sheet('Super Store')
         .select(
             K.col('Profit').avg().alias('Avg Profit'),
             K.col('Order Id').count_unique().alias('Num unique orders'),
         )
         .group_by('City')
         .filter(K.col('Profit').avg().gt(50))
         .filter(K.col('Order Id').count_unique().gt(100))
         .order_by('Avg Profit')
         .no_limit())

query.compute(use_group_names=True)

### 1.d Show the repartition of customers per year and state

In [None]:
import seaborn as sns

query = (kawa
         .sheet('Super Store')
         .select(
             K.col('Order Date'),
             K.col('State'),
             K.col('Customer Name').count_unique().alias('num clients'),
         )
         .sample('YEAR',column_name='Order Date' )
         .group_by('Order Date', 'State')
         .filter(K.col('State').in_list(['California','Texas', 'Washington','New York']))
         .no_limit())

df = query.compute(use_group_names=True)

matrix = df.pivot_table(index="group(0) Order Date", columns="State", values="num clients")
sns.heatmap(matrix,  annot=True, fmt="0.0f");
