# Python SQL Tutorial

In this notebook, we will show you how to use Fugue SQL to work with SQL in a python environment. Fugue SQL is a SQL engine that can run on Spark, Dask, Ray, and Pandas. It is a SQL engine that is designed for data scientists and engineers and to be easy to use.
It also allows us to convert a Jupyter code cell into a SQL cell with the jupyter magic command ```%%fsql```.

DuckDB is an in-process SQL OLAP database management system. The speed is very good on even gigabytes of data on local machines. Fugue has a deep integration with DuckDB. Fugue not only uses DuckDB as the SQL engine, but also implemented all execution engine methods using DuckDB SQL and relations. So in most part of the workflow, the data tables are kept in DuckDB and in rare cases the tables will be materialized and converted to arrow dataframes. 

In this notebook you will learn how to create tables and insert data into the created tables.

In [2]:
import os
import duckdb
import pandas as pd
import sqlalchemy
import json
from fugue_notebook import setup
import fugue_duckdb

setup()

<IPython.core.display.Javascript object>

We have two csv files in our data directory:
- raw_orders.csv
- raw_payments.csv

and a json file:
- raw_customers.json


We will use pandas to load the data first:

In [3]:
df_customers = pd.read_json('data/raw_customers.json', orient='records', lines=True)
df_customers.head()

Unnamed: 0,id,first_name,last_name
0,1,Michael,P.
1,2,Shawn,M.
2,3,Kathleen,P.
3,4,Jimmy,C.
4,5,Katherine,R.


In [4]:
df_orders = pd.read_csv('data/raw_orders.csv')
df_orders.head()

Unnamed: 0,id,user_id,order_date,status
0,1,1,2018-01-01,returned
1,2,3,2018-01-02,completed
2,3,94,2018-01-04,completed
3,4,50,2018-01-05,completed
4,5,64,2018-01-05,completed


In [5]:
df_payment = pd.read_csv('data/raw_payments.csv')
df_payment.head()

Unnamed: 0,id,order_id,payment_method,amount
0,1,1,credit_card,1000
1,2,2,credit_card,2000
2,3,3,coupon,100
3,4,4,coupon,2500
4,5,5,bank_transfer,1700


But we could do the same with fugue. With ```%%fsql``` we can convert the cell into a SQL cell and use fugue SQL to query the data. ```duck```  is the name of the execution engine which is DuckDB in our case.

With ```LOAD``` we can import our data and save it in a variable. We can use the variable name to query the data.

In [6]:
%%fsql duck

raw_customers = LOAD "data/raw_customers.json"

SELECT * FROM raw_customers
PRINT

Unnamed: 0,id:long,first_name:str,last_name:str
0,1,Michael,P.
1,2,Shawn,M.
2,3,Kathleen,P.
3,4,Jimmy,C.
4,5,Katherine,R.
5,6,Sarah,R.
6,7,Martin,M.
7,8,Frank,R.
8,9,Jennifer,F.
9,10,Henry,W.


As you can see you can load the data from various sources like csv, json, parquet, etc. and you can also load data from a database.

In [7]:
%%fsql duck

raw_orders = LOAD "data/raw_orders.csv" (header = "true")

SELECT * FROM raw_orders
PRINT

Unnamed: 0,id:str,user_id:str,order_date:str,status:str
0,1,1,2018-01-01,returned
1,2,3,2018-01-02,completed
2,3,94,2018-01-04,completed
3,4,50,2018-01-05,completed
4,5,64,2018-01-05,completed
5,6,54,2018-01-07,completed
6,7,88,2018-01-09,completed
7,8,2,2018-01-11,returned
8,9,53,2018-01-12,completed
9,10,7,2018-01-14,completed


In [8]:
%%fsql duck

raw_payments = LOAD "data/raw_payments.csv" (header = "true")

SELECT * FROM raw_payments

PRINT

Unnamed: 0,id:str,order_id:str,payment_method:str,amount:str
0,1,1,credit_card,1000
1,2,2,credit_card,2000
2,3,3,coupon,100
3,4,4,coupon,2500
4,5,5,bank_transfer,1700
5,6,6,credit_card,600
6,7,7,credit_card,1600
7,8,8,credit_card,2300
8,9,9,gift_card,2300
9,10,9,bank_transfer,0


Now we can have a look at some questions. For example the RFM (Recency, Frequency, and Monetary) questions that are relevant for analysing customer behaviour. Here we can either load the data again with fugue or combine pandas and fugue and use fugue on top of pandas dataframes.

- When did the customers last purchase?

First we will join the orders and the customers table to get the customers name and the order date in one table and save it with ```YIELD``` in a dataframe ```df1``` that we can also use outside of this cell.

There are also ways to save the data in a database or as a files with ```SAVE```.

In [9]:
%%fsql

SELECT o.user_id, o.order_date, c.first_name, c.last_name FROM df_orders AS o
JOIN df_customers AS c
ON o.user_id = c.id
YIELD DATAFRAME AS df1

PRINT

Unnamed: 0,user_id:long,order_date:str,first_name:str,last_name:str
0,1,2018-01-01,Michael,P.
1,1,2018-02-10,Michael,P.
2,3,2018-01-02,Kathleen,P.
3,3,2018-01-27,Kathleen,P.
4,3,2018-03-11,Kathleen,P.
5,94,2018-01-04,Gregory,H.
6,94,2018-01-29,Gregory,H.
7,50,2018-01-05,Billy,L.
8,50,2018-02-20,Billy,L.
9,64,2018-01-05,David,C.


Now we can write a python function that we can use with fugue in SQL. For that we have to define the schema first. Here we will use everything so we will use the ```*``` to select all columns. The schema is defined with a comment before the function.

In [10]:
#schema: *
def get_latest_order_date_per_customer(df: pd.DataFrame) -> pd.DataFrame:
    return df.sort_values('order_date', ascending=False).groupby('user_id').first().reset_index()

This python function will return the maximum date of the order date column. And can be used now in a fugue SQL query. For that we will use the ```TRANSFORM``` command. This command will apply the function to the dataframe. The function we want to use is defined by the ```USING``` command. The result will be saved in a new dataframe ```df2```.

In [11]:
%%fsql duck

df2 = TRANSFORM df1 USING get_latest_order_date_per_customer
YIELD DATAFRAME AS df2
PRINT

Unnamed: 0,user_id:long,order_date:str,first_name:str,last_name:str
0,1,2018-02-10,Michael,P.
1,2,2018-01-11,Shawn,M.
2,3,2018-03-11,Kathleen,P.
3,6,2018-02-19,Sarah,R.
4,7,2018-01-14,Martin,M.
5,8,2018-03-12,Frank,R.
6,9,2018-03-17,Jennifer,F.
7,11,2018-03-23,Fred,S.
8,12,2018-03-03,Amy,D.
9,13,2018-03-07,Kathleen,M.


This Dataframe ```df2``` can than also be used again as a pandas DataFrame:

In [12]:
df2.head(5)

Unnamed: 0,user_id:long,order_date:str,first_name:str,last_name:str
0,1,2018-02-10,Michael,P.
1,2,2018-01-11,Shawn,M.
2,3,2018-03-11,Kathleen,P.
3,6,2018-02-19,Sarah,R.
4,7,2018-01-14,Martin,M.


## Exercise 1

-  What is the customer recency in days?

The answer to this question you can get from the previous table, by getting the date difference. Do this as an exercise.

In [13]:
from datetime import datetime

#schema: *, recency:int
def calculate_recency(df: pd.DataFrame) -> pd.DataFrame:
    df['recency'] = df['order_date'].apply(lambda x: (datetime.now() - datetime.strptime(x, "%Y-%m-%d")).days)
    return df

In [14]:
%%fsql

df_recency = TRANSFORM df2 USING calculate_recency
YIELD DATAFRAME AS df_recency
PRINT


Unnamed: 0,user_id:long,order_date:str,first_name:str,last_name:str,recency:int
0,1,2018-02-10,Michael,P.,2074
1,2,2018-01-11,Shawn,M.,2104
2,3,2018-03-11,Kathleen,P.,2045
3,6,2018-02-19,Sarah,R.,2065
4,7,2018-01-14,Martin,M.,2101
5,8,2018-03-12,Frank,R.,2044
6,9,2018-03-17,Jennifer,F.,2039
7,11,2018-03-23,Fred,S.,2033
8,12,2018-03-03,Amy,D.,2053
9,13,2018-03-07,Kathleen,M.,2049


The recency is quite large because the data is from 2018 ;)

-  How much did they spend?

To answer this question let us join all 3 tables together and save it in a dataframe ```df3```.

Additionally let's save the joined table as a parquet file for possible future use. Also we will see here that we can use global variables inside of fugue sql cells if we use double curly brackets ```{{}}``` around them.

In [15]:
PATH = os.path.join(os.getcwd())

In [16]:
%%fsql

SELECT o.user_id, c.first_name, c.last_name, p.amount FROM df_orders AS o
JOIN df_customers AS c
ON o.user_id = c.id
JOIN df_payment AS p
ON o.id = p.order_id
YIELD DATAFRAME AS df3

SAVE OVERWRITE "{{PATH}}/data/joined_data.parquet" 

PRINT


Unnamed: 0,user_id:long,first_name:str,last_name:str,amount:long
0,1,Michael,P.,1000
1,1,Michael,P.,2300
2,3,Kathleen,P.,2000
3,3,Kathleen,P.,2600
4,3,Kathleen,P.,1900
5,94,Gregory,H.,100
6,94,Gregory,H.,2300
7,50,Billy,L.,2500
8,50,Billy,L.,2200
9,64,David,C.,1700


And now we can simply sum about the amount per customer:

In [17]:
%%fsql duck

SELECT first_name, last_name, SUM(amount) FROM df3
GROUP BY first_name, last_name
PRINT

Unnamed: 0,first_name:str,last_name:str,`sum(amount)`:long
0,Michael,P.,3300
1,David,C.,3000
2,Victor,H.,2400
3,Amanda,H.,1200
4,Adam,W.,3900
5,Fred,S.,300
6,Todd,W.,2900
7,Willie,H.,2200
8,Billy,L.,4700
9,Rose,M.,5700


## Exercise 2

Now you can try and answer the next questions as an exercise either you can write a python function that you apply with ```TRANSFORM``` or you can use the SQL to answer the questions.

- What is the most common payment method?

In [18]:
%%fsql

SELECT  p.payment_method, COUNT(*) AS method_count FROM df_payment AS p
GROUP BY p.payment_method
ORDER BY method_count DESC
PRINT


Unnamed: 0,payment_method:str,method_count:long
2,credit_card,55
0,bank_transfer,33
1,coupon,13
3,gift_card,12


With python function and fugue's transform:

In [41]:
#schema: payment_method:str, count:int
def count_payment_methods(df: pd.DataFrame) -> pd.DataFrame:
    df = df['payment_method'].value_counts().reset_index()
    return df

In [43]:
%%fsql duck

df_payment_method_count = TRANSFORM df_payment USING count_payment_methods

PRINT

Unnamed: 0,payment_method:str,count:int
0,credit_card,55
1,bank_transfer,33
2,coupon,13
3,gift_card,12


## Exercise 3

- Create a table that includes the aggregated data from the questions above!

In [21]:
%%fsql

SELECT  df2.user_id, 
        df2.first_name, 
        df2.last_name, 
        df2.order_date, 
        df3.amount,
        df_recency.recency
FROM df2
JOIN df3 ON df3.user_id = df2.user_id
JOIN df_recency ON df_recency.user_id = df2.user_id

PRINT

Unnamed: 0,user_id:long,first_name:str,last_name:str,order_date:str,amount:long,recency:int
0,1,Michael,P.,2018-02-10,1000,2074
1,1,Michael,P.,2018-02-10,2300,2074
2,2,Shawn,M.,2018-01-11,2300,2104
3,3,Kathleen,P.,2018-03-11,2000,2045
4,3,Kathleen,P.,2018-03-11,2600,2045
5,3,Kathleen,P.,2018-03-11,1900,2045
6,6,Sarah,R.,2018-02-19,800,2065
7,7,Martin,M.,2018-01-14,2600,2101
8,8,Frank,R.,2018-03-12,1900,2044
9,8,Frank,R.,2018-03-12,2600,2044
