# Python SQL Tutorial

In this notebook, we will show you how to use Fugue SQL to work with SQL in a python environment. Fugue SQL is a SQL engine that can run on Spark, Dask, Ray, and Pandas. It is a SQL engine that is designed for data scientists and engineers and to be easy to use.
It also allows us to convert a Jupyter code cell into a SQL cell with the jupyter magic command ```%%fsql```.

DuckDB is an in-process SQL OLAP database management system. The speed is very good on even gigabytes of data on local machines. Fugue has a deep integration with DuckDB. Fugue not only uses DuckDB as the SQL engine, but also implemented all execution engine methods using DuckDB SQL and relations. So in most part of the workflow, the data tables are kept in DuckDB and in rare cases the tables will be materialized and converted to arrow dataframes. 

In this notebook you will learn how to create tables and insert data into the created tables.

In [None]:
import os
import duckdb
import pandas as pd
import sqlalchemy
import json
from fugue_notebook import setup
import fugue_duckdb

setup()

We have two csv files in our data directory:
- raw_orders.csv
- raw_payments.csv

and a json file:
- raw_customers.json


We will use pandas to load the data first:

In [None]:
df_customers = pd.read_json('data/raw_customers.json', orient='records', lines=True)
df_customers.head()

In [None]:
df_orders = pd.read_csv('data/raw_orders.csv')
df_orders.head()

In [None]:
df_payment = pd.read_csv('data/raw_payments.csv')
df_payment.head()

But we could do the same with fugue. With ```%%fsql``` we can convert the cell into a SQL cell and use fugue SQL to query the data. ```duck```  is the name of the execution engine which is DuckDB in our case.

With ```LOAD``` we can import our data and save it in a variable. We can use the variable name to query the data.

In [None]:
%%fsql duck

raw_customers = LOAD "data/raw_customers.json"

SELECT * FROM raw_customers
PRINT

As you can see you can load the data from various sources like csv, json, parquet, etc. and you can also load data from a database.

In [None]:
%%fsql duck

raw_orders = LOAD "data/raw_orders.csv" (header = "true")

SELECT * FROM raw_orders
PRINT

In [None]:
%%fsql duck

raw_payments = LOAD "data/raw_payments.csv" (header = "true")

SELECT * FROM raw_payments

PRINT

Now we can have a look at some questions. For example the RFM (Recency, Frequency, and Monetary) questions that are relevant for analysing customer behaviour. Here we can either load the data again with fugue or combine pandas and fugue and use fugue on top of pandas dataframes.

- When did the customers last purchase?

First we will join the orders and the customers table to get the customers name and the order date in one table and save it with ```YIELD``` in a dataframe ```df1``` that we can also use outside of this cell.

There are also ways to save the data in a database or as a files with ```SAVE```.

In [None]:
%%fsql

SELECT o.user_id, o.order_date, c.first_name, c.last_name FROM df_orders AS o
JOIN df_customers AS c
ON o.user_id = c.id
YIELD DATAFRAME AS df1

PRINT

Now we can write a python function that we can use with fugue in SQL. For that we have to define the schema first. Here we will use everything so we will use the ```*``` to select all columns. The schema is defined with a comment before the function.

In [None]:
#schema: *
def get_latest_order_date_per_customer(df: pd.DataFrame) -> pd.DataFrame:
    return df.sort_values('order_date', ascending=False).groupby('user_id').first().reset_index()

This python function will return the maximum date of the order date column. And can be used now in a fugue SQL query. For that we will use the ```TRANSFORM``` command. This command will apply the function to the dataframe. The function we want to use is defined by the ```USING``` command. The result will be saved in a new dataframe ```df2```.

In [None]:
%%fsql duck

df2 = TRANSFORM df1 USING get_latest_order_date_per_customer
YIELD DATAFRAME AS df2
PRINT

This Dataframe ```df2``` can than also be used again as a pandas DataFrame:

In [None]:
df2.head(5)

## Exercise 1

-  What is the customer recency in days?

The answer to this question you can get from the previous table, by getting the date difference. Do this as an exercise.

In [None]:
%%fsql

-  How much did they spend?

To answer this question let us join all 3 tables together and save it in a dataframe ```df3```.

Additionally let's save the joined table as a parquet file for possible future use. Also we will see here that we can use global variables inside of fugue sql cells if we use double curly brackets ```{{}}``` around them.

In [None]:
PATH = os.path.join(os.getcwd())

In [None]:
%%fsql

SELECT o.user_id, c.first_name, c.last_name, p.amount FROM df_orders AS o
JOIN df_customers AS c
ON o.user_id = c.id
JOIN df_payment AS p
ON o.id = p.order_id
YIELD DATAFRAME AS df3

SAVE OVERWRITE "{{PATH}}/data/joined_data.parquet" 

PRINT


And now we can simply sum about the amount per customer:

In [None]:
%%fsql duck

SELECT first_name, last_name, SUM(amount) FROM df3
GROUP BY first_name, last_name
PRINT

## Exercise 2

Now you can try and answer the next questions as an exercise either you can write a python function that you apply with ```TRANSFORM``` or you can use the SQL to answer the questions. How does the query look without TRANSFORM?

- What is the most common payment method?

In [None]:
%%fsql

## Exercise 3

- Create a table that includes the aggregated data from the questions above!

In [None]:
%%fsql