---
title: "Read data, Combine tables, & aggregate numbers to understand business performance"
format:
  html:
    toc: true
execute:
    eval: false
    output: true
---

## Setup

In [None]:
%%capture
%%bash
python ./generate_data.py
python ./run_ddl.py

In [163]:
%%sql --show
use prod.db

In this chapter, we will go over SQL basics.


## A Spark catalog can have multiple schemas, & schemas can have multiple tables

Typically database servers can have multiple databases; each database can have multiple schemas. Each schema can have multiple tables, and each table can have multiple columns.

**Note:** We use Trino, which has `catalogs` that allow it to connect with the different underlying systems. (e.g., Postgres, Redis, Hive, etc.)

In our lab, we use Trino, and we can check the available catalogs, their schemas, the tables in a schema, & the columns in a table, as shown below.


In [164]:
%%sql 
show catalogs;

catalog
demo
spark_catalog


In [165]:
%%sql
show schemas IN demo;

-- Catalog -> schema

namespace
prod


In [166]:
%%sql
show schemas IN prod;

-- schema -> namespace

namespace
prod.db


In [167]:
%%sql
show tables IN prod.db -- namespace -> Table

namespace,tableName,isTemporary
prod.db,customer,False
prod.db,lineitem,False
prod.db,nation,False
prod.db,orders,False
prod.db,part,False
prod.db,partsupp,False
prod.db,region,False
prod.db,supplier,False


Note how, when referencing the table name, we use the full path, i.e., `database.schema.table_name`. We can skip using the full path of the table if we let Trino know which schema to use by default, as shown below.


In [168]:
%%sql
DESCRIBE lineitem

col_name,data_type,comment
l_orderkey,bigint,
l_partkey,bigint,
l_suppkey,bigint,
l_linenumber,int,
l_quantity,"decimal(15,2)",
l_extendedprice,"decimal(15,2)",
l_discount,"decimal(15,2)",
l_tax,"decimal(15,2)",
l_returnflag,string,
l_linestatus,string,


In [169]:
%%sql
DESCRIBE extended lineitem

col_name,data_type,comment
l_orderkey,bigint,
l_partkey,bigint,
l_suppkey,bigint,
l_linenumber,int,
l_quantity,"decimal(15,2)",
l_extendedprice,"decimal(15,2)",
l_discount,"decimal(15,2)",
l_tax,"decimal(15,2)",
l_returnflag,string,
l_linestatus,string,


## Use SELECT...FROM, LIMIT, WHERE, & ORDER BY to read the required data 

The most common use for querying is to read data in our tables. We can do this using a `SELECT ... FROM` statement, as shown below.


In [170]:
%%sql
-- use * to specify all columns
SELECT
  *
FROM
  orders
LIMIT
  4

o_orderkey,o_custkey,o_orderstatus,o_totalprice,o_orderdate,o_orderpriority,o_clerk,o_shippriority,o_comment
1,3691,O,194029.55,1996-01-02,5-LOW,Clerk#000000951,0,ly express platelets. deposits acc
2,7801,O,60951.63,1996-12-01,1-URGENT,Clerk#000000880,0,ve the furiously fluffy dependencies. carefully regular
3,12332,F,247296.05,1993-10-14,5-LOW,Clerk#000000955,0,after the asymptotes. instructions cajole after the foxes. carefully unu
4,13678,O,53829.87,1995-10-11,5-LOW,Clerk#000000124,0,st the furiously bold pinto beans. furiously pending theodolites cajol


In [171]:
%%sql
-- use column names to only read data from those columns
SELECT
  o_orderkey,
  o_totalprice
FROM
  orders
LIMIT
  4

o_orderkey,o_totalprice
1,194029.55
2,60951.63
3,247296.05
4,53829.87


However, running a `SELECT ... FROM` statement can cause issues when the data set is extensive. If you want to look at the data, use `LIMIT n` to tell Trino only to get n number of rows. 

We can use the' WHERE' clause if we want to get the rows that match specific criteria. We can specify one or more filters within the' WHERE' clause. The `WHERE` clause with more than one filter can use combinations of `AND` and `OR` criteria to combine the filter criteria, as shown below.


In [172]:
%%sql
-- all customer rows that have c_nationkey = 20
SELECT
  *
FROM
  customer
WHERE
  c_nationkey = 20
LIMIT
  10;

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
6,Customer#000000006,"g1s,pzDenUEBW3O,2 pxu0f9n2g64rJrt5E",20,30-114-968-4951,7638.57,AUTOMOBILE,quickly silent asymptotes are slyly regular excuses. instructions wake furiously? quickly bold courts p
81,Customer#000000081,9jUFbrThIIeoUNd8 9,20,30-165-277-3269,2023.71,BUILDING,s against the ironic packages haggle carefully above the slyly express pinto beans
100,Customer#000000100,MBy6qq3OEGpV4u,20,30-749-445-4907,9889.89,FURNITURE,"dazzle carefully furiously final foxes. express, ironic packages among the qui"
210,Customer#000000210,",XOlfSzkZDAkm96adR41j,",20,30-876-248-9750,7250.14,HOUSEHOLD,es cajole bravely across the blithely
223,Customer#000000223,MyQxUcG0P QCetmG00GlF,20,30-193-643-1517,7476.2,BUILDING,"xcuses. silent theodolites across the carefully bold excuses sleep ironic, final courts. regular excuses"
228,Customer#000000228,"rZ1wxvHNByT71bUJWZjXMDROzlAch6FVu,dj8Zfq",20,30-435-915-1603,6868.12,FURNITURE,es. blithely permanent sentim
247,Customer#000000247,eSAW4XaakYFj2WToKU,20,30-151-905-3513,8495.92,HOUSEHOLD,"tes nag according to the blithe, even packages. sometimes unusual packages integrate"
278,Customer#000000278,XHAfHlrYQM3elmhJ,20,30-445-570-5841,7621.56,BUILDING,"ely unusual accounts. stealthily special instructions affix blithely. regular, ironic packages sleep even platelet"
285,Customer#000000285,rB6fTQKle64k3MvCCatad8DfMgR5OZA G4r,20,30-235-130-1313,7276.72,FURNITURE,slyly according to the blithely special instructions. ironic ideas against the blithely furious pac
321,Customer#000000321,LX0SKs3jqo9wH1yixIdGWp2ItclDiuL,20,30-114-675-9153,7718.77,FURNITURE,"ng the final, bold requests. furiously regular accounts inside the furiously pending"


In [173]:
%%sql
-- all customer rows that have c_nationkey = 20 and c_acctbal > 1000
SELECT
  *
FROM
  customer
WHERE
  c_nationkey = 20
  AND c_acctbal > 1000
LIMIT
  10;

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
6,Customer#000000006,"g1s,pzDenUEBW3O,2 pxu0f9n2g64rJrt5E",20,30-114-968-4951,7638.57,AUTOMOBILE,quickly silent asymptotes are slyly regular excuses. instructions wake furiously? quickly bold courts p
81,Customer#000000081,9jUFbrThIIeoUNd8 9,20,30-165-277-3269,2023.71,BUILDING,s against the ironic packages haggle carefully above the slyly express pinto beans
100,Customer#000000100,MBy6qq3OEGpV4u,20,30-749-445-4907,9889.89,FURNITURE,"dazzle carefully furiously final foxes. express, ironic packages among the qui"
210,Customer#000000210,",XOlfSzkZDAkm96adR41j,",20,30-876-248-9750,7250.14,HOUSEHOLD,es cajole bravely across the blithely
223,Customer#000000223,MyQxUcG0P QCetmG00GlF,20,30-193-643-1517,7476.2,BUILDING,"xcuses. silent theodolites across the carefully bold excuses sleep ironic, final courts. regular excuses"
228,Customer#000000228,"rZ1wxvHNByT71bUJWZjXMDROzlAch6FVu,dj8Zfq",20,30-435-915-1603,6868.12,FURNITURE,es. blithely permanent sentim
247,Customer#000000247,eSAW4XaakYFj2WToKU,20,30-151-905-3513,8495.92,HOUSEHOLD,"tes nag according to the blithe, even packages. sometimes unusual packages integrate"
278,Customer#000000278,XHAfHlrYQM3elmhJ,20,30-445-570-5841,7621.56,BUILDING,"ely unusual accounts. stealthily special instructions affix blithely. regular, ironic packages sleep even platelet"
285,Customer#000000285,rB6fTQKle64k3MvCCatad8DfMgR5OZA G4r,20,30-235-130-1313,7276.72,FURNITURE,slyly according to the blithely special instructions. ironic ideas against the blithely furious pac
321,Customer#000000321,LX0SKs3jqo9wH1yixIdGWp2ItclDiuL,20,30-114-675-9153,7718.77,FURNITURE,"ng the final, bold requests. furiously regular accounts inside the furiously pending"


In [174]:
%%sql
-- all customer rows that have c_nationkey = 20 or c_acctbal > 1000
SELECT
  *
FROM
  customer
WHERE
  c_nationkey = 20
  OR c_acctbal > 1000
LIMIT
  10;

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
3,Customer#000000003,fkRGN8nY4pkE,1,11-719-748-3364,7498.12,AUTOMOBILE,fully. carefully silent instructions sleep alongside of the slyly regular asymptotes. quickly regular
4,Customer#000000004,4u58h fqkyE,4,14-128-190-5944,2866.83,MACHINERY,sublate. fluffily even instructions are about th
6,Customer#000000006,"g1s,pzDenUEBW3O,2 pxu0f9n2g64rJrt5E",20,30-114-968-4951,7638.57,AUTOMOBILE,quickly silent asymptotes are slyly regular excuses. instructions wake furiously? quickly bold courts p
7,Customer#000000007,8OkMVLQ1dK6Mbu6WG9 w4pLGQ n7MQ,18,28-190-982-9759,9561.95,AUTOMOBILE,"ounts. ironic, regular accounts sleep. final requests haggle quickly after the"
8,Customer#000000008,"j,pZ,Qp,qtFEo0r0c 92qobZtlhSuOqbE4JGV",17,27-147-574-9335,6819.74,BUILDING,riously final excuses sublate quickly among the fluffily even foxes. quickly final packages haggle furiously furi
9,Customer#000000009,vgIql8H6zoyuLMFNdAMLyE7 H9,8,18-338-906-3675,8324.07,FURNITURE,ss pinto beans believe slyly quiet deposits-- doggedly bold packages boost. quickly ironic de
10,Customer#000000010,"Vf mQ6Ug9Ucf5OKGYq fsaX AtfsO7,rwY",5,15-741-346-9870,2753.54,HOUSEHOLD,g quickly after the evenly bold
12,Customer#000000012,Sb4gxKs7W1AZa,13,23-791-276-1263,3396.49,HOUSEHOLD,ickly regular dependencies boost blithely around the slyly ironic theodolites. furiously special dolp
13,Customer#000000013,Ez3ax0D5HnUbeUVSxoX8a8B,3,13-761-547-5974,3857.34,BUILDING,quickly brave foxes. blithely even packages against the pinto beans boost furiously against the re
14,Customer#000000014,h3GFMzeFfYiamqr,1,11-845-129-3851,5266.3,FURNITURE,"r, express foxes cajole slyly aga"


In [175]:
%%sql
-- all customer rows that have (c_nationkey = 20 and c_acctbal > 1000) or rows that have c_nationkey = 11
SELECT
  *
FROM
  customer
WHERE
  (
    c_nationkey = 20
    AND c_acctbal > 1000
  )
  OR c_nationkey = 11
LIMIT
  10;

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
6,Customer#000000006,"g1s,pzDenUEBW3O,2 pxu0f9n2g64rJrt5E",20,30-114-968-4951,7638.57,AUTOMOBILE,quickly silent asymptotes are slyly regular excuses. instructions wake furiously? quickly bold courts p
52,Customer#000000052,"UracAlAA8tSHL5V,poTZIOjh8o,",11,21-186-284-5998,5630.28,HOUSEHOLD,ts boost. carefully express waters across the blithely regular foxes inte
81,Customer#000000081,9jUFbrThIIeoUNd8 9,20,30-165-277-3269,2023.71,BUILDING,s against the ironic packages haggle carefully above the slyly express pinto beans
84,Customer#000000084,GB3sUmv RRXV DPzeOSbGxMIF9Z4Eq9 rop,11,21-546-818-3802,5174.71,FURNITURE,ounts. blithely express theodolites nag carefully ironic pinto beans. carefully final
100,Customer#000000100,MBy6qq3OEGpV4u,20,30-749-445-4907,9889.89,FURNITURE,"dazzle carefully furiously final foxes. express, ironic packages among the qui"
131,Customer#000000131,"ItdUFrHPZlzjZ, fo03sG4topAKTV",11,21-840-210-3572,8595.53,HOUSEHOLD,ly final Tiresias. slyly permanent theodolites cajole quickly. carefully unus
134,Customer#000000134,6I1TTaoG7bbiogCqRcptG6BYme,11,21-200-159-5932,4608.9,BUILDING,ly regular dolphins haggle blithely.
148,Customer#000000148,qJ8bFn4kwiit7RzwGrwo5m,11,21-562-498-6636,2135.6,HOUSEHOLD,e carefully pending ideas detect slyly along the furiously special excuses. instructions use carefully
190,Customer#000000190,"mY30kK8AfsTGrx,L4zI QlQnnmCUxikyc8QcZ7",11,21-730-373-8193,1657.46,AUTOMOBILE,y even packages engage furiously pending p
210,Customer#000000210,",XOlfSzkZDAkm96adR41j,",20,30-876-248-9750,7250.14,HOUSEHOLD,es cajole bravely across the blithely



We can combine multiple filter clauses, as seen above. We have seen examples of equals (`=`) and greater than (`>`) conditional operators. There are 6 **conditional operators**, they are

1. `<` Less than
2. `>` Greater than
3. `<=` Less than or equal to
4. `>=` Greater than or equal to
5. `=` Equal
6. `<>` and `!=` both represent Not equal (some DBs only support one of these)

Additionally, for string types, we can make **pattern matching with `like` condition**. In a `like` condition, a `_` means any single character, and `%` means zero or more characters, for example.


In [176]:
%%sql
-- all customer rows where the name has a 381 in it
SELECT
  *
FROM
  customer
WHERE
  c_name LIKE '%381%';

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
381,Customer#000000381,wXs5zN2nPHqPsfFO,5,15-860-208-7093,9931.71,BUILDING,"ithely along the regular, regular theodolites. fluffily pending"
1381,Customer#000001381,kAgLl7nUiPStCleWOiKevH3QAOhqtg9dVvrdN,22,32-418-900-6494,367.82,BUILDING,posits sleep carefully around the slyly e
2381,Customer#000002381,"z7B43DZ7RGlkgEi3YaXfy,Aw2SZepYurvII41Do",5,15-493-990-8133,412.99,FURNITURE,ul requests use slyly quickly even deposits. slyly pending
3381,Customer#000003381,03jULkpVTm92eKW24meIj,13,23-441-750-5088,2473.54,AUTOMOBILE,er the carefully bold multipliers doze blithely along the furiousl
3810,Customer#000003810,hlRTIO4e4HNahc8A D,18,28-881-994-8196,9906.8,FURNITURE,bold requests after the furiousl
3811,Customer#000003811,b6vEJqifAgSbGhzTwTz,22,32-962-997-2221,5697.04,FURNITURE,he carefully special packages. regular deposits sleep blithely bl
3812,Customer#000003812,HGYp5dZtlA,14,24-653-654-5032,4204.53,FURNITURE,y ironic requests believe blithely
3813,Customer#000003813,Aeky0En0JO5V1zRgFZ9EvCcBWaTmW,6,16-983-191-7833,-494.03,HOUSEHOLD,rding to the express foxes. bold platelets main
3814,Customer#000003814,FQ3lWCA3znooc3S SmDCfwqdn4R9,20,30-833-732-5401,-207.83,AUTOMOBILE,ounts alongside of the fluffily pendin
3815,Customer#000003815,S5SIUeDCuVOKRTZqZ5M4CC,19,29-968-870-7672,2887.99,FURNITURE,ccounts. fluffily bold requests sleep furio


In [177]:
%%sql
-- all customer rows where the name ends with a 381
SELECT
  *
FROM
  customer
WHERE
  c_name LIKE '%381';

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
381,Customer#000000381,wXs5zN2nPHqPsfFO,5,15-860-208-7093,9931.71,BUILDING,"ithely along the regular, regular theodolites. fluffily pending"
1381,Customer#000001381,kAgLl7nUiPStCleWOiKevH3QAOhqtg9dVvrdN,22,32-418-900-6494,367.82,BUILDING,posits sleep carefully around the slyly e
2381,Customer#000002381,"z7B43DZ7RGlkgEi3YaXfy,Aw2SZepYurvII41Do",5,15-493-990-8133,412.99,FURNITURE,ul requests use slyly quickly even deposits. slyly pending
3381,Customer#000003381,03jULkpVTm92eKW24meIj,13,23-441-750-5088,2473.54,AUTOMOBILE,er the carefully bold multipliers doze blithely along the furiousl
4381,Customer#000004381,MIQXH5W6Zsup5cVYfCtWupiJtgi,2,12-570-797-1472,2542.55,HOUSEHOLD,r deposits. carefully even packages along
5381,Customer#000005381,"bXQ,KuigJB1nASXN73PDwNOvXCIkp5",5,15-700-184-7619,4130.88,MACHINERY,es. carefully ironic ideas sleep blithely about the i
6381,Customer#000006381,BKfk07DtN45gg2w4mMUK1,7,17-877-502-9214,7346.88,HOUSEHOLD,"inal asymptotes boost. bold, ironic requests are along the regular, special packages. pending account"
7381,Customer#000007381,yq7RXRmclCUi6wJspelKaEWSJ TfycLah,20,30-666-139-1602,73.39,BUILDING,fluffily special requests are about the fluffily unusual foxes. final frets are slyly fluffily final deposits. even
8381,Customer#000008381,7kbg8wegbgGmgiW8OQ4SbJ8colXl6rpBmHudJ,0,10-177-308-9094,6674.59,AUTOMOBILE,uests against the carefully bold excuses sleep blithely slyly final instructions; unusual requests about
9381,Customer#000009381,BhXODcEOpwNg6,17,27-708-588-6706,4788.15,HOUSEHOLD,sual hockey players use above the final packages. quickly ironic excuses sleep. slyly final pa


In [178]:
%%sql
-- all customer rows where the name starts with a 381
SELECT
  *
FROM
  customer
WHERE
  c_name LIKE '381%';

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment


In [179]:
%%sql
-- all customer rows where the name has a combination of any character and 9 and 1
SELECT
  *
FROM
  customer
WHERE
  c_name LIKE '%_91%';

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
91,Customer#000000091,9Sce2m BjvDdjQkqMx8UnrUsJkk1IBAvZPTsA,8,18-239-400-3677,4643.14,AUTOMOBILE,yly ironic foxes lose slyly pending asymptotes. slyly final theodolites nag blithely ar
191,Customer#000000191,cZMo3 b4GwZtUmdbw,16,26-811-707-6869,2945.16,BUILDING,daringly quickly ironic foxes. care
291,Customer#000000291,2FfdPluDa2fxPaRh,8,18-657-656-2318,4261.68,HOUSEHOLD,"ld deposits. regularly ironic pinto beans cajole permanently furiously express packages. regular, unusual sheaves"
391,Customer#000000391,"BZ,850WgpZ0YSFs79Sb",11,21-604-451-4462,4801.3,HOUSEHOLD,tions wake about the blithely final instructions. excuses sleep regular requests. slyly
491,Customer#000000491,"AXsbcyMDujG,CAiEu4FmufbZ1k",0,10-856-259-7548,785.37,AUTOMOBILE,"ly final, even hockey players. carefully final ideas w"
591,Customer#000000591,wkmTqEmyI3UOEoG3q,20,30-584-309-7885,6344.66,MACHINERY,xpress deposits. slyly ironic ideas haggle: daringly even requests after the quickly final ideas boost q
691,Customer#000000691,0aGn3Vcf6ZKi82ogENfnso,16,26-741-688-4189,9566.15,MACHINERY,ven packages cajole fluffily fluffily unusual frays. ironic excuses sleep furiously. regular
791,Customer#000000791,Y14aVvMuDDgnmEuCEPK,13,23-575-775-4059,3694.81,HOUSEHOLD,beans use carefully furiously regular deposits. slyly
891,Customer#000000891,"r4,EU38BM0qdbjwqH",11,21-439-958-7518,6032.18,FURNITURE,ong the quickly quick patterns. slyly
910,Customer#000000910,bKS7h8o7ZEiRj,9,19-899-463-4292,5794.69,BUILDING,silent deposits are. blithely final foxes cajole slyly according to the furiously re


We can also filter for more than one value using `IN` and `NOT IN`.

In [180]:
%%sql
-- all customer rows which have nationkey = 10 or nationkey = 20
SELECT
  *
FROM
  customer
WHERE
  c_nationkey IN (10, 20);

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
6,Customer#000000006,"g1s,pzDenUEBW3O,2 pxu0f9n2g64rJrt5E",20,30-114-968-4951,7638.57,AUTOMOBILE,quickly silent asymptotes are slyly regular excuses. instructions wake furiously? quickly bold courts p
16,Customer#000000016,"P2IQMff18ercaYrO,40",10,20-781-609-3107,4681.03,FURNITURE,ests cajole. pinto beans detect slyly. final packages cajole slyly
41,Customer#000000041,jeREsFtCuMqEwdvTFqTkY2NzGRYDG1m,10,20-917-711-4011,270.95,HOUSEHOLD,uctions wake carefully pending deposits: pinto beans along the carefully final deposits sleep blithely a
49,Customer#000000049,PdKqM4TlA OLTjaeRmvH7QWDu80USfslgqutF,10,20-908-631-4424,4573.94,FURNITURE,quests haggle! furiously unusual theodolites cajole carefully. t
55,Customer#000000055,ti9p9XgdmFsjsQI6XQrISDUMFAusnmKS SBoCE,10,20-180-440-8525,4572.11,MACHINERY,dolites. bold instructions wake fluffily regular ideas. regular theodolites are furiously carefully unusual ac
56,Customer#000000056,qh212iaGWtoVp,10,20-895-685-6920,6530.86,FURNITURE,quickly final dependencies. even dependencies are slyly regularly silent theodolites. slow a
81,Customer#000000081,9jUFbrThIIeoUNd8 9,20,30-165-277-3269,2023.71,BUILDING,s against the ironic packages haggle carefully above the slyly express pinto beans
100,Customer#000000100,MBy6qq3OEGpV4u,20,30-749-445-4907,9889.89,FURNITURE,"dazzle carefully furiously final foxes. express, ironic packages among the qui"
104,Customer#000000104,SEOogsfT y09vI2z PcSTnI18U6rNTf,10,20-966-284-8065,-588.38,FURNITURE,efully bold deposits. carefully
105,Customer#000000105,"XI8hMXfr8bIKTGhIRS2sYs,p",10,20-793-553-6417,9091.82,MACHINERY,"solve pending, final requests. regular, bold platele"


In [181]:
%%sql
-- all customer rows which have do not have nationkey as 10 or 20
SELECT
  *
FROM
  customer
WHERE
  c_nationkey NOT IN (10, 20);

c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
1,Customer#000000001,j5JsirBM9PsCy0O1m,15,25-989-741-2988,711.56,BUILDING,y final requests wake slyly quickly special accounts. blithely
2,Customer#000000002,487LW1dovn6Q4dMVymKwwLE9OKf3QG,13,23-768-687-3665,121.65,AUTOMOBILE,y carefully regular foxes. slyly regular requests about the bli
3,Customer#000000003,fkRGN8nY4pkE,1,11-719-748-3364,7498.12,AUTOMOBILE,fully. carefully silent instructions sleep alongside of the slyly regular asymptotes. quickly regular
4,Customer#000000004,4u58h fqkyE,4,14-128-190-5944,2866.83,MACHINERY,sublate. fluffily even instructions are about th
5,Customer#000000005,hwBtxkoBF qSW4KrIk5U 2B1AU7H,3,13-750-942-6364,794.47,HOUSEHOLD,equests haggle furiously against the pending packa
7,Customer#000000007,8OkMVLQ1dK6Mbu6WG9 w4pLGQ n7MQ,18,28-190-982-9759,9561.95,AUTOMOBILE,"ounts. ironic, regular accounts sleep. final requests haggle quickly after the"
8,Customer#000000008,"j,pZ,Qp,qtFEo0r0c 92qobZtlhSuOqbE4JGV",17,27-147-574-9335,6819.74,BUILDING,riously final excuses sublate quickly among the fluffily even foxes. quickly final packages haggle furiously furi
9,Customer#000000009,vgIql8H6zoyuLMFNdAMLyE7 H9,8,18-338-906-3675,8324.07,FURNITURE,ss pinto beans believe slyly quiet deposits-- doggedly bold packages boost. quickly ironic de
10,Customer#000000010,"Vf mQ6Ug9Ucf5OKGYq fsaX AtfsO7,rwY",5,15-741-346-9870,2753.54,HOUSEHOLD,g quickly after the evenly bold
11,Customer#000000011,cG48rYjF3Aw7xs hKUXXqmI,23,33-464-151-3439,-272.6,BUILDING,ng to the regular foxes. furiously final deposits across the final platelets cajole quickly above th


We can get the number of rows in a table using `count(*)` as shown below.

In [182]:
%%sql
SELECT
  COUNT(*)
FROM
  customer;

-- 1500

count(1)
15000


In [183]:
%%sql
SELECT
  COUNT(*)
FROM
  lineitem;

-- 60175

count(1)
600572


If we want to get the rows sorted by values in a specific column, we use `ORDER BY`, for example.

In [184]:
%%sql
-- Will show the first ten customer records with the lowest custkey
-- rows are ordered in ASC order by default
SELECT
  *
FROM
  orders
ORDER BY
  o_custkey
LIMIT
  10;

o_orderkey,o_custkey,o_orderstatus,o_totalprice,o_orderdate,o_orderpriority,o_clerk,o_shippriority,o_comment
36422,1,O,268835.44,1997-03-04,3-MEDIUM,Clerk#000000532,0,s. slyly regular platelets doubt slyly after the thinly
224167,1,O,81485.84,1996-05-08,5-LOW,Clerk#000000657,0,ithely unusual deposits. slyly pending somas wake quickly according to
135943,1,F,263247.54,1993-06-22,4-NOT SPECIFIED,Clerk#000000685,0,ironic ideas affix furiously ac
164711,1,F,283261.47,1992-04-26,3-MEDIUM,Clerk#000000361,0,fully special ideas. fluffil
287619,1,O,11925.85,1996-12-26,5-LOW,Clerk#000000854,0,t pending requests. carefully ironic sheaves among the slyly final asymptotes
385825,1,O,235155.22,1995-11-01,2-HIGH,Clerk#000000465,0,ly express accounts. special requests according to the carefull
430243,1,F,35523.05,1994-12-24,4-NOT SPECIFIED,Clerk#000000121,0,e slyly along the furiously pending attainments
454791,1,F,83779.26,1992-04-19,1-URGENT,Clerk#000000815,0,ccounts sleep carefully along the slyly ev
579908,1,O,45744.09,1996-12-09,5-LOW,Clerk#000000783,0,"t packages hinder bold, even dolphins. slyly ironic packages wake fluffily a"
52263,2,F,36433.77,1994-05-08,4-NOT SPECIFIED,Clerk#000000080,0,"uests dazzle blithely against the final, final requests. regular theodo"


In [185]:
%%sql
-- Will show the first ten customer's records with the highest custkey
SELECT
  *
FROM
  orders
ORDER BY
  o_custkey DESC
LIMIT
  10;

o_orderkey,o_custkey,o_orderstatus,o_totalprice,o_orderdate,o_orderpriority,o_clerk,o_shippriority,o_comment
134848,14999,O,170212.14,1998-03-07,2-HIGH,Clerk#000000669,0,"uests alongside of the ironic, ironic instructions use above t"
129605,14999,P,172005.93,1995-03-27,3-MEDIUM,Clerk#000000578,0,ffix sometimes. regular ideas haggle carefu
94817,14999,F,193676.15,1992-08-02,4-NOT SPECIFIED,Clerk#000000650,0,d pearls. asymptotes haggle furiously regular ideas. furiously
67298,14999,O,296795.91,1995-09-15,1-URGENT,Clerk#000000213,0,carefully bold requests. careful
157894,14999,F,76068.72,1992-06-08,4-NOT SPECIFIED,Clerk#000000952,0,olites. unusual multipliers nag slyly even dependencies. slyly spec
158657,14999,O,164096.58,1998-02-28,3-MEDIUM,Clerk#000000977,0,eposits haggle slyly? blithely final packages about the regular p
178087,14999,F,320537.5,1994-04-16,5-LOW,Clerk#000000872,0,l asymptotes nag stealthily. fluffily ironic reques
190498,14999,O,102768.22,1998-03-13,5-LOW,Clerk#000000941,0,n foxes. theodolites integrate blithely. final packages lose quick
215168,14999,F,202248.65,1992-02-24,2-HIGH,Clerk#000000633,0,use quickly regular request
233956,14999,F,219593.16,1994-03-28,5-LOW,Clerk#000000719,0,pinto beans. regular pinto beans along


## Combine data from multiple tables using JOINs

We can combine data from multiple tables using joins. When we write a join query, we have a format as shown below.

```sql
SELECT
    a.*
FROM
    table_a a -- LEFT table a
    JOIN table_b b -- RIGHT table b
    ON a.id = b.id
```

The table specified first (table_a) is the left table, whereas the table established second is the right table. When we have multiple tables joined, we consider the joined dataset from the first two tables as the left table and the third table as the right table (The DB optimizes our join for performance).

```sql
SELECT
    a.*
FROM
    table_a a -- LEFT table a
    JOIN table_b b -- RIGHT table b
    ON a.id = b.id
    JOIN table_c c -- LEFT table is the joined data from table_a & table_b, right table is table_c
    ON a.c_id = c.id
```

There are five main types of joins, they are:

### 1. Inner join (default): Get rows with same join keys from both tables

In [186]:
%%sql
SELECT
  o.o_orderkey,
  l.l_orderkey
FROM
  orders o
  JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY
LIMIT
  10;

o_orderkey,l_orderkey
7,7
32,32
33,33
69,69
71,71
132,132
133,133
198,198
259,259
260,260


In [187]:
%%sql
SELECT
  COUNT(o.o_orderkey) AS order_rows_count,
  COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
  orders o
  JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY;
-- 2477, 2477

order_rows_count,lineitem_rows_count
24613,24613


**Note:** `JOIN` defaults to INNER JOIN`.

The output will have rows from orders and lineitem that found at least one matching row from the other table with the specified join condition (same orderkey and orderdate within ship date +/- 5 days). 

We can also see that 2,477 rows from orders and lineitem tables matched.

### 2. Left outer join (aka left join): Get all rows from the left table and only matching rows from the right table.

In [188]:
%%sql

SELECT
  o.o_orderkey,
  l.l_orderkey
FROM
  orders o
  LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY
LIMIT
  10;

o_orderkey,l_orderkey
1,
2,
3,
4,
5,
6,
7,7.0
32,32.0
33,33.0
34,


In [189]:
%%sql
SELECT
  COUNT(o.o_orderkey) AS order_rows_count,
  COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
  orders o
  LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY;
-- 15197, 2477

order_rows_count,lineitem_rows_count
151933,24613


The output will have all the rows from orders and the rows from lineitem that were able to find at least one matching row from the orders table with the specified join condition (same orderkey and orderdate within ship date +/- 5 days). 

We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477. The number of rows in orders is 15000, but the join condition produces 15197 since some orders match with multiple lineitems.

### 3. Right outer join (aka right join): Get matching rows from the left and all rows from the right table.

In [190]:
%%sql
SELECT
  o.o_orderkey,
  l.l_orderkey
FROM
  orders o
  RIGHT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY
LIMIT
  10;

o_orderkey,l_orderkey
,1
,1
,1
,1
,1
,1
,2
,3
,3
,3


In [191]:
%%sql
SELECT
  COUNT(o.o_orderkey) AS order_rows_count,
  COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
  orders o
  RIGHT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY;
-- 2477, 60175

order_rows_count,lineitem_rows_count
24613,600572


The output will have the rows from orders that found at least one matching row from the lineitem table with the specified join condition (same orderkey and orderdate within ship date +/- 5 days) and all the rows from the lineitem table.

We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477.

### 4. Full outer join: Get matched and un-matched rows from both the tables.

In [192]:
%%sql
SELECT
  o.o_orderkey,
  l.l_orderkey
FROM
  orders o
  FULL OUTER JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY
LIMIT
  10

o_orderkey,l_orderkey
7.0,7
,7
,7
,7
,7
,7
,7
32.0,32
,32
,32


In [193]:
%%sql
SELECT
  COUNT(o.o_orderkey) AS order_rows_count,
  COUNT(l.l_orderkey) AS lineitem_rows_count
FROM
  orders o
  FULL OUTER JOIN lineitem l ON o.o_orderkey = l.l_orderkey
  AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY AND l.l_shipdate  + INTERVAL '5' DAY;
-- 15197, 60175

order_rows_count,lineitem_rows_count
151933,600572


The output will have all the rows from orders that found at least one matching row from the lineitem table with the specified join condition (same orderkey and orderdate within ship date +/- 5 days) and all the rows from the lineitem table.

We can also see that the number of rows from the orders table is 15,197 & from the lineitem table is 2,477.

### 5. Cross join: Join every row in left table with every row in the right table

In [194]:
%%sql
SELECT
  n.n_name AS nation_c_name,
  r.r_name AS region_c_name
FROM
  nation n
  CROSS JOIN region r;

nation_c_name,region_c_name
ALGERIA,AFRICA
ALGERIA,AMERICA
ALGERIA,ASIA
ALGERIA,EUROPE
ALGERIA,MIDDLE EAST
ARGENTINA,AFRICA
ARGENTINA,AMERICA
ARGENTINA,ASIA
ARGENTINA,EUROPE
ARGENTINA,MIDDLE EAST


The output will have every row of the nation joined with every row of the region. There are 25 nations and five regions, leading to 125 rows in our result from the cross-join.


There are cases where we will need to join a table with itself, called a SELF-join. Lets consider an example.

1. For every customer order, get the order placed earlier in the same week (Sunday - Saturday, not the previous seven days). Only show customer orders that have at least one such order.

In [195]:
%%sql    
SELECT
    o1.o_custkey as o1_custkey,
    o1.o_totalprice as o1_totalprice,
    o1.o_orderdate as o1_orderdate,
    o2.o_totalprice as o2_totalprice,
    o2.o_orderdate as o2_orderdate
FROM
    orders o1
    JOIN orders o2 ON o1.o_custkey = o2.o_custkey
    AND year(o1.o_orderdate) = year(o2.o_orderdate)
    AND weekofyear(o1.o_orderdate) = weekofyear(o2.o_orderdate)
WHERE
    o1.o_orderkey != o2.o_orderkey
LIMIT
    10;

o1_custkey,o1_totalprice,o1_orderdate,o2_totalprice,o2_orderdate
8177,307811.89,1996-09-20,123887.45,1996-09-22
6049,280793.15,1995-10-21,88561.12,1995-10-19
12271,10429.67,1998-03-28,65768.61,1998-03-25
2227,51571.37,1992-01-13,20804.5,1992-01-14
6874,148501.65,1997-02-23,254438.67,1997-02-22
8143,228500.69,1992-10-21,171242.0,1992-10-23
5524,52232.65,1994-02-13,276444.19,1994-02-13
6476,87984.15,1995-10-08,173116.65,1995-10-02
13462,83237.16,1997-01-12,216550.22,1997-01-12
3220,74571.04,1993-06-20,235954.73,1993-06-16


## Combine data from multiple rows into one using GROUP BY

Most analytical queries require calculating metrics that involve combining data from multiple rows. `GROUP BY` allows us to perform aggregate calculations on data from a set of rows recognized by values of specified column(s). For example:

1. Create a report that shows the number of orders per orderpriority segment.

In [196]:
%%sql
SELECT
  o_orderpriority,
  COUNT(*) AS num_orders
FROM
  orders
GROUP BY
  o_orderpriority;

o_orderpriority,num_orders
5-LOW,30244
3-MEDIUM,29563
1-URGENT,30111
4-NOT SPECIFIED,29910
2-HIGH,30172


In the above query, we group the data by `orderpriority`, and the calculation `count(*)` will be applied to the rows having a specific `orderpriority` value.

The calculations allowed are typically SUM/MIN/MAX/AVG/COUNT. However, some databases have more complex aggregate functions; check your DB documentation.

### Use HAVING to filter based on the aggregates created by GROUP BY 

## Replicate IF.ELSE logic with CASE statements

We can do conditional logic in the `SELECT ... FROM` part of our query, as shown below.

In [197]:
%%sql
SELECT
    o_orderkey,
    o_totalprice,
    CASE
        WHEN o_totalprice > 100000 THEN 'high'
        WHEN o_totalprice BETWEEN 25000
        AND 100000 THEN 'medium'
        ELSE 'low'
    END AS order_price_bucket
FROM
    orders;

o_orderkey,o_totalprice,order_price_bucket
1,194029.55,high
2,60951.63,medium
3,247296.05,high
4,53829.87,medium
5,139660.54,high
6,65843.52,medium
7,231037.28,high
32,166802.63,high
33,118518.56,high
34,75662.77,medium


We can see how we display different values depending on the `totalprice` column. We can also use multiple criteria as our conditional criteria (e.g., totalprice > 100000 AND orderpriority = '2-HIGH').

## Stack tables on top of each other with UNION and UNION ALL, subtract tables with EXCEPT

When we want to combine data from tables by stacking them on top of each other, we use UNION or UNION ALL. `UNION` removes duplicate rows, and `UNION ALL` does not remove duplicate rows. Let's look at an example.

In [198]:
%%sql

SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%' -- 25 rows

c_custkey,c_name
91,Customer#000000091
191,Customer#000000191
291,Customer#000000291
391,Customer#000000391
491,Customer#000000491
591,Customer#000000591
691,Customer#000000691
791,Customer#000000791
891,Customer#000000891
910,Customer#000000910


In [199]:
%%sql
-- UNION will remove duplicate rows; the below query will produce 25 rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE'%_91%'
UNION
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
UNION
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91'

c_custkey,c_name
8991,Customer#000008991
2916,Customer#000002916
7911,Customer#000007911
9140,Customer#000009140
3291,Customer#000003291
5914,Customer#000005914
9160,Customer#000009160
14991,Customer#000014991
9103,Customer#000009103
11191,Customer#000011191


In [200]:
%%sql
-- UNION ALL will not remove duplicate rows; the below query will produce 75 rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
UNION ALL
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
UNION ALL
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%';

c_custkey,c_name
91,Customer#000000091
191,Customer#000000191
291,Customer#000000291
391,Customer#000000391
491,Customer#000000491
591,Customer#000000591
691,Customer#000000691
791,Customer#000000791
891,Customer#000000891
910,Customer#000000910


When we want to get all the rows from the first dataset that are not in the second dataset, we can use `EXCEPT`.

In [201]:
%%sql
-- EXCEPT will get the rows in the first query result that is not in the second query result, 0 rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%'
EXCEPT
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%_91%';

c_custkey,c_name


In [202]:
%%sql
-- The below query will result in 23 rows; the first query has 25 rows, and the second has two rows
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE'%_91%'
EXCEPT
SELECT c_custkey, c_name FROM customer WHERE c_name LIKE '%191%';

c_custkey,c_name
8991,Customer#000008991
2916,Customer#000002916
7911,Customer#000007911
9140,Customer#000009140
3291,Customer#000003291
5914,Customer#000005914
9160,Customer#000009160
14991,Customer#000014991
9103,Customer#000009103
11991,Customer#000011991


## Sub-query: Use query instead of a table

When we want to use the result of a query as a table in another query, we use subqueries. Let's consider an example:

1. Create a report that shows the nation, how many items it supplied (by suppliers in that nation), and how many items it purchased (by customers in that nation). 

In [203]:
%%sql
SELECT
  n.n_name AS nation_c_name,
  s.quantity AS supplied_items_quantity,
  c.quantity AS purchased_items_quantity
FROM
  nation n
  LEFT JOIN (
    SELECT
      n.n_nationkey,
      SUM(l.l_quantity) AS quantity
    FROM
      lineitem l
      JOIN supplier s ON l.l_suppkey = s.s_suppkey
      JOIN nation n ON s.s_nationkey = n.n_nationkey
    GROUP BY
      n.n_nationkey
  ) s ON n.n_nationkey = s.n_nationkey
  LEFT JOIN (
    SELECT
      n.n_nationkey,
      SUM(l.l_quantity) AS quantity
    FROM
      lineitem l
      JOIN orders o ON l.l_orderkey = o.o_orderkey
      JOIN customer c ON o.o_custkey = c.c_custkey
      JOIN nation n ON c.c_nationkey = n.n_nationkey
    GROUP BY
      n.n_nationkey
  ) c ON n.n_nationkey = c.n_nationkey;

nation_c_name,supplied_items_quantity,purchased_items_quantity
JAPAN,632809.0,594514.0
RUSSIA,719815.0,607446.0
ARGENTINA,583989.0,609330.0
JORDAN,435841.0,609850.0
FRANCE,534549.0,585564.0
MOZAMBIQUE,523591.0,613443.0
CANADA,569306.0,631774.0
SAUDI ARABIA,720977.0,569819.0
ETHIOPIA,506759.0,647056.0
ROMANIA,506500.0,628528.0


In the above query, we can see that there are two sub-queries, one to calculate the quantity supplied by a nation and the other to calculate the quantity purchased by the customers of a nation.

## Change data types (CAST) and handle NULLS (COALESCE)

Every column in a table has a specific data type. The data types fall under one of the following categories.

1. **`Numerical`**: Data types used to store numbers.
   1. Integer: Positive and negative numbers. Different types of Integer, such as tinyint, int, and bigint, allow storage of different ranges of values. Integers cannot have decimal digits.
   2. Floating: These can have decimal digits but stores an approximate value.
   3. Decimal: These can have decimal digits and store the exact value. The decimal type allows you to specify the scale and precision. Where scale denotes the count of numbers allowed as a whole & precision denotes the count of numbers allowed after the decimal point. E.g., DECIMAL(8,3) allows eight numbers in total, with three allowed after the decimal point.
2. **`Boolean`**: Data types used to store True or False values.
3. **` String`**: Data types used to store alphanumeric characters.
   1. Varchar(n): Data type allows storage of variable character string, with a permitted max length n.
   2. Char(n): Data type allows storage of fixed character string. A column of char(n) type adds (length(string) - n) empty spaces to a string that does not have n characters.
4. **`Date & time`**: Data types used to store dates, time, & timestamps(date + time).
5. **` Objects (JSON, ARRAY)`**: Data types used to store JSON and ARRAY data.

Some databases have data types that are unique to them as well. We should check the database documents to understand the data types offered.

Functions such as `DATE_DIFF` and `ROUND` are specific to a data type. It is best practice to use the appropriate data type for your columns. We can convert data types using the `CAST` function, as shown below.

A `NULL` will be used for that field when a value is not present. In cases where we want to use the first non-NULL value from a list of columns, we use `COALESCE` as shown below.

Let's consider an example as shown below. We can see how when `l.orderkey` is NULL; the DB uses `999999` as the output.

In [204]:
%%sql
SELECT
    o.o_orderkey,
    o.o_orderdate,
    COALESCE(l.l_orderkey, 9999999) AS lineitem_orderkey,
    l.l_shipdate
FROM
    orders o
    LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey
    AND o.o_orderdate BETWEEN l.l_shipdate - INTERVAL '5' DAY
    AND l.l_shipdate + INTERVAL '5' DAY
LIMIT
    10;

o_orderkey,o_orderdate,lineitem_orderkey,l_shipdate
1,1996-01-02,9999999,
2,1996-12-01,9999999,
3,1993-10-14,9999999,
4,1995-10-11,9999999,
5,1994-07-30,9999999,
6,1992-02-21,9999999,
7,1996-01-10,7,1996-01-15
32,1995-07-16,32,1995-07-21
33,1993-10-27,33,1993-10-29
34,1998-07-21,9999999,


## Use these standard inbuilt DB functions for String, Time, and Numeric data manipulation

When processing data, more often than not, we will need to change values in columns; shown below are a few standard functions to be aware of:

1. **` String functions`**
   1. **LENGTH** is used to calculate the length of a string. E.g., `SELECT LENGTH('hi');` will output 2.
   2. **CONCAT** combines multiple string columns into one. E.g., `SELECT CONCAT(clerk, '-', orderpriority) FROM ORDERS LIMIT 5;` will concatenate clear and orderpriority columns with a dash in between them.
   3. **SPLIT** is used to split a value into an array based on a given delimiter. E.g., `SELECT SPLIT(clerk, '#') FROM ORDERS LIMIT 5;` will output a column with arrays formed by splitting clerk values on `#`.
   4. **SUBSTRING** is used to get a sub-string from a value, given the start and end character indices. E.g., `SELECT clerk, SUBSTR(clerk, 1, 5) FROM orders LIMIT 5;` will get the first five (1 - 5) characters of the clerk column. Note that the indexing starts from 1 in Trino.
   5. **TRIM** is used to remove empty spaces to the left and right of the value. E.g., `SELECT TRIM(' hi ');` will output `hi` without any spaces around it. LTRIM and RTRIM are similar but only remove spaces before and after the string, respectively.
2. **` Date and Time functions`**
   1. **Adding and subtracting dates**: Is used to add and subtract periods; the format heavily depends on the DB. E.g., In Trino, the query
      ```sql
        SELECT
        date_diff('DAY', DATE '2022-10-01', DATE '2023-11-05') diff_in_days,
        date_diff('MONTH', DATE '2022-10-01', DATE '2023-11-05') diff_in_months,
        date_diff('YEAR', DATE '2022-10-01', DATE '2023-11-05') diff_in_years;
      ```
    It will show the difference between the two dates in the specified period. We can also add/subtract an arbitrary period from a date/time column. E.g., `SELECT DATE '2022-11-05' + INTERVAL '10' DAY;` will show the output `2022-11-15`.
   2. **string <=> date/time conversions**: When we want to change the data type of a string to date/time, we can use the `DATE 'YYYY-MM-DD'` or `TIMESTAMP 'YYYY-MM-DD HH:mm:SS` functions. But when the data is in a different date/time format such as `MM/DD/YYYY`, we will need to specify the input structure; we do this using `date_parse,` E.g. `SELECT date_parse('11-05-2023', '%m-%d-%Y');`. We can convert a timestamp/date into a string with the required format using `date_format`. E.g., `SELECT DATE_FORMAT(orderdate, '%Y-%m-01') AS first_month_date FROM orders LIMIT 5;` will map every orderdate to the first of their month.
   3. **Time frame functions (YEAR/MONTH/DAY)**:  When we want to extract specific periods from a date/time column, we can use these functions. E.g., `SELECT year(date '2023-11-05');` will return 2023. Similarly, we have month, day, hour, min, etc.
3. **`Numeric`**
   1. **ROUND** is used to specify the number of digits allowed after the decimal point. E.g. `SELECT ROUND(100.102345, 2);`

## Save queries as views for more straightforward reads

When we have large/complex queries that we need to run often, we can save them as views. Views are DB objects that operate similarly to a table. The OLAP DB executes the underlying query when we query a view. 

Use views to hide query complexities and limit column access (by exposing only specific table columns) for end-users.

For example, we can create a view for the nation-level report from the above section, as shown below.

In [205]:
%%sql
DROP VIEW IF EXISTS nation_supplied_purchased_quantity

In [206]:
%%sql
CREATE VIEW nation_supplied_purchased_quantity AS
SELECT
    n.n_name AS nation_name,
    s.quantity AS supplied_items_quantity,
    c.quantity AS purchased_items_quantity
FROM
    nation n
    LEFT JOIN (
        SELECT
            n_nationkey as nationkey,
            sum(l_quantity) AS quantity
        FROM
            lineitem l
            JOIN supplier s ON l.l_suppkey = s.s_suppkey
            JOIN nation n ON s.s_nationkey = n.n_nationkey
        GROUP BY
            n.n_nationkey
    ) s ON n.n_nationkey = s.nationkey
    LEFT JOIN (
        SELECT
            n_nationkey as nationkey,
            sum(l_quantity) AS quantity
        FROM
            lineitem l
            JOIN orders o ON l.l_orderkey = o.o_orderkey
            JOIN customer c ON o.o_custkey = c.c_custkey
            JOIN nation n ON c.c_nationkey = n.n_nationkey
        GROUP BY
            n.n_nationkey
    ) c ON n.n_nationkey = c.nationkey;

In [207]:
%%sql
SELECT
    *
FROM
    nation_supplied_purchased_quantity;

nation_name,supplied_items_quantity,purchased_items_quantity
JAPAN,632809.0,594514.0
RUSSIA,719815.0,607446.0
ARGENTINA,583989.0,609330.0
JORDAN,435841.0,609850.0
FRANCE,534549.0,585564.0
MOZAMBIQUE,523591.0,613443.0
CANADA,569306.0,631774.0
SAUDI ARABIA,720977.0,569819.0
ETHIOPIA,506759.0,647056.0
ROMANIA,506500.0,628528.0


Now the view `nation_supplied_purchased_quantity` will run the underlying query when used. Note here we use the `minio.tpch` schema because catalog `tpch` does not allow the creation of views. The `tpch` catalog comes with Trino and only allows read operations. Read more about [connectors here](https://trino.io/docs/current/overview/concepts.html#connector).

## Exercises

1. Create a report that shows the number of returns for each region name 
2. Top 10 most selling parts
3. Sellers who sell atleast one of the top 10 selling parts
4. Number of returns per order price bucket

Assume the price bucket logic is 
```sql
 CASE
        WHEN totalprice > 100000 THEN 'high'
        WHEN totalprice BETWEEN 25000
        AND 100000 THEN 'medium'
        ELSE 'low'
    END AS order_price_bucket
```

5. Average time (in days) between receiptdate and shipdate for each nation

## Recommended reading

1. https://www.startdataengineering.com/post/improve-sql-skills-de/
2. https://www.startdataengineering.com/post/n-sql-tips-de/
3. https://www.startdataengineering.com/post/advanced-sql/