<a href="https://colab.research.google.com/github/nurfnick/Data_Viz/blob/main/04_SQL_Essentials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL Essentials

In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


Let's start with the basics.  I'll continue to work with the liquor store data.  

In [2]:
%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,store_location,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-37362200037,2021-06-09,4582,Jiffy #926 / Spirit Lake,"2402, 17th St",Spirit Lake,51360.0,POINT (-95.126585 43.42282),30,DICKINSON,1092100.0,Imported Distilled Spirit Specialty,421.0,SAZERAC COMPANY INC,77487,Tortilla Gold DSS,12,1000,4.85,7.28,60,436.8,60.0,15.85
1,INV-21177600024,2019-08-12,2465,Sid's Beverage Shop,2727 Dodge St,Dubuque,52003.0,POINT (-90.705328 42.491862),31,DUBUQUE,1081100.0,Coffee Liqueurs,65.0,Jim Beam Brands,67557,Kamora Coffee Liqueur,12,1000,8.39,12.59,6,75.54,6.0,1.58
2,INV-25272100048,2020-02-17,2671,Hy-Vee / Jefferson,"106, W Washington St",Jefferson,50129.0,POINT (-94.375508 42.017267),37,GREENE,1081500.0,Triple Sec,434.0,LUXCO INC,86507,Paramount Triple Sec,12,1000,3.84,5.76,4,23.04,4.0,1.05
3,S27599300045,2015-08-31,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632.0,POINT (-91.387797 40.400038),56,Lee,1081700.0,DISTILLED SPIRITS SPECIALTY,434.0,Luxco-St Louis,75087,Juarez Gold Dss,12,1000,4.92,7.38,48,354.24,48.0,12.68
4,S21884700116,2014-10-20,2605,Hy-Vee Drugstore #5 / Cedar Rapids,2001 BLAIRS FERRY ROAD NE,CEDAR RAPIDS,52402.0,POINT (-91.668909 42.034799),57,Linn,1081010.0,AMERICAN AMARETTO,421.0,"Sazerac Co., Inc.",71886,Amaretto E Dolce,12,750,3.34,5.01,7,35.07,5.25,1.39


Why did I run that command?  Well it gives me an idea of what is in the table to reference and think about what questions I might ask!  Let's see how many gallons of liqour have been sold.

In [7]:
%%bigquery --project pic-math
SELECT SUM(volume_sold_gallons) as Total_Gallons_of_Liquor
FROM `bigquery-public-data.iowa_liquor_sales.sales`

Unnamed: 0,Total_Gallons_of_Liquor
0,55607630.0


Let's do a small conversion just to see what that means.  An olympic pool holds 660 000 gallons so

In [8]:
(5.560763*10**7)/660000

84.25398484848485

About 84 swimming pools of liquor in Iowa!  Fun times...

Let's make it more complicated.  Let's see what the dollars per gallon is on the full dataset.

In [9]:
%%bigquery --project pic-math
SELECT SUM(sale_dollars)/SUM(volume_sold_gallons) as Total_Dollars_Per_Gallon
FROM `bigquery-public-data.iowa_liquor_sales.sales`

Unnamed: 0,Total_Dollars_Per_Gallon
0,57.373869


Okay not terribly interesting.  Let's take that quesiton and add too it.  Let's create a column that is dollars per gallon of liquor.

In [13]:
%%bigquery --project pic-math
SELECT sale_dollars/volume_sold_gallons as Dollars_Per_Gallon
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0
LIMIT 5

Unnamed: 0,Dollars_Per_Gallon
0,13.741325
1,94.537815
2,28.391167
3,48.692516
4,136.840336


I have added a few new things here.  The `WHERE` clause allows me to restrict what I consider.  You can combine several of the statements logically

In [15]:
%%bigquery --project pic-math
SELECT sale_dollars/volume_sold_gallons as Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'Coffee Liqueurs'

Unnamed: 0,Dollars_Per_Gallon,item_description
0,136.363636,Iowa Coffee Company Liqueur
1,136.134454,Iowa Coffee Company Liqueur
2,136.134454,Iowa Coffee Company Liqueur
3,135.000000,Iowa Coffee Company Liqueur
4,118.421053,Mozart Chocolate Coffee Cream Liqueur
...,...,...
78766,91.474576,Kahlua French Vanilla Liqueur DISCO
78767,91.474576,Kahlua French Vanilla Liqueur DISCO
78768,91.474576,Kahlua French Vanilla Liqueur DISCO
78769,91.474576,Kahlua French Vanilla Liqueur DISCO


I added the `item_description` so that I could see which ones were different.  It is not utilized the the analysis yet.  Let's include it by getting the average price of coffee liqueurs based on the description.  To do this I'll add the `GROUP BY` command

In [17]:
%%bigquery --project pic-math
SELECT AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'Coffee Liqueurs'
GROUP BY item_description

Unnamed: 0,Average_Dollars_Per_Gallon,item_description
0,93.411691,Kahlua Coffee
1,776.615963,Kahlua Coffee Mini
2,92.405096,Kahlua Coffee Liqueur
3,48.811417,Kamora Coffee Liqueur
4,109.767153,Patron Xo Cafe
5,76.032478,Kahlua Coffee Liqueur Mini
6,187.292855,Iowa Coffee Company Liqueur
7,41.034034,Kapali Coffee Liqueur
8,94.736642,Kahlua Salted Caramel
9,38.571044,Chila Coffee Liqueur


Okay but we pointed out earlier that there is no order...  Let's force an order on it

In [18]:
%%bigquery --project pic-math
SELECT AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'Coffee Liqueurs'
GROUP BY item_description
ORDER BY Average_Dollars_Per_Gallon DESC

Unnamed: 0,Average_Dollars_Per_Gallon,item_description
0,1136.210526,Kahlua Chili Chocolate
1,776.615963,Kahlua Coffee Mini
2,248.540562,CCD Coffee Liqueur
3,205.745228,Kahlua French Vanilla Liqueur
4,187.292855,Iowa Coffee Company Liqueur
5,137.784413,Original Secret Family Recipe - A Coffee Liqueur
6,118.170457,J. Rieger & Co. Caffe Amaro
7,117.936185,Tia Maria Coffee Liqueur
8,115.01068,Mozart Chocolate Coffee Cream Liqueur
9,109.767153,Patron Xo Cafe


Let's keep going down the rabbit hole here.  What if we want to rank them?  There are lots of ways `ROW_NUMBER`, `RANK` and `DENSE_RANK`.  I find them difficult to use because they require lots of other inputs.

The general call is something like

``ROW_NUMBER() OVER(PARTITION BY ________ ORDER BY _________)``

Partition is like grouping.  I'll add another liquor to use it

In [37]:
%%bigquery --project pic-math
SELECT 
  AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, 
  item_description, 
  ROW_NUMBER() OVER(
    PARTITION BY category_name
    ORDER BY item_description) row_num
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and (category_name = 'Coffee Liqueurs' or category_name = 'Imported Vodkas')
GROUP BY item_description, category_name
ORDER BY Average_Dollars_Per_Gallon DESC

Unnamed: 0,Average_Dollars_Per_Gallon,item_description,row_num
0,17136.000000,E.T. 51 Premium Vodka Mini,44
1,16996.723367,E.T.51 Premium Vodka Mini,45
2,9000.000000,Outerspace Vodka Mini,86
3,2935.463736,Grey Goose Essences Watermelon & Basil Mini,54
4,2921.600000,Grey Goose Essences White Peach & Rosemary Mini,56
...,...,...,...
164,30.485213,Fris Danish Vodka,49
165,28.975845,Stolichnaya Russian Vodka 80 Prf,124
166,27.608509,Polar Ice,98
167,23.335834,SOOH Relska 80 Proof Vodka,109


You should notice that the `ROW_NUMBER` didn't do what we needed.  You will not be able to do the row nor rank on the column we created because it is not yet available to the SQL call.  This leads to sub-processees.  Let's show one today and come back to it next class.

In [38]:
%%bigquery --project pic-math
WITH t as(
SELECT 
  AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, 
  item_description, category_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and (category_name = 'Coffee Liqueurs' or category_name = 'Imported Vodkas')
GROUP BY item_description, category_name
ORDER BY Average_Dollars_Per_Gallon DESC
)

SELECT *, 
    RANK() OVER(
    PARTITION BY category_name
    ORDER BY item_description) rk_num
FROM t
ORDER BY rk_num

Unnamed: 0,Average_Dollars_Per_Gallon,item_description,category_name,row_num
0,87.341772,42 Below Pure Vodka,Imported Vodkas,1
1,248.540562,CCD Coffee Liqueur,Coffee Liqueurs,1
2,178.770622,AO Vodka,Imported Vodkas,2
3,38.358325,Caffe Lolita Coffee Liqueur,Coffee Liqueurs,2
4,64.869387,Absolut 1.75L w/Powell & Mahoney Ginger Beer,Imported Vodkas,3
...,...,...,...,...
164,137.650268,Tom of Finland,Imported Vodkas,137
165,92.152534,Van Gogh Double Expresso Vodka,Imported Vodkas,138
166,1051.242604,Van Gogh Double Expresso Vodka Mini,Imported Vodkas,139
167,128.685240,Vektor Vodka,Imported Vodkas,140


We will come back to this but a nice taste of some of the really powerful aspects of SQL!

## Me Playing Around

I wanted to find why 'vodka' wasn't in the table.  Well it was but with other qualifiers.  I needed to use a string command `LIKE` I also looked at `CONTAINS` but didn't get it to work.

In [36]:
%%bigquery --project pic-math
SELECT category_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE category_name LIKE '%Vodka%'
GROUP BY category_name


Unnamed: 0,category_name
0,Imported Vodka
1,American Vodka
2,American Vodkas
3,Imported Vodkas
4,American Flavored Vodka
5,Imported Flavored Vodka
