<a href="https://colab.research.google.com/github/nurfnick/Data_Viz/blob/main/Content/Data_Collecting/04_SQL_Essentials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL Essentials

In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


Let's start with the basics.  I'll continue to work with the liquor store data.  

In [None]:
%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,store_location,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-25210600054,2020-02-13,3385,Sam's Club 8162 / Cedar Rapids,2605 Blairs Ferry Rd NE,Cedar Rapids,52402.0,POINT (-91.67969 42.031819),57,LINN,1081500.0,Triple Sec,55,SAZERAC NORTH AMERICA,86390,Montezuma Triple Sec,12,1000,2.13,3.2,48,153.6,48.0,12.68
1,S23115100025,2014-12-22,5004,Ida Grove Food Pride,200 SUSAN LAWRENCE DR,IDA GROVE,51445.0,,47,Ida,1042100.0,IMPORTED DRY GINS,260,Diageo Americas,28867,Tanqueray Gin,12,1000,15.25,22.88,4,91.52,4.0,1.06
2,S18402900152,2014-04-14,4312,I-80 Liquor / Council Bluffs,2411 S 24TH ST #1,COUNCIL BLUFFS,51501.0,POINT (-95.8792 41.238092),78,Pottawattamie,1042100.0,IMPORTED DRY GINS,35,"Bacardi U.S.A., Inc.",28233,Bombay Sapphire Gin,12,1000,17.61,26.41,11,290.51,11.0,2.91
3,S14219300007,2013-08-29,4844,Iowa City Fast Break,"2580, NAPLES AVE",IOWA CITY,52240.0,POINT (-91.571064 41.632792),52,Johnson,1011500.0,STRAIGHT RYE WHISKIES,255,Wilson Daniels Ltd.,27102,Templeton Rye,6,750,18.08,27.13,18,488.34,13.5,3.57
4,S24682700083,2015-03-24,2604,Hy-Vee Wine and Spirits / Lemars,1201 12TH AVE SW,LEMARS,51031.0,POINT (-96.18335000000002 42.778257),75,Plymouth,1081365.0,TROPICAL FRUIT SCHNAPPS,421,"Sazerac Co., Inc.",83907,Maui Blue Hawaiian Schnapps,12,1000,4.54,6.81,4,27.24,4.0,1.06


Why did I run that command?  Well it gives me an idea of what is in the table to reference and think about what questions I might ask!  Let's see how many gallons of liqour have been sold.

In [None]:
%%bigquery --project pic-math
SELECT SUM(volume_sold_gallons) as Total_Gallons_of_Liquor
FROM `bigquery-public-data.iowa_liquor_sales.sales`

Unnamed: 0,Total_Gallons_of_Liquor
0,55607630.0


Let's do a small conversion just to see what that means.  An olympic pool holds 660 000 gallons so

In [None]:
(5.560763*10**7)/660000

84.25398484848485

About 84 swimming pools of liquor in Iowa!  Fun times...

Let's make it more complicated.  Let's see what the dollars per gallon is on the full dataset.

In [None]:
%%bigquery --project pic-math
SELECT SUM(sale_dollars)/SUM(volume_sold_gallons) as Total_Dollars_Per_Gallon
FROM `bigquery-public-data.iowa_liquor_sales.sales`

Unnamed: 0,Total_Dollars_Per_Gallon
0,57.373869


Okay not terribly interesting.  Let's take that quesiton and add too it.  Let's create a column that is dollars per gallon of liquor.

In [None]:
%%bigquery --project pic-math
SELECT sale_dollars/volume_sold_gallons as Dollars_Per_Gallon
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0
LIMIT 5

Unnamed: 0,Dollars_Per_Gallon
0,13.741325
1,94.537815
2,28.391167
3,48.692516
4,136.840336


I have added a few new things here.  The `WHERE` clause allows me to restrict what I consider.  You can combine several of the statements logically

In [None]:
%%bigquery --project pic-math
SELECT sale_dollars/volume_sold_gallons as Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'Coffee Liqueurs'

Unnamed: 0,Dollars_Per_Gallon,item_description
0,136.363636,Iowa Coffee Company Liqueur
1,136.134454,Iowa Coffee Company Liqueur
2,136.134454,Iowa Coffee Company Liqueur
3,135.000000,Iowa Coffee Company Liqueur
4,118.421053,Mozart Chocolate Coffee Cream Liqueur
...,...,...
78766,91.474576,Kahlua French Vanilla Liqueur DISCO
78767,91.474576,Kahlua French Vanilla Liqueur DISCO
78768,91.474576,Kahlua French Vanilla Liqueur DISCO
78769,91.474576,Kahlua French Vanilla Liqueur DISCO


I added the `item_description` so that I could see which ones were different.  It is not utilized the the analysis yet.  Let's include it by getting the average price of coffee liqueurs based on the description.  To do this I'll add the `GROUP BY` command

In [None]:
%%bigquery --project pic-math
SELECT AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'Coffee Liqueurs'
GROUP BY item_description

Unnamed: 0,Average_Dollars_Per_Gallon,item_description
0,93.411691,Kahlua Coffee
1,776.615963,Kahlua Coffee Mini
2,92.405096,Kahlua Coffee Liqueur
3,48.811417,Kamora Coffee Liqueur
4,109.767153,Patron Xo Cafe
5,76.032478,Kahlua Coffee Liqueur Mini
6,187.292855,Iowa Coffee Company Liqueur
7,41.034034,Kapali Coffee Liqueur
8,94.736642,Kahlua Salted Caramel
9,38.571044,Chila Coffee Liqueur


Okay but we pointed out earlier that there is no order...  Let's force an order on it

In [None]:
%%bigquery --project pic-math
SELECT AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'Coffee Liqueurs'
GROUP BY item_description
ORDER BY Average_Dollars_Per_Gallon DESC

Unnamed: 0,Average_Dollars_Per_Gallon,item_description
0,1136.210526,Kahlua Chili Chocolate
1,776.615963,Kahlua Coffee Mini
2,248.540562,CCD Coffee Liqueur
3,205.745228,Kahlua French Vanilla Liqueur
4,187.292855,Iowa Coffee Company Liqueur
5,137.784413,Original Secret Family Recipe - A Coffee Liqueur
6,118.170457,J. Rieger & Co. Caffe Amaro
7,117.936185,Tia Maria Coffee Liqueur
8,115.01068,Mozart Chocolate Coffee Cream Liqueur
9,109.767153,Patron Xo Cafe


Let's keep going down the rabbit hole here.  What if we want to rank them?  There are lots of ways `ROW_NUMBER`, `RANK` and `DENSE_RANK`.  I find them difficult to use because they require lots of other inputs.

The general call is something like

``ROW_NUMBER() OVER(PARTITION BY ________ ORDER BY _________)``

Partition is like grouping.  I'll add another liquor to use it

In [None]:
%%bigquery --project pic-math
SELECT 
  AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, 
  item_description, 
  ROW_NUMBER() OVER(
    PARTITION BY category_name
    ORDER BY item_description) row_num
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and (category_name = 'Coffee Liqueurs')
GROUP BY item_description, category_name
ORDER BY Average_Dollars_Per_Gallon DESC

Unnamed: 0,Average_Dollars_Per_Gallon,item_description,row_num
0,1136.210526,Kahlua Chili Chocolate,11
1,776.615963,Kahlua Coffee Mini,15
2,248.540562,CCD Coffee Liqueur,1
3,205.745228,Kahlua French Vanilla Liqueur,17
4,187.292855,Iowa Coffee Company Liqueur,9
5,137.784413,Original Secret Family Recipe - A Coffee Liqueur,25
6,118.170457,J. Rieger & Co. Caffe Amaro,10
7,117.936185,Tia Maria Coffee Liqueur,28
8,115.01068,Mozart Chocolate Coffee Cream Liqueur,24
9,109.767153,Patron Xo Cafe,27


You should notice that the `ROW_NUMBER` didn't do what we needed.  You will not be able to do the row nor rank on the column we created because it is not yet available to the SQL call.  This leads to sub-processees.  Let's show one today and come back to it next class.

In [None]:
%%bigquery --project pic-math
WITH t as(
SELECT 
  AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, 
  item_description, category_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and (category_name = 'Coffee Liqueurs' or category_name = 'Imported Vodkas')
GROUP BY item_description, category_name
ORDER BY Average_Dollars_Per_Gallon DESC
)

SELECT *, 
    RANK() OVER(
    PARTITION BY category_name
    ORDER BY Average_Dollars_Per_Gallon) rk_num
FROM t
ORDER BY rk_num

Unnamed: 0,Average_Dollars_Per_Gallon,item_description,category_name,rk_num
0,16.996845,Conciere Coffee Liqueur,Coffee Liqueurs,1
1,23.335834,SOOH Relska 80 Proof Vodka,Imported Vodkas,1
2,38.358325,Caffe Lolita Coffee Liqueur,Coffee Liqueurs,2
3,27.608509,Polar Ice,Imported Vodkas,2
4,38.571044,Chila Coffee Liqueur,Coffee Liqueurs,3
...,...,...,...,...
164,2921.600000,Grey Goose Essences Strawberry & Lemongrass Mini,Imported Vodkas,137
165,2935.463736,Grey Goose Essences Watermelon & Basil Mini,Imported Vodkas,138
166,9000.000000,Outerspace Vodka Mini,Imported Vodkas,139
167,16996.723367,E.T.51 Premium Vodka Mini,Imported Vodkas,140


We will come back to this but a nice taste of some of the really powerful aspects of SQL!

## Assignment



Assignement for today

1. Start a notebook getting BigQuery to work. Feel free to use the authentication atop.
2. Navigate to the dataset 'austin_bikeshare.bikeshare_trips'
3. Compute the average time for a trip based on starting point
4. Compute how many trips start at each starting point.




## Me Playing Around

I wanted to find why 'vodka' wasn't in the table.  Well it was but with other qualifiers.  I needed to use a string command `LIKE` I also looked at `CONTAINS` but didn't get it to work.

In [None]:
%%bigquery --project pic-math
SELECT category_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE category_name LIKE '%Vodka%'
GROUP BY category_name


Unnamed: 0,category_name
0,Imported Vodka
1,American Vodka
2,American Vodkas
3,Imported Vodkas
4,American Flavored Vodka
5,Imported Flavored Vodka
