# Kaggle Intro to SQL (and BigQuery)
- https://www.kaggle.com/learn/intro-to-sql

## 4. Exercise: Order By
- plus EXTRACT for DATE or DATETIME fields.

### Introduction
- To know knew datasets, you can run a couple of SELECT queries.
- The World Bank has made tons of interesting education data available through BigQuery. Run the following cell to see the first few rows of the `international_education` table from the `world_bank_intl_education` dataset.

In [18]:
### Fetch the 'full' table from the 'hacker_news' dataset.
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client('jmproject86385')

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("world_bank_intl_education", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table
table_ref = dataset_ref.table("international_education")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "comments" table
client.list_rows(table, max_results=5).to_dataframe()

  client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,country_name,country_code,indicator_name,indicator_code,value,year
0,Chad,TCD,"Enrolment in lower secondary education, both s...",UIS.E.2,321921.0,2012
1,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,68809.0,2006
2,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,30551.0,1999
3,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,79784.0,2007
4,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,282699.0,2006


### Ex. 1) Goverment expenditure on education

- The value in the `indicator_code` column describes what type of data is shown in a given row.  - 
One interesting indicator code is `SE.XPD.TOTL.GD.ZS`, which corresponds to "Government expenditure on education as % of GDP (%.
- Wich countries spend the largest fraction of GDP on education?
- To answer this question, consider only the rows in the dataset corresponding to indicator code `SE.XPD.TOTL.GD.ZS`, and write a query that returns the average value in the `value` column for each country in the dataset between the years 2010-2017 (including 2010 and 2017 in the average).

Requirements:
- Your results should have the country name rather than the country code. You will have one row for each country.
- The aggregate function for average is **AVG()**.  Use the name `avg_ed_spending_pct` for the column created by this aggregation.
- Order the results so the countries that spend the largest fraction of GDP on education show up first..

In [29]:
### group the latest q1 by country_name and sum the values
q3 = '''
    SELECT country_name, AVG(value) AS avg_ed_spending_pct
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    WHERE indicator_code = 'SE.XPD.TOTL.GD.ZS' and year >= 2010 and year <= 2017
    GROUP BY country_name
    ORDER BY avg_ed_spending_pct DESC '''

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(q3, job_config=safe_config)
q3_df = query_job.to_dataframe()

In [31]:
#q3_df.iloc[[0, 5, 9, -9, -5, -1]]
q3_df.head(9)

Unnamed: 0,country_name,avg_ed_spending_pct
0,Cuba,12.83727
1,"Micronesia, Fed. Sts.",12.46775
2,Solomon Islands,10.00108
3,Moldova,8.372153
4,Namibia,8.34961
5,Denmark,8.2743
6,Timor-Leste,7.975114
7,Iceland,7.48021
8,Sweden,7.233168


In [26]:
q3_df.head(9)

Unnamed: 0,country_name,avg_ed_spending_pct
0,Tuvalu,1865421.0
1,Marshall Islands,11.8163
2,Zimbabwe,10.51361
3,Cuba,9.324694
4,Lesotho,8.727555
5,"Micronesia, Fed. Sts.",8.528567
6,Timor-Leste,8.474134
7,Djibouti,8.116339
8,"Yemen, Rep.",8.01846


In [19]:
### query only rows where inricator code is SE.XPD.TOTL.GD.ZS
q1 = '''
    SELECT *
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    WHERE indicator_code = 'SE.XPD.TOTL.GD.ZS' '''

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(q1, job_config=safe_config)
q1_df = query_job.to_dataframe()
#q1_df.head()
q1_df.iloc[[0, 5, 9, -9, -5, -1]]

Unnamed: 0,country_name,country_code,indicator_name,indicator_code,value,year
0,Afghanistan,AFG,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,3.46149,2010
5,Angola,AGO,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,2.77713,2005
9,Albania,ALB,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,3.37669,1999
3529,Sao Tome and Principe,STP,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,3.52904,2004
3533,Central African Republic,CAF,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,1.69106,1999
3537,St. Vincent and the Grenadines,VCT,Government expenditure on education as % of GD...,SE.XPD.TOTL.GD.ZS,6.35081,2005


In [20]:
### group the latest q1 by country_name and sum the values
q2 = '''
    SELECT country_name, SUM(value) AS sum
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    WHERE indicator_code = 'SE.XPD.TOTL.GD.ZS'
    GROUP BY country_name
    ORDER BY sum DESC '''

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(q2, job_config=safe_config)
q2_df = query_job.to_dataframe()
#q2_df.iloc[[0, 5, 9, -9, -5, -1]]

In [22]:
#q2_df.head(8)
q2_df.iloc[[0, 5, 9, -9, -5, -1]]

Unnamed: 0,country_name,sum
0,Tuvalu,3730841.0
5,Netherlands,230.301
9,Ireland,203.0445
187,Grenada,3.92674
191,Somalia,2.46197
195,Haiti,1.07399


### Ex. 2) Identify interesting codes to explore

- The last question started by telling you to focus on rows with the code SE.XPD.TOTL.GD.ZS. But how would you find more interesting indicator codes to explore?
- 
There are 1000s of codes in the dataset, so it would be time consuming to review them all. But many codes are available for only a few countries. When browsing the options for different codes, you might restrict yourself to codes that are reported by many countri.    


Write a query below that selects the indicator code and indicator name for all codes with at least 175 rows in the year 2016.

In [42]:
### ok begin with simple query
qa = '''
    SELECT indicator_name, indicator_code, COUNT(1) AS num_ind_code
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    GROUP BY indicator_name, indicator_code '''

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(qa, job_config=safe_config)
qa_df = query_job.to_dataframe()  

In [43]:
qa_df.head(9)

Unnamed: 0,indicator_name,indicator_code,num_ind_code
0,"Enrolment in lower secondary education, both s...",UIS.E.2,3077
1,"Enrolment in upper secondary education, both s...",UIS.E.3,2905
2,Enrolment in post-secondary non-tertiary educa...,UIS.E.4,1643
3,"Enrolment in tertiary education, ISCED 6 progr...",UIS.E.6,297
4,"Repeaters in primary education, all grades, bo...",UIS.R.1,5121
5,"Enrolment in early childhood education, female...",UIS.E.0.F,1776
6,"Enrolment in early childhood education, both s...",UIS.E.0.T,1847
7,"Enrolment in lower secondary education, female...",UIS.E.2.F,3048
8,"Enrolment in lower secondary vocational, both ...",UIS.E.2.V,1232


In [39]:
### other query
qb = '''
    SELECT COUNT(indicator_code) AS num_ind_code
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    GROUP BY indicator_code '''

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(qb, job_config=safe_config)
qb_df = query_job.to_dataframe()  

In [40]:
qb_df.head(9)

Unnamed: 0,num_ind_code
0,3077
1,2905
2,1643
3,297
4,5121
5,1776
6,1847
7,3048
8,1232


In [44]:
qc = """
    SELECT indicator_code, indicator_name, COUNT(1) AS num_rows
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    WHERE year = 2016
    GROUP BY indicator_name, indicator_code
    HAVING COUNT(1) >= 175
    ORDER BY COUNT(1) DESC """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(qc, job_config=safe_config)
qc_df = query_job.to_dataframe()  

In [45]:
qc_df.head(9)

Unnamed: 0,indicator_code,indicator_name,num_rows
0,SP.POP.GROW,Population growth (annual %),232
1,SP.POP.TOTL,"Population, total",232
2,IT.NET.USER.P2,Internet users (per 100 people),223
3,SH.DYN.MORT,"Mortality rate, under-5 (per 1,000)",213
4,SP.POP.0014.TO,"Population, ages 0-14, total",213
5,SP.POP.0014.TO.ZS,"Population, ages 0-14 (% of total)",213
6,SP.POP.1564.TO.ZS,"Population, ages 15-64 (% of total)",213
7,SP.POP.TOTL.MA.ZS,"Population, male (% of total)",213
8,SP.POP.1564.TO,"Population, ages 15-64, total",213
