# ORDER BY
ORDER BY is usually the last clause in your query, and it sorts the results returned by the rest of your query.

## Dates
They come up very frequently in real-world databases. There are two ways that dates can be stored in BigQuery: as a DATE or as a DATETIME.

The DATE format has the year first, then the month, and then the day. It looks like this:

YYYY-[M]M-[D]D

- YYYY: Four-digit year
- [M]M: One or two digit month
- [D]D: One or two digit day

The DATETIME format is like the date format ... but with time added at the end.



## EXTRACT
Often you'll want to look at part of a date, like the year or the day.

### Example: Which day of the week has the most fatal motor accidents?
Let's use the US Traffic Fatality Records database, which contains information on traffic accidents in the US where at least one person died.

In [3]:
from google.cloud import bigquery

client = bigquery.Client(project="sqlbigquery7711")

dataset_ref = client.dataset("nhtsa_traffic_fatalities", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)

tables = list(client.list_tables(dataset))

for table in tables:
    print (table.table_id)

 accident_2015
 accident_2016
 accident_2017
 accident_2018
 accident_2019
 accident_2020
 cevent_2015
 cevent_2016
 cevent_2017
 cevent_2018
 cevent_2019
 cevent_2020
 damage_2015
 damage_2016
 damage_2017
 damage_2018
 damage_2019
 damage_2020
 distract_2015
 distract_2016
 distract_2017
 distract_2018
 distract_2019
 distract_2020
 drimpair_2015
 drimpair_2016
 drimpair_2017
 drimpair_2018
 drimpair_2019
 drimpair_2020
 factor_2015
 factor_2016
 factor_2017
 factor_2018
 factor_2019
 factor_2020
 maneuver_2015
 maneuver_2016
 maneuver_2017
 maneuver_2018
 maneuver_2019
 maneuver_2020
 nmcrash_2015
 nmcrash_2016
 nmcrash_2017
 nmcrash_2018
 nmcrash_2019
 nmcrash_2020
 nmimpair_2015
 nmimpair_2016
 nmimpair_2017
 nmimpair_2018
 nmimpair_2019
 nmimpair_2020
 nmprior_2015
 nmprior_2016
 nmprior_2017
 nmprior_2018
 nmprior_2019
 nmprior_2020
 parkwork_2015
 parkwork_2016
 parkwork_2017
 parkwork_2018
 parkwork_2019
 parkwork_2020
 pbtype_2015
 pbtype_2016
 pbtype_2017
 pbtype_2018
 pbtyp

In [5]:
# Construct a reference to the "accident_2015" table
table_ref = dataset_ref.table("accident_2015")

# API request - fetch the table
table = client.get_table(table_ref) # tables[0]

client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,state_number,state_name,consecutive_number,number_of_vehicle_forms_submitted_all,number_of_motor_vehicles_in_transport_mvit,number_of_parked_working_vehicles,number_of_forms_submitted_for_persons_not_in_motor_vehicles,number_of_forms_submitted_for_persons_in_motor_vehicles,number_of_persons_in_motor_vehicles_in_transport_mvit,number_of_persons_not_in_motor_vehicles_in_transport_mvit,...,minute_of_ems_arrival_at_hospital_name,related_factors_crash_level_1,related_factors_crash_level_1_name,related_factors_crash_level_2,related_factors_crash_level_2_name,related_factors_crash_level_3,related_factors_crash_level_3_name,number_of_fatalities,number_of_drunk_drivers,timestamp_of_crash
0,1,Alabama,10115,1,1,0,0,1,1,0,...,Unknown EMS Hospital Arrival Time,0,,0,,0,,1,0,2015-03-27 23:28:00+00:00
1,1,Alabama,10191,1,1,0,0,2,2,0,...,Unknown EMS Hospital Arrival Time,0,,0,,0,,1,1,2015-04-25 04:20:00+00:00
2,1,Alabama,10192,1,1,0,0,1,1,0,...,Not Applicable (Not Transported),0,,0,,0,,1,0,2015-04-28 08:11:00+00:00
3,1,Alabama,10204,1,1,0,0,3,3,0,...,Unknown EMS Hospital Arrival Time,0,,0,,0,,1,0,2015-05-01 18:15:00+00:00
4,1,Alabama,10231,1,1,0,0,2,2,0,...,0,0,,0,,0,,1,0,2015-05-23 20:00:00+00:00


Let's use the table to determine how the number of accidents varies with the day of the week. Since:

- the consecutive_number column contains a unique ID for each accident, and
- the timestamp_of_crash column contains the date of the accident in DATETIME format

In [8]:
# Query to find out the number of accidents for each day of the week
query = """
        SELECT COUNT(consecutive_number) AS num_accidents, 
               EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week
        FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`
        GROUP BY day_of_week
        ORDER BY num_accidents DESC
        """

query_job = client.query(query)

df_query = query_job.to_dataframe()

df_query.head()

Unnamed: 0,num_accidents,day_of_week
0,5659,7
1,5298,1
2,4916,6
3,4460,5
4,4182,4


Now we are going to use the `international_education` table from the `world_bank_intl_education` dataset.

In [9]:
from google.cloud import bigquery

client = bigquery.Client(project="sqlbigquery7711")

dataset_ref = client.dataset("world_bank_intl_education", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)

tables = list(client.list_tables(dataset))

for table in tables:
    print (table.table_id)

country_series_definitions
country_summary
international_education
series_summary


In [11]:
table_ref = dataset_ref.table("international_education")
table = client.get_table(table_ref)

client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,country_name,country_code,indicator_name,indicator_code,value,year
0,Chad,TCD,"Enrolment in lower secondary education, both s...",UIS.E.2,321921.0,2012
1,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,68809.0,2006
2,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,30551.0,1999
3,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,79784.0,2007
4,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,282699.0,2006


The value in the `indicator_code` column describes what type of data is shown in a given row.  
One interesting indicator code is `SE.XPD.TOTL.GD.ZS`, which corresponds to "Government expenditure on education as % of GDP (%)".

### 1. Government expenditure on education
​
Which countries spend the largest fraction of GDP on education?  
​write a query that returns the average value in the value column for each country in the dataset between the years 2010-2017 (including 2010 and 2017 in the average).

In [15]:
query = """
        SELECT country_name, 
               AVG(value) AS avg_ed_spending_pct
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE indicator_code='SE.XPD.TOTL.GD.ZS' AND year>=2010 AND year<=2017
        GROUP BY country_name
        ORDER BY country_name DESC
        """

query_job = client.query(query)

df_query = query_job.to_dataframe()

df_query.head()

Unnamed: 0,country_name,avg_ed_spending_pct
0,Zimbabwe,10.513611
1,Zambia,3.669393
2,"Yemen, Rep.",8.01846
3,West Bank and Gaza,1.50376
4,Vietnam,5.136473


### 2. Identify interesting codes to explore

The last question started by telling you to focus on rows with the code `SE.XPD.TOTL.GD.ZS`. But how would you find more interesting indicator codes to explore?

There are 1000s of codes in the dataset, so it would be time consuming to review them all. But many codes are available for only a few countries. When browsing the options for different codes, you might restrict yourself to codes that are reported by many countries.

Write a query below that selects the indicator code and indicator name for all codes with at least 175 rows in the year 2016.

In [24]:
query = """
        SELECT indicator_code, 
                indicator_name,
                COUNT(1) AS num_rows
        FROM `bigquery-public-data.world_bank_intl_education.international_education`
        WHERE year=2016
        GROUP BY indicator_code, indicator_name
        HAVING COUNT(1)>=175
        ORDER BY COUNT(1) DESC
        """

query_job = client.query(query)

df_query = query_job.to_dataframe()

df_query.head()

Unnamed: 0,indicator_code,indicator_name,num_rows
0,SP.POP.TOTL,"Population, total",232
1,SP.POP.GROW,Population growth (annual %),232
2,IT.NET.USER.P2,Internet users (per 100 people),223
3,SP.POP.TOTL.FE.ZS,"Population, female (% of total)",213
4,SH.DYN.MORT,"Mortality rate, under-5 (per 1,000)",213
