# Kaggle Intro to SQL (and BigQuery)
- https://www.kaggle.com/learn/intro-to-sql

## 4. Exercise: Order By
- plus EXTRACT for DATE or DATETIME fields.

### Introduction
- To know knew datasets, you can run a couple of SELECT queries.
- The World Bank has made tons of interesting education data available through BigQuery. Run the following cell to see the first few rows of the `international_education` table from the `world_bank_intl_education` dataset.

In [1]:
### Fetch the 'full' table from the 'hacker_news' dataset.
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("world_bank_intl_education", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table
table_ref = dataset_ref.table("international_education")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "comments" table
client.list_rows(table, max_results=5).to_dataframe()

  client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,country_name,country_code,indicator_name,indicator_code,value,year
0,Chad,TCD,"Enrolment in lower secondary education, both s...",UIS.E.2,321921.0,2012
1,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,68809.0,2006
2,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,30551.0,1999
3,Chad,TCD,"Enrolment in upper secondary education, both s...",UIS.E.3,79784.0,2007
4,Chad,TCD,"Repeaters in primary education, all grades, bo...",UIS.R.1,282699.0,2006


### Ex. 1) Prolifict commenters
Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors (I'll use by column) with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts.

In [2]:
# Query to select prolific commenters and post counts
prolific_commenters_query = '''
    SELECT `by`, COUNT(1) AS NumPosts
    FROM `bigquery-public-data.hacker_news.full`
    GROUP BY `by`
    HAVING COUNT(1) > 10000 ''' 
# To select a column (or any identifier) that is also a keyword in MySQL,
# you need to use backticks around the column name

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
prolific_commenters = query_job.to_dataframe()

# View top few rows of results
print(prolific_commenters.head())

            by  NumPosts
0       ncmncm     13621
1        pjc50     21417
2  dredmorbius     26568
3       nradov     13138
4      amelius     20985


### Ex. 2) Deleted comments
- How many comments have been deleted?

In [5]:
# Query 
deleted_comments_query = '''
    SELECT deleted, SUM(1) AS num_del_comments
    FROM `bigquery-public-data.hacker_news.full`
    WHERE deleted = True
    GROUP BY deleted '''
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(deleted_comments_query, job_config=safe_config)
deleted_posts_df = query_job.to_dataframe()
print(deleted_posts_df)

   deleted  num_del_comments
0     True            968172


In [7]:
num_deleted_posts = deleted_posts_df.loc[0, 'num_del_comments']
num_deleted_posts

968172