# Kaggle Intro to SQL (and BigQuery)
- https://www.kaggle.com/learn/intro-to-sql

## 3. Exercise: Count, Group By... Having
- plus AS: aliasing, Count(1)

### Introduction
- Queries with GROUP BY can be powerful.
- We are going to write queries using GROUP BY to answer questions from de Hacker News dataset.

In [1]:
### Fetch the 'full' table from the 'hacker_news' dataset.
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "comments" table
client.list_rows(table, max_results=5).to_dataframe()

  client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,"I would rather just have wired earbuds, period...",,zeveb,,1591717736,2020-06-09 15:48:56+00:00,comment,23467666,23456782,,,
1,,,DNS?,,nly,,1572810465,2019-11-03 19:47:45+00:00,comment,21436112,21435130,,,
2,,,These benchmarks seem pretty good. Filterable...,,mrkeen,,1591717727,2020-06-09 15:48:47+00:00,comment,23467665,23467426,,,
3,,,Oh really?<p>* Excel alone uses 86.1MB of priv...,,oceanswave,,1462987532,2016-05-11 17:25:32+00:00,comment,11677248,11676886,,,
4,,,These systems are useless. Of the many flaws:...,,nyxxie,,1572810473,2019-11-03 19:47:53+00:00,comment,21436113,21435025,,,


### Ex. 1) Prolifict commenters
Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors (I'll use by column) with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts.

In [2]:
# Query to select prolific commenters and post counts
prolific_commenters_query = '''
    SELECT `by`, COUNT(1) AS NumPosts
    FROM `bigquery-public-data.hacker_news.full`
    GROUP BY `by`
    HAVING COUNT(1) > 10000 ''' 
# To select a column (or any identifier) that is also a keyword in MySQL,
# you need to use backticks around the column name

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(prolific_commenters_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
prolific_commenters = query_job.to_dataframe()

# View top few rows of results
print(prolific_commenters.head())

            by  NumPosts
0       ncmncm     13621
1        pjc50     21417
2  dredmorbius     26568
3       nradov     13138
4      amelius     20985


### Ex. 2) Deleted comments
- How many comments have been deleted?

In [5]:
# Query 
deleted_comments_query = '''
    SELECT deleted, SUM(1) AS num_del_comments
    FROM `bigquery-public-data.hacker_news.full`
    WHERE deleted = True
    GROUP BY deleted '''
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(deleted_comments_query, job_config=safe_config)
deleted_posts_df = query_job.to_dataframe()
print(deleted_posts_df)

   deleted  num_del_comments
0     True            968172


In [7]:
num_deleted_posts = deleted_posts_df.loc[0, 'num_del_comments']
num_deleted_posts

968172

You got the right countries. Nice job! Some countries showed up many times in the results. To get each country only once you can run `SELECT DISTINCT country ...`.

The DISTINCT keyword ensures each column shows up once, which you'll want in some cases

##### Or to get each country just once, you could use
``` Python:
first_query = '''
    SELECT DISTINCT country
    FROM `bigquery-public-data.openaq.global_air_quality`
    WHERE unit = 'ppm' '''
```
 """

### Ex. 2) High air quality
- Which pollution levels were reported to be exactly 0?

In [None]:
zero_pollution_query = '''
    SELECT *
    FROM `bigquery-public-data.openaq.global_air_quality`
    WHERE value=0 '''
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
zpq_job = client.query(zero_pollution_query, job_config=safe_config)
zero_pollution_results = zpq_job.to_dataframe()    # this is my 'df'
zero_pollution_results.iloc[[0, 5, 9. -9, -5, -1]]

That query wasn't too complicated, and it got the data you want. But these SELECT queries don't organizing data in a way that answers the most interesting questions. For that, we'll need the GROUP BY command.

If you know how to use groupby() in pandas, this is similar. But BigQuery works quickly with far larger datasets.

### Ex. 3) JM- Whole table as a DataFrame. Problem with:
- `safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)`

In [None]:
query = '''
    SELECT *
    FROM `bigquery-public-data.openaq.global_air_quality` '''
safe_config = bigquery.QueryJobConfig(maximun_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)
df_whole_table = query_job.to_dataframe()

In [None]:
df_whole_table.shape