<a href="https://colab.research.google.com/github/lazlozerv/Kaggle_SQL_Summer_Camp/blob/master/Lesson3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

####Intro

In continuation of the previous lesson, we are going to see the rest of SQL queries. 

Specifically we are going to see **GROUP BY, HAVING** and **COUNT()**.

####COUNT()

**COUNT()** is an example of an **aggregation function**, which takes many values and returns one. (Other examples functions include **SUM(), AVG(), MIN(), MAX()** ).

####GROUP BY

We can use **GROUP BY** to group together rows that have the same value in the `Animal` column, while using __COUNT()__ to find out how many ID'S we have in each group.

####GROUP BY ... HAVING

__Having__ is used in combination with **GROUP BY** to ignore groups that don't meet certain criteria. 

Obviously returns the groups that meet the specified criteria.

In [None]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

In [None]:
# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news",project="bigquery-public-data")


# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)


# Construct a reference to the "comments" table
table_ref = dataset_ref.table("comments")


# API request - fetch the table
table = client.get_table(table_ref)


# Preview the first five lines of the "comments" table
client.list_rows(table, max_results=5).to_dataframe()

In [None]:
# Query to select comments that received more than 10 replies
query_popular = """
                SELECT parent, COUNT(id)
                FROM `bigquery-public-data.hacker_news.comments`
                GROUP BY parent
                HAVING COUNT(id) > 10
                """

In [None]:
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_popular, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
popular_comments = query_job.to_dataframe()

# Print the first five rows of the DataFrame
popular_comments.head()

#### Aliasing and other improvements

A couple hints to make your queries even better:


* The column resulting from `COUNT(id)` was called `f0__`. That's not a very descriptive name. You can change the name by adding `AS NumPosts` after you specify the aggregation. This is called __aliasing__, and it will be covered in more detail in an upcoming lesson.

* If you are ever unsure what to put inside the **COUNT()** function, you can do `COUNT(1)`to count the rows in each group. Most people find it especially readable, because we know it's not focusing on other columns. It also scans less data than if supplied column names (making it faster and using less of your data access quota).

Using these tricks, we can rewrite our query:

In [None]:
# Improved version of earlier query, now with aliasing & improved readability
query_improved = """
                 SELECT parent, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.comments`
                 GROUP BY parent
                 HAVING COUNT(1) > 10
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()

#### Note on using  **GROUP BY**

Note that because it tells SQL how to apply aggregate functions (like **COUNT()** ), it doesn't make sense to use **GROUP BY** without an aggregate function. Similarly, if you have any __GROUP BY__ clause, then all variables must be passed to either a

1. **GROUP BY** command, or
2. an aggregation function

Consider the query below:

In [None]:
query_good = """
             SELECT parent, COUNT(id)
             FROM `bigquery-public-data.hacker_news.comments`
             GROUP BY parent
             """

Every variable must be passed to an aggregate function or a **GROUP BY** clause.