# Group By, Having & Count


This can help you answer questions like:

- How many of each kind of fruit has our store sold?
- How many species of animal has the vet office treated?

We are going to learn these commands with the Hacker News dataset:


In [2]:
from google.cloud import bigquery

client = bigquery.Client(project="sqlbigquery7711")

dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

dataset = client.get_dataset(dataset_ref)


In [3]:
tables = list(client.list_tables(dataset))

for table in tables:
    print (table.table_id)

full


In [4]:
client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,,True,,,1437239977,2015-07-18 17:19:37+00:00,story,9908144,,,,
1,,,,,,,1437243699,2015-07-18 18:21:39+00:00,story,9908379,,,,
2,,,,,,,1437244469,2015-07-18 18:34:29+00:00,story,9908416,,,,
3,,,,,,,1437244659,2015-07-18 18:37:39+00:00,story,9908429,,,,
4,,,,,,,1437245389,2015-07-18 18:49:49+00:00,story,9908468,,,,


Let's use the table to see which comments generated the most replies. Since:

- the parent column indicates the comment that was replied to, and
- the id column has the unique ID used to identify each comment.

we can GROUP BY the parent column and COUNT() the id column in order to figure out the number of comments that were made as responses to a specific comment. (This might not make sense immediately -- take your time here to ensure that everything is clear!)

Furthermore, since we're only interested in popular comments, we'll look at comments with more than ten replies. So, we'll only return groups HAVING more than ten ID's

In [6]:
query = """
        SELECT parent, COUNT(id)
        FROM bigquery-public-data.hacker_news.full
        GROUP BY parent
        HAVING COUNT(id) > 10
        """

query_job = client.query(query)

df_query = query_job.to_dataframe()

df_query.head()

Unnamed: 0,parent,f0_
0,9910146,56
1,188489,143
2,7905584,46
3,7920642,42
4,9165278,44


### Aliasing and other improvements

- The column resulting from COUNT(id) was called f0__. That's not a very descriptive name. You can change the name by adding AS NumPosts after you specify the aggregation. 
- If you are ever unsure what to put inside the COUNT() function, you can do COUNT(1) to count the rows in each group. Most people find it especially readable, because we know it's not focusing on other columns. It also scans less data than if supplied column names (making it faster and using less of your data access quota).

In [7]:
# Improved version of earlier query, now with aliasing & improved readability
query_improved = """
                 SELECT parent, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.full`
                 GROUP BY parent
                 HAVING COUNT(1) > 10
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()

Unnamed: 0,parent,NumPosts
0,9874521,44
1,9917442,49
2,9996333,632
3,8107222,95
4,8185461,63


### Exercises
#### 1. Prolific commenters
Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts.

In [16]:
query_improved = """
                 SELECT `by` AS author, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.full`
                 GROUP BY `by`
                 HAVING COUNT(1) > 10000
                 """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()

Unnamed: 0,author,NumPosts
0,stcredzero,17353
1,jrockway,19292
2,userbinator,18170
3,adrianN,10816
4,pseudolus,17814


### 2) Deleted comments

How many comments have been deleted? (If a comment was deleted, the `deleted` column in the table will have the value `True`.)

In [18]:
query_improved = """
                 SELECT deleted, COUNT(1)
                 FROM `bigquery-public-data.hacker_news.full`
                 GROUP BY `deleted`
                 HAVING COUNT(1) > 10000
                 """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()

Unnamed: 0,deleted,f0_
0,,41905809
