# Group By and Having

### Introduction

Now so far, when we have queried for aggregate information, we have done by asking questions about the entire table, or a single subset of the table.  For example, we could answer a question about the average rating of a restaurant in the Bronx, and then perhaps a separate query about the average restaurant rating in Queens.  But what if we'd like to organize our data into different groups based on the neighborhood, and then calculate the average rating of each group of restaurants?  For that we use group by.  

### Working with Group By

Let's start with that exact query.  We'll group our restaurants based on neighborhood and then calculate the average rating.

In [1]:
import pandas as pd
import sqlite3

yelp_db = sqlite3.connect('yelp.db')
df = pd.read_csv('https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/mapping/tilemill/yelp-lunch-nyc.csv')
df.to_sql('restaurants', yelp_db)
cursor = yelp_db.cursor()

In [2]:
df[:2]

Unnamed: 0,Name,Address,City,Category,Rating,URL
0,Rambling House,4292 Katonah Ave,Bronx,Pubs,4.0,http://www.yelp.com/biz/rambling-house-bronx
1,Curry Spot,4268 Katonah Ave,Bronx,Indian,4.0,http://www.yelp.com/biz/curry-spot-bronx


In [3]:
cursor.execute('SELECT City, AVG(rating) from restaurants GROUP BY City LIMIT 5;')

<sqlite3.Cursor at 0x7ff4e8e152d0>

In [4]:
cursor.fetchall()

[('Arverne', 4.0),
 ('Astoria', 4.114130434782608),
 ('Bayonne', 4.0),
 ('Bayside', 3.907142857142857),
 ('Belle Harbor', 2.5)]

`SELECT AVG(rating) from restaurants GROUP BY neighborhood`

One way to think of GROUP BY is that it first creates separate piles of the data based on the column provided, and then performs the calculation for each pile.

Note that we can use GROUP BY with any of our aggregate methods.  Let's do another one.

In [5]:
cursor.execute('SELECT City, COUNT(*) from restaurants GROUP BY City LIMIT 5;')
cursor.fetchall()

[('Arverne', 3),
 ('Astoria', 92),
 ('Bayonne', 4),
 ('Bayside', 70),
 ('Belle Harbor', 3)]

So now we have the number of restaurants in each of these cities.  Let's order by the count from largest to smallest.

In [6]:
cursor.execute('SELECT City, COUNT(*) as num_restaurants from restaurants GROUP BY City ORDER BY num_restaurants DESC LIMIT 5;')
cursor.fetchall()

[('Brooklyn', 1282),
 ('New York', 1149),
 ('Staten Island', 1034),
 ('Bronx', 817),
 ('Flushing', 161)]

### Having

Now so far we have returned an answer for each group in a column, but sometimes this can be unweildy.  For example, we may not wish to see the average rating of each neighborhood's restaurants, but only for those neighborhoods with over 50 restaurants.  We can do so with the following:

In [7]:
cursor.execute('SELECT City, AVG(rating) FROM restaurants GROUP BY City HAVING COUNT(*) > 50;')
cursor.fetchall()

[('Astoria', 4.114130434782608),
 ('Bayside', 3.907142857142857),
 ('Bronx', 3.821297429620563),
 ('Brooklyn', 3.985179407176287),
 ('Flushing', 3.919254658385093),
 ('Forest Hills', 3.8556701030927836),
 ('Jamaica', 3.913793103448276),
 ('Long Island City', 3.6785714285714284),
 ('New York', 3.987815491731941),
 ('Rockaway Park', 3.4491525423728815),
 ('Staten Island', 3.710348162475822)]

Notice that we are scoping down our information, so it seems like we would have to use a WHERE statement, but we're using HAVING instead.  What gives?  Well we can't use the WHERE statement because of a problem of order of operations in SQL.

For us to select only those categories that have more than five rows, we must first group by and then category and then perform the count.  But in SQL, by default WHERE occurs before the GROUP BY clause.  So to tell SQL to not scope down our data until after we have separated it into groups, we use the HAVING clause.

### Summary

In this lesson, we learned about using GROUP BY in SQL.  We saw that, GROUP BY first places our rows of data into different piles, and then performs an aggregate calculation on each respective pile.  For example, the following query groups our rows by neighborhood and then counts the items in each group.

In [8]:
cursor.execute('SELECT City, COUNT(*) from restaurants GROUP BY City LIMIT 5;')
cursor.fetchall()

[('Arverne', 3),
 ('Astoria', 92),
 ('Bayonne', 4),
 ('Bayside', 70),
 ('Belle Harbor', 3)]

When we are then filtering data based on an aggregate, we use the HAVING keyword so that we tell SQL to delay the filtering until after the aggregation occurs.