# SQL with Python Reference Guide 6
# Aggregation
## (Justin M. Olds)
Based on Stanford SQL course: https://lagunita.stanford.edu/courses/DB/SQL/SelfPaced/info

---
**Aggregation overview** 

These functions appear in the SELECT clause and perform different operations across the selected rows. Such as, **MIN, MAX, SUM, AVG, COUNT**

These aggregation functions are supplemented within SELECT statements with new clauses: "GROUP BY" and "HAVING".
* **GROUP BY** allows results to be partitioned into separate groups.
* **HAVING** allows for queries to test based on filter conditions. 


In [3]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("class.db")
c = conn.cursor()

---
### Tables and Insert code below (same as before--college admissions data)

In [4]:
c.execute('DROP TABLE IF EXISTS College')
c.execute('DROP TABLE IF EXISTS Student') 
c.execute('DROP TABLE IF EXISTS Apply') 

c.execute('CREATE TABLE College(cName TEXT, state TEXT, enrollment INT)')
c.execute('CREATE TABLE Student(sID INT, sName TEXT, GPA REAL, sizeHS INT)')
c.execute('CREATE TABLE Apply(sID INT, cName TEXT, major TEXT, decision TEXT)')
conn.commit()

In [5]:
c.execute('DELETE FROM Student')
c.execute('DELETE FROM College')
c.execute('DELETE FROM Apply')

c.execute("INSERT INTO Student VALUES (123, 'Amy', 3.9, 1000)")
c.execute("INSERT INTO Student values (234, 'Bob', 3.6, 1500)")
c.execute("INSERT INTO Student values (345, 'Craig', 3.5, 500)")
c.execute("INSERT INTO Student values (456, 'Doris', 3.9, 1000)")
c.execute("INSERT INTO Student values (567, 'Edward', 2.9, 2000)")
c.execute("INSERT INTO Student values (678, 'Fay', 3.8, 200)")
c.execute("INSERT INTO Student values (789, 'Gary', 3.4, 800)")
c.execute("INSERT INTO Student values (987, 'Helen', 3.7, 800)")
c.execute("INSERT INTO Student values (876, 'Irene', 3.9, 400)")
c.execute("INSERT INTO Student values (765, 'Jay', 2.9, 1500)")
c.execute("INSERT INTO Student values (654, 'Amy', 3.9, 1000)")
c.execute("INSERT INTO Student values (543, 'Craig', 3.4, 2000)")

c.execute("INSERT INTO College values ('Stanford', 'CA', 15000)")
c.execute("INSERT INTO College values ('Berkeley', 'CA', 36000)")
c.execute("INSERT INTO College values ('MIT', 'MA', 10000)")
c.execute("INSERT INTO College values ('Cornell', 'NY', 21000)")

c.execute("INSERT INTO Apply values (123, 'Stanford', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (123, 'Stanford', 'EE', 'N')")
c.execute("INSERT INTO Apply values (123, 'Berkeley', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (123, 'Cornell', 'EE', 'Y')")
c.execute("INSERT INTO Apply values (234, 'Berkeley', 'biology', 'N')")
c.execute("INSERT INTO Apply values (345, 'MIT', 'bioengineering', 'Y')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'bioengineering', 'N')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (345, 'Cornell', 'EE', 'N')")
c.execute("INSERT INTO Apply values (678, 'Stanford', 'history', 'Y')")
c.execute("INSERT INTO Apply values (987, 'Stanford', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (987, 'Berkeley', 'CS', 'Y')")
c.execute("INSERT INTO Apply values (876, 'Stanford', 'CS', 'N')")
c.execute("INSERT INTO Apply values (876, 'MIT', 'biology', 'Y')")
c.execute("INSERT INTO Apply values (876, 'MIT', 'marine biology', 'N')")
c.execute("INSERT INTO Apply values (765, 'Stanford', 'history', 'Y')")
c.execute("INSERT INTO Apply values (765, 'Cornell', 'history', 'N')")
c.execute("INSERT INTO Apply values (765, 'Cornell', 'psychology', 'Y')")
c.execute("INSERT INTO Apply values (543, 'MIT', 'CS', 'N')")
conn.commit()


---
### AVG (Average)

Start with simple query to compute the average GPA of all students within the Student table. 


In [6]:
df = pd.read_sql_query("""
    SELECT AVG(GPA)
    FROM Student
""", conn);df

Unnamed: 0,AVG(GPA)
0,3.566667


### Using AVG with joined tables
The query below finds the average GPA of students applying to CS.

In [8]:
df = pd.read_sql_query("""
    SELECT DISTINCT AVG(GPA)
    FROM Student, Apply
    WHERE Student.sID = Apply.sID AND major = 'CS'
""", conn);df

Unnamed: 0,AVG(GPA)
0,3.714286


**PROBLEM:** This average is computed with some students counted more than once. This can be observed by inspecting all the rows with SELECT *

In [9]:
df = pd.read_sql_query("""
    SELECT *
    FROM Student, Apply 
    WHERE Student.sID = Apply.sID
        AND major = 'CS'
""", conn);df

Unnamed: 0,sID,sName,GPA,sizeHS,sID.1,cName,major,decision
0,123,Amy,3.9,1000,123,Stanford,CS,Y
1,123,Amy,3.9,1000,123,Berkeley,CS,Y
2,345,Craig,3.5,500,345,Cornell,CS,Y
3,987,Helen,3.7,800,987,Stanford,CS,Y
4,987,Helen,3.7,800,987,Berkeley,CS,Y
5,876,Irene,3.9,400,876,Stanford,CS,N
6,543,Craig,3.4,2000,543,MIT,CS,N


To avoid the above problem the query can be edited using a subquery.

In [20]:
df = pd.read_sql_query("""
    SELECT *
    FROM Student 
    WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS') 
""", conn);df

Unnamed: 0,sID,sName,GPA,sizeHS
0,123,Amy,3.9,1000
1,345,Craig,3.5,500
2,987,Helen,3.7,800
3,876,Irene,3.9,400
4,543,Craig,3.4,2000


In [21]:
df = pd.read_sql_query("""
    SELECT AVG(GPA)
    FROM Student 
    WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS') 
""", conn);df

Unnamed: 0,AVG(GPA)
0,3.68


---
### COUNT 
Provides the number of rows (tuples).


In [22]:
df = pd.read_sql_query("""
    SELECT COUNT(*)
    FROM Apply
    WHERE cName = 'Cornell'
""", conn);df

Unnamed: 0,COUNT(*)
0,6


### COUNT DISTINCT

In many cases we want to know the number of unique rows for a particular attribute, such as how many unique students applied to Cornell rather than a count of all applications (including students that reapplied).

In [23]:
df = pd.read_sql_query("""
    SELECT COUNT(DISTINCT sID)
    FROM Apply
    WHERE cName = 'Cornell'
""", conn);df

Unnamed: 0,COUNT(DISTINCT sID)
0,3


### Combinging aggregate functions with a query
The following query returns students such that the number of other student with the same GPA is equal to the number ofther students with the same size high school. 

In [24]:
df = pd.read_sql_query("""
    SELECT *
    FROM Student S1
    WHERE (SELECT COUNT(*) FROM Student S2
        WHERE S2.sID <> S1.sID AND S2.GPA = S1.GPA) 
        =
        (SELECT COUNT(*) FROM Student S2
        WHERE S2.sID <> S1.sID AND S2.sizeHS = S1.sizeHS)
""", conn); df

Unnamed: 0,sID,sName,GPA,sizeHS
0,345,Craig,3.5,500
1,567,Edward,2.9,2000
2,678,Fay,3.8,200
3,789,Gary,3.4,800
4,765,Jay,2.9,1500
5,543,Craig,3.4,2000


### Combinging aggregate functions with a query (Another example)
The following query returns the amount by which average GPA of students applying to CS exceeds the average of stduents not applying to CS. 

**Subquery reminder:** A subquery in the from clause allows you to write a SELECT-FROM-WHERE expression and then use the result of that expression as if it were an actual table in the database.

In [26]:
df = pd.read_sql_query("""
    SELECT CS.avgGPA - NonCS.avgGPA
    FROM 
        (SELECT AVG(GPA) AS avgGPA 
        FROM Student
        WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')) as CS,
        (SELECT AVG(GPA) AS avgGPA  
        FROM Student 
        WHERE sID NOT IN (SELECT sID FROM Apply WHERE major = 'CS')) as nonCS
""", conn); df

Unnamed: 0,CS.avgGPA - NonCS.avgGPA
0,0.194286


The same result can be obtained by using aggregate functions within subqueries placed in the SELECT clause. 

In [29]:
df = pd.read_sql_query("""
    SELECT (SELECT AVG(GPA) as avgGPA FROM Student
            WHERE sID IN (SELECT sID FROM Apply WHERE major = 'CS')) -
            (SELECT AVG(GPA) AS avgGPA FROM Student 
            WHERE sID NOT IN (SELECT sID FROM Apply WHERE major = 'CS'))
            AS d
    """, conn); df

Unnamed: 0,d
0,0.194286


---
### GROUP BY

Below: Number of application to each college

NOTE: It wouldn't make much sense to include attributes other than cName (the grouped by attribute) within the SELECT clause. For example, if major was added within the SELECT clause a random major name from the rows used for each GROUP BY would be returned. 

In [43]:
df = pd.read_sql_query("""
    SELECT cName, COUNT(*)
    FROM Apply 
    GROUP BY cName
""", conn);df

Unnamed: 0,cName,COUNT(*),major
0,Berkeley,3,CS
1,Cornell,6,EE
2,MIT,4,bioengineering
3,Stanford,6,CS


### GROUP BY -- another example

College enrollments by state

In [31]:
df = pd.read_sql_query("""
    SELECT state, SUM(enrollment)
    FROM College
    GROUP BY state
""", conn); df

Unnamed: 0,state,SUM(enrollment)
0,CA,51000
1,MA,10000
2,NY,21000


### GROUP BY using two attributes

Minimum and Maximum GPAs of applicants to each college and major

In [32]:
df = pd.read_sql_query("""
    SELECT cName, major, MIN(GPA), MAX(GPA)
    FROM Student, Apply
    WHERE Student.sID = Apply.sID
    GROUP BY cName, major
""", conn); df

Unnamed: 0,cName,major,MIN(GPA),MAX(GPA)
0,Berkeley,CS,3.7,3.9
1,Berkeley,biology,3.6,3.6
2,Cornell,CS,3.5,3.5
3,Cornell,EE,3.5,3.9
4,Cornell,bioengineering,3.5,3.5
5,Cornell,history,2.9,2.9
6,Cornell,psychology,2.9,2.9
7,MIT,CS,3.4,3.4
8,MIT,bioengineering,3.5,3.5
9,MIT,biology,3.9,3.9


The above SELECT statement can be filtered further to find, for example, the largest difference between the min.GPA and max.GPA across the different majors from each college. 

In [34]:
df = pd.read_sql_query("""
    SELECT MAX(mx-mn)
    FROM
        (SELECT cName, major, MIN(GPA) AS mn, MAX(GPA) AS mx
        FROM Student, Apply
        WHERE Student.sID = Apply.sID
        GROUP BY cName, major) AS M 
""", conn); df

Unnamed: 0,MAX(mx-mn)
0,0.9


Number of colleges applied to by each student

In [35]:
df = pd.read_sql_query("""
    SELECT Student.sID, COUNT(DISTINCT cName)
    FROM Student, Apply
    WHERE Student.sID = Apply.sID
    GROUP BY Student.sID
""", conn); df

Unnamed: 0,sID,COUNT(DISTINCT cName)
0,123,3
1,234,1
2,345,2
3,543,1
4,678,1
5,765,2
6,876,2
7,987,2


---
### HAVING

Allows us to filter results after the GROUP BY clause.

The following query returns the colleges with fewer than 5 applications.

In [44]:
df = pd.read_sql_query("""
    SELECT cName
    FROM Apply 
    GROUP BY cNAME
    HAVING COUNT(*) < 5
""", conn); df

Unnamed: 0,cName
0,Berkeley
1,MIT


The above results returns colleges with fewer than 5 applications, but can be changed to return the colleges with fewer than 5 applications from distinct students as follows: 

In [45]:
df = pd.read_sql_query("""
    SELECT cName
    FROM Apply 
    GROUP BY cNAME
    HAVING COUNT(DISTINCT sID) < 5
""", conn); df

Unnamed: 0,cName
0,Berkeley
1,Cornell
2,MIT


The last example below returns majors whose applicant's maximum GPA is below the average

In [46]:
df = pd.read_sql_query("""
    SELECT major
    FROM Student, Apply 
    WHERE Student.sID = Apply.sID
    GROUP BY major
    HAVING MAX(GPA) < (SELECT AVG(GPA) FROM Student)
""", conn); df

Unnamed: 0,major
0,bioengineering
1,psychology
