# NYC High Schools Aggregates

### Introduction
In this lab we will practice using aggregate SQL functions. These functions, such as AVG, MIN, and MAX, allow us to perform mathematical operations on a set of numbers, and return one value. We will also use the GROUP BY function. GROUP BY allows us to group rows that have identical values in a column (or columns), often with the intention of performing an aggregate function on these groups. In the database we are using in this lab, each row represents a school, with each column representing some metric or information about that school. We could use an aggregate function to find the MAX total students of all the schools listed. But what if we wanted to know the MAX number of students by Boro? Previously we might have used a WHERE clause, but that would require a separate statement for each boro. Thats where GROUP BY clauses come in. In this example we could use GROUP BY boro, and the query would return the results of our aggregate function for each boro.

Lets begin by using the `sqlite3` library to connect to the database

In [9]:
import sqlite3
import pandas as pd
conn = sqlite3.connect('nyc_schools.db')
cursor = conn.cursor()
hs_url = "https://raw.githubusercontent.com/eng-6-22/mod-1-sql-curriculum/master/sql-agg-hs-queries/highschools.csv"
high_school_df = pd.read_csv(hs_url)
high_school_df.head()
high_school_df.to_sql('high_schools', conn, index = False, if_exists = 'replace')


356

In [10]:
high_school_df.head()


Unnamed: 0,id,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,graduation_rate,attendance_rate,college_career_rate
0,0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,M,171,0.66,0.87,0.36
1,1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,M,465,0.9,0.93,0.7
2,2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,M,683,0.92,0.94,0.77
3,3,01M509,MARTA VALLE HIGH SCHOOL,44.0,390.0,433.0,384.0,M,148,0.74,0.79,0.49
4,4,01M539,"NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...",159.0,522.0,574.0,525.0,M,1734,0.97,0.95,0.85


In [2]:
cursor.execute('SELECT name from sqlite_master where type= "table"')
cursor.fetchall()

[('high_schools',)]

In [3]:
cursor.execute('PRAGMA table_info(high_schools)')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0),
 (1, 'dbn', 'TEXT', 0, None, 0),
 (2, 'name', 'TEXT', 0, None, 0),
 (3, 'num_test_takers', 'REAL', 0, None, 0),
 (4, 'reading_avg', 'REAL', 0, None, 0),
 (5, 'math_avg', 'REAL', 0, None, 0),
 (6, 'writing_score', 'REAL', 0, None, 0),
 (7, 'boro', 'TEXT', 0, None, 0),
 (8, 'total_students', 'INTEGER', 0, None, 0),
 (9, 'graduation_rate', 'REAL', 0, None, 0),
 (10, 'attendance_rate', 'REAL', 0, None, 0),
 (11, 'college_career_rate', 'REAL', 0, None, 0)]

### Aggregates

For each of the questions below, use a SQL aggregate function to find the solution. (Note that in the database, the boro column consists of the values "M" for Manhattan, "X" for the Bronx, "K" for Brooklyn, and "Q" for Queens)

* What's the average number of students in Manhattan?

In [13]:
def avg_students_manhattan():
    statement = """SELECT AVG(total_students) FROM high_schools WHERE boro = 'M'"""
    cursor.execute(statement)
    return cursor.fetchall()

avg_students_manhattan()
# [(601.9666666666667,)]

[(601.9666666666667,)]

* What's the average attendance in Manhattan?

In [14]:
def avg_attendance_rate_in_hs():
    statement = """SELECT AVG(attendance_rate) FROM high_schools WHERE boro = 'M'"""
    cursor.execute(statement)
    return cursor.fetchall()

avg_attendance_rate_in_hs()
# [(0.8782222222222222,)]


[(0.8782222222222222,)]

* What's the largest difference between graduation_rate and college_career_rate?

In [18]:
def largest_diff_btwn_grad_rate_and_college_career_rate():
    statement = """SELECT MAX(graduation_rate - college_career_rate) FROM high_schools"""
    cursor.execute(statement)
    return cursor.fetchall()
largest_diff_btwn_grad_rate_and_college_career_rate()
# [(0.55,)]

[(0.55,)]

* What is the highest math_avg in queens

In [21]:
def highest_math_avg_queens():
    statement = """SELECT MAX(math_avg) FROM high_schools WHERE boro = 'Q'"""
    cursor.execute(statement)
    return cursor.fetchall()
highest_math_avg_queens()
# [(660.0,)]

[(660.0,)]

* What is the highest math_avg in manhattan?

In [22]:
def highest_math_avg_manhattan():
    statement = """SELECT MAX(math_avg) FROM high_schools WHERE boro = 'M'"""
    cursor.execute(statement)
    return cursor.fetchall()
highest_math_avg_manhattan()

[(735.0,)]

* What is the highest combined score in manhattan?

In [25]:
def highest_combined_score():
    statement = """SELECT MAX(math_avg + reading_avg) FROM high_schools WHERE boro = 'M'"""
    cursor.execute(statement)
    return cursor.fetchall()
highest_combined_score()
# [(1414.0,)]

[(1414.0,)]

### Group By

* What's the average number of students in each borough

In [27]:
def avg_num_of_students_per_borough():
    statement = """SELECT boro, AVG(total_students) FROM high_schools GROUP BY boro"""
    cursor.execute(statement)
    return cursor.fetchall()
avg_num_of_students_per_borough()
# [('K', 740.2884615384615),
#         ('M', 601.9666666666667),
#         ('Q', 1135.4615384615386),
#         ('R', 1863.2),
#         ('X', 523.4827586206897)]

[('K', 740.2884615384615),
 ('M', 601.9666666666667),
 ('Q', 1135.4615384615386),
 ('R', 1863.2),
 ('X', 523.4827586206897)]

* What's the average difference between graduation_rate and college_career_rate by borough

In [28]:
def avg_diff_btwn_grad_rate_and_college_career_rate_by_boro():
    statement = """SELECT boro, AVG(graduation_rate - college_career_rate) FROM high_schools GROUP BY boro"""
    cursor.execute(statement)
    return cursor.fetchall()

avg_diff_btwn_grad_rate_and_college_career_rate_by_boro()

# [('K', 0.22480392156862752),
#             ('M', 0.17298850574712643),
#             ('Q', 0.1706153846153846),
#             ('R', 0.23200000000000004),
#             ('X', 0.21264367816091953)]

[('K', 0.22480392156862752),
 ('M', 0.17298850574712643),
 ('Q', 0.1706153846153846),
 ('R', 0.23200000000000004),
 ('X', 0.21264367816091953)]

* What's the avg college career rate grouped by math_avg scores (Hint: https://stackoverflow.com/questions/30929526/sqlite-group-by-range-of-1000s)

In [36]:
def avg_college_career_rate_by_math_avg():
    statement = """SELECT math_avg as range, AVG(college_career_rate) FROM high_schools GROUP BY range"""
    cursor.execute(statement)
    return cursor.fetchall()
avg_college_career_rate_by_math_avg()

[(None, 0.6124999999999999),
 (312.0, 0.41),
 (315.0, 0.43),
 (320.0, 0.76),
 (322.0, 0.49),
 (323.0, 0.27999999999999997),
 (324.0, 0.42),
 (333.0, 0.47),
 (335.0, 0.5),
 (339.0, 0.74),
 (342.0, 0.3),
 (346.0, 0.37),
 (349.0, 0.43833333333333324),
 (350.0, 0.72),
 (351.0, 0.5),
 (353.0, 0.5800000000000001),
 (355.0, 0.53),
 (356.0, 0.32),
 (357.0, 0.46),
 (358.0, 0.4766666666666666),
 (359.0, 0.53),
 (360.0, 0.35),
 (361.0, 0.23),
 (362.0, 0.46),
 (363.0, 0.45),
 (364.0, 0.5319999999999999),
 (365.0, 0.48),
 (366.0, 0.44),
 (367.0, 0.5675),
 (368.0, 0.465),
 (369.0, 0.33),
 (370.0, 0.48),
 (371.0, 0.372),
 (372.0, 0.44),
 (373.0, 0.69),
 (374.0, 0.42),
 (375.0, 0.5275),
 (376.0, 0.39),
 (377.0, 0.48),
 (378.0, 0.5357142857142857),
 (379.0, 0.45599999999999996),
 (380.0, 0.52),
 (381.0, 0.5187499999999999),
 (382.0, 0.57),
 (384.0, 0.4779999999999999),
 (385.0, 0.405),
 (386.0, 0.42400000000000004),
 (387.0, 0.64),
 (388.0, 0.61),
 (390.0, 0.54),
 (391.0, 0.5383333333333333),
 (392.0, 

### HAVING
One important thing to note is that once we use the GROUP BY clause, we can no longer use the WHERE clause for aggregate functions. For example, let's say we wanted to know the average number of students in each boro, but we only wanted the results for boros with an average of more than 1000. Here we would use the HAVING clause. See the example below and then use the HAVING clause to find the solution for the the next question.

In [37]:
cursor.execute('''SELECT boro, AVG(total_students)
FROM high_schools
GROUP BY boro HAVING AVG(total_students) > 1000''')
cursor.fetchall()

[('Q', 1135.4615384615386), ('R', 1863.2)]

In [39]:
def boroughs_with_avg_total_students_over_one_thousand():
    cursor.execute('''SELECT boro, AVG(total_students)
FROM high_schools
GROUP BY boro HAVING AVG(total_students) > 1000''')
    return cursor.fetchall()

boroughs_with_avg_total_students_over_one_thousand()
# [('Q', 1135.4615384615386), ('R', 1863.2)]

[('Q', 1135.4615384615386), ('R', 1863.2)]

What is the average college career rate for each boro, selecting only boros with an average college career rate less than .6?

In [41]:
def boroughs_with_avg_college_career_under_point_six():
    cursor.execute('''SELECT boro, AVG(college_career_rate) FROM high_schools GROUP BY boro HAVING AVG(college_career_rate) < 0.6''')
    return cursor.fetchall()
boroughs_with_avg_college_career_under_point_six()
# [('K', 0.5471568627450981), ('X', 0.5295402298850576)]

[('K', 0.5471568627450981), ('X', 0.5295402298850576)]

### Conclusion
In this lab, we performed aggregate functions on our data. This allows us to perform mathematical operations on a set of values in our database. We also used the GROUP BY clause, which gave us the ability to perform the aggregate functions on different subsets of the data at once. Finally, we used the HAVING clause to filter our results in GROUP BY queries.