# 1: Introduction

In the last mission, we began investigating possible relationships between SAT scores and demographic factors. In order to do this, we acquired several data sets about New York City public schools. We manipulated these data sets, and found that we could combine them all using the DBN column. All of the data sets are currently stored as keys in the data dictionary. Each individual data set is a pandas dataframe.

In this mission, we'll clean the data a bit more, then combine it. Finally, we'll compute correlations and perform some analysis.

The first thing we'll need to do in preparation for the merge is condense some of the data sets. In the last mission, we noticed that the values in the DBN column were unique in the sat_results data set. Other data sets like class_size had duplicate DBN values, however.

We'll need to condense these data sets so that each value in the DBN column is unique. If not, we'll run into issues when it comes time to combine the data sets.

While the main data set we want to analyze, sat_results, has unique DBN values for every high school in New York City, other data sets aren't as clean. A single row in the sat_results data set may match multiple columns in the class_size data set, for example. This situation will create problems, because we don't know which of the multiple entries in the class_size data set we should combine with the single matching entry in sat_results. Here's a diagram that illustrates the problem:

<img src='cartesian_product.png'>

In the diagram above, we can't just combine the rows from both data sets because there are several cases where multiple rows in class_size match a single row in sat_results.

To resolve this issue, we'll condense the class_size, graduation, and demographics data sets so that each DBN is unique.

# 2: Condensing the Class Size Data Set

The first data set that we'll condense is class_size. The first few rows of class_size look like this:
	
       CSD 	BOROUGH 	SCHOOL CODE 	SCHOOL NAME 	GRADE 	PROGRAM TYPE 	CORE SUBJECT (MS CORE and 9-12 ONLY) 	CORE COURSE (MS CORE and 9-12 ONLY) 	SERVICE CATEGORY(K-9* ONLY) 	NUMBER OF STUDENTS / SEATS FILLED 	NUMBER OF SECTIONS 	AVERAGE CLASS SIZE 	SIZE OF SMALLEST CLASS 	SIZE OF LARGEST CLASS 	DATA SOURCE 	SCHOOLWIDE PUPIL-TEACHER RATIO 	padded_csd 	DBN
    0 	1 	M 	M015 	P.S. 015 Roberto Clemente 	0K 	GEN ED 	- 	- 	- 	19.0 	1.0 	19.0 	19.0 	19.0 	ATS 	NaN 	01 	01M015
    1 	1 	M 	M015 	P.S. 015 Roberto Clemente 	0K 	CTT 	- 	- 	- 	21.0 	1.0 	21.0 	21.0 	21.0 	ATS 	NaN 	01 	01M015
    2 	1 	M 	M015 	P.S. 015 Roberto Clemente 	01 	GEN ED 	- 	- 	- 	17.0 	1.0 	17.0 	17.0 	17.0 	ATS 	NaN 	01 	01M015
    3 	1 	M 	M015 	P.S. 015 Roberto Clemente 	01 	CTT 	- 	- 	- 	17.0 	1.0 	17.0 	17.0 	17.0 	ATS 	NaN 	01 	01M015
    4 	1 	M 	M015 	P.S. 015 Roberto Clemente 	02 	GEN ED 	- 	- 	- 	15.0 	1.0 	15.0 	15.0 	15.0 	ATS 	NaN 	01 	01M015

As you can see, the first few rows all pertain to the same school, which is why the DBN appears more than once. It looks like each school has multiple values for GRADE, PROGRAM TYPE, CORE SUBJECT (MS CORE and 9-12 ONLY), and CORE COURSE (MS CORE and 9-12 ONLY).

If we look at the unique values for GRADE, we get the following:

    array(['0K', '01', '02', '03', '04', '05', '0K-09', nan, '06', '07', '08',

       'MS Core', '09-12', '09'], dtype=object)

Because we're dealing with high schools, we're only concerned with grades 9 through 12. That means we only want to pick rows where the value in the GRADE column is 09-12.

If we look at the unique values for PROGRAM TYPE, we get the following:

    array(['GEN ED', 'CTT', 'SPEC ED', nan, 'G&T'], dtype=object)

Each school can have multiple program types. Because GEN ED is the largest category by far, let's only select rows where PROGRAM TYPE is GEN ED

# 3: Condensing the Class Size Data Set

## Instructions

    Create a new variable called class_size, and assign the value of data["class_size"] to it.
    Filter class_size so the GRADE column only contains the value 09-12. Note that the name of the GRADE column has a space at the end; you'll generate an error if you don't include it.
    Filter class_size so that the PROGRAM TYPE column only contains the value GEN ED.
    Display the first five rows of class_size to verify.



In [6]:
import pandas as pd
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]
data = {}

for file in data_files:
    df = pd.read_csv( "../data/" + file )
    data[file.split( "." )[0]] = df
    
# normal string concatanation won't work here
def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return string_representation.zfill(2)
    

data['hs_directory']['DBN'] = data['hs_directory']['dbn']

data['class_size']["padded_csd"] = data['class_size']['CSD'].apply( pad_csd )
data['class_size']["DBN"] = data['class_size']["padded_csd"] + data['class_size']["SCHOOL CODE"]

In [7]:
class_size = data['class_size']
class_size = class_size[ class_size['GRADE '] == '09-12']
class_size = class_size[ class_size['PROGRAM TYPE'] == 'GEN ED']
print( class_size.head( 5 ))

     CSD BOROUGH SCHOOL CODE                                    SCHOOL NAME  \
225    1       M        M292  Henry Street School for International Studies   
226    1       M        M292  Henry Street School for International Studies   
227    1       M        M292  Henry Street School for International Studies   
228    1       M        M292  Henry Street School for International Studies   
229    1       M        M292  Henry Street School for International Studies   

    GRADE  PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY)  \
225  09-12       GEN ED                              ENGLISH   
226  09-12       GEN ED                              ENGLISH   
227  09-12       GEN ED                              ENGLISH   
228  09-12       GEN ED                              ENGLISH   
229  09-12       GEN ED                                 MATH   

    CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY)  \
225                           English 9                           -  

# 4: Computing Average Class Sizes

As we saw when we displayed class_size on the last screen, DBN still isn't completely unique. This is due to the CORE COURSE (MS CORE and 9-12 ONLY) and CORE SUBJECT (MS CORE and 9-12 ONLY) columns.

CORE COURSE (MS CORE and 9-12 ONLY) and CORE SUBJECT (MS CORE and 9-12 ONLY) seem to pertain to different kinds of classes. For example, here are the unique values for CORE SUBJECT (MS CORE and 9-12 ONLY):

    array(['ENGLISH', 'MATH', 'SCIENCE', 'SOCIAL STUDIES'], dtype=object)

This column only seems to include certain subjects. We want our class size data to include every single class a school offers -- not just a subset of them. What we can do is take the average across all of the classes a school offers. This will give us unique DBN values, while also incorporating as much data as possible into the average.

Fortunately, we can use the pandas.DataFrame.groupby() method to help us with this. The DataFrame.groupby() method will split a dataframe up into unique groups, based on a given column. We can then use the agg() method on the resulting pandas.core.groupby object to find the mean of each column.

Let's say we have this data set:

<img src='classize_table.png'>

Using the groupby() method, we'll split this dataframe into four separate groups -- one with the DBN 01M292, one with the DBN 01M332, one with the DBN 01M378, and one with the DBN 01M448:

<img src='classsize_agg.png'>

Then, we can compute the averages for the AVERAGE CLASS SIZE column in each of the four groups using the agg() method:

<img src='classsize_result.png'>

After we group a dataframe and aggregate data based on it, the column we performed the grouping on (in this case DBN) will become the index, and will no longer appear as a column in the data itself. To undo this change and keep DBN as a column, we'll need to use pandas.DataFrame.reset_index(). This method will reset the index to a list of integers and make DBN a column again.

# 5: Computing Average Class Sizes

## Instructions

    Find the average values for each column associated with each DBN in class_size.
        Use the pandas.DataFrame.groupby() method to group class_size by DBN.
        Use the agg() method on the resulting pandas.core.groupby object, along with the numpy.mean() function as an argument, to calculate the average of each group.
        Assign the result back to class_size.
    Reset the index to make DBN a column again.
        Use the pandas.DataFrame.reset_index() method, along with the keyword argument inplace=True.
    Assign class_size back to the class_size key of the data dictionary.
    Display the first few rows of data["class_size"] to verify that everything went ok

In [14]:
import numpy
test = class_size.groupby( 'DBN' )
test.agg?