# Class Size Report (2016-2017)

We will also use data from the [Kaggle 2016-2017 NYC Class Size Report](https://www.kaggle.com/marcomarchetti/20162017-nyc-class-size-report)

This dataset includes the following information:
- number of students
- number of classes
- average class size
- minimum class size
- maximum class size

by School x Program Type x Department x Subject

It also contains a school-wide pupil-to-teacher ratio

In [None]:
# import necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# set default options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)

%matplotlib inline

In [None]:
# load data
raw_class_size_df = pd.read_csv('February2017_Avg_ClassSize_School_all.csv')

In [None]:
# look at top-level stats on the dataset
raw_class_size_df.info()


## Observation:

Since almost a third of our data is missing Department and Subject-level data, we will aggregate number of students and classes at the school level. We will recalculate average class size from those numbers, rather than averaging the provided averages.

In [None]:
# preview the data
raw_class_size_df

In [None]:
# look at a specific school to get a sense of the data
raw_class_size_df[raw_class_size_df.DBN == '01M034']

## Observation:

Some rows contain data per deparment-subject.  These rows are specified by Grade Level = 'MS Core".  Other rows for the same school contain data per grade level, but without subject.  These rows are specified by Grade Level in [K...8]

In [None]:
# count of non-null observations for all columns, by grade level
raw_class_size_df.groupby(['Grade Level']).count()

In [None]:
# mean value for all numeric columns, by grade level
raw_class_size_df.groupby(['Grade Level']).mean()

In [None]:
# DEBUGGING / CROSS-CHECKING LOGIC

# Examine data for a few specific schools

# this school reported both ways (by class level and separately by grade)
# numbers appear pretty consistent between the two reporting approaches
# raw_class_size_df[raw_class_size_df.DBN == '01M034']

# this school reported both ways, but numbers aren't consistent between them
# raw_class_size_df[raw_class_size_df.DBN == '01M539']
# raw_class_size_df[raw_class_size_df.DBN == '31R024']

# this school reported only by program type
# raw_class_size_df[raw_class_size_df.DBN == '31R044']

## Observations:

Department data is only listed for "MS Core" and "HS Core" grade levels.  Based on small class sizes, I suspect that "K-8 SC" may be special ed classes.

The original dataset has minimal documentation, except to say that it's the merger of 3 datasets: "K-8 Avg, MS HS Avg, PTR".  PTR must mean pupil-teacher ratio, and appears to have been cleanly joined to all rows, presumably based on DBN.  Based on that description and the above observations re: department/subject columns, I'm going to assume that the "MS HS Avg" dataset is represented here are "MS Core" and "HS Core" grade level.   I'll assume that the individual grade levels [K, 1, ... 8] and "K-8 SC" come from the "K-8 Avg" dataset.  Spot checking a few middle school DBN's shows that there are records of both types in our dataset, but I'm not able to reconcile the numbers.  

**As a result, I will stick with only the "MS HS Avg" dataset, with its additional PTR joined column.**  Since we only care about middle school for PASSNYC purposes, we only need to keep Grade Level == 'MS Core' (ie. filter out Grade Level == 'HS Core').

In [None]:
# remove all except the 'MS Core' data
class_size_df = raw_class_size_df[raw_class_size_df['Grade Level'] == 'MS Core']

# we expect to still have multiple rows per school (because of program X department X subject variations)
class_size_df.info()

# we don't need to read too much into these stats, but worth taking a quick look
class_size_df.describe()

In [None]:
# taking another quick look at the dataframe
class_size_df

Next, we temporarily pull out the pupil-teacher ratios into a separate dataframe.  We do this because we'll need to do a groupby and pivot on the remaining columns in order to flatten the department X subject stats into columns.  Afterwards we will rejoin this info.

Note: all pupil-teacher ratio values for each school are identical, so the mean is just a convenient way of grabbing that value.  It's not actually averaging a wider distribution.  For example: `mean(9.0 x n records) = (9.0 x n) / n = 9.0`

In [None]:
# split out school-level pupil-teacher ratios (one row per school)
ratio_df = class_size_df.groupby(['DBN'])['School Pupil-Teacher Ratio'].mean()
ratio_df.describe()

In [None]:
# sum students and num classes by school x department (combining across different program types and subjects)
class_stats_df = class_size_df.groupby(['DBN','Department'])['Number of Students','Number of Classes'].sum()
class_stats_df

In [None]:
# derive an average class size column
class_stats_df['Average Class Size'] = class_stats_df['Number of Students'] / class_stats_df['Number of Classes']
class_stats_df

In [None]:
# reindex so we can pivot
class_stats_df = class_stats_df.reset_index()

# pivot to get department x stats into columns, not rows
class_stats_w_avg_df = class_stats_df.pivot(index='DBN', columns='Department')

In [None]:
# after pivot, we have all of our numbers in columns, with one row per school
class_stats_w_avg_df

## Finally join everything back together in a flattened dataset

In [None]:
# create column names based on the "levels" generated during groupby
class_size_out_df = class_stats_w_avg_df.copy(deep=False)
class_size_out_df.columns = [' '.join(col).strip() for col in class_size_out_df.columns.values]
class_size_out_df.columns

In [None]:
# join the class size states with student-teacher ratio
class_size_out_df = class_size_out_df.join(ratio_df)
class_size_out_df

## Next we'll plot the key histograms

In [None]:
class_size_out_df.hist(column='School Pupil-Teacher Ratio')

In [None]:
class_size_out_df.hist(column='Average Class Size Math')

In [None]:
class_size_out_df.hist(column='Average Class Size Science')

In [None]:
class_size_out_df.hist(column='Average Class Size English')

In [None]:
class_size_out_df.hist(column='Average Class Size Social Studies')

## Write the cleaned and flattened dataset to disk as a csv file

In [None]:
class_size_out_df.to_csv('class_size_cleaned.csv')


In [None]:
class_size_out_df.info()

## Final observations about `class_size_cleaned.csv`

- we have data for 494 middle schools
- there are no duplicate entries (manually confirmed)
- we have pupil-teacher ratio for all schools
- we have avg science class size for all schools
- we are missing avg math class size for only one school
- we are missing avg English class size for 10 schools
- we are missing avg Social Studies class size for 20 schools
- all class size data and pupil-teacher ratio data are approximately normally distributed

If we assume schools keep class sizes pretty similar across subjects, then we could treat the avg science class size (100% complete data) as a proxy for the school's class size.  Since most of the specialized schools are STEM-focused, this also seems to be a not inappropriate reduction of dimensionality.