# Class Sizes Preliminary EDA / Cleaning Notebook
[Return to project overview](final_project_overview.ipynb)

### Andrew Larimer, Deepak Nagaraj, Daniel Olmstead, Michael Winton (W207-4-Summer 2018 Final Project)

The [2016-2017 NYC Class Size Report](https://www.kaggle.com/marcomarchetti/20162017-nyc-class-size-report) dataset includes the following information:

- number of students
- number of classes
- average class size
- minimum class size
- maximum class size

by School x Program Type x Department x Subject.

It also contains a school-wide pupil-to-teacher ratio.

In [None]:
# import necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import util

# set default options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

%matplotlib inline

In [None]:
# load dataset from CSV
raw_class_sizes_df = pd.read_csv('data_raw/February2017_Avg_ClassSize_School_all.csv')

# look at top-level stats on the dataset
raw_class_sizes_df.info()

In [None]:
# remove the rows that aren't part of the 'MS Core' dataset
class_sizes_df = raw_class_sizes_df[raw_class_sizes_df['Grade Level'] == 'MS Core']

# check summary stats after filtering
class_sizes_df.info()

In [None]:
# preview the data
class_sizes_df.head(10)

## Cleanup column names

In [None]:
class_sizes_df.columns = [util.sanitize_column_names(c) for c in class_sizes_df.columns]
class_sizes_df.head()

In [None]:
# split out school-level pupil-teacher ratios into new dataframe with one row per school
ratio_df = class_sizes_df.groupby(['dbn'])['school_pupil_teacher_ratio'].mean()
ratio_df.describe().round(2)

In [None]:
# sum num students and classes by school x department (combining across different program types and subjects)
class_stats_df = class_sizes_df.groupby(['dbn','department'])['number_of_students','number_of_classes'].sum()

# derive an average class size column
class_stats_df['average_class_size'] = class_stats_df['number_of_students'] / class_stats_df['number_of_classes']

# take a quick look at the output
class_stats_df.head(20).round(2)

In [None]:
# reindex to integer rownums so we can pivot.  We want columns for the subjects, not rows.
class_stats_df = class_stats_df.reset_index()

# pivot to get department x stats into columns, not rows.  Note that DBN is now the index
class_stats_pivot_df = class_stats_df.pivot(index='dbn', columns='department')

# take a quick look at the output
class_stats_pivot_df.head(20)

In [None]:
# set column names based on the "levels" generated during groupby
class_sizes_cleaned_df = class_stats_pivot_df.copy(deep=False)
class_sizes_cleaned_df.columns = [' '.join(col).strip() for col in class_sizes_cleaned_df.columns.values]

# clean up the new column names
class_sizes_cleaned_df.columns = [util.sanitize_column_names(c) for c in class_sizes_cleaned_df.columns]

# join the class size stats with student-teacher ratio
class_sizes_cleaned_df = class_sizes_cleaned_df.join(ratio_df)
class_sizes_cleaned_df.head(20)

## Summary stats and histograms of key columns

In [None]:
class_sizes_cleaned_df.describe().round(2)

In [None]:
class_sizes_cleaned_df.hist(column='school_pupil_teacher_ratio')

In [None]:
class_sizes_cleaned_df.hist(column='average_class_size_math')

In [None]:
class_sizes_cleaned_df.hist(column='average_class_size_science')

In [None]:
class_sizes_cleaned_df.hist(column='average_class_size_english')

In [None]:
class_sizes_cleaned_df.hist(column='average_class_size_social_studies')

In [None]:
# check final shape (rows = number of schools)
# should be (494, 13)
class_sizes_cleaned_df.shape

In [None]:
# save the cleaned dataset to CSV
class_sizes_cleaned_df.to_csv('data_cleaned/cleaned_class_sizes.csv')