# Using machine learning to prioritize NYC middle schools for intervention, with an intent to increase diverse student registrations for the specialized high school admissions test

### Andrew Larimer, Deepak Nagaraj, Daniel Olmstead, Michael Winton (W207-4-Summer 2018 Final Project)


Paraphrased from the Kaggle [PASSNYC: Data Science for Good Challenge](https://www.kaggle.com/passnyc/data-science-for-good) overview (June 2018):

## Problem Statement

PASSNYC is a not-for-profit organization dedicated to broadening educational opportunities for New York City's talented and underserved students. In recent years, the City’s specialized high schools - institutions with historically transformative impact on student outcomes - have seen a shift toward more homogeneous student body demographics.  PASSNYC aims to increase the diversity of students taking the Specialized High School Admissions Test (SHSAT). By focusing efforts in underperforming areas that are historically underrepresented in SHSAT registration, we will help pave the path to specialized high schools for a more diverse group of students.

PASSNYC and its partners provide outreach services that improve the chances of students taking the SHSAT and receiving placements in these specialized high schools. The current process of identifying schools is effective, but PASSNYC could have an even greater impact with a more informed, granular approach to quantifying the potential for outreach at a given school. Proxies that have been good indicators of these types of schools include data on English Language Learners, Students with Disabilities, Students on Free/Reduced Lunch, and Students with Temporary Housing.

Part of this challenge is to assess the needs of students by using publicly available data to quantify the challenges they face in taking the SHSAT. The best solutions will enable PASSNYC to identify the schools where minority and underserved students stand to gain the most from services like after school programs, test preparation, mentoring, or resources for parents.

More on [PASSNYC](http://www.passnyc.org/opportunity-explorer/).


## Overview of our approach

Write this up in prose (copied from the Google Doc):
- Treat as a classification problem
- Proposed "success" class label: # SHSAT registrants normalized by school size
- We'll try multiple algorithms (LogisticRegression, SVM, KNN, Random Forest, etc…)
- We may try PCA for dimensionality reduction, as many features are likely correlated.
- Using cross-validation to calculate accuracy (and/or F-score, Precision & Recall, etc…)
- Once we have a "best" model, use it to predict whether the remaining ~700 schools would be expected to have SHSAT registrations.  We would rank by decreasing level of diversity (and possibly decreasing economic need), which is in alignment with PASSNYC mission.  This represents the end report that PASSNYC is asking for.


In [1]:
# import necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# set default options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)

%matplotlib inline

## Datasets

We joined several datasets together for this project. We focused mainly on data from the 2016-2017 period as predictors for the 2017 SHSAT test, which is taken in the fall. 

### 1. School Explorer 2016

The [2016 School Explorer](https://www.kaggle.com/passnyc/data-science-for-good#2016%20School%20Explorer.csv) dataset was provided by PASSNYC on Kaggle.  It contains key information about every school in NYC such as Name, location (address and lat/lon), grades, budget, demographics, ratings, and number of students who test at level 4 in Math/English ELA by various demographic and economic levels.  Sample row:

In [44]:
se_2016 = pd.read_csv('2016_school_explorer.csv')
# Filter the schools that have an 8th grade
se_2016_8th = se_2016[se_2016['Grades'].str.contains('08')]
# Show a sample row
se_2016_sample = pd.DataFrame(se_2016_8th.iloc[0,])
sample_idx = se_2016_sample..iloc[0]['Location Code']
se_2016_sample

Unnamed: 0,3
Adjusted Grade,
New?,
Other Location Code in LCGMS,
School Name,P.S. 034 FRANKLIN D. ROOSEVELT
SED Code,310100010034
Location Code,01M034
District,1
Latitude,40.7261
Longitude,-73.975
Address (Full),"730 E 12TH ST NEW YORK, NY 10009"


In [13]:
se_2016.shape

(1272, 161)

### 2. NYC SHSAT Test Results 2017

The [NYC SHSAT Test Results 2017](https://www.kaggle.com/willkoehrsen/nyc-shsat-test-results-2017/home) dataset contains data from the New York Times Article: ["See Where New York City’s Elite High Schools Get Their Students"](https://www.nytimes.com/interactive/2018/06/29/nyregion/nyc-high-schools-middle-schools-shsat-students.html) by Jasmine Lee published June 29, 2018.  Data was parsed and uploaded to Kaggle by [Richard W DiSalvo](https://www.kaggle.com/rdisalv2).  This dataset contains information on schools with students eligible to take the SHSAT, the number of students who took the test, the number of resulting offers, and a basic demographic percentage of Black/Hispanic students at the school (ie, NOT test-takers).

In [47]:
shsat_2017 = pd.read_csv('nytdf.csv')
# Show a sample row
shsat_2017_sample = pd.DataFrame(shsat_2017.loc[shsat_2017['DBN'] == sample_idx]).T
shsat_2017_sample

Unnamed: 0,172
DBN,01M034
DataName,P.S. 034 FRANKLIN D. ROOSEVELT
SchoolName1,Public School 34
SchoolName2,The Franklin D. Roosevelt School
NumSHSATTestTakers,6
NumSpecializedOffers,0
OffersPerStudent,0
PctBlackOrHispanic,93%


In [15]:
shsat_2017.shape

(589, 8)

### 3. NYC Class Size Report 2016-2017

The [2016-2017 NYC Class Size Report](https://www.kaggle.com/marcomarchetti/20162017-nyc-class-size-report) dataset originally came from the [NYC Schools website](http://schools.nyc.gov/AboutUs/schools/data/classsize/classsize_2017_2_15.htm), but is no longer available there.  It was parsed and uploaded to Kaggle by [Marco Marchetti](https://www.kaggle.com/marcomarchetti).  It is a merge of three datasets: "K-8 Avg, MS HS Avg, PTR".  The "MS HS Avg" subset gives the average class size by program, department, and subject for each school.  The "PTR" data gives the pupil-teacher ratio for the school.

In [54]:
nyccs_2017 = pd.read_csv('February2017_Avg_ClassSize_School_all.csv')
# Show a sample row
nyccs_2017_sample = pd.DataFrame(nyccs_2017.loc[nyccs_2017['DBN'] == sample_idx].iloc[0,])
nyccs_2017_sample

Unnamed: 0,0
DBN,01M034
School Name,P.S. 034 FRANKLIN D. ROOSEVELT
Grade Level,MS Core
Program Type,Gen Ed
Number of Students,73
Number of Classes,3
Average Class Size,24.3
Minimum Class Size,14
Maximum Class Size,30
Department,English


In [17]:
nyccs_2017.shape

(31713, 12)

### 4. Demographic Snapshot School 2013-2018

This [2013-2018 Demographic Snapshot of NYC Schools](https://data.cityofnewyork.us/Education/2013-2018-Demographic-Snapshot-School/s52a-8aq6) was downloaded directly from the NYC Open Data project. It contains grade-level enrollments for each school.

In [56]:
dss = pd.read_csv('doe_demographic_snapshot_school.csv')
# Filter to just the 2016-17 school year
dss_2017 = dss[dss['Year']=='2016-17']
# Show a sample row
dss_2017_sample = pd.DataFrame(dss_2017.loc[dss_2017['DBN'] == sample_idx].iloc[0,])
dss_2017_sample

Unnamed: 0,18
DBN,01M034
School Name,P.S. 034 Franklin D. Roosevelt
Year,2016-17
Total Enrollment,350
Grade PK (Half Day & Full Day),13
Grade K,21
Grade 1,24
Grade 2,37
Grade 3,31
Grade 4,29


## Preliminary EDA and Data Cleaning

Each of these datasets required varying degrees of cleaning before they could be joined together.  The EDA and cleaning of each was done in separate notebooks, and results saved as CSV files.

1. [School Explorer EDA Notebook](prep_explorer.ipynb)
2. [SHSAT Results Notebook](prep_shsat_results.ipynb)
3. [Class Size Notebook](prep_class_sizes.ipynb)
4. [Demographic Snapshot Notebook](prep_demographic.ipynb)

Next we load the resulting CSV files, join into one master dataset, and save as [combined_data.csv](combined_data.csv).

In [1]:
# Load CSV files

# Join CSV files

# Re-save master CSV file

## Highlights of Univariate and Bivariate EDA

**TODO: complete this section**

## Model Building

Next, we apply a variety of machine learning techniques to determine which provide the best classification results.  Each technique is performed in a separate notebook.  In each case, we also evaluate the effects of using PCA for dimensionality reduction.

1. [Decision Trees](model_trees.ipynb)
2. [Random Forests](model_forests.ipynb)
3. [Logistic Regression](model_logreg.ipynb)
4. [SVMs](model_svm.ipynb)

**TODO: add more if needed.  I like the idea of 4 since there are 4 of us, but we don't have to stick with it. **

## Conclusion

**TODO: describe findings , make a table of results, etc...**

## Further Work

The original PASSNYC challenge requested a prioritized list of schools that would help meet their objective of increasing the SHSAT registrations by diverse students.  Since the current applicants are predominantly white, we chose to intentionally not train our models on demographic factors such as race in order to avoid reinforcing the current bias (ie. we didn't want a model to suggest that emphasis should be on schools with a high percentage of white students because there's an existing correlation with SHSAT registrations).  Instead, we tried to train a "race-blind" classifier.  However, once the full set of NYC schools was run through our model in order to generate predictions, we then ranked the resulting schools by these diversity factors (in descending order).

The results show that the top 20 schools that we'd recommend PASSNYC engage with are... **TODO: complete this**