# SHSAT Test Results Preliminary EDA / Cleaning Notebook
[Return to project overview](final_project_overview.ipynb)

### Andrew Larimer, Deepak Nagaraj, Daniel Olmstead, Michael Winton (W207-4-Summer 2018 Final Project)

In this notebook, we will prepare the "training data" needed to run and validate our classifier.

## About the datasets

We use two datasets to create the training labels.

* [New York Times dataset](https://www.kaggle.com/willkoehrsen/nyc-shsat-test-results-2017)
* [New York Department of Education dataset](https://data.cityofnewyork.us/Education/2013-2018-Demographic-Snapshot-School/s52a-8aq6)

The first comes from a NYT article about lack of diversity in students who attend New York's specialized high schools.  For a number of New York schools, it gives us information about how many students took the SHSAT, and what the school's racial composition is.

The second dataset gives us information about how many students were enrolled in specific grades.

## Reading data

First, let us read the datasets.

In [None]:
import pandas as pd
import numpy as np
import re
import util

school_df = pd.read_csv('data_raw/2016_school_explorer.csv')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 5)

nyt_df = pd.read_csv('data_raw/nytdf.csv')
doe_school_df = pd.read_csv('data_raw/doe_demographic_snapshot_school.csv')

## NYTimes School Data

Let us have a first look at NYTimes data.

In [None]:
nyt_df.head()

### Cleanup

We will now use our utility functions to clean up some columns and column names for easier analysis.  After this, we will have another look.

In [None]:
# Remove percent and convert to float
percent_columns = [
    'OffersPerStudent',
    'PctBlackOrHispanic',
]
for col in percent_columns:
    nyt_df[col] = util.pct_to_number(nyt_df, col)
nyt_df.columns = [util.sanitize_column_names(c) for c in nyt_df.columns]

Let us now get an overview of the NYTimes data.

In [None]:
nyt_df.head()

## DoE Demographics Data

We notice that the NYTimes dataset does not have school enrollment information.

We will look at the DoE data and clean it up as needed.

In [None]:
doe_school_df.head()

### Cleanup

In [None]:
doe_school_df.columns = [util.sanitize_column_names(c) for c in doe_school_df.columns]
doe_school_df.head()

### Filtering data

We will filter it down to the year we are interested in (2017-18) based on the above SHSAT data, and do some cleanup.  We look at 2017-18 because the test is at the very beginning of the year:

> Registration is September 7-October 12, 2017... In 2017, tests were given October 21, 22; October 29, and November 4.

We can also filter information into Grade 8/9 enrollments:

> All students in grades eight and nine who are current New York City residents are eligible. [Source](https://www.schools.nyc.gov/school-life/learning/testing/specialized-high-school-admissions-test)

In [None]:
shsat_eligible_class_size_df = doe_school_df \
    .query("year == '2017-18'") \
    [['dbn', 'grade_7', 'grade_8', 'grade_9']]
shsat_eligible_class_size_df.head()

### Combining data
We will now combine the enrollment data with the NYTimes data.

In [None]:
combined_df = nyt_df.merge(shsat_eligible_class_size_df, on='dbn', how='left')
combined_df.head()

Let us run some quick sanity checks.

In [None]:
# using the fact that np.nan != np.nan
display(combined_df.query('grade_8 != grade_8 | grade_9 != grade_9'))
display(combined_df.query('numshsattesttakers > grade_8 + grade_9'))

Looks good.  There are no invalid values or empty values.

### Percent test takers

We will define our outcome variable as the fraction of test takers, for each school.  i.e., PASSNYC would want to design its model so that the fraction of test takers is high for each school.  Similarly, we would want to emulate schools that have a high fraction of test takers.

Let us add a column for fraction of test takers and look at its distribution.

In [None]:
combined_df['pct_test_takers'] = (combined_df['numshsattesttakers'] * 100 \
                / (combined_df['grade_8'] + combined_df['grade_9'])).astype('int')
combined_df[['dataname', 'pct_test_takers']].head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

plt.hist(combined_df['pct_test_takers'], bins=20)
plt.show()

In [None]:
plt.boxplot(combined_df['pct_test_takers'])
plt.show()

In [None]:
# calculate 75th percentile (correponding to top of box plot)
pct75 = np.percentile(combined_df['pct_test_takers'], q=75)
pct75

### Defining success label

From the plots above, it looks like we have a sharp drop at around 40%.  Calculating the 75th percentile correponding to the top of the box in the blox plot, we see it's actually 38.0.  We will use that as a threshhold, and label any school that has >38.0% registrations (ie. 75th percentile) as "successful".

Let us also have a look at how the distribution of offers per student is spread out.

In [None]:
plt.hist(combined_df['offersperstudent'], bins=20)
plt.show()

In [None]:
combined_df['high_registrations'] = (combined_df['pct_test_takers'] > pct75).astype('int')
combined_df.head()

In [None]:
out_df = combined_df[['dbn', 'grade_7', 'numshsattesttakers', 'offersperstudent', 'pct_test_takers', 'high_registrations']]
# manually add underscores to ccolumn names for consistency
out_df.columns = ['dbn', 'grade_7_enrollment', 'num_shsat_test_takers', 'offers_per_student', 'pct_test_takers', 'high_registrations']
# check final shape (rows = number of schools)
out_df.shape

We will save this csv file out.

In [None]:
out_df.to_csv('data_cleaned/cleaned_shsat_outcomes.csv', index=False)