# SHSAT Test Results Merge Notebook
[Return to project overview](final_project_overview.ipynb)

### Andrew Larimer, Deepak Nagaraj, Daniel Olmstead, Michael Winton (W207-4-Summer 2018 Final Project)

In this notebook, we will merge the data cleaned by the other "prep_" notebooks to create a single merged csv.

### Importing dataframes, indexed by our primary key
While school names may change or be input inconsistently, each school has a unique identifying DBN, sometimes referred to as a Location Code, to identify it. By importing each cleaned dataset with the DBN as the index, we are able to easily join them into a merged dataset.

In [7]:
import pandas as pd
import datetime
import re

In [2]:
# Load all datasets from CSV; when loading set index to the DBN column (to enforce uniqueness)
shsat_df = pd.read_csv('data_cleaned/cleaned_shsat_outcomes.csv', index_col="dbn")
print('SHSAT dataset:',shsat_df.shape) # confirm that it's (589, 5)

class_sizes_df = pd.read_csv('data_cleaned/cleaned_class_sizes.csv', index_col="DBN")
print('Class size dataset:', class_sizes_df.shape) # confirm that it's (494,13)

explorer_df = pd.read_csv('data_cleaned/cleaned_explorer.csv', index_col="dbn")
print('Explorer dataset:', explorer_df.shape) # confirm that it's (596, 50)

SHSAT dataset: (589, 4)
Class size dataset: (494, 13)
Explorer dataset: (596, 49)


### Checking for duplicate entries.
We do a quick check to make sure there are no duplicate entries.

In [3]:
shsat_dups = shsat_df.index.duplicated()
class_sizes_dups = class_sizes_df.index.duplicated()
explorer_dups = explorer_df.index.duplicated()
                            
print("True or False: there are duplicated indices within any dataframes?")
print("{0}.".format(bool(sum(shsat_dups) + sum(class_sizes_dups) + sum(explorer_dups))))

True or False: there are duplicated indices within any dataframes?
False.


### Inner joins for more complete data
We'll use inner joins to select the intersection of our datasets, thus only selecting for schools for which we have data from each dataframe.

In [14]:
merged_df = shsat_df.join(explorer_df, how="inner")
merged_df = merged_df.join(class_sizes_df, how="inner")
print("Merged Dataframe shape:",merged_df.shape)

Merged Dataframe shape: (464, 66)


This still leaves us with a merged dataframe of 464 rows and 66 features.

### Sanitizing column names.
To have consistent naming conventions across the column names, we'll replace spaces with underscores and lowercase the column names.

In [15]:
# Having spaces etc. can cause annoying problems: replace with underscores
def sanitize_column_names(c):
    c = c.lower()
    c = re.sub('[?,()/]', '', c)
    c = re.sub('\s-\s', '_', c)
    c = re.sub('[ -]', '_', c)
    c = c.replace('%', 'percent')
    return c

merged_df.columns = [sanitize_column_names(c) for c in merged_df.columns]

### Evaluating Density
Let's take a look at how sparse our data is.

In [19]:
print("Total empty cells:",merged_df.isnull().sum().sum())
print("Percent null: {0:.3f}%".format(100*merged_df.isnull().sum().sum()/(merged_df.shape[0]*merged_df.shape[1])))

Total empty cells: 99
Percent null: 0.323%


Let's take a look at our worst offending rows and columns to see if anything stands out enough to be removed:

#### Columns with Nulls

In [10]:
merged_df.isnull().sum()[merged_df.isnull().sum() > 0]\
    .sort_values(ascending=False)

average_class_size_social_studies    19
number_of_classes_social_studies     19
number_of_students_social_studies    19
average_class_size_english           10
number_of_classes_english            10
number_of_students_english           10
average_math_proficiency              3
average_ela_proficiency               3
economic_need_index                   3
average_class_size_math               1
number_of_classes_math                1
number_of_students_math               1
dtype: int64

#### Rows with Nulls

In [11]:
merged_df.isnull().sum(axis=1)[merged_df.isnull().sum(axis=1) > 0]\
    .sort_values(ascending=False)

02M407    9
02M312    6
02M225    6
02M255    6
06M209    6
02M413    6
01M839    6
15K839    6
19K404    6
17K484    6
03M291    3
01M188    3
28Q358    3
03M860    3
08X562    3
07X298    3
03M859    3
01M332    3
19K678    3
29Q355    3
11X462    3
13K265    3
dtype: int64

At the moment we don't see any of these as being offending enough to be removed.

### Saving a dated file

To allow updates to the merged dataframe without disrupting work on models downstream until they are ready, we save a dated merged filename.

In [20]:
# Get the date to create the filename.
d = datetime.date
filename = "combined_data_{0}.csv".format( d.today().isoformat() )
print(filename)

combined_data_2018-07-17.csv


In [21]:
merged_df.to_csv("data_merged/{0}".format(filename))