# Create Public School Machine Learning Datasets
** This program creates all the _ML datasets in the NCEA repository.** 
* This notebook reads each School Dataset file located at \EducationDataNC\ *schoolYear* \School Datasets\ as input data.
* Different school years are processed by changing the *schoolYear* parameter.
* Different input / output files are processed / created by changing the *inputFileName* paramter in the cell below.  
* While a single program is used to create all the _ML datasets, one program copy per dataset is maintained in the repositiory so the dataset specific tranformation reports may be reviewed. 

**Datasets ending in ML are preprocessed for Machine Learning and go through the following transformations: **
1. Missing student body racial compositions are imputed using district averages.
2. Columns that have the same value in every single row are deleted.
3. Columns that have a unique value in every single row (all values are different) are deleted.
4. Empty columns (all values are NA or NULL) are deleted.
5. Numeric columns with more than the percentage of missing values specified by the *missingThreshold* parameter.
6. Remaining numeric, non-race columns with missing values are imputed / populated with 0.  In many cases, schools are not reporting values when they are zero. However, mean imputation or some other more sophisticated strategy might be considered here.
7. Categorical / text based columns with > *uniqueThreshold* unique values are deleted.
8. All remaining categorical / text based columns are one-hot encoded.  In categorical columns, one-hot encoding creates one new boolean / binary field per unique value in the target column, converting all categorical columns to a numeric data type. 
9. Duplicated or highly similar columns with > 95% correlation are delelted.    

In [1]:
#import required Libraries
import pandas as pd
import numpy as np
import os
import string

#**********************************************************************************
# Set the following variables before running this code!!!
#**********************************************************************************
#All raw data files are processed for the year below
schoolYear = 2019

#Location where copies of the raw data files will be read in from csv files.
dataDir = 'D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/October 2020/' + str(schoolYear) + '/School Datasets/'

#Name of the file to be processed
#inputFileName = 'PublicSchools2019'
#inputFileName = 'PublicHighSchools2019'
inputFileName = 'PublicMiddleSchools2019'
#inputFileName = 'PublicElementarySchools2019'

#Input file being transformed for machine learning 
inputFile = dataDir + inputFileName + '.csv'

#Location where the new school datasets will be created.
outputDir = 'D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/October 2020/' + str(schoolYear) + '/Machine Learning Datasets/'

#Missing Data Threshold (Per Column)
missingThreshold = 0.60

#Unique Value Threshold (Per Column)
#Delete Columns >  uniqueThreshold unique values prior to one-hot encoding. 
#(each unique value becomes a new column during one-hot encoding)
uniqueThreshold = 200

#Read in the School Data File
schoolData = pd.read_csv(inputFile, low_memory=False, dtype={'agency_code': object})
print('*********Start: Beginning Column and Row Counts********************************************')
schoolData.info(verbose=False)

#Select only public schools as charter schools are missing data for many columns.
schoolData = schoolData[(schoolData['designation_type_loc'] == 'P')] #& (schoolData['avg_student_num'] > 0)] -rcd_admin missing

print('\r\n*********After: Selecting Only Public School Campuses**********************************')
schoolData.info(verbose=False)

#Save primary key
agency_code = schoolData['agency_code']
#Convert zip code to string
schoolData['zip_loc'] = schoolData['zip_loc'].astype('object')

*********Start: Beginning Column and Row Counts********************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Columns: 4674 entries, agency_code to welcome_url
dtypes: float64(4626), object(48)
memory usage: 30.2+ MB

*********After: Selecting Only Public School Campuses**********************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 4674 entries, agency_code to welcome_url
dtypes: float64(4626), object(48)
memory usage: 24.6+ MB


In [2]:
schoolData.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Data columns (total 4674 columns):
 #    Column                                         Dtype  
---   ------                                         -----  
 0    agency_code                                    object 
 1    category_code_loc                              object 
 2    agency_level_loc                               object 
 3    lea_code_loc                                   object 
 4    designation_type_loc                           object 
 5    name_loc                                       object 
 6    county_loc                                     object 
 7    street_addr_loc                                object 
 8    stree_addr2_loc                                object 
 9    city_loc                                       object 
 10   state_loc                                      object 
 11   zip_loc                                        object 
 12   phone_loc                        

# Prepare Consolidated Dataset for Machine Learning
**Below we perform operations on the entire dataset to remove columns and update row values that could cause problems during machine learning.**

## Student Body Racial Composition Features 
**Impute / update missing Student Body Racial Composition Fields using mean imputation.**
* When there are no racial composition percentages for a particular school campus / agency_code, fill in the missing values 

In [3]:
#Get Student Body Racial Composition Fields
raceCompositionFields = schoolData.filter(regex='indian|asian|hispanic|black|white|pacific_island|two_or_more|minority')\
                                  .filter(regex='pct').columns
    
rowsBefore = schoolData[raceCompositionFields].isnull().T.any().T.sum()

#Update missing race values with the district average when avaiable (No district averages for charter schools) 
schoolData[raceCompositionFields] = schoolData.groupby('lea_code_loc')[raceCompositionFields]\
                                              .transform(lambda x: x.fillna(x.mean()))

    #Review dataset contents after Racial Composition Imputation
print('*********After: Updating Missing Racial Compostion Values****************************')   
rowsAfter = schoolData[raceCompositionFields].isnull().T.any().T.sum()
rowsUpdated = rowsBefore - rowsAfter
print('Rows Updated / Imputed: ', rowsUpdated) 
print('\r\nTotal Rows Missing Racial Compositions By District Code') 
schoolData['lea_code_loc'][schoolData[raceCompositionFields].isnull().T.any().T].value_counts()

*********After: Updating Missing Racial Compostion Values****************************
Rows Updated / Imputed:  4

Total Rows Missing Racial Compositions By District Code


998LEA    13
997LEA     9
298LEA     3
269LEA     2
679LEA     1
299LEA     1
Name: lea_code_loc, dtype: int64

## Remove Columns with Problematic Data
**Here we remove entire columns that could cause problems during machine learning.  The following operations are performed:**
* Remove any columns that have the same value in every single row.
* Remove any columns that have a unique value in every single row (all values are different).
* Remove empty columns (all values are NA or NULL).

In [4]:
#Remove any fields that have the same value in all rows
UniqueValueCounts = schoolData.nunique(dropna=False)
SingleValueCols = UniqueValueCounts[UniqueValueCounts == 1].index
schoolData = schoolData.drop(SingleValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with the same value in every row.*******************')
schoolData.info(verbose=False)
print ('\r\nColumns Deleted: ', len(SingleValueCols))

*********After: Removing columns with the same value in every row.*******************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 4156 entries, agency_code to welcome_url
dtypes: float64(4111), object(45)
memory usage: 21.9+ MB

Columns Deleted:  518


In [5]:
#Remove any fields that have unique values in every row
schoolDataRecordCt = schoolData.shape[0]
UniqueValueCounts = schoolData.apply(pd.Series.nunique)
AllUniqueValueCols = UniqueValueCounts[UniqueValueCounts == schoolDataRecordCt].index
schoolData = schoolData.drop(AllUniqueValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing columns with unique values in every row.*******************')
schoolData.info(verbose=False)
print ('\r\nColumns Deleted: ', len(AllUniqueValueCols))

*********After: Removing columns with unique values in every row.*******************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 4155 entries, category_code_loc to welcome_url
dtypes: float64(4111), object(44)
memory usage: 21.9+ MB

Columns Deleted:  1


In [6]:
#Remove any empty fields (null values in every row)
schoolDataRecordCt = schoolData.shape[0]
NullValueCounts = schoolData.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts == schoolDataRecordCt].index
schoolData = schoolData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with null / blank values in every row.*************')
schoolData.info(verbose=False)
print ('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with null / blank values in every row.*************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 4155 entries, category_code_loc to welcome_url
dtypes: float64(4111), object(44)
memory usage: 21.9+ MB

Columns Deleted:  0


## Handle Other Missing Values Types
* Here we eliminate any numeric columns with more than the percentage of missing values specified by the *missingThreshold* parameter.
* All remaining non-race, numeric column missing values are populated with 0.
* In many cases, it seems that schools are not simply not reporting values when they are zero. However, mean imputation or some other strategy might be considered.

In [7]:
#Isolate continuous and categorical data types
#These are indexers into the schoolData dataframe and may be used similar to the schoolData dataframe 
sD_boolean = schoolData.loc[:, (schoolData.dtypes == bool) ]
sD_nominal = schoolData.loc[:, (schoolData.dtypes == object)]
sD_continuous = schoolData.loc[:, (schoolData.dtypes != bool) & (schoolData.dtypes != object)]
print ("Boolean Columns: ", sD_boolean.shape[1])
print ("Nominal Columns: ", sD_nominal.shape[1])
print ("Continuous Columns: ", sD_continuous.shape[1])
print ("Columns Accounted for: ", sD_nominal.shape[1] + sD_continuous.shape[1] + sD_boolean.shape[1])

Boolean Columns:  0
Nominal Columns:  44
Continuous Columns:  4111
Columns Accounted for:  4155


In [8]:
#Eliminate continuous columns with more than missingThreshold percentage of missing values
schoolDataRecordCt = sD_continuous.shape[0]
missingValueLimit = schoolDataRecordCt * missingThreshold
NullValueCounts = sD_continuous.isnull().sum()
NullValueCols = NullValueCounts[NullValueCounts >= missingValueLimit].index
schoolData = schoolData.drop(NullValueCols, axis=1)

#Review dataset contents after empty field drops
print('*********After: Removing columns with >= missingThreshold % of missing values******')
schoolData.info(verbose=False)
print('\r\nColumns Deleted: ', len(NullValueCols))

*********After: Removing columns with >= missingThreshold % of missing values******
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 1409 entries, category_code_loc to welcome_url
dtypes: float64(1365), object(44)
memory usage: 7.4+ MB

Columns Deleted:  2746


## One-Hot Encoding of Categorical Variables
**All categorical / string variables are converted to numberic variables via one hot encoding.  Each unique row value will become a new binary / numeric column in the dataset.**
* All remaining categorical columns are one-hot encoded.  
* In categorical columns, one-hot encoding creates one new boolean / binary field per unique value in the target column, converting all categorical columns to a numeric data type. 
* Prior to one-hot encoding, columns with > *uniqueThreshold* unique values are deleted.  

In [9]:
#Delete categorical columns with > uniqueThreshold unique values 
#(Each unique value becomes a column during one-hot encoding)
oneHotUniqueValueCounts = schoolData[sD_nominal.columns].apply(lambda x: x.nunique())
oneHotUniqueValueCols = oneHotUniqueValueCounts[oneHotUniqueValueCounts >= uniqueThreshold].index
schoolData.drop(oneHotUniqueValueCols, axis=1, inplace=True) 

#Remove categorical columns which do not add value to machine learning
#Grades columns dropped since they all have indicator fields (device_access, device_home, BYOD)
#with less unique values to one-hot encode

# _dlmi columns may or may not exist after running cleanups above.

#----MUST CHECK FOR:---- 'device_access_grades_dlmi','device_home_grades_dlmi','byod_grades_dlmi'
# to fields below, if they exist
schoolData.drop(['welcome_url'], axis=1, inplace=True) 

#Review dataset contents one hot high unique value drops
print('*********After: Removing columns with >= uniqueThreshold unique values***********')
schoolData.info(verbose=False)
print('\r\nColumns Deleted: ', len(oneHotUniqueValueCols))
print(oneHotUniqueValueCols)
print('Dropped Manually: welcome_url,device_access_grades_dlmi ,device_home_grades_dlmi ,byod_grades_dlmi ')

*********After: Removing columns with >= uniqueThreshold unique values***********
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 1394 entries, category_code_loc to avg_size_nc_math_1_sar
dtypes: float64(1365), object(29)
memory usage: 7.4+ MB

Columns Deleted:  14
Index(['name_loc', 'street_addr_loc', 'city_loc', 'zip_loc', 'phone_loc',
       'url_loc', 'school_name', 'school_name_attendance',
       'ada_ct_2017_attendance', 'adm_ct_2017_attendance',
       'ada_ct_2018_attendance', 'adm_ct_2018_attendance',
       'ada_ct_2019_attendance', 'adm_ct_2019_attendance'],
      dtype='object')
Dropped Manually: welcome_url,device_access_grades_dlmi ,device_home_grades_dlmi ,byod_grades_dlmi 


In [10]:
#Isolate remaining categorical variables
begColumnCt = len(schoolData.columns)
sD_nominal = schoolData.loc[:, (schoolData.dtypes == object)]

#one hot encode categorical variables
schoolData = pd.get_dummies(data=schoolData, 
                       columns=sD_nominal.columns, drop_first=True)

#Determine change in column count
endColumnCt = len(schoolData.columns)
columnsAdded = endColumnCt - begColumnCt

#Review dataset contents one hot high unique value drops
print('Columns To One-Hot Encode: ', len(sD_nominal.columns))
print('\r\n*********After: Adding New Columns Via One-Hot Encoding*************************')
schoolData.info(verbose=False)
print('\r\nNew Columns Created Via One-Hot Encoding: ', columnsAdded)

Columns To One-Hot Encode:  29

*********After: Adding New Columns Via One-Hot Encoding*************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 1980 entries, lea to poverty_eq_NEITHER
dtypes: float64(1365), uint8(615)
memory usage: 7.6 MB

New Columns Created Via One-Hot Encoding:  586


In [11]:
#Remove any fields that have the same value in all rows after one-hot encoding
UniqueValueCounts = schoolData.nunique(dropna=False)
SingleValueCols = UniqueValueCounts[UniqueValueCounts == 1].index
schoolData = schoolData.drop(SingleValueCols, axis=1)

#Review dataset contents after drops
print('*********After: Removing one-hot encoded columns with the same value in every row.*******************')
schoolData.info(verbose=False)
print ('\r\nColumns Deleted: ', len(SingleValueCols))

*********After: Removing one-hot encoded columns with the same value in every row.*******************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 1980 entries, lea to poverty_eq_NEITHER
dtypes: float64(1365), uint8(615)
memory usage: 7.6 MB

Columns Deleted:  0


## Impute any Remaining Missing Values as Zero

In [12]:
#Print out all the missing value rows
pd.set_option('display.max_rows', 1500)

print('\r\n*********The Remaining Missing Values Below will be set to Zero!*************************')

#Check for Missing values 
missing_values = schoolData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values




*********The Remaining Missing Values Below will be set to Zero!*************************


Unnamed: 0,Variable Name,Number Missing Values
0,lea,33
1,school,33
2,indian_pct,29
3,asian_pct,29
4,hispanic_pct,29
5,black_pct,29
6,white_pct,29
7,pacific_island_pct,29
8,two_or_more_pct,29
9,minority_pct,29


In [13]:
#Replace all remaining NaN with 0
schoolData = schoolData.fillna(0)

#Check for Missing values after final imputation 
missing_values = schoolData.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values = missing_values[missing_values['Number Missing Values'] > 0] 
missing_values

Unnamed: 0,Variable Name,Number Missing Values


## Identify and Remove Highly Correlated Features
**Find and remove any columns / features that are > 95% correlated**
* https://stackoverflow.com/questions/39409866/correlation-heatmap
* https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/
* https://codeyarns.com/2015/04/20/how-to-change-font-size-in-seaborn/

In [14]:
# calculate the correlation matrix
corr_matrix  = schoolData.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

In [15]:
#Print out all the missing value rows
pd.set_option('display.max_rows', 3500)

#Get all of the correlation values > 95%
x = np.where(upper > 0.95)

#Display all field combinations with > 95% correlation
cf = pd.DataFrame()
cf['Field1'] = upper.columns[x[1]]
cf['Field2'] = upper.index[x[0]]

#Get the correlation values for every field combination. (There must be a more pythonic way to do this!)
corr = [0] * len(cf)
for i in range(0, len(cf)):
    corr[i] =  upper[cf['Field1'][i]][cf['Field2'][i]] 
    
cf['Correlation'] = corr

print ('There are ', str(len(cf['Field1'])), ' field correlations > 95%.')
cf

There are  1861  field correlations > 95%.


Unnamed: 0,Field1,Field2,Correlation
0,lea_no_attendance,lea,0.994106
1,school_no_attendance,school,0.990736
2,indian_male_pct,indian_pct,0.993593
3,indian_female_pct,indian_pct,0.993122
4,asian_male_pct,asian_pct,0.97359
5,asian_female_pct,asian_pct,0.971578
6,hispanic_male_pct,hispanic_pct,0.956595
7,hispanic_female_pct,hispanic_pct,0.95556
8,black_male_pct,black_pct,0.954817
9,white_male_pct,white_pct,0.971395


In [16]:
print('Dropping the following ', str(len(cf['Field2'].unique())), ' highly correlated, unique fields.')
cf['Field2'].unique()

Dropping the following  780  highly correlated, unique fields.


array(['lea', 'school', 'indian_pct', 'asian_pct', 'hispanic_pct',
       'black_pct', 'white_pct', 'indian_male_pct',
       'ada_adm_ratio_2018_attendance', 'eg_index_all_all_eg',
       'eg_index_ma_all_eg', 'eg_score_all_bl7_eg', 'eg_score_all_els_eg',
       'eg_score_all_hi7_eg', 'eg_score_all_mu7_eg',
       'eg_score_all_swd_eg', 'eg_score_all_wh7_eg', 'eg_score_ma_all_eg',
       'eg_score_rd_all_eg', 'den_all_elp', 'den_els_elp', 'den_hi7_elp',
       'den_male_elp', 'pct_all_elp', 'pct_prof_6gr_irm',
       'target_assign_math_grades_3-8_ltg',
       'target_assign_reading_grades_3-8_ltg',
       'target_assign_total_targets_ltg', 'pct_met_math_grades_3-8_part',
       'pct_met_reading_grades_3-8_part',
       'pct_met_science_grades_5&8_part',
       'target_assign_math_grades_3-8_part',
       'target_assign_reading_grades_3-8_part',
       'target_assign_science_grades_5&8_part',
       'target_assign_total_targets_part',
       'target_met_math_grades_3-8_part',
       '

In [17]:
#Check columns before drop 
print('\r\n*********Before: Dropping Highly Correlated Fields*************************************')
schoolData.info(verbose=False)

# Drop the highly correlated features from our training data 
#schoolData = schoolData.drop(to_drop, axis=1)
schoolData = schoolData.drop(cf['Field2'], axis=1)

#Check columns after drop 
print('\r\n*********After: Dropping Highly Correlated Fields**************************************')
schoolData.info(verbose=False)


*********Before: Dropping Highly Correlated Fields*************************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 1980 entries, lea to poverty_eq_NEITHER
dtypes: float64(1365), uint8(615)
memory usage: 7.6 MB

*********After: Dropping Highly Correlated Fields**************************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Columns: 1200 entries, pacific_island_pct to poverty_eq_NEITHER
dtypes: float64(907), uint8(293)
memory usage: 5.0 MB


In [18]:
#Restore the unit_code before saving
schoolData['agency_code'] = agency_code
#Save the final dataset to a .csv file
schoolData.to_csv(outputDir + inputFileName + '_ML.csv', sep=',', index=False)

In [19]:
print('*********FINAL DATASET DETAILS*********************************************************\r\n')
schoolData.info(verbose=True)

*********FINAL DATASET DETAILS*********************************************************

<class 'pandas.core.frame.DataFrame'>
Int64Index: 691 entries, 2 to 845
Data columns (total 1201 columns):
 #    Column                                                                                                                       Dtype  
---   ------                                                                                                                       -----  
 0    pacific_island_pct                                                                                                           float64
 1    two_or_more_pct                                                                                                              float64
 2    minority_pct                                                                                                                 float64
 3    asian_male_pct                                                                                         

In [20]:
import sklearn
import pandas as pd

print('Sklearn Version: ' + sklearn.__version__)
print('Pandas Version: ' + pd.__version__)

Sklearn Version: 0.22.1
Pandas Version: 1.0.3


In [21]:
print('Output File Location:\r\n\r\n' + outputDir + inputFileName + '_ML.csv')

Output File Location:

D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/October 2020/2019/Machine Learning Datasets/PublicMiddleSchools2019_ML.csv
