## Using Four Bins to Model

In order to use more observations, the decision was to bin all the data, including large schools and bin them in four categories, 90%+, 80-89%, 60-79%, and 0-59%. The national graduation rate is 84.6% (school year 2016-17) so the four categories represent above average school graduation rates, average graduation rates, below average and poor performing schools. These bins were unbalanced with 52.57% of schools above average and 9.72% of schools had poor performance. Nonetheless, these bins had very distinct patterns for four variables, chronic absenteeism, sports participation, days missed due to suspensions and the number of non-certified teachers. 

With the four bins, several models were explored, random forests classifier, balanced random forest, random forest with class weights, and grid search. With grid search finding the best parameters took 6.9 hours, and training data could predict graduation rates with 99.9% correctly. With test data, it was 63.8%.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

In [2]:
binned = pd.read_csv('/Users/flatironschool/Absenteeism_Project/data/processed/binned.csv')


  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
binned.head()

Unnamed: 0.2,Unnamed: 0,index,Unnamed: 0.1,Unnamed: 0.1.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,...,total_suspension_days,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate,grad_slice,grad_rate_bin
0,0,0,0,0,ALABAMA,Albertville City,10000500871,Albertville High Sch,296,92,...,78.0,0.060232,0.0,0.0,2811.937359,0.0,0.003475,0.378788,92,90%+
1,1,1,1,1,ALABAMA,Marshall County,10000600872,Asbury Sch,67,95,...,10.0,0.018553,2.0,0.003711,4825.189777,0.0,0.002783,0.1,95,90%+
2,2,2,2,2,ALABAMA,Marshall County,10000600878,Douglas High Sch,153,85-89,...,18.0,0.030303,5.0,0.008418,5317.932795,0.0,0.001684,0.105263,85,80-89%
3,3,3,3,3,ALABAMA,Marshall County,10000600883,Kate D Smith DAR High Sch,120,80-84,...,10.0,0.021786,0.0,0.0,5909.375686,0.0,0.002179,0.068966,80,80-89%
4,4,4,4,4,ALABAMA,Marshall County,10000601585,Brindlee Mt High Sch,94,85-89,...,8.0,0.022039,0.0,0.0,3962.305785,0.0,0.002755,0.10101,85,80-89%


## Create Four Bins

In [4]:
#create the binned categories
binned['four_rate_bins'] = pd.cut(binned['grad_slice'].astype(int), 
    [0, 59, 79, 89, 100], labels = ['0-59%', '60-79%', '80-89%', '90%+'])

In [5]:
binned.tail()

Unnamed: 0.2,Unnamed: 0,index,Unnamed: 0.1,Unnamed: 0.1.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,...,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate,grad_slice,grad_rate_bin,four_rate_bins
16568,16568,21854,21854,21854,WYOMING,Sheridan County School District #1,560569000311,Big Horn High School,37,90,...,0.013245,4.0,0.02649,1738.913907,0.0,0.006623,0.846154,90,90%+,90%+
16569,16569,21858,21858,21858,WYOMING,Sheridan County School District #2,560569500360,Sheridan High School,236,89,...,0.046796,0.0,0.0,1694.559176,0.0,0.005086,0.198779,89,80-89%,80-89%
16570,16570,21861,21861,21861,WYOMING,Sweetwater County School District #2,560576200324,Green River High School,176,85-89,...,0.14157,1.0,0.001287,1840.87749,0.0,0.003861,0.2,85,80-89%,80-89%
16571,16571,21863,21863,21863,WYOMING,Teton County School District #1,560583000335,Jackson Hole High School,127,95,...,0.040625,0.0,0.0,4153.518984,0.0,0.004687,0.118939,95,90%+,90%+
16572,16572,21866,21866,21866,WYOMING,Washakie County School District #1,560624000343,Worland High School,105,75-79,...,0.025,0.0,0.0,3015.755325,0.0,0.004275,0.331544,75,70-79%,60-79%


In [6]:
binned['four_rate_bins'].value_counts()

90%+      8712
80-89%    4257
60-79%    1993
0-59%     1611
Name: four_rate_bins, dtype: int64

In [7]:
binned.to_csv('binned_wo_imputed.csv')

## Impute Mean for Missing Values

In [32]:
#impute mean for feature numerical vars
#did not include total enrollment or all rate
binned_feat = binned[['ap_ib_de_rate', 'sat_act_rate', 'pass_algebra_rate',
      'geometry_rate', 'algebra2_rate', 'calc_rate', 'chronic_absent_rate', 'activities_funds_rate',
      'sports_rate', 'suspensed_day_rate', 'harassed_rate',
      'non_cert_rate','counselor_rate','absent_teacher_rate']]

In [33]:
for i in range(0, len(binned_feat.columns)):
    print('im here:', binned_feat.columns[i])
    binned.fillna(binned.mean(), inplace=True)

im here: ap_ib_de_rate
im here: sat_act_rate
im here: pass_algebra_rate
im here: geometry_rate
im here: algebra2_rate
im here: calc_rate
im here: chronic_absent_rate
im here: activities_funds_rate
im here: sports_rate
im here: suspensed_day_rate
im here: harassed_rate
im here: non_cert_rate
im here: counselor_rate
im here: absent_teacher_rate


In [34]:
#No missing values
binned.isna().sum()

Unnamed: 0                 0
index                      0
Unnamed: 0.1               0
Unnamed: 0.1.1             0
STNAM                      0
LEANM                      0
NCESSCH                    0
SCHNAM                     0
ALL_COHORT_1516            0
ALL_RATE_1516              0
LEA_STATE                  0
LEA_STATE_NAME             0
LEAID_y                    0
LEA_NAME                   0
SCHID                      0
SCH_NAME                   0
COMBOKEY                   0
JJ                         0
SCH_STATUS_SPED            0
SCH_STATUS_MAGNET          0
SCH_STATUS_CHARTER         0
SCH_STATUS_ALT             0
SCH_MAGNETDETAIL         638
SCH_ALTFOCUS             498
TOT_ENR_M                  0
TOT_ENR_F                  0
TOT_GTENR_M                0
TOT_GTENR_F                0
TOT_DUAL_M                 0
TOT_DUAL_F                 0
                        ... 
districtID                 0
IDSCH                      0
total_enrollment           0
total_ap_ib_de

In [35]:
binned.to_csv('binned_imputed.csv')

In [49]:
X['chronic_absent_rate'].isnull().sum()

18

In [42]:
len(X)

16573

## Model 1 - Random Forest Classifier

In [131]:
y = binned['four_rate_bins']
X = binned[['ap_ib_de_rate', 'sat_act_rate', 'pass_algebra_rate',
      'geometry_rate', 'algebra2_rate', 'calc_rate', 'chronic_absent_rate', 'activities_funds_rate',
      'sports_rate', 'suspensed_day_rate', 'harassed_rate',
      'non_cert_rate','counselor_rate','absent_teacher_rate']]

In [132]:
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X, y, test_size=0.33, random_state=42)


In [144]:
rf = RandomForestClassifier(n_estimators=100, max_depth=5)

In [145]:
rf.fit(X_train_rf, y_train_rf)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [146]:
rf.score(X_train_rf, y_train_rf)

0.6245158966045213

In [147]:
rf.score(X_test_rf, y_test_rf)

0.6051188299817185

In [148]:
confusion_matrix(y_test_rf, rf.predict(X_test_rf))

array([[ 321,   27,   28,  201],
       [  91,   36,  163,  345],
       [  43,   17,  201, 1117],
       [  20,    5,  103, 2752]])

## Model 2 - Random Forest with Class Weights

In [149]:
y = binned['four_rate_bins']
X = binned[['ap_ib_de_rate', 'sat_act_rate', 'pass_algebra_rate',
      'geometry_rate', 'algebra2_rate', 'calc_rate', 'chronic_absent_rate', 'activities_funds_rate',
      'sports_rate', 'suspensed_day_rate', 'harassed_rate',
      'non_cert_rate','counselor_rate','absent_teacher_rate']]

In [150]:
#with only four bins, dummies variables to use class weights becomes an option
pd.get_dummies(data = binned, columns= ['four_rate_bins'])

Unnamed: 0.2,Unnamed: 0,index,Unnamed: 0.1,Unnamed: 0.1.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,...,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate,grad_slice,grad_rate_bin,four_rate_bins_0-59%,four_rate_bins_60-79%,four_rate_bins_80-89%,four_rate_bins_90%+
0,0,0,0,0,ALABAMA,Albertville City,10000500871,Albertville High Sch,296,92,...,2811.937359,0.000000,0.003475,0.378788,92,90%+,0,0,0,1
1,1,1,1,1,ALABAMA,Marshall County,10000600872,Asbury Sch,67,95,...,4825.189777,0.000000,0.002783,0.100000,95,90%+,0,0,0,1
2,2,2,2,2,ALABAMA,Marshall County,10000600878,Douglas High Sch,153,85-89,...,5317.932795,0.000000,0.001684,0.105263,85,80-89%,0,0,1,0
3,3,3,3,3,ALABAMA,Marshall County,10000600883,Kate D Smith DAR High Sch,120,80-84,...,5909.375686,0.000000,0.002179,0.068966,80,80-89%,0,0,1,0
4,4,4,4,4,ALABAMA,Marshall County,10000601585,Brindlee Mt High Sch,94,85-89,...,3962.305785,0.000000,0.002755,0.101010,85,80-89%,0,0,1,0
5,5,5,5,5,ALABAMA,Hoover City,10000700251,Hoover High Sch,714,92,...,434.833734,0.019066,0.001685,0.333651,92,90%+,0,0,0,1
6,6,7,7,7,ALABAMA,Hoover City,10000701456,Spain Park High Sch,412,94,...,511.925343,0.016920,0.001789,0.346870,94,90%+,0,0,0,1
7,7,8,8,8,ALABAMA,Madison City,10000800831,Bob Jones High Sch,451,97,...,72676.817158,0.000000,0.002278,0.019608,97,90%+,0,0,0,1
8,8,9,9,9,ALABAMA,Madison City,10000802198,James Clemens High School,404,96,...,3478.029292,0.000000,0.002635,0.232804,96,90%+,0,0,0,1
9,9,10,10,10,ALABAMA,Leeds City,10001102096,Leeds High Sch,123,90-94,...,3966.648380,0.000000,0.002132,0.343750,90,90%+,0,0,0,1


In [151]:
X_train_weight, X_test_weight, y_train_weight, y_test_weight = train_test_split(X, y, test_size=0.33, random_state=42)


In [152]:
from sklearn.utils.class_weight import compute_class_weight

In [153]:
class_weight = compute_class_weight("balanced", np.unique(y), y)
class_weight

array([2.57184978, 2.07890115, 0.9732793 , 0.47557966])

In [117]:
weight={'0-59%': 2.57184978, '60-79%': 2.07890115, '80-89%':0.9732793, '90%+':0.47557966}

In [168]:
rf_weight = RandomForestClassifier(n_estimators=400, max_depth=5, 
    class_weight=weight)

In [169]:
rf_weight.fit(X_train_weight, y_train_weight)

RandomForestClassifier(bootstrap=True,
            class_weight={'0-59%': 1, '60-79%': 1, '80-89%': 1, '90%+': 1},
            criterion='gini', max_depth=5, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=400, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [170]:
#best score by trying variables, using info to inform grid search
print('RF class weight train score: ', rf_weight.score(X_train_weight, y_train_weight))
print('RF class weight test score: ', rf_weight.score(X_test_weight, y_test_weight))

RF class weight train score:  0.6254165540844817
RF class weight test score:  0.6053016453382084


In [171]:
rf_weight.predict(X_test_weight)
confusion_matrix(y_test_weight, rf_weight.predict(X_test_weight))

array([[ 315,   34,   27,  201],
       [  86,   38,  170,  341],
       [  42,   17,  211, 1108],
       [  20,    5,  108, 2747]])

## Model 3 - Balanced Random Forest

In [172]:
from imblearn.ensemble import BalancedRandomForestClassifier

In [173]:
y = binned['four_rate_bins']
X = binned[['ap_ib_de_rate', 'sat_act_rate', 'pass_algebra_rate',
      'geometry_rate', 'algebra2_rate', 'calc_rate', 'chronic_absent_rate', 'activities_funds_rate',
      'sports_rate', 'suspensed_day_rate', 'harassed_rate',
      'non_cert_rate','counselor_rate','absent_teacher_rate']]

In [174]:
X_train_bal, X_test_bal, y_train_bal, y_test_bal = train_test_split(X, y, test_size=0.33, random_state=42)


In [176]:
balance = BalancedRandomForestClassifier(max_depth=5)

In [177]:
balance.fit(X_train_bal, y_train_bal)

BalancedRandomForestClassifier(bootstrap=True, class_weight=None,
                criterion='gini', max_depth=5, max_features='auto',
                max_leaf_nodes=None, min_impurity_decrease=0.0,
                min_samples_leaf=2, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
                oob_score=False, random_state=None, replacement=False,
                sampling_strategy='auto', verbose=0, warm_start=False)

In [178]:
balance.score(X_train_bal, y_train_bal)

0.5999279474016032

In [179]:
balance.score(X_test_bal, y_test_bal)

0.5839122486288848

In [180]:
confusion_matrix(y_test_bal, balance.predict(X_test_bal))

array([[ 500,   43,   11,   23],
       [ 176,  210,  141,  108],
       [ 120,  274,  452,  532],
       [ 164,  157,  527, 2032]])

## Model 4 - Grid Search

In [21]:
y = binned['four_rate_bins']
X = binned[['ap_ib_de_rate', 'sat_act_rate', 'pass_algebra_rate',
      'geometry_rate', 'algebra2_rate', 'calc_rate', 'chronic_absent_rate', 'activities_funds_rate',
      'sports_rate', 'suspensed_day_rate', 'harassed_rate',
      'non_cert_rate','counselor_rate','absent_teacher_rate']]

In [22]:
#use X, y from above
X_train_tune, X_test_tune, y_train_tune, y_test_tune = train_test_split(X, y, test_size=0.2, random_state=102)

In [23]:
rf = RandomForestClassifier()

In [184]:
# Create the random grid
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 4500, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]


random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


In [50]:
rf_random = RandomizedSearchCV(estimator=rf, param_distributions = random_grid, cv = 3, n_iter=100, verbose=2, n_jobs = -1)


In [29]:
rf_random = RandomForestClassifier()


In [31]:
X

Unnamed: 0,ap_ib_de_rate,sat_act_rate,pass_algebra_rate,geometry_rate,algebra2_rate,calc_rate,chronic_absent_rate,activities_funds_rate,sports_rate,suspensed_day_rate,harassed_rate,non_cert_rate,counselor_rate,absent_teacher_rate
0,,0.234749,,0.258687,0.132046,0.025483,0.213127,2811.937359,0.169884,0.060232,0.000000,0.000000,0.003475,0.378788
1,,0.224490,,0.128015,0.155844,0.029685,0.202226,4825.189777,0.324675,0.018553,0.003711,0.000000,0.002783,0.100000
2,,0.476431,,0.220539,0.321549,0.043771,0.264310,5317.932795,0.385522,0.030303,0.008418,0.000000,0.001684,0.105263
3,,0.505447,,0.226580,0.289760,,0.270153,5909.375686,0.525054,0.021786,0.000000,0.000000,0.002179,0.068966
4,,0.473829,,0.239669,0.366391,0.022039,0.322314,3962.305785,0.176309,0.022039,0.000000,0.000000,0.002755,0.101010
5,0.228851,0.416245,,0.242332,0.056623,0.065723,0.144253,434.833734,0.302663,0.053927,0.000337,0.019066,0.001685,0.333651
6,,0.288014,,0.269529,0.045915,0.064997,0.171139,511.925343,0.313655,0.138342,0.000000,0.016920,0.001789,0.346870
7,,0.396925,,0.126424,0.034169,0.068337,0.009112,72676.817158,0.321754,0.041002,0.001139,0.000000,0.002278,0.019608
8,,0.340749,,0.108314,0.014637,0.045667,0.103630,3478.029292,0.163934,0.078454,0.000000,0.000000,0.002635,0.232804
9,,0.373134,,0.260128,0.255864,0.023454,0.166311,3966.648380,0.539446,0.119403,0.000000,0.000000,0.002132,0.343750


In [39]:
binned['ap_ib_de_rate'].isna().sum()

0

In [51]:
rf_random.fit(X_train_tune, y_train_tune)

In [25]:
y_train_tune

15666    80-89%
8127     80-89%
996      80-89%
5243       90%+
10637      90%+
6134     80-89%
15529    60-79%
3238     80-89%
1514       90%+
8497     80-89%
9064     60-79%
7905     80-89%
5785       90%+
3860     80-89%
5951       90%+
13856      90%+
2037       90%+
13931      90%+
1779     80-89%
15332      90%+
12305      90%+
7351       90%+
14587    80-89%
8749       90%+
9024     80-89%
13604      90%+
13322      90%+
3278     60-79%
6171       90%+
6963     80-89%
          ...  
6041       90%+
3808     60-79%
565      60-79%
13555    80-89%
6442     60-79%
3751      0-59%
11205    80-89%
12491     0-59%
15654    80-89%
4139     60-79%
15234      90%+
13283    80-89%
217        90%+
13743      90%+
2820     80-89%
4167       90%+
7751       90%+
16358     0-59%
3040       90%+
8631     80-89%
9894       90%+
6916       90%+
15239    80-89%
79       80-89%
2376       90%+
4075     60-79%
13167    80-89%
14962      90%+
2290       90%+
10496      90%+
Name: four_rate_bins, Le

In [52]:
rf_random.best_params_

In [19]:
#Using the best parameters, the model does perform very well, but it is overfit,
#and does not generalize well at all.
rf_random_best = RandomForestClassifier(n_estimators=2111, min_samples_split=5,
                                        min_samples_leaf=2, max_features='auto',
                                        max_depth=90,bootstrap=False)

In [20]:
rf_random_best.fit(X_train_tune, y_train_tune)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [17]:
rf_random_best.score(X_train_tune, y_train_tune)

NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

In [192]:
rf_random_best.score(X_test_tune, y_test_tune)

0.6386123680241327

In [8]:
rf_random_best.predict(X_test_tune, y_test_tune)

NameError: name 'rf_random_best' is not defined

In [None]:
rf_random_best(confusion_matrix(y_test, rf_random_best_predict))

In [109]:
def evaluate(model, test_features, test_labels):
    predictions = base_model.predict(X_test_tune)
    errors = (predictions - y_test_tune)
    rmse = np.sqrt(np.mean(errors**2)) ### RMSE
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('RMSE = {:0.4f}'.format(rmse))
    
    return rmse

In [113]:
base_model = RandomForestClassifier(n_estimators = 10)
base_model.fit(X_train_tune, y_train_tune)
base_accuracy = evaluate(base_model, X_test_tune, y_test_tune)

TypeError: Series cannot perform the operation -