## Started Binning All Schools, Large and Small to Get More Observations

Began with ten bins, 0-9, 10-19, 20-29, ... 90-99. Did a small amount of exploratory data analysis and tried prediction models using random forest and then tuning it.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [2]:
binned = pd.read_csv('/Users/flatironschool/Absenteeism_Project/data/processed/combo_cleaned.csv')

In [3]:
binned.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,LEA_STATE,LEA_STATE_NAME,...,sports_part,sports_rate,total_suspension_days,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate
0,0,0,ALABAMA,Albertville City,10000500871,Albertville High Sch,296,92,AL,ALABAMA,...,220.0,0.169884,78.0,0.060232,0.0,0.0,2811.937359,0.0,0.003475,0.378788
1,1,1,ALABAMA,Marshall County,10000600872,Asbury Sch,67,GE95,AL,ALABAMA,...,175.0,0.324675,10.0,0.018553,2.0,0.003711,4825.189777,0.0,0.002783,0.1
2,2,2,ALABAMA,Marshall County,10000600878,Douglas High Sch,153,85-89,AL,ALABAMA,...,229.0,0.385522,18.0,0.030303,5.0,0.008418,5317.932795,0.0,0.001684,0.105263
3,3,3,ALABAMA,Marshall County,10000600883,Kate D Smith DAR High Sch,120,80-84,AL,ALABAMA,...,241.0,0.525054,10.0,0.021786,0.0,0.0,5909.375686,0.0,0.002179,0.068966
4,4,4,ALABAMA,Marshall County,10000601585,Brindlee Mt High Sch,94,85-89,AL,ALABAMA,...,64.0,0.176309,8.0,0.022039,0.0,0.0,3962.305785,0.0,0.002755,0.10101


## Binning Graduation Rates into 10 Rates



In [4]:
binned.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,LEA_STATE,LEA_STATE_NAME,...,sports_part,sports_rate,total_suspension_days,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate
0,0,0,ALABAMA,Albertville City,10000500871,Albertville High Sch,296,92,AL,ALABAMA,...,220.0,0.169884,78.0,0.060232,0.0,0.0,2811.937359,0.0,0.003475,0.378788
1,1,1,ALABAMA,Marshall County,10000600872,Asbury Sch,67,GE95,AL,ALABAMA,...,175.0,0.324675,10.0,0.018553,2.0,0.003711,4825.189777,0.0,0.002783,0.1
2,2,2,ALABAMA,Marshall County,10000600878,Douglas High Sch,153,85-89,AL,ALABAMA,...,229.0,0.385522,18.0,0.030303,5.0,0.008418,5317.932795,0.0,0.001684,0.105263
3,3,3,ALABAMA,Marshall County,10000600883,Kate D Smith DAR High Sch,120,80-84,AL,ALABAMA,...,241.0,0.525054,10.0,0.021786,0.0,0.0,5909.375686,0.0,0.002179,0.068966
4,4,4,ALABAMA,Marshall County,10000601585,Brindlee Mt High Sch,94,85-89,AL,ALABAMA,...,64.0,0.176309,8.0,0.022039,0.0,0.0,3962.305785,0.0,0.002755,0.10101


In [5]:
#separate out schools that have 31 or more students in the graduation class. Very small classes have ranges that are 
#too wide to be useful for modeling.
bigger = binned[binned['ALL_COHORT_1516'] >= 31]

In [9]:
bigger100 = bigger[bigger['ALL_RATE_1516'] == 100]

In [10]:
bigger100.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Columns: 101 entries, Unnamed: 0 to absent_teacher_rate
dtypes: float64(78), int64(8), object(15)
memory usage: 0.0+ bytes


In [6]:
#remove "GE" and "LE" from ranges
bigger.replace({'ALL_RATE_1516': 'GE95'}, '95', inplace=True)
bigger.replace({'ALL_RATE_1516': 'GE90'}, '90', inplace=True)
bigger.replace({'ALL_RATE_1516': 'GE99'}, '99', inplace=True)
bigger.replace({'ALL_RATE_1516': 'LE10'}, '10', inplace=True)
bigger.replace({'ALL_RATE_1516': 'LE1'}, '1', inplace=True)
bigger.replace({'ALL_RATE_1516': 'LE5'}, '5', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


In [7]:
#smallest range needs to be dealt with, has one digit before '-'
bigger.replace({'ALL_RATE_1516': '6-9'}, '6', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


In [8]:
#take first two digits of rates
bigger['grad_slice'] = bigger['ALL_RATE_1516'].str[:2]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [9]:
bigger.reset_index(inplace=True)

In [10]:
#create the binned categories
bigger['grad_rate_bin'] = pd.cut(bigger['grad_slice'].astype(int), [0, 9, 19, 29, 39, 49, 59, 69, 79, 89, 100],
      labels = ['0-9%', '10-19%', '20-29%', '30-39%', '40-49%', '50-59%',
               '60-69%', '70-79%', '80-89%', '90%+'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [11]:
bigger.tail()

Unnamed: 0.2,index,Unnamed: 0,Unnamed: 0.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,LEA_STATE,...,total_suspension_days,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate,grad_slice,grad_rate_bin
16568,21854,21854,21854,WYOMING,Sheridan County School District #1,560569000311,Big Horn High School,37,90,WY,...,2.0,0.013245,4.0,0.02649,1738.913907,0.0,0.006623,0.846154,90,90%+
16569,21858,21858,21858,WYOMING,Sheridan County School District #2,560569500360,Sheridan High School,236,89,WY,...,46.0,0.046796,0.0,0.0,1694.559176,0.0,0.005086,0.198779,89,80-89%
16570,21861,21861,21861,WYOMING,Sweetwater County School District #2,560576200324,Green River High School,176,85-89,WY,...,110.0,0.14157,1.0,0.001287,1840.87749,0.0,0.003861,0.2,85,80-89%
16571,21863,21863,21863,WYOMING,Teton County School District #1,560583000335,Jackson Hole High School,127,95,WY,...,26.0,0.040625,0.0,0.0,4153.518984,0.0,0.004687,0.118939,95,90%+
16572,21866,21866,21866,WYOMING,Washakie County School District #1,560624000343,Worland High School,105,75-79,WY,...,10.0,0.025,0.0,0.0,3015.755325,0.0,0.004275,0.331544,75,70-79%


In [12]:
bigger['grad_rate_bin'].value_counts()

90%+      8712
80-89%    4257
70-79%    1457
60-69%     536
50-59%     352
10-19%     305
40-49%     278
30-39%     264
20-29%     264
0-9%       148
Name: grad_rate_bin, dtype: int64

In [13]:
#Juvenile Justice schools have very different graduation rates, 
#31 of 41 schools have graduation rates of less than 50%.

byRate = bigger[bigger['JJ'] == 'Yes'].groupby('grad_rate_bin')

In [14]:
byRate.count()

Unnamed: 0_level_0,index,Unnamed: 0,Unnamed: 0.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,LEA_STATE,...,sports_rate,total_suspension_days,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate,grad_slice
grad_rate_bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0-9%,8,8,8,8,8,8,8,8,8,8,...,0,8,8,8,8,8,8,8,8,8
10-19%,15,15,15,15,15,15,15,15,15,15,...,0,15,15,15,15,14,15,15,15,15
20-29%,7,7,7,7,7,7,7,7,7,7,...,0,7,7,7,7,7,7,7,7,7
30-39%,5,5,5,5,5,5,5,5,5,5,...,0,5,5,5,5,5,5,5,5,5
40-49%,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50-59%,3,3,3,3,3,3,3,3,3,3,...,0,3,3,3,3,3,3,3,3,3
60-69%,1,1,1,1,1,1,1,1,1,1,...,0,1,1,1,1,1,1,1,1,1
70-79%,1,1,1,1,1,1,1,1,1,1,...,0,1,1,1,1,1,1,1,1,1
80-89%,1,1,1,1,1,1,1,1,1,1,...,0,1,1,1,1,1,1,1,1,1
90%+,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
#Alternative Schools have a different graduation rate pattern. Out of 
#1,055 Alternative Schools, 629 graduate less than half of their cohort.
byRate_alt = bigger[bigger['SCH_STATUS_ALT'] == 'Yes'].groupby('grad_rate_bin')
byRate_alt.count()

Unnamed: 0_level_0,index,Unnamed: 0,Unnamed: 0.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,LEA_STATE,...,sports_rate,total_suspension_days,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate,grad_slice
grad_rate_bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0-9%,85,85,85,85,85,85,85,85,85,85,...,2,85,85,84,84,48,61,85,61,85
10-19%,162,162,162,162,162,162,162,162,162,162,...,7,162,162,160,160,125,146,162,146,162
20-29%,128,128,128,128,128,128,128,128,128,128,...,6,128,128,127,127,119,125,128,125,128
30-39%,128,128,128,128,128,128,128,128,128,128,...,10,128,128,127,127,123,126,128,126,128
40-49%,126,126,126,126,126,126,126,126,126,126,...,5,126,126,123,123,124,126,126,126,126
50-59%,106,106,106,106,106,106,106,106,106,106,...,7,106,106,106,106,106,105,106,105,106
60-69%,97,97,97,97,97,97,97,97,97,97,...,2,97,97,95,95,95,96,97,96,97
70-79%,96,96,96,96,96,96,96,96,96,96,...,3,96,96,95,95,93,94,96,94,96
80-89%,79,79,79,79,79,79,79,79,79,79,...,4,79,79,79,79,79,79,79,79,79
90%+,48,48,48,48,48,48,48,48,48,48,...,15,48,48,48,48,48,48,48,48,48


In [16]:
#The 75 Special Education Schools have a different graduation rate pattern. Out of 
#30 Special Education Schools graduate less than half of their cohort; however,
#25 Special Education Schools graduate 90% or more of their cohort.
byRate_sped = bigger[bigger['SCH_STATUS_SPED'] == 'Yes'].groupby('grad_rate_bin')
byRate_sped.count()

Unnamed: 0_level_0,index,Unnamed: 0,Unnamed: 0.1,STNAM,LEANM,NCESSCH,SCHNAM,ALL_COHORT_1516,ALL_RATE_1516,LEA_STATE,...,sports_rate,total_suspension_days,suspensed_day_rate,harassed,harassed_rate,activities_funds_rate,non_cert_rate,counselor_rate,absent_teacher_rate,grad_slice
grad_rate_bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0-9%,3,3,3,3,3,3,3,3,3,3,...,0,3,3,3,3,3,3,3,3,3
10-19%,16,16,16,16,16,16,16,16,16,16,...,1,16,16,16,16,15,15,16,15,16
20-29%,3,3,3,3,3,3,3,3,3,3,...,0,3,3,3,3,3,3,3,3,3
30-39%,3,3,3,3,3,3,3,3,3,3,...,0,3,3,3,3,2,3,3,3,3
40-49%,5,5,5,5,5,5,5,5,5,5,...,2,5,5,5,5,5,5,5,5,5
50-59%,2,2,2,2,2,2,2,2,2,2,...,0,2,2,2,2,2,2,2,2,2
60-69%,4,4,4,4,4,4,4,4,4,4,...,2,4,4,4,4,3,4,4,4,4
70-79%,5,5,5,5,5,5,5,5,5,5,...,4,5,5,5,5,5,5,5,5,5
80-89%,9,9,9,9,9,9,9,9,9,9,...,7,9,9,9,9,9,9,9,9,9
90%+,25,25,25,25,25,25,25,25,25,25,...,22,25,25,25,25,24,24,25,24,25


In [17]:
#save binned model for future use
bigger.to_csv('binned.csv')

## Model 1 - Random Forest Classifier 

In [17]:
#impute mean for feature numerical vars
#did not include total enrollment or all rate
bigger_feat = bigger[['ap_ib_de_rate', 'sat_act_rate', 'pass_algebra_rate',
      'geometry_rate', 'algebra2_rate', 'calc_rate', 'chronic_absent_rate', 'activities_funds_rate',
      'sports_rate', 'suspensed_day_rate', 'harassed_rate',
      'non_cert_rate','counselor_rate','absent_teacher_rate']]

In [19]:
for i in range(0, len(bigger_feat.columns)):
    print('im here:', bigger_feat.columns[i])
    bigger.fillna(bigger.mean(), inplace=True)
    

im here: ap_ib_de_rate


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


im here: sat_act_rate
im here: pass_algebra_rate
im here: geometry_rate
im here: algebra2_rate
im here: calc_rate
im here: chronic_absent_rate
im here: activities_funds_rate
im here: sports_rate
im here: suspensed_day_rate
im here: harassed_rate
im here: non_cert_rate
im here: counselor_rate
im here: absent_teacher_rate


In [21]:
#No missing values
bigger.isna().sum()

index                      0
Unnamed: 0                 0
Unnamed: 0.1               0
STNAM                      0
LEANM                      0
NCESSCH                    0
SCHNAM                     0
ALL_COHORT_1516            0
ALL_RATE_1516              0
LEA_STATE                  0
LEA_STATE_NAME             0
LEAID_y                    0
LEA_NAME                   0
SCHID                      0
SCH_NAME                   0
COMBOKEY                   0
JJ                         0
SCH_STATUS_SPED            0
SCH_STATUS_MAGNET          0
SCH_STATUS_CHARTER         0
SCH_STATUS_ALT             0
SCH_MAGNETDETAIL         638
SCH_ALTFOCUS             498
TOT_ENR_M                  0
TOT_ENR_F                  0
TOT_GTENR_M                0
TOT_GTENR_F                0
TOT_DUAL_M                 0
TOT_DUAL_F                 0
TOT_ALGENR_GS0910_M        0
                        ... 
SCH_FTETEACH_ABSENT        0
districtID                 0
IDSCH                      0
total_enrollme

In [22]:
y = bigger['grad_rate_bin']
X = bigger[['ap_ib_de_rate', 'sat_act_rate', 'pass_algebra_rate',
      'geometry_rate', 'algebra2_rate', 'calc_rate', 'chronic_absent_rate', 'activities_funds_rate',
      'sports_rate', 'suspensed_day_rate', 'harassed_rate',
      'non_cert_rate','counselor_rate','absent_teacher_rate']]

In [23]:
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X, y, test_size=0.33, random_state=42)

In [24]:
rf = RandomForestClassifier(n_estimators=100, max_depth=5)

In [25]:
rf.fit(X_train_rf, y_train_rf)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [26]:
rf.score(X_test_rf, y_test_rf)

0.5561243144424132

In [27]:
rf.feature_importances_

array([0.00113737, 0.09093924, 0.00653056, 0.01463708, 0.02414926,
       0.11523727, 0.25132241, 0.00913153, 0.11606474, 0.28011475,
       0.00455352, 0.02342948, 0.04506198, 0.01769081])

In [29]:
from sklearn.metrics import confusion_matrix, classification_report


In [30]:
confusion_matrix(y_test_rf, rf.predict(X_test_rf))

array([[   0,    1,    0,    0,    0,    0,    0,    0,   15,   39],
       [   0,    7,    0,    0,    0,    0,    1,    0,   20,   73],
       [   0,    2,    0,    0,    0,    0,    2,    0,   28,   53],
       [   0,    8,    0,    0,    0,    0,    4,    1,   33,   57],
       [   0,    7,    0,    0,    0,    0,    3,    0,   34,   61],
       [   0,    3,    0,    0,    0,    0,    1,    3,   56,   65],
       [   0,    3,    0,    0,    0,    0,    1,    0,   74,   98],
       [   0,    3,    0,    0,    0,    0,    2,    0,  180,  274],
       [   0,    1,    0,    0,    0,    0,    0,    0,  289, 1088],
       [   0,    0,    0,    0,    0,    0,    2,    0,  133, 2745]])

## Model 2 - Random Forest with Tuning

In [None]:
#use X, y from above
X_train_tune, X_test_tune, y_train_tune, y_test_tune = train_test_split(X, y, test_size=0.2, random_state=102)

In [None]:
# Create the random grid
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 4500, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
rf_tune = RandomForestClassifier()

In [None]:
#changed it to auto from sqrt for max_features, no significant change
grid_clf = GridSearchCV(rf_tune, random_grid, cv=10)
grid_clf.fit(X_train_tune, y_train_tune)

In [None]:
rf_tune.fit(X_train_tune, y_train_tune)

In [None]:
rf_tune.score(X_test_tune, y_test_tune)

In [None]:
rf_tune.feature_importances_

In [None]:
feat_importances = pd.Series(rf_tune.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')