# Predicting graduation rates group project

In this group project, you'll be predicting graduation rates (or, more specifically, the percent difference from the mean graduation rate) for high school students in different areas of the country.

As you may already know, [your illustrious instructor Dave Yerrington won a competition on this dataset.](http://devpost.com/software/sriozidave_datafordiploma)

---

### Dataset

The files for this project can be found in the `/DSI-SF-2/datasets/data_for_diplomas/` folder. The contents are:

    grad_train: the training data you will be building models on
    graduation_with_census_schema.pdf: the "codebook" for the grad_train.csv columns
    school_county_spending.csv: data on the spending of different schools
    school_county_spending_info.pdf: information about the spending data csv
    climate_data/: a folder that has climate data for every state for 2011 and 2012. There are csvs for precipitation and average temperature
    
This isn't all the data Dave used, but it's a decent amount of it. You're not expected to use _all_ of this data. After all, you don't have that long to build these models, but it's there if you want to.

**Target variable:**

In the spreadsheet there is a variable called `grad_pct_from_mean`. This is the difference in percent graduated for that school from the mean graduation rate across the country. In other words, it is:

    (schools graduation rate / (mean of all schools graduation rate) - 1.) * 100.
    
I changed this from the original rate column because, as you know, regression is not appropriate for prediction of rates since they are guaranteed to fall between 0 and 1.

**Do not include variables in the model that contain the same information as the target variable!** 

---

### Structure of the project

You will, in groups, try to build a model that predicts this `grad_pct_from_mean` variable using the information you have. 

**This project is also an excercise in using your time wisely.** You don't have that much time, so keep the scope of your process simple rather than complex. This will likely mean _not_ considering every variable available to you.

**I have left out 25% of the data as a testing set. At the end, groups will come up and get to test their model on the testing set.**

Since you will likely be cleaning the data, you will need to be able to run the testing code through the same cleaning and munging process as the training data. I recommend writing some functions that make this process faster/easier!

Good luck!

In [32]:
filepath = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/grad_train.csv'
filepath2 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/school_county_spending.csv'

In [102]:
# two source data paths
filepath = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/grad_train.csv'
filepath2 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/school_county_spending.csv'
df = pd.read_csv(filepath)
df2 = pd.read_csv(filepath2)

# adding string versions of the ID to do the merge between the two tables
df['leaid11_str'] = df['leaid11'].astype(str)
df2['NCESID_str'] = df2['NCESID'].astype(str)
df['leaid11_str'] = df['leaid11_str'].map(lambda x : '0'+x if len(x)==6 else x)


# doing the merge --- > adding in PPSALWG
grad_train = df.merge(df2[['NCESID_str','PPSALWG']], left_on = 'leaid11_str', right_on='NCESID_str',how='left')

#reducing to necessary columns
grad_train =  grad_train[['grad_pct_from_mean','State','County','Females_ACS_08_12','Females_CEN_2010','Males_ACS_08_12','Males_CEN_2010',
            'RURAL_POP_CEN_2010','URBAN_CLUSTER_POP_CEN_2010','URBANIZED_AREA_POP_CEN_2010',
         'LAND_AREA','ECD_COHORT_1112', 'MrdCple_Fmly_HHD_CEN_2010',
               'MrdCple_Fmly_HHD_ACS_08_12','MrdCple_Fmly_HHD_ACSMOE_08_12','Not_MrdCple_HHD_CEN_2010',
        'Not_MrdCple_HHD_ACS_08_12','Not_MrdCple_HHD_ACSMOE_08_12','pct_MrdCple_HHD_CEN_2010','pct_MrdCple_HHD_ACS_08_12',
        'pct_MrdCple_HHD_ACSMOE_08_12','pct_Not_MrdCple_HHD_CEN_2010','pct_Not_MrdCple_HHD_ACS_08_12',
            'pct_Not_MrdCple_HHD_ACSMOE_08_12', 'Pov_Univ_ACS_08_12', 'Pov_Univ_ACSMOE_08_12', 'Prs_Blw_Pov_Lev_ACS_08_12', 
        'Prs_Blw_Pov_Lev_ACSMOE_08_12', 'pct_Prs_Blw_Pov_Lev_ACS_08_12','PPSALWG']]





In [103]:
grad_train_norm =grad_train.dropna()
grad_train_norm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7174 entries, 0 to 7337
Data columns (total 30 columns):
grad_pct_from_mean                  7174 non-null float64
State                               7174 non-null int64
County                              7174 non-null int64
Females_ACS_08_12                   7174 non-null float64
Females_CEN_2010                    7174 non-null int64
Males_ACS_08_12                     7174 non-null float64
Males_CEN_2010                      7174 non-null int64
RURAL_POP_CEN_2010                  7174 non-null int64
URBAN_CLUSTER_POP_CEN_2010          7174 non-null int64
URBANIZED_AREA_POP_CEN_2010         7174 non-null int64
LAND_AREA                           7174 non-null float64
ECD_COHORT_1112                     7174 non-null float64
MrdCple_Fmly_HHD_CEN_2010           7174 non-null int64
MrdCple_Fmly_HHD_ACS_08_12          7174 non-null float64
MrdCple_Fmly_HHD_ACSMOE_08_12       7174 non-null float64
Not_MrdCple_HHD_CEN_2010            717

In [107]:
grad_train_norm.info()

import patsy
formula = 'grad_pct_from_mean ~' + '+'.join([x for x in grad_train.columns.values if x !='grad_pct_from_mean'])+'-1'
y, X = patsy.dmatrices(formula,grad_train)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7174 entries, 0 to 7337
Data columns (total 30 columns):
grad_pct_from_mean                  7174 non-null float64
State                               7174 non-null int64
County                              7174 non-null int64
Females_ACS_08_12                   7174 non-null float64
Females_CEN_2010                    7174 non-null int64
Males_ACS_08_12                     7174 non-null float64
Males_CEN_2010                      7174 non-null int64
RURAL_POP_CEN_2010                  7174 non-null int64
URBAN_CLUSTER_POP_CEN_2010          7174 non-null int64
URBANIZED_AREA_POP_CEN_2010         7174 non-null int64
LAND_AREA                           7174 non-null float64
ECD_COHORT_1112                     7174 non-null float64
MrdCple_Fmly_HHD_CEN_2010           7174 non-null int64
MrdCple_Fmly_HHD_ACS_08_12          7174 non-null float64
MrdCple_Fmly_HHD_ACSMOE_08_12       7174 non-null float64
Not_MrdCple_HHD_CEN_2010            717

In [115]:

from sklearn.linear_model import RidgeCV
rcv = RidgeCV(cv = 5, alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000,5000, 10000,20000,30000])
rcv_model = rcv.fit(X,y)
score = rcv_model.score(X,y)
print score
print rcv_model.alpha_
print rcv_model.coef_

0.18879971106
20000
[[  6.61798041e-02   1.31808654e-03   2.09590143e-03   3.55719979e-03
    2.74308107e-03  -3.08165734e-03  -4.96500260e-04   4.55585201e-04
    5.16457521e-04  -1.14838059e-03  -1.36763509e-03   3.48018003e-03
    5.08386390e-03  -3.61797240e-02  -2.26052342e-03  -1.54651057e-03
    5.02748151e-02   3.69767574e-02   2.57414424e-03   3.20831178e-02
   -3.69956773e-02  -2.57414429e-03   9.02390292e-02  -3.29052149e-03
   -7.36179309e-03  -7.79623703e-05   4.53198343e-03  -2.59899402e-01
   -8.02121332e-04]]


In [145]:



def run_my_stuff(trainpath, suppath,trainpath2, suppath2):
#     if trainpath=='-1':
#         filepath = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/grad_train.csv'
#     if suppath=='-1':
#         filepath2 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/school_county_spending.csv'
#     if trainpath2=='-1':
#         filepath3 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/grad_train.csv'
#     if suppath2=='-1':
#         filepath4 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/school_county_spending.csv'
    
    
    df = pd.read_csv(filepath)
    df2 = pd.read_csv(filepath2)
    
    df3 = pd.read_csv(filepath3)
    df4 = pd.read_csv(filepath4)
     

    # adding string versions of the ID to do the merge between the two tables
    df['leaid11_str'] = df['leaid11'].astype(str)
    df2['NCESID_str'] = df2['NCESID'].astype(str)
    df['leaid11_str'] = df['leaid11_str'].map(lambda x : '0'+x if len(x)==6 else x)

    df3['leaid11_str'] = df3['leaid11'].astype(str)
    df4['NCESID_str'] = df4['NCESID'].astype(str)
    df3['leaid11_str'] = df3['leaid11_str'].map(lambda x : '0'+x if len(x)==6 else x)



    # doing the merge --- > adding in PPSALWG
    grad_train = df.merge(df2[['NCESID_str','PPSALWG']], left_on = 'leaid11_str', right_on='NCESID_str',how='left')
    grad_test = df3.merge(df4[['NCESID_str','PPSALWG']], left_on = 'leaid11_str', right_on='NCESID_str',how='left')


    #reducing to necessary columns
    grad_train =  grad_train[['grad_pct_from_mean','State','Females_ACS_08_12','Females_CEN_2010','Males_ACS_08_12','Males_CEN_2010',
                'RURAL_POP_CEN_2010','URBAN_CLUSTER_POP_CEN_2010','URBANIZED_AREA_POP_CEN_2010',
             'LAND_AREA','ECD_COHORT_1112', 'MrdCple_Fmly_HHD_CEN_2010',
                   'MrdCple_Fmly_HHD_ACS_08_12','MrdCple_Fmly_HHD_ACSMOE_08_12','Not_MrdCple_HHD_CEN_2010',
            'Not_MrdCple_HHD_ACS_08_12','Not_MrdCple_HHD_ACSMOE_08_12','pct_MrdCple_HHD_CEN_2010','pct_MrdCple_HHD_ACS_08_12',
            'pct_MrdCple_HHD_ACSMOE_08_12','pct_Not_MrdCple_HHD_CEN_2010','pct_Not_MrdCple_HHD_ACS_08_12',
                'pct_Not_MrdCple_HHD_ACSMOE_08_12', 'Pov_Univ_ACS_08_12', 'Pov_Univ_ACSMOE_08_12', 'Prs_Blw_Pov_Lev_ACS_08_12', 
            'Prs_Blw_Pov_Lev_ACSMOE_08_12', 'pct_Prs_Blw_Pov_Lev_ACS_08_12','PPSALWG']]
    
    grad_test =  grad_test[['grad_pct_from_mean','State','Females_ACS_08_12','Females_CEN_2010','Males_ACS_08_12','Males_CEN_2010',
                'RURAL_POP_CEN_2010','URBAN_CLUSTER_POP_CEN_2010','URBANIZED_AREA_POP_CEN_2010',
             'LAND_AREA','ECD_COHORT_1112', 'MrdCple_Fmly_HHD_CEN_2010',
                   'MrdCple_Fmly_HHD_ACS_08_12','MrdCple_Fmly_HHD_ACSMOE_08_12','Not_MrdCple_HHD_CEN_2010',
            'Not_MrdCple_HHD_ACS_08_12','Not_MrdCple_HHD_ACSMOE_08_12','pct_MrdCple_HHD_CEN_2010','pct_MrdCple_HHD_ACS_08_12',
            'pct_MrdCple_HHD_ACSMOE_08_12','pct_Not_MrdCple_HHD_CEN_2010','pct_Not_MrdCple_HHD_ACS_08_12',
                'pct_Not_MrdCple_HHD_ACSMOE_08_12', 'Pov_Univ_ACS_08_12', 'Pov_Univ_ACSMOE_08_12', 'Prs_Blw_Pov_Lev_ACS_08_12', 
            'Prs_Blw_Pov_Lev_ACSMOE_08_12', 'pct_Prs_Blw_Pov_Lev_ACS_08_12','PPSALWG']]


    
    import patsy
    formula = 'grad_pct_from_mean ~' + '+'.join([x for x in grad_train.columns.values if x !='grad_pct_from_mean'])+'-1'
    y, X = patsy.dmatrices(formula,grad_train)
    
    formula = 'grad_pct_from_mean ~' + '+'.join([x for x in grad_test.columns.values if x !='grad_pct_from_mean'])+'-1'
    y_test, X_test = patsy.dmatrices(formula,grad_test)
    
    
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    y = np.ravel(y)

    X_test = scaler.fit_transform(X_test)
    y_test = np.ravel(y_test)

    
    # ===================== models and scoring
    from sklearn.linear_model import RidgeCV
    rcv = RidgeCV(cv = 5, alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 2000,3000, 5000, 10000])
    rcv_model = rcv.fit(X,y)
    score = rcv_model.score(X_test,y_test)
    print '='*60
    print "ridge score:"
    print score
    print "ridge alpha:"    
    print rcv_model.alpha_
    
    from sklearn.linear_model import LassoCV
    lcv = LassoCV(cv = 5,n_alphas=100 )
    lcv_model = lcv.fit(X,y)
    score = lcv_model.score(X_test,y_test)
    print '='*60    
    print "lasso score:"
    print score
    print "lasso alpha:"    
    print lcv_model.alpha_

#training data - Grad PCT   
filepath10 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/grad_train.csv'
filepath20 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/school_county_spending.csv'
filepath40 = '/Users/tlee010/desktop/dsi-sf-2-timdavidlee/datasets/data_for_diplomas/school_county_spending.csv'

#test data - 
filepath30 = '/Users/tlee010/desktop/dsi-sf-2/datasets/data_for_diplomas/grad_test.csv'


run_my_stuff(filepath10,filepath20,filepath30,filepath40)




ridge score:
0.2069986813
ridge alpha:
2000
ridge score:
0.224415617936
ridge alpha:
0.0135242518428


In [125]:
grad_train.head()

Unnamed: 0,grad_pct_from_mean,State,County,Females_ACS_08_12,Females_CEN_2010,Males_ACS_08_12,Males_CEN_2010,RURAL_POP_CEN_2010,URBAN_CLUSTER_POP_CEN_2010,URBANIZED_AREA_POP_CEN_2010,...,pct_MrdCple_HHD_ACSMOE_08_12,pct_Not_MrdCple_HHD_CEN_2010,pct_Not_MrdCple_HHD_ACS_08_12,pct_Not_MrdCple_HHD_ACSMOE_08_12,Pov_Univ_ACS_08_12,Pov_Univ_ACSMOE_08_12,Prs_Blw_Pov_Lev_ACS_08_12,Prs_Blw_Pov_Lev_ACSMOE_08_12,pct_Prs_Blw_Pov_Lev_ACS_08_12,PPSALWG
0,10.791194,18,53,2849.0,2898,2554.0,2786,3963,1721,0,...,6.76419,32.94,32.814021,6.368943,5403.0,387.0,457.0,216.0,8.458264,4614.0
1,3.565682,48,215,3027.0,2667,2418.0,2304,0,0,4971,...,11.6331,54.88,51.527861,12.008602,5445.0,723.0,2936.0,797.0,53.921028,5479.0
2,14.403951,48,239,2609.0,2475,2546.0,2354,1733,3096,0,...,5.856993,44.11,38.421053,8.063208,5038.0,454.0,497.0,199.0,9.865026,5318.0
3,14.403951,36,59,2210.0,2324,2306.0,2241,0,0,4565,...,4.624822,40.65,44.284822,9.530248,4516.0,235.0,393.0,189.0,8.702391,17585.0
4,4.769934,28,109,1837.0,1798,1375.0,1575,3373,0,0,...,6.450314,55.43,59.145775,8.946724,2688.0,253.0,710.0,221.0,26.41369,5089.0


In [96]:
import pandas as pd
import numpy as np
import patsy
from sklearn.cross_validation import cross_val_score, cross_val_predict


Unnamed: 0,IDCENSUS,NAME,CONUM,CSA,CBSA,NCESID,ENROLL,TOTALREV,TFEDREV,FEDRCOMP,...,PPSALWG,PPEMPBEN,PPITOTAL,PPISALWG,PPIEMBEN,PPSTOTAL,PPSPUPIL,PPSSTAFF,PPSGENAD,PPSSCHAD
0,1500100100000,AUTAUGA COUNTY SCHOOL DISTRICT,1001,388,33860,100240,9825,77270,7416,1268,...,4332,1680,4288,2966,1075,2229,345,205,155,422
1,1500200100000,BALDWIN COUNTY SCHOOL DISTRICT,1003,380,19300,100270,28700,277787,22367,5987,...,4783,1763,4757,3159,1136,2950,457,406,125,530
2,1500300100000,BARBOUR COUNTY SCHOOL DISTRICT,1005,N,21640,100300,1060,11213,2642,1084,...,5676,2139,5264,3397,1215,3752,569,345,574,681
3,1500300200000,EUFAULA CITY SCHOOL DISTRICT,1005,N,21640,101410,2759,23972,3094,892,...,4834,1866,4965,3219,1212,2542,456,242,369,518
4,1500400100000,BIBB COUNTY SCHOOL DISTRICT,1007,142,13820,100360,3539,32599,4286,1258,...,4750,1836,4372,3011,1107,2737,431,420,260,478


(7338, 580)

In [97]:
df.columns.values

array(['leaid11', 'STNAM', 'FIPST', 'leanm11', 'ALL_COHORT_1112',
       'ALL_RATE_1112', 'MAM_COHORT_1112', 'MAM_RATE_1112',
       'MAS_COHORT_1112', 'MAS_RATE_1112', 'MBL_COHORT_1112',
       'MBL_RATE_1112', 'MHI_COHORT_1112', 'MHI_RATE_1112',
       'MTR_COHORT_1112', 'MTR_RATE_1112', 'MWH_COHORT_1112',
       'MWH_RATE_1112', 'CWD_COHORT_1112', 'CWD_RATE_1112',
       'ECD_COHORT_1112', 'ECD_RATE_1112', 'LEP_COHORT_1112', 'Percentage',
       'State', 'County', 'Tract.Code', 'School.District', 'District.ID',
       'GIDTR', 'State.1', 'State_name', 'County.1', 'County_name',
       'Tract', 'Flag', 'Num_BGs_in_Tract', 'LAND_AREA', 'AIAN_LAND',
       'URBANIZED_AREA_POP_CEN_2010', 'URBAN_CLUSTER_POP_CEN_2010',
       'RURAL_POP_CEN_2010', 'Tot_Population_CEN_2010',
       'Tot_Population_ACS_08_12', 'Tot_Population_ACSMOE_08_12',
       'Males_CEN_2010', 'Males_ACS_08_12', 'Males_ACSMOE_08_12',
       'Females_CEN_2010', 'Females_ACS_08_12', 'Females_ACSMOE_08_12',
       'Pop_un

In [40]:
df.leaid11.unique()

array([1808340, 4831040, 4824150, ..., 4814790, 3025560, 2512630])

In [44]:
4831040



Unnamed: 0,IDCENSUS,NAME,CONUM,CSA,CBSA,NCESID,ENROLL,TOTALREV,TFEDREV,FEDRCOMP,...,PPSALWG,PPEMPBEN,PPITOTAL,PPISALWG,PPIEMBEN,PPSTOTAL,PPSPUPIL,PPSSTAFF,PPSGENAD,PPSSCHAD
12619,44512000300000,INDUSTRIAL IND SCH DIST 905,48239,N,N,4824150,1153,12061,346,78,...,5318,1146,4879,3540,735,3192,275,221,408,482


In [45]:
df2[df2['NCESID']=='3612510']

Unnamed: 0,IDCENSUS,NAME,CONUM,CSA,CBSA,NCESID,ENROLL,TOTALREV,TFEDREV,FEDRCOMP,...,PPSALWG,PPEMPBEN,PPITOTAL,PPISALWG,PPIEMBEN,PPSTOTAL,PPSPUPIL,PPSSTAFF,PPSGENAD,PPSSCHAD
8848,33503000800000,GREAT NECK UF SCH DIST,36059,408,35620,3612510,6553,202099,2441,416,...,17585,6853,18539,12803,5141,9318,944,431,419,1506


In [21]:
df.isnull().sum()

leaid11                                0
STNAM                                  0
FIPST                                  0
leanm11                                0
ALL_COHORT_1112                        0
ALL_RATE_1112                          0
MAM_COHORT_1112                     4467
MAM_RATE_1112                       4467
MAS_COHORT_1112                     3437
MAS_RATE_1112                       3437
MBL_COHORT_1112                     2608
MBL_RATE_1112                       2608
MHI_COHORT_1112                     1899
MHI_RATE_1112                       1899
MTR_COHORT_1112                     4237
MTR_RATE_1112                       4237
MWH_COHORT_1112                       77
MWH_RATE_1112                         77
CWD_COHORT_1112                      213
CWD_RATE_1112                        213
ECD_COHORT_1112                      102
ECD_RATE_1112                        102
LEP_COHORT_1112                     3893
Percentage                             0
State           

In [26]:
[x for x in df.columns.values if 'Mrd' in x]

for x in df.columns.values:
    if 'Mrd' in x:
        print x

# CEN2010 Married-couple family households in the 2010 Census
# Married-couple family households in the ACS
# the not indicates the non married
# pct ACS columns are Mrd / total households in ACS
# pct CEN2010 columns are Mrd / total households in CEN2010

MrdCple_Fmly_HHD_CEN_2010
MrdCple_Fmly_HHD_ACS_08_12
MrdCple_Fmly_HHD_ACSMOE_08_12
Not_MrdCple_HHD_CEN_2010
Not_MrdCple_HHD_ACS_08_12
Not_MrdCple_HHD_ACSMOE_08_12
pct_MrdCple_HHD_CEN_2010
pct_MrdCple_HHD_ACS_08_12
pct_MrdCple_HHD_ACSMOE_08_12
pct_Not_MrdCple_HHD_CEN_2010
pct_Not_MrdCple_HHD_ACS_08_12
pct_Not_MrdCple_HHD_ACSMOE_08_12


In [27]:
for x in df.columns.values:
    if 'ECD' in x:
        print x

ECD_COHORT_1112
ECD_RATE_1112


In [28]:
for x in df.columns.values:
    if 'PP' in x:
        print x

HHD_PPL_Und_18_CEN_2010
HHD_PPL_Und_18_ACS_08_12
HHD_PPL_Und_18_ACSMOE_08_12
pct_HHD_PPL_Und_18_CEN_2010
pct_HHD_PPL_Und_18_ACS_08_12
pct_HHD_PPL_Und_18_ACSMOE_08_12


In [None]:
#========================= married stats
MrdCple_Fmly_HHD_CEN_2010
MrdCple_Fmly_HHD_ACS_08_12
MrdCple_Fmly_HHD_ACSMOE_08_12
Not_MrdCple_HHD_CEN_2010
Not_MrdCple_HHD_ACS_08_12
Not_MrdCple_HHD_ACSMOE_08_12
pct_MrdCple_HHD_CEN_2010
pct_MrdCple_HHD_ACS_08_12

#=========================== economically disadvantaged 
ECD_COHORT_1011
ECD_RATE_1011

#======================= spending per pupil
PPSALWG

#========================

grad_train[['Pov_Univ_ACS_08_12', 'Pov_Univ_ACSMOE_08_12', 
            'Prs_Blw_Pov_Lev_ACS_08_12', 'Prs_Blw_Pov_Lev_ACSMOE_08_12']]