## Can we predict how long a couple will stay together?

We will use the modified data set we saved in exploratory data analysis to predict how long couples would stay together based on:



    1) Household Income
    2) Religious Attendance
    3) Political Differences
    4) Age Differences



We will explore other possible features to include in subsequent models, but let's begin with the 4 abovementioned features.

In [28]:
import numpy as np
import pandas as pd

#import dataset
pd.set_option('display.max_columns', None)
data_set = pd.read_csv('t_data.csv', index_col=0, header=0)

In [29]:
#create training set and test set. Take 95% of data to be training data. 5% test data. Set the seed for states to be 1.
train_size = int(data_set.shape[0]*0.95)
X = data_set[['Household_Income', 'Religious_Attendance', 'Pol_Diff', 'Age_Diff']].iloc[0:train_size].copy()
X_test= data_set[['Household_Income', 'Religious_Attendance', 'Pol_Diff', 'Age_Diff']].iloc[train_size::].copy()

#create target vector
y = data_set.Years_Together.iloc[0:train_size].copy()
y_test = data_set.Years_Together.iloc[train_size::].copy()

In [30]:
seed = 1
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
X = X.to_numpy(dtype='int8')
y = y.to_numpy(dtype='int8')
# Use 10-fold cross validation to minimize training error
ten_fold = KFold(n_splits=10, random_state=seed, shuffle=True)

In [31]:
#create and fit random forest regressor, use default settings with 100 n_estimators
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100, criterion='mse', random_state=seed, max_depth=None, min_samples_split=2, 
                            min_samples_leaf=1)
rfr_results = cross_val_score(rfr, X, y, cv=ten_fold)
print(f'Accuracy: {rfr_results.mean(): .2%}, Std. Dev: {rfr_results.std():.2%}')

Accuracy: -15.87%, Std. Dev: 9.88%


Let's use the ELI5 library to check our intuition on the relative importance of the input variables we have chosen:

In [5]:
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rfr, random_state=seed).fit(X_train, y_train)
eli5.show_weights(perm, feature_names = ['Household_Income', 'Religious_Attendance', 'Pol_Diff', 'Age_Diff'])



Weight,Feature
0.7540  ± 0.0514,Household_Income
0.7219  ± 0.0222,Age_Diff
0.6842  ± 0.0278,Religious_Attendance
0.5393  ± 0.0382,Pol_Diff


From ELI5, we noted that the strongest predictor for relationship length is indeed household income as discussed in the EDA. Surprisingly, ELI5 determined that age difference is the second strongest predictor, followed by religious attendance and lastly, political differences. 

In [6]:
from sklearn.metrics import mean_squared_error as mse
np.sqrt(mse(y_test, y_hat))

18.63759725615683

In [14]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr_results = cross_val_score(lr, X, y, cv=ten_fold)
print(f'Accuracy: {lr_results.mean(): .2%}, Std. Dev: {lr_results.std():.2%}')

Accuracy:  58.81%, Std. Dev: 3.13%


In [15]:
from xgboost import XGBRegressor
xg = XGBRegressor()
xg_results = cross_val_score(xg, X, y, cv=ten_fold)
print(f'Accuracy: {xg_results.mean(): .2%}, Std. Dev: {xg_results.std():.2%}')

Accuracy:  55.05%, Std. Dev: 4.46%


In [9]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)
  Starting server from C:\Users\Grant\Documents\LearningStuff\DataScienceProjects\Date_Marriage_HCMST_2017\vrenv\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\Grant\AppData\Local\Temp\tmptp7evmfg
  JVM stdout: C:\Users\Grant\AppData\Local\Temp\tmptp7evmfg\h2o_Grant_started_from_python.out
  JVM stderr: C:\Users\Grant\AppData\Local\Temp\tmptp7evmfg\h2o_Grant_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,12 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.2
H2O_cluster_version_age:,"14 days, 6 hours and 39 minutes"
H2O_cluster_name:,H2O_from_python_Grant_af590o
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.965 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [13]:
from h2o import H2OFrame
#create h2o dataframe
h2o_df = H2OFrame(data_set[['Household_Income', 'Religious_Attendance', 'Pol_Diff', 'Age_Diff', 'Years_Together']].copy())

#train, test split
train, test = h2o_df.split_frame(ratios=[.80])

x_labels = train.columns
y_label = 'Years_Together'

#remove the column with the target variable
x_labels.remove(y_label)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [14]:
h2o_df

Household_Income,Religious_Attendance,Pol_Diff,Age_Diff,Years_Together
17,5,1,3,34
19,2,0,2,11
18,4,0,0,34
13,1,3,1,36
11,1,3,1,51
12,5,1,0,50
20,5,0,10,9
19,5,2,2,10
14,4,0,4,15
15,3,2,3,10




In [15]:
from h2o.automl import H2OAutoML as haml
#run automl with user specified params
aml = haml(
    max_runtime_secs=600, 
    #exclude_algo=['Deep'] we can exclude if we know what models work poorly
    seed=1, #ensure reproducible models
    #max_models set limit to number of models fitted
    project_name = 'Final',
    stopping_metric  = 'RMSE'
          )

#use time cell magic to track how long tracking takes
%time aml.train(x=x_labels, y=y_label, training_frame=train)

AutoML progress: |
22:17:35.20: AutoML: XGBoost is not available; skipping it.
22:17:38.202: Skipping training of model DRF_1_AutoML_20201201_221734 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for DRF model: DRF_1_AutoML_20201201_221734.  Details: ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.

22:17:38.202: Skipping training of model GBM_1_AutoML_20201201_221734 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_1_AutoML_20201201_221734.  Details: ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.

22:17:38.210: Skipping training of model GBM_2_AutoML_20201201_221734 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_2_AutoML_20201201_221734.  Details: ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.

22:17:

In [16]:
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GLM_1_AutoML_20201201_221938,262.45,16.2003,262.45,13.5767,
StackedEnsemble_BestOfFamily_AutoML_20201201_221938,262.483,16.2013,262.483,13.5622,1.0205
GBM_5_AutoML_20201201_221938,264.551,16.265,264.551,13.632,1.02625
GBM_grid__1_AutoML_20201201_221938_model_3,265.037,16.28,265.037,13.6319,1.02442
GBM_grid__1_AutoML_20201201_221938_model_1,265.042,16.2801,265.042,13.646,1.02175
DeepLearning_grid__1_AutoML_20201201_221938_model_4,265.088,16.2815,265.088,13.7116,
DeepLearning_grid__3_AutoML_20201201_221938_model_1,265.197,16.2849,265.197,13.7911,
GBM_grid__1_AutoML_20201201_221938_model_7,265.378,16.2904,265.378,13.643,1.02299
DeepLearning_grid__2_AutoML_20201201_221938_model_1,267.048,16.3416,267.048,13.739,1.02572
GLM_1_AutoML_20201201_221734,267.22,16.3469,267.22,13.7473,1.02915




In [19]:
se = aml.leader
metalearner = h2o.get_model(se)

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  GLM_1_AutoML_20201201_221938


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,gaussian,identity,Ridge ( lambda = 0.01469 ),"nlambda = 30, lambda.max = 325.02, lambda.min = 0.01469, lambda.1s...",4,4,25,automl_training_py_33_sid_865b




ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 261.7812699729432
RMSE: 16.179656052368458
MAE: 13.560258790230629
RMSLE: 1.033504692381193
R^2: 0.08722161098877856
Mean Residual Deviance: 261.7812699729432
Null degrees of freedom: 2328
Residual degrees of freedom: 2324
Null deviance: 667948.0858737646
Residual deviance: 609688.5777669847
AIC: 19588.14486518512

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 262.4495512938646
RMSE: 16.200294790338372
MAE: 13.576705645895334
RMSLE: NaN
R^2: 0.08489144906550594
Mean Residual Deviance: 262.4495512938646
Null degrees of freedom: 2328
Residual degrees of freedom: 2324
Null deviance: 668229.1976836439
Residual deviance: 611245.0049634107
AIC: 19594.08281456311

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,13.588357,0.6247265,14.320624,13.635339,13.19769,14.02438,12.763752
1,mean_residual_deviance,262.24966,20.817219,282.0009,263.43307,260.60712,276.6153,228.59189
2,mse,262.24966,20.817219,282.0009,263.43307,260.60712,276.6153,228.59189
3,null_deviance,133645.84,8035.152,143556.94,131909.53,131944.11,138569.1,122249.54
4,r2,0.085686035,0.024799272,0.084253445,0.069353335,0.07899463,0.06752222,0.12830654
5,residual_deviance,122162.62,9793.32,131412.4,122759.81,121442.92,128902.734,106295.23
6,rmse,16.183569,0.653592,16.792881,16.230621,16.143332,16.631756,15.119255
7,rmsle,1.0333453,0.05809085,1.0681968,1.0207449,0.9570174,1.0874221,



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test,deviance_xval,deviance_se,training_rmse,training_deviance,training_mae,training_r2
0,,2020-12-01 22:19:39,0.000 sec,1,330.0,5,286.614353,,286.762286,7.625175,,,,
1,,2020-12-01 22:19:39,0.008 sec,2,200.0,5,286.504453,,286.67493,7.627065,,,,
2,,2020-12-01 22:19:39,0.008 sec,3,130.0,5,286.329,,286.535188,7.630104,,,,
3,,2020-12-01 22:19:39,0.016 sec,4,78.0,5,286.050432,,286.312822,7.634978,,,,
4,,2020-12-01 22:19:39,0.016 sec,5,48.0,5,285.611704,,285.961001,7.64279,,,,
5,,2020-12-01 22:19:39,0.016 sec,6,30.0,5,284.92982,,285.410488,7.655267,,,,
6,,2020-12-01 22:19:39,0.016 sec,7,19.0,5,283.89181,,284.563411,7.675091,,,,
7,,2020-12-01 22:19:39,0.024 sec,8,12.0,5,282.361379,,283.293844,7.706317,,,,
8,,2020-12-01 22:19:39,0.024 sec,9,7.2,5,280.21126,,281.465743,7.754793,,,,
9,,2020-12-01 22:19:39,0.024 sec,10,4.5,5,277.395626,,278.985578,7.828248,,,,



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Religious_Attendance,2.815405,1.0,0.314675
1,Age_Diff,2.30585,0.819012,0.257723
2,Household_Income,2.041623,0.725161,0.22819
3,Pol_Diff,1.784147,0.633709,0.199412


Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  GLM_1_AutoML_20201201_221938


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,gaussian,identity,Ridge ( lambda = 0.01469 ),"nlambda = 30, lambda.max = 325.02, lambda.min = 0.01469, lambda.1s...",4,4,25,automl_training_py_33_sid_865b




ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 261.7812699729432
RMSE: 16.179656052368458
MAE: 13.560258790230629
RMSLE: 1.033504692381193
R^2: 0.08722161098877856
Mean Residual Deviance: 261.7812699729432
Null degrees of freedom: 2328
Residual degrees of freedom: 2324
Null deviance: 667948.0858737646
Residual deviance: 609688.5777669847
AIC: 19588.14486518512

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 262.4495512938646
RMSE: 16.200294790338372
MAE: 13.576705645895334
RMSLE: NaN
R^2: 0.08489144906550594
Mean Residual Deviance: 262.4495512938646
Null degrees of freedom: 2328
Residual degrees of freedom: 2324
Null deviance: 668229.1976836439
Residual deviance: 611245.0049634107
AIC: 19594.08281456311

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,13.588357,0.6247265,14.320624,13.635339,13.19769,14.02438,12.763752
1,mean_residual_deviance,262.24966,20.817219,282.0009,263.43307,260.60712,276.6153,228.59189
2,mse,262.24966,20.817219,282.0009,263.43307,260.60712,276.6153,228.59189
3,null_deviance,133645.84,8035.152,143556.94,131909.53,131944.11,138569.1,122249.54
4,r2,0.085686035,0.024799272,0.084253445,0.069353335,0.07899463,0.06752222,0.12830654
5,residual_deviance,122162.62,9793.32,131412.4,122759.81,121442.92,128902.734,106295.23
6,rmse,16.183569,0.653592,16.792881,16.230621,16.143332,16.631756,15.119255
7,rmsle,1.0333453,0.05809085,1.0681968,1.0207449,0.9570174,1.0874221,



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test,deviance_xval,deviance_se,training_rmse,training_deviance,training_mae,training_r2
0,,2020-12-01 22:19:39,0.000 sec,1,330.0,5,286.614353,,286.762286,7.625175,,,,
1,,2020-12-01 22:19:39,0.008 sec,2,200.0,5,286.504453,,286.67493,7.627065,,,,
2,,2020-12-01 22:19:39,0.008 sec,3,130.0,5,286.329,,286.535188,7.630104,,,,
3,,2020-12-01 22:19:39,0.016 sec,4,78.0,5,286.050432,,286.312822,7.634978,,,,
4,,2020-12-01 22:19:39,0.016 sec,5,48.0,5,285.611704,,285.961001,7.64279,,,,
5,,2020-12-01 22:19:39,0.016 sec,6,30.0,5,284.92982,,285.410488,7.655267,,,,
6,,2020-12-01 22:19:39,0.016 sec,7,19.0,5,283.89181,,284.563411,7.675091,,,,
7,,2020-12-01 22:19:39,0.024 sec,8,12.0,5,282.361379,,283.293844,7.706317,,,,
8,,2020-12-01 22:19:39,0.024 sec,9,7.2,5,280.21126,,281.465743,7.754793,,,,
9,,2020-12-01 22:19:39,0.024 sec,10,4.5,5,277.395626,,278.985578,7.828248,,,,



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Religious_Attendance,2.815405,1.0,0.314675
1,Age_Diff,2.30585,0.819012,0.257723
2,Household_Income,2.041623,0.725161,0.22819
3,Pol_Diff,1.784147,0.633709,0.199412


H2OTypeError: Argument `model_id` should be a string, got H2OGeneralizedLinearEstimator 

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  GLM_1_AutoML_20201201_221938


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,gaussian,identity,Ridge ( lambda = 0.01469 ),"nlambda = 30, lambda.max = 325.02, lambda.min = 0.01469, lambda.1s...",4,4,25,automl_training_py_33_sid_865b




ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 261.7812699729432
RMSE: 16.179656052368458
MAE: 13.560258790230629
RMSLE: 1.033504692381193
R^2: 0.08722161098877856
Mean Residual Deviance: 261.7812699729432
Null degrees of freedom: 2328
Residual degrees of freedom: 2324
Null deviance: 667948.0858737646
Residual deviance: 609688.5777669847
AIC: 19588.14486518512

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 262.4495512938646
RMSE: 16.200294790338372
MAE: 13.576705645895334
RMSLE: NaN
R^2: 0.08489144906550594
Mean Residual Deviance: 262.4495512938646
Null degrees of freedom: 2328
Residual degrees of freedom: 2324
Null deviance: 668229.1976836439
Residual deviance: 611245.0049634107
AIC: 19594.08281456311

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,13.588357,0.6247265,14.320624,13.635339,13.19769,14.02438,12.763752
1,mean_residual_deviance,262.24966,20.817219,282.0009,263.43307,260.60712,276.6153,228.59189
2,mse,262.24966,20.817219,282.0009,263.43307,260.60712,276.6153,228.59189
3,null_deviance,133645.84,8035.152,143556.94,131909.53,131944.11,138569.1,122249.54
4,r2,0.085686035,0.024799272,0.084253445,0.069353335,0.07899463,0.06752222,0.12830654
5,residual_deviance,122162.62,9793.32,131412.4,122759.81,121442.92,128902.734,106295.23
6,rmse,16.183569,0.653592,16.792881,16.230621,16.143332,16.631756,15.119255
7,rmsle,1.0333453,0.05809085,1.0681968,1.0207449,0.9570174,1.0874221,



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test,deviance_xval,deviance_se,training_rmse,training_deviance,training_mae,training_r2
0,,2020-12-01 22:19:39,0.000 sec,1,330.0,5,286.614353,,286.762286,7.625175,,,,
1,,2020-12-01 22:19:39,0.008 sec,2,200.0,5,286.504453,,286.67493,7.627065,,,,
2,,2020-12-01 22:19:39,0.008 sec,3,130.0,5,286.329,,286.535188,7.630104,,,,
3,,2020-12-01 22:19:39,0.016 sec,4,78.0,5,286.050432,,286.312822,7.634978,,,,
4,,2020-12-01 22:19:39,0.016 sec,5,48.0,5,285.611704,,285.961001,7.64279,,,,
5,,2020-12-01 22:19:39,0.016 sec,6,30.0,5,284.92982,,285.410488,7.655267,,,,
6,,2020-12-01 22:19:39,0.016 sec,7,19.0,5,283.89181,,284.563411,7.675091,,,,
7,,2020-12-01 22:19:39,0.024 sec,8,12.0,5,282.361379,,283.293844,7.706317,,,,
8,,2020-12-01 22:19:39,0.024 sec,9,7.2,5,280.21126,,281.465743,7.754793,,,,
9,,2020-12-01 22:19:39,0.024 sec,10,4.5,5,277.395626,,278.985578,7.828248,,,,



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Religious_Attendance,2.815405,1.0,0.314675
1,Age_Diff,2.30585,0.819012,0.257723
2,Household_Income,2.041623,0.725161,0.22819
3,Pol_Diff,1.784147,0.633709,0.199412
