# Przemysław Kaleta

Our goal is to build a model predicting wheter given person is married based on a few features and then to visualize model decisions.

Based on data from: https://data.stanford.edu/hcmst2017

Interesting app:
https://qz.com/quartzy/1551272/here-is-the-probability-you-will-break-up-with-your-partner/

In [139]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import xgboost as xgb

In [4]:
data = pd.read_stata("hcmts.dta")
print(len(data))
data.head()

3510


Unnamed: 0,CaseID,CASEID_NEW,qflag,weight1,weight1_freqwt,weight2,weight1a,weight1a_freqwt,weight_combo,weight_combo_freqwt,...,hcm2017q24_met_through_family,hcm2017q24_met_through_friend,hcm2017q24_met_through_as_nghbrs,hcm2017q24_met_as_through_cowork,w6_subject_race,interracial_5cat,partner_mother_yrsed,subject_mother_yrsed,partner_yrsed,subject_yrsed
0,2,2014039,Qualified,,,0.8945,,,0.277188,19240.0,...,no,no,no,no,White,no,12.0,14.0,12.0,14.0
1,3,2019003,Qualified,0.9078,71115.0,,0.9026,70707.0,1.020621,70841.0,...,no,no,no,yes,White,no,12.0,16.0,17.0,17.0
2,5,2145527,Qualified,0.7205,56442.0,,0.7164,56121.0,0.810074,56227.0,...,no,no,no,no,White,no,9.0,7.5,14.0,17.0
3,6,2648857,Qualified,1.2597,98682.0,1.3507,1.2524,98110.0,0.418556,29052.0,...,no,no,no,no,White,no,16.0,12.0,12.0,12.0
4,7,2623465,Qualified,0.8686,68044.0,,0.8636,67652.0,0.976522,67781.0,...,no,no,yes,no,White,no,14.0,17.0,16.0,16.0


Explanations of variable names taken from:

https://stacks.stanford.edu/file/druid:vt073cc9067/HCMST_2017_fresh_Codeboodk_v1.1a.pdf

* Yes/no questions:
    * **Q5** Is [Partner name] the same sex as you? 
    * **Q25_2** Did you and [Partner name] attend the same high school?
    * **Q26_2** Did you and [Partner name] attend the same college or university?
    * **hcm2017q24_met_online** Met online
    
* Quantitative questions:
    * **w6_q9** partner age in 2017
    * **w6_q24_length** length of q24 how met answer in characters

## Selecting data

In [49]:
binary_variables = ["Q5", "Q25_2", "Q26_2",  "hcm2017q24_met_online"]
continuous_variables = ["w6_q9", "w6_q24_length"]
predicted_variables = ["S1"]
variables = binary_variables + continuous_variables + predicted_variables
mydata = data.loc[data.S1.notna(), variables]

In [46]:
mydata.head()

Unnamed: 0,Q5,Q25_2,Q26_2,hcm2017q24_met_online,w6_q9,w6_q24_length,S1
0,,Different High School,,yes,26.0,232.0,"No, I am not Married"
1,,,,no,52.0,213.0,"Yes, I am Married"
2,,,,yes,45.0,87.0,"Yes, I am Married"
3,,Different High School,,yes,26.0,80.0,"No, I am not Married"
4,,,,no,59.0,648.0,"Yes, I am Married"


Let's check how many of our variables are unknown. It seems to be a big problem here.

In [47]:
mydata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3510 entries, 0 to 3509
Data columns (total 7 columns):
Q5                       468 non-null category
Q25_2                    538 non-null category
Q26_2                    202 non-null category
hcm2017q24_met_online    3394 non-null category
w6_q9                    3374 non-null float64
w6_q24_length            3394 non-null float32
S1                       3510 non-null category
dtypes: category(5), float32(1), float64(1)
memory usage: 86.2 KB


Below we see for example, that some person wrote 3855 characters describing when he/she first met their partner.

In [48]:
mydata.describe()

Unnamed: 0,w6_q9,w6_q24_length
count,3374.0,3394.0
mean,48.777119,182.854446
std,17.119645,236.993225
min,-1.0,0.0
25%,34.0,54.0
50%,50.0,124.0
75%,62.0,222.75
max,95.0,3855.0


## Model fitting

In [109]:
for column_name in binary_variables + predicted_variables:
    mydata[column_name] = mydata[column_name].values.codes

In [112]:
mydata.head()

Unnamed: 0,Q5,Q25_2,Q26_2,hcm2017q24_met_online,w6_q9,w6_q24_length,S1
0,-1,2,-1,1,26.0,232.0,1
1,-1,-1,-1,0,52.0,213.0,0
2,-1,-1,-1,1,45.0,87.0,0
3,-1,2,-1,1,26.0,80.0,1
4,-1,-1,-1,0,59.0,648.0,0


In [122]:
X, y = mydata.drop("S1", axis=1), mydata["S1"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [183]:
xgb_model = xgb.XGBClassifier(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 0, n_estimators = 100)

In [184]:
xgb_model.fit(X_train, y_train)

XGBClassifier(alpha=0, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.3, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [185]:
y_pred = xgb_model.predict(X_test)

In [186]:
y_pred_train = xgb_model.predict(X_train)

In [187]:
def test_classifier(y_true, y_pred):
    n = len(y_true)
    print(f"Positive/negative percentages in population: {sum(y_true) / n} / {sum(y_true==0) / n}")
    print(f"Accuracy {sum(y_true == y_pred) / n}")
    print(f"F1 score: {f1_score(y_true, y_pred)}")

In [188]:
test_classifier(y_train, y_pred_train)

Positive/negative percentages in population: 0.405982905982906 / 0.594017094017094
Accuracy 0.842948717948718
F1 score: 0.7778337531486146


In [189]:
test_classifier(y_test, y_pred)

Positive/negative percentages in population: 0.405982905982906 / 0.594017094017094
Accuracy 0.8034188034188035
F1 score: 0.7124999999999999
