## Comparison of different ML models by MSE

### Prepared by Junho Choi

### A. Preparation for model comparisons

In this section, I prepare necessary modules and the X- and Y-variables to be used in the model selection process.

#### A.1. Importing necessary modules

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

from scipy.stats import uniform as sp_uniform
from scipy.stats import randint as sp_randint
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

## for ignoring warnings about future updates, etc.
warnings.filterwarnings(action='ignore', category=FutureWarning)
warnings.filterwarnings(action='ignore', category=UserWarning)

#### A.2. Importing data and defining X- and Y-variables

In [10]:
baseline_df = pd.read_csv('baseline_df.csv')
xval_cols = ['age', 'female', 'urgency', 'urg_times_wd', 'aids_area',
             'exploit_area', 'christian_pct', 'schooling_yn',
             'mother', 'father', 'mother_irreg_emp', 'mother_reg_emp',
             'father_irreg_emp', 'father_reg_emp',
             'region_mo_inc_w_impu', 'asia', 'southame', 'centralame']
xvals = baseline_df[xval_cols].values
yvals = baseline_df['ever_matched'].values

### B. Optimization initialization

In [14]:
def model_optimization(xvals, yvals, model, search_dict,
                       randomness, model_rtn=False):
    '''
    Provides the best (hyper)parameters and MSE based on
    Randomized Search CV.
    
    Input:
    - xvals (np.array (2D)): values of feature variables
    - yvals (np.array (1D)): values of dependent variable
    - model: model initialzation (e.g., those provided)
        by scikit-learn
    - search_dict (dict): dictionary of hyperparameters
        to test
    - randomness (int): random seed
    - model_rtn (boolean): if False, does not return the
        "best" fitted model
    
    Output:
    - if model_rtn is True, triple of the best hyperparmeters,
        best MSE, and best model; if False, only the first
        two are returned as a tuple
    '''
    
    random_search = RandomizedSearchCV(
        model, param_distributions=search_dict, n_iter=200,
        n_jobs=-1, cv=5, random_state=randomness,
        scoring='neg_mean_squared_error')
    
    random_search.fit(xvals, yvals)
    return_vars = [random_search.best_params_,
                   -random_search.best_score_]
    
    if model_rtn:
        return_vars.append(random_search)
        
    return return_vars

### C. Hyper-parameter optimization

#### C.1. Logistic Regression

In [15]:
randomness = 60615
LR = LogisticRegression(random_state=randomness)
param_dist_LR = {
    'penalty': ['l1', 'l2'],
    'C': sp_uniform(0.1, 10.0)
}

In [16]:
LR_param, LR_score = model_optimization(xvals, yvals, LR, param_dist_LR,
                                        randomness)
print('Best params are:', LR_param)
print('Best score (MSE) is:', str(round(LR_score, 6)))

Best params are: {'C': 0.4140831117072995, 'penalty': 'l1'}
Best score (MSE) is: 0.420782


#### C.2. Decision Tree

In [20]:
DTC = DecisionTreeClassifier(random_state=randomness)
param_dist_DTC = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['random', 'best'],
    'min_samples_split': list(range(2, 10))
}

In [21]:
DTC_param, DTC_score = model_optimization(xvals, yvals, DTC,
                                          param_dist_DTC, randomness)
print('Best params are:', DTC_param)
print('Best score (MSE) is:', str(round(DTC_score, 6)))

Best params are: {'splitter': 'random', 'min_samples_split': 9, 'criterion': 'entropy'}
Best score (MSE) is: 0.419311


#### C.3. Random Forest

In [22]:
RFC = RandomForestClassifier(random_state=randomness)
param_dist_RFC = {
    'n_estimators': sp_randint(5, 200),
    'max_depth': sp_randint(2, 4),
    'min_samples_split': sp_randint(2, 20),
    'min_samples_leaf': sp_randint(2, 20),
    'max_features': sp_randint(1, 4)
}

In [23]:
RFC_param, RFC_score = model_optimization(xvals, yvals, RFC,
                                          param_dist_RFC, randomness)
print('Best params are:', RFC_param)
print('Best score (MSE) is:', str(round(RFC_score, 6)))

Best params are: {'max_depth': 3, 'max_features': 3, 'min_samples_leaf': 5, 'min_samples_split': 15, 'n_estimators': 14}
Best score (MSE) is: 0.394621


#### C.4. Support Vector Machine

In [24]:
SVC_model = SVC(random_state=randomness, kernel='rbf')
param_dist_SVC = {
    'C': sp_uniform(loc=0.1, scale=10.0),
    'gamma': ['scale', 'auto'],
    'shrinking': [True, False]
}

In [30]:
SVC_param, SVC_score = model_optimization(xvals, yvals, SVC_model,
                                          param_dist_SVC, randomness)
print('Best params are:', SVC_param)
print('Best score (MSE) is:', str(round(SVC_score, 6)))

Best params are: {'C': 2.1936855213947792, 'gamma': 'auto', 'shrinking': True}
Best score (MSE) is: 0.414478


#### C.5. Quadratic Discriminant Analysis

In [26]:
QDA = QuadraticDiscriminantAnalysis()
param_dist_QDA = {
    'reg_param': np.linspace(0, 1, 11), 
    'tol': [1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3]
}

In [27]:
QDA_param, QDA_score = model_optimization(xvals, yvals, QDA,
                                          param_dist_QDA, randomness)
print('Best params are:', QDA_param)
print('Best score (MSE) is:', str(round(QDA_score, 6)))

Best params are: {'tol': 1e-06, 'reg_param': 0.8}
Best score (MSE) is: 0.421937


#### C.6. Multilayer Perceptron

In [28]:
MLP = MLPClassifier(random_state=randomness)
param_dist_MLP = {
    'hidden_layer_sizes': sp_randint(1, 100), 
    'activation': ['logistic', 'relu'],
    'alpha': sp_uniform(0.1, 10.0)
}

In [29]:
MLP_param, MLP_score = model_optimization(xvals, yvals, MLP,
                                          param_dist_MLP, randomness)
print('Best params are:', MLP_param)
print('Best score (MSE) is:', str(round(MLP_score, 6)))

Best params are: {'activation': 'relu', 'alpha': 1.9360643043685588, 'hidden_layer_sizes': 51}
Best score (MSE) is: 0.411326


### D. Brief comment

In terms of MSE, it seems that the "best" model (in terms of minimized mean-squared error) is the hyperparameter-tuned random forest model. However, the said MSE of the best model is not considerably lower than those of other models; perhaps there must be further considerations like addition of more relevant features, different target score for tuning, and so forth.