# Multivariate Regression Modeling


Moving on from the hypertuned univariate regression models, we will now move to something a bit more complicated

## Problem Statement
As the proportion of the population that binge drinks increases, does the mortality rate by self harm change? 



---
### Process

In this notebook, I'll investigate relationships between self-harm mortality rate and other variables, such as alcohol use/type, state, and sex, as well as between unemployment rate and labor force.

The models in this notebook will all be simple univariate linear models that directly compare alcohol prevalence use, sex, state, unemployment or labor force with self-harm mortality. 


In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# imports - modeling
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest

#sklearn.set_config(display='diagram')

In [2]:
# read in data
# read in the cleaned mortality data

#read in joined selfharm_alcohol use data
harm_alcohol_any = pd.read_csv('../data/cleaned/selfharm_alcohol_joins/selfharm_join_alcohol_any.csv', dtype={'FIPS': 'object'})
harm_alcohol_heavy = pd.read_csv('../data/cleaned/selfharm_alcohol_joins/selfharm_join_heavy_prop_heavy.csv', dtype={'FIPS': 'object'})
harm_alcohol_binge = pd.read_csv('../data/cleaned/selfharm_alcohol_joins/selfharm_join_binge_prop_binge.csv', dtype={'FIPS': 'object'})

#drop the NAs
harm_alcohol_any.dropna(inplace=True)
harm_alcohol_heavy.dropna(inplace=True)
harm_alcohol_binge.dropna(inplace=True)

In [3]:
# similar to before, I'll create a list of regression models, and re-use the function to run them with default parameters and return results
# instantiate different regression models and put them in a list
lr = LinearRegression()
lasso = Lasso()
ridge = Ridge()
rf = RandomForestRegressor()
ada = AdaBoostRegressor()
gboost = GradientBoostingRegressor()

estimators = [lr, lasso, ridge, rf, ada, gboost]

In [4]:
# function
def model_eval(df, independent_vars, dependent_var, estimator_list):
    
    #drop the "both" sexes rows, so we're left with only male and female
    df = df[df['sex'] != 'Both']
    
    X = df[independent_vars]
    y = df[dependent_var]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    ohe = OneHotEncoder(sparse=False, drop='if_binary', handle_unknown='ignore')
    

#     ct = make_column_transformer(
#         (ohe, ['state']),
#         remainder='passthrough',
#         verbose_feature_names_out=False
#     )

    ct = make_column_transformer(
            (ohe, ['state', 'sex']),
            remainder='passthrough',
            verbose_feature_names_out=False
        )
    test_scores = []
    train_score_list = []
    rmse_list = []
    est_list_readable = ['Linear Regression', 'Lasso', 'Ridge', 'RandomForest Regressor', 'AdaBoost Regressor', 'GradientBoost Regressor']
    
    for estimator in estimator_list:
        pipe = make_pipeline(ct, estimator)

        pipe.fit(X_train, y_train)

        train_score = round(pipe.score(X_train, y_train), 2)
        test_score = round(pipe.score(X_test, y_test), 2)

        rmse = round(mean_squared_error(y_test, pipe.predict(X_test), squared=False), 2)
        
        # append scores to list
        train_score_list.append(train_score)
        test_scores.append(test_score)
        rmse_list.append(rmse)
        
        #print(f'Estimator: {estimator}, Train/Test Accuracy: {train_score, test_score}, RMSE: {rmse}')
    df_eval_metrics = pd.DataFrame({'Estimator': est_list_readable, 'Train Accuracy': train_score_list, 'Test Accuracy': test_scores, 'RMSE': rmse_list})
    
    return df_eval_metrics
    

In [5]:
# ignoring alcohol, do state and sex alone make good predictors?
df_harm_state_sex = model_eval(harm_alcohol_any, ['state', 'sex'], 'mx', estimators)
df_harm_state_sex['vars_used'] = 'state / sex'
df_harm_state_sex = df_harm_state_sex[['Estimator', 'vars_used', 'Train Accuracy', 'Test Accuracy', 'RMSE']]
df_harm_state_sex

Unnamed: 0,Estimator,vars_used,Train Accuracy,Test Accuracy,RMSE
0,Linear Regression,state / sex,0.83,0.83,4.45
1,Lasso,state / sex,0.73,0.73,5.62
2,Ridge,state / sex,0.83,0.83,4.45
3,RandomForest Regressor,state / sex,0.85,0.85,4.16
4,AdaBoost Regressor,state / sex,0.72,0.73,5.67
5,GradientBoost Regressor,state / sex,0.84,0.84,4.3


In [6]:
# doing this with alcohol_any
df_harm_alcohol_any3_eval = model_eval(harm_alcohol_any, ['alcohol_any', 'state', 'sex'], 'mx', estimators)
df_harm_alcohol_any3_eval['vars_used'] = 'Alcohol_any / state / sex'
df_harm_alcohol_any3_eval = df_harm_alcohol_any3_eval[['Estimator', 'vars_used', 'Train Accuracy', 'Test Accuracy', 'RMSE']]
df_harm_alcohol_any3_eval

Unnamed: 0,Estimator,vars_used,Train Accuracy,Test Accuracy,RMSE
0,Linear Regression,Alcohol_any / state / sex,0.83,0.84,4.38
1,Lasso,Alcohol_any / state / sex,0.73,0.73,5.62
2,Ridge,Alcohol_any / state / sex,0.83,0.84,4.38
3,RandomForest Regressor,Alcohol_any / state / sex,0.93,0.87,3.86
4,AdaBoost Regressor,Alcohol_any / state / sex,0.55,0.55,7.34
5,GradientBoost Regressor,Alcohol_any / state / sex,0.87,0.87,3.98


In [7]:
# doing this with alcohol_any and including year
df_harm_alcohol_any4_eval = model_eval(harm_alcohol_any, ['alcohol_any', 'state', 'sex', 'year_id'], 'mx', estimators)
df_harm_alcohol_any4_eval['vars_used'] = 'Alcohol_any / state / sex / year_id' 
df_harm_alcohol_any4_eval = df_harm_alcohol_any4_eval[['Estimator', 'vars_used', 'Train Accuracy', 'Test Accuracy', 'RMSE']]
df_harm_alcohol_any4_eval

# including year doesn't seem to make much of a difference in the best performing models, and seems to induce some overfitting in the RForest -- therefore we will not continue using it

Unnamed: 0,Estimator,vars_used,Train Accuracy,Test Accuracy,RMSE
0,Linear Regression,Alcohol_any / state / sex / year_id,0.84,0.84,4.33
1,Lasso,Alcohol_any / state / sex / year_id,0.73,0.74,5.6
2,Ridge,Alcohol_any / state / sex / year_id,0.84,0.84,4.33
3,RandomForest Regressor,Alcohol_any / state / sex / year_id,0.97,0.88,3.85
4,AdaBoost Regressor,Alcohol_any / state / sex / year_id,0.56,0.56,7.2
5,GradientBoost Regressor,Alcohol_any / state / sex / year_id,0.87,0.87,3.95


In [8]:
# do this with heavy drinking
df_harm_alcohol_heavy_eval = model_eval(harm_alcohol_heavy, ['alcohol_heavy', 'state', 'sex'], 'mx', estimators)
df_harm_alcohol_heavy_eval['vars_used'] = 'Alcohol_heavy / state / sex' 
df_harm_alcohol_heavy_eval = df_harm_alcohol_heavy_eval[['Estimator', 'vars_used', 'Train Accuracy', 'Test Accuracy', 'RMSE']]
df_harm_alcohol_heavy_eval

Unnamed: 0,Estimator,vars_used,Train Accuracy,Test Accuracy,RMSE
0,Linear Regression,Alcohol_heavy / state / sex,0.83,0.83,4.47
1,Lasso,Alcohol_heavy / state / sex,0.71,0.71,5.88
2,Ridge,Alcohol_heavy / state / sex,0.83,0.83,4.47
3,RandomForest Regressor,Alcohol_heavy / state / sex,0.91,0.85,4.23
4,AdaBoost Regressor,Alcohol_heavy / state / sex,0.73,0.72,5.77
5,GradientBoost Regressor,Alcohol_heavy / state / sex,0.86,0.86,4.13


In [9]:
# prop heavy 
df_harm_alcohol_propheavy_eval = model_eval(harm_alcohol_heavy, ['alcohol_prop_heavy', 'state', 'sex'], 'mx', estimators)
df_harm_alcohol_propheavy_eval['vars_used'] = 'Alcohol_prop_heavy / state / sex' 
df_harm_alcohol_propheavy_eval = df_harm_alcohol_propheavy_eval[['Estimator', 'vars_used', 'Train Accuracy', 'Test Accuracy', 'RMSE']]
df_harm_alcohol_propheavy_eval

Unnamed: 0,Estimator,vars_used,Train Accuracy,Test Accuracy,RMSE
0,Linear Regression,Alcohol_prop_heavy / state / sex,0.85,0.85,4.28
1,Lasso,Alcohol_prop_heavy / state / sex,0.74,0.73,5.64
2,Ridge,Alcohol_prop_heavy / state / sex,0.85,0.85,4.28
3,RandomForest Regressor,Alcohol_prop_heavy / state / sex,0.92,0.86,4.08
4,AdaBoost Regressor,Alcohol_prop_heavy / state / sex,0.68,0.67,6.25
5,GradientBoost Regressor,Alcohol_prop_heavy / state / sex,0.88,0.87,3.97


In [10]:
# do this with binge
df_harm_alcohol_binge_eval = model_eval(harm_alcohol_binge, ['alcohol_binge', 'state', 'sex'], 'mx', estimators)
df_harm_alcohol_binge_eval['vars_used'] = 'Alcohol_binge / state / sex' 
df_harm_alcohol_binge_eval = df_harm_alcohol_binge_eval[['Estimator', 'vars_used', 'Train Accuracy', 'Test Accuracy', 'RMSE']]
df_harm_alcohol_binge_eval

Unnamed: 0,Estimator,vars_used,Train Accuracy,Test Accuracy,RMSE
0,Linear Regression,Alcohol_binge / state / sex,0.83,0.83,4.45
1,Lasso,Alcohol_binge / state / sex,0.69,0.69,6.08
2,Ridge,Alcohol_binge / state / sex,0.83,0.83,4.45
3,RandomForest Regressor,Alcohol_binge / state / sex,0.89,0.84,4.35
4,AdaBoost Regressor,Alcohol_binge / state / sex,0.7,0.71,5.87
5,GradientBoost Regressor,Alcohol_binge / state / sex,0.84,0.85,4.24


In [11]:
# prop_binge
df_harm_alcohol_propbinge_eval = model_eval(harm_alcohol_binge, ['alcohol_prop_binge', 'state', 'sex'], 'mx', estimators)
df_harm_alcohol_propbinge_eval['vars_used'] = 'Alcohol_prop_binge / state / sex' 
df_harm_alcohol_propbinge_eval = df_harm_alcohol_propbinge_eval[['Estimator', 'vars_used', 'Train Accuracy', 'Test Accuracy', 'RMSE']]
df_harm_alcohol_propbinge_eval

Unnamed: 0,Estimator,vars_used,Train Accuracy,Test Accuracy,RMSE
0,Linear Regression,Alcohol_prop_binge / state / sex,0.84,0.84,4.31
1,Lasso,Alcohol_prop_binge / state / sex,0.65,0.65,6.41
2,Ridge,Alcohol_prop_binge / state / sex,0.84,0.84,4.31
3,RandomForest Regressor,Alcohol_prop_binge / state / sex,0.92,0.87,3.96
4,AdaBoost Regressor,Alcohol_prop_binge / state / sex,0.74,0.74,5.5
5,GradientBoost Regressor,Alcohol_prop_binge / state / sex,0.87,0.87,3.9


Including `year_id` doesn't seem to improve the accuracy score enough to continue using it. Sex and state however seem to be what boosted the score the most. Linear Regression continues to perform well across all variable combinations. 

---

