# Predictive Modeling of US Suicide Deaths
Capstone Project for M.S. Data Analytics Program

Melissa Stone Rogers, [GitHub](https://github.com/meldstonerogers/capstone-stonerogers), April 4, 2025

## Introduction 
This is a professional project exaiming trends in suicide over time. Data has been gathered from Center for Disease Control using
the Wide-ranging ONline Data for Epidemiologic Research[(WONDER)](https://wonder.cdc.gov) system. 

Commands were used on a Mac machine running zsh.

### Import and Read Data

In [2]:
import pandas as pd
df = pd.read_csv("data/cleaned_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7106 entries, 0 to 7105
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   state            7106 non-null   object
 1   state_code       7106 non-null   int64 
 2   age_group_years  7106 non-null   int64 
 3   sex              7106 non-null   int64 
 4   race             7106 non-null   object
 5   race_code        7106 non-null   int64 
 6   year             7106 non-null   int64 
 7   deaths           7106 non-null   int64 
 8   population       7106 non-null   int64 
dtypes: int64(7), object(2)
memory usage: 499.8+ KB


### Train/Test Data Split 

In [4]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df,
                        test_size=0.2, random_state=123)
print('Train size: ', len(train_set), 'Test size: ', len(test_set))

Train size:  5684 Test size:  1422


### Train and Evaluate Linear Regression Model 

In [9]:
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X_train = train_set[['age_group_years', 'sex', 'race_code']]
y_train = train_set['deaths']

X_test = test_set[['age_group_years', 'sex', 'race_code']]
y_test = test_set['deaths']

lr_model = LinearRegression()
lr_model.fit(X_train,y_train)

y_pred = lr_model.predict(X_train)
print('Results for linear regression on training data')
print('  Default settings')
print('Internal parameters:')
print('   Bias is ', lr_model.intercept_)
print('   Coefficients', lr_model.coef_)
print('   Score', lr_model.score(X_train,y_train))
print('MAE is  ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2    ', r2_score(y,y_pred))

y_test_pred = lr_model.predict(X_test)
print()
print('Results for linear regression on test data')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test,
y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test,y_test_pred))

Results for linear regression on training data
  Default settings
Internal parameters:
   Bias is  -53.81872238742251
   Coefficients [-0.05172193 26.8194335  12.92654776]
   Score 0.1330797258467411
MAE is   22.039936927469412
RMSE is  35.89402636264416
MSE is  1288.381128522194
R^2     0.1330797258467411

Results for linear regression on test data
MAE is   22.620146863967054
RMSE is  36.75530187996859
MSE is  1350.9522162876228
R^2     0.1259184752621485


### Train and Evaluate Random Forest Regressor

In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
from sklearn.datasets import make_regression

# Initialize the Random Forest Regressor
regr = RandomForestRegressor(max_depth=2, random_state=0)

# Fit the model on training data
regr.fit(X_train, y_train)

# Predict and evaluate on training data
y_train_pred = regr.predict(X_train)
print('Results for linear regression on training data with pipeline')
print('MAE is  ', mean_absolute_error(y_train, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_train, y_pred)))
print('MSE is ', mean_squared_error(y_train, y_pred))
print('R^2    ', r2_score(y_train, y_pred))

# Predict and evaluate on test data
y_test_pred = regr.predict(X_test)
print('Results for linear regression on test data with pipeline')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test, y_test_pred))

Results for linear regression on training data with pipeline
MAE is   22.039936927469412
RMSE is  35.89402636264416
MSE is  1288.381128522194
R^2     0.1330797258467411
Results for linear regression on test data with pipeline
MAE is   21.94673000328097
RMSE is  36.43730822147239
MSE is  1327.6774304265793
R^2     0.1409775277349834


## Train and Evaluate Decision Tree Model

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
power = 3
poly_process = PolynomialFeatures(degree=power, include_bias=False)

#Define the pipeline
autompg_pipe = Pipeline([
    ('median_transform', SimpleImputer(strategy='median')), 
    ('poly_process', PolynomialFeatures()),
    ('scale_transform', StandardScaler()),
    ('lin_reg', LinearRegression())])

# Fit the pipeline
autompg_pipe.fit(X,y)

# Output the intercept and coefficients 
print("The stage bias is " ,
      autompg_pipe.named_steps['lin_reg'].intercept_)
print("The stage feature coefficients are ",
      autompg_pipe.named_steps['lin_reg'].coef_)

# Predict and evaluate on training data
y_pred = autompg_pipe.predict(X)
print('Results for linear regression with polynomial features on training data with pipeline')
print('MAE is  ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2    ', r2_score(y, y_pred))

# Predict and evaluate on test data
y_test_pred = autompg_pipe.predict(X_test)
print('Results for linear regression with polynomial features on test data with pipeline')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test, y_test_pred))

The stage bias is  38.149366643209035
The stage feature coefficients are  [  0.          28.02779352  -3.66739426   0.54258712 -36.794721
  -0.40769252   8.29152938  -3.66739426  24.27052257  -9.19817123]
Results for linear regression with polynomial features on training data with pipeline
MAE is   21.44059239314999
RMSE is  35.12414912096016
MSE is  1233.7058514714465
R^2     0.16986938778831862
Results for linear regression with polynomial features on test data with pipeline
MAE is   22.16738415974764
RMSE is  36.224740650779744
MSE is  1312.2318352162542
R^2     0.15097100437253919


## Train and Evaluate Random Forest Model

# Results
Basic results for models to predict deaths based on cleaned suicide dataset.
| Model | Training Features | Set | RMSE | R2 |
|:---|:---|:---|:---|:---|
|Linear Regression|age_group_years, sex, race_code|Training|35.89|13.31|
|Linear Regression|age_group_years, sex, race_code|Test|36.76|12.59|
|Random Forest Regressor|age_group_years, sex, race_code|Training|x|x|
|Random Forest Regressor|age_group_years, sex, race_code|Test|x|x|
|Decision Tree Model|xx|Training|x|x|
|Decision Tree Model|xx|Test|x|x|
|Random Forest Model|xx|Training|x|x|
|Random Forest Model|xx|Test|x|x|


#### Discussion of Results
The performance of the models improved with each iteration, the pipeline model with the polynomial features performing the best. However, none of the models performed spectacularly- none of the R^2 scores were close to 80%. I do not have a concern of over- or under-fitting on any of the models, as the test models vs the training models performed as expected. I even ran the polynomial pipeline with a higher power and saw no discernable difference in model performance. I think my chosen features are not the best features to predict the target feature, mpg. 

I was curious, so I duplicated the notebook to run the other features. I still believe that weight is an important feature as it relates to mpg, so I then changed my secondary feature. Cylinders, acceleration, and origin had similar performance numbers as displacement. 

However, the pipeline with polynomial features, using weight and model year as my training features had significantly improved scores: training RMSE, 2.91 and R^2, 86.26%; test RMSE, 2.11 and R^2, 85.88%. I would imagine this could be due to advancements in car engineering allowing for a more effective use of fuel, thus having better mpg. The advancements that occurred as model years increased must have a strong correlation to the mpg, allowing this variable to be an important predictive feature.   
