# Predictive Modeling of US Suicide Deaths
Capstone Project for M.S. Data Analytics Program

Melissa Stone Rogers, [GitHub](https://github.com/meldstonerogers/capstone-stonerogers), April 4, 2025

## Introduction 
This is a professional project exaiming trends in suicide over time. Data has been gathered from Center for Disease Control using
the Wide-ranging ONline Data for Epidemiologic Research[(WONDER)](https://wonder.cdc.gov) system. 

Commands were used on a Mac machine running zsh.

### Import and Read Data

In [2]:
import pandas as pd
df = pd.read_csv("data/cleaned_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7106 entries, 0 to 7105
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   state            7106 non-null   object
 1   state_code       7106 non-null   int64 
 2   age_group_years  7106 non-null   int64 
 3   sex              7106 non-null   int64 
 4   race             7106 non-null   object
 5   race_code        7106 non-null   int64 
 6   year             7106 non-null   int64 
 7   deaths           7106 non-null   int64 
 8   population       7106 non-null   int64 
dtypes: int64(7), object(2)
memory usage: 499.8+ KB


In [3]:
print(autompg.head(n=10))
print(autompg.describe())

NameError: name 'autompg' is not defined

### Train/Test Data Split 

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(autompg,
                        test_size=0.2, random_state=123)
print('Train size: ', len(train_set), 'Test size: ', len(test_set))

Train size:  318 Test size:  80


### Train and Evaluate a Linear Regression Model 

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X = train_set[['weight', 'displacement']]
y = train_set['mpg']

X_test = test_set[['weight', 'displacement']]
y_test = test_set['mpg']

lr_model = LinearRegression()
lr_model.fit(X,y)

y_pred = lr_model.predict(X)
print('Results for linear regression on training data')
print('  Default settings')
print('Internal parameters:')
print('   Bias is ', lr_model.intercept_)
print('   Coefficients', lr_model.coef_)
print('   Score', lr_model.score(X,y))
print('MAE is  ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2    ', r2_score(y,y_pred))

y_test_pred = lr_model.predict(X_test)
print()
print('Results for linear regression on test data')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test,
y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test,y_test_pred))

Results for linear regression on training data
  Default settings
Internal parameters:
   Bias is  43.621433976443285
   Coefficients [-0.00556012 -0.01801839]
   Score 0.7024820582511935
MAE is   3.2290037032091603
RMSE is  4.2789696537676765
MSE is  18.309581297864668
R^2     0.7024820582511935

Results for linear regression on test data
MAE is   3.4980689722029554
RMSE is  4.343131313113012
MSE is  18.862789602942755
R^2     0.6765950182605451


### Pipeline Models

#### Pipeline 1

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

#Define the pipline
autompg_pipe = Pipeline([
    ('median_transform', SimpleImputer(strategy='median')),
    ('scale_transform', StandardScaler()),
    ('lin_reg', LinearRegression())])

# Fit the pipeline
autompg_pipe.fit(X,y)

# Output the intercept and coefficients 
print("The stage bias is " ,
      autompg_pipe.named_steps['lin_reg'].intercept_)
print("The stage feature coefficients are ",
      autompg_pipe.named_steps['lin_reg'].coef_)

# Predict and evaluate on training data
y_pred = autompg_pipe.predict(X)
print('Results for linear regression on training data with pipeline')
print('MAE is  ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2    ', r2_score(y, y_pred))

# Predict and evaluate on test data
y_test_pred = autompg_pipe.predict(X_test)
print('Results for linear regression on test data with pipeline')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test, y_test_pred))

The stage bias is  23.412578616352203
The stage feature coefficients are  [-4.76109404 -1.88955493]
Results for linear regression on training data with pipeline
MAE is   3.229003703209161
RMSE is  4.2789696537676765
MSE is  18.309581297864668
R^2     0.7024820582511935
Results for linear regression on test data with pipeline
MAE is   3.4980689722029568
RMSE is  4.3431313131130125
MSE is  18.862789602942758
R^2     0.676595018260545


#### Pipeline 2

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
power = 3
poly_process = PolynomialFeatures(degree=power, include_bias=False)

#Define the pipeline
autompg_pipe = Pipeline([
    ('median_transform', SimpleImputer(strategy='median')), 
    ('poly_process', PolynomialFeatures()),
    ('scale_transform', StandardScaler()),
    ('lin_reg', LinearRegression())])

# Fit the pipeline
autompg_pipe.fit(X,y)

# Output the intercept and coefficients 
print("The stage bias is " ,
      autompg_pipe.named_steps['lin_reg'].intercept_)
print("The stage feature coefficients are ",
      autompg_pipe.named_steps['lin_reg'].coef_)

# Predict and evaluate on training data
y_pred = autompg_pipe.predict(X)
print('Results for linear regression with polynomial features on training data with pipeline')
print('MAE is  ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2    ', r2_score(y, y_pred))

# Predict and evaluate on test data
y_test_pred = autompg_pipe.predict(X_test)
print('Results for linear regression with polynomial features on test data with pipeline')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test, y_test_pred))

The stage bias is  23.412578616352206
The stage feature coefficients are  [  0.          -1.49502986 -14.1635106  -11.16752447  24.49126934
  -4.53461007]
Results for linear regression with polynomial features on training data with pipeline
MAE is   2.9597527898224114
RMSE is  4.053977013800068
MSE is  16.43472962841932
R^2     0.7329470918695618
Results for linear regression with polynomial features on test data with pipeline
MAE is   3.1805704976155433
RMSE is  4.284941459858176
MSE is  18.360723314411516
R^2     0.6852030100948552


# Results
Basic results for models to predict mpg on
the auto-mpg data.
| Model | Training Features | Set | RMSE | R2 |
|:---|:---|:---|:---|:---|
|Linear Regression|Weight,Displacement|Training|4.28|70.25|
|Linear Regression|Weight,Displacement|Test|4.34|67.67|
|Pipeline, Linear Regression|Weight,Displacement|Training|4.28|70.25|
|Pipeline, Linear Regression|Weight,Displacement|Test|4.34|67.67|
|Pipeline with Polynomial Features, Linear Regression|Weight,Displacement|Training|4.05|73.29|
|Pipeline with Polynomial Features, Linear Regression|Weight,Displacement|Test|4.28|68.52|


#### Discussion of Results
The performance of the models improved with each iteration, the pipeline model with the polynomial features performing the best. However, none of the models performed spectacularly- none of the R^2 scores were close to 80%. I do not have a concern of over- or under-fitting on any of the models, as the test models vs the training models performed as expected. I even ran the polynomial pipeline with a higher power and saw no discernable difference in model performance. I think my chosen features are not the best features to predict the target feature, mpg. 

I was curious, so I duplicated the notebook to run the other features. I still believe that weight is an important feature as it relates to mpg, so I then changed my secondary feature. Cylinders, acceleration, and origin had similar performance numbers as displacement. 

However, the pipeline with polynomial features, using weight and model year as my training features had significantly improved scores: training RMSE, 2.91 and R^2, 86.26%; test RMSE, 2.11 and R^2, 85.88%. I would imagine this could be due to advancements in car engineering allowing for a more effective use of fuel, thus having better mpg. The advancements that occurred as model years increased must have a strong correlation to the mpg, allowing this variable to be an important predictive feature.   
