# Predictive Modeling of US Suicide Deaths
Capstone Project for M.S. Data Analytics Program

Melissa Stone Rogers, [GitHub](https://github.com/meldstonerogers/capstone-stonerogers), April 4, 2025

## Introduction 
This is a professional project exaiming trends in suicide over time. Data has been gathered from Center for Disease Control using
the Wide-ranging ONline Data for Epidemiologic Research[(WONDER)](https://wonder.cdc.gov) system. 

Commands were used on a Mac machine running zsh.

### Import and Read Data

In [2]:
import pandas as pd
df = pd.read_csv("data/cleaned_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7106 entries, 0 to 7105
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   state            7106 non-null   object
 1   state_code       7106 non-null   int64 
 2   age_group_years  7106 non-null   int64 
 3   sex              7106 non-null   int64 
 4   race             7106 non-null   object
 5   race_code        7106 non-null   int64 
 6   year             7106 non-null   int64 
 7   deaths           7106 non-null   int64 
 8   population       7106 non-null   int64 
dtypes: int64(7), object(2)
memory usage: 499.8+ KB


### Train/Test Data Split 

In [4]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df,
                        test_size=0.2, random_state=123)
print('Train size: ', len(train_set), 'Test size: ', len(test_set))

Train size:  5684 Test size:  1422


### Train and Evaluate Linear Regression Model 

In [62]:
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import pickle

X_train = train_set[['age_group_years', 'sex', 'race_code', 'population']]
y_train = train_set['deaths']

X_test = test_set[['age_group_years', 'sex', 'race_code', 'population']]
y_test = test_set['deaths']

lr_model = LinearRegression()
lr_model.fit(X_train,y_train)

y_pred = lr_model.predict(X_train)
print('Results for linear regression on training data')
print('  Default settings')
print('Internal parameters:')
print('   Bias is ', lr_model.intercept_)
print('   Coefficients', lr_model.coef_)
print('   Score', lr_model.score(X_train,y_train))
print('MAE is  ', mean_absolute_error(y_train, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_train, y_pred)))
print('MSE is ', mean_squared_error(y_train, y_pred))
print('R^2    ', r2_score(y_train,y_pred))

y_test_pred = lr_model.predict(X_test)
print()
print('Results for linear regression on test data')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test,
y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test,y_test_pred))

# Save the trained model to a pickle file
lr_model_path = 'models/lr_model.pkl'
with open(lr_model_path, 'wb') as f:  
    pickle.dump(lr_model, f)

print(f"\nModel saved to '{lr_model_path}'")

Results for linear regression on training data
  Default settings
Internal parameters:
   Bias is  -96.00559296275848
   Coefficients [1.25845476e-01 4.19552460e+01 7.29870907e+00 1.58547912e-04]
   Score 0.6330167640008322
MAE is   15.571962860140275
RMSE is  23.35370072974408
MSE is  545.3953377744491
R^2     0.6330167640008322

Results for linear regression on test data
MAE is   15.140016736685686
RMSE is  22.089473187792617
MSE is  487.94482571420895
R^2     0.6842941207645077

Model saved to 'models/lr_model.pkl'


## Train and Evaluate Random Forest Regressor

In [63]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
from sklearn.datasets import make_regression

# Initialize the Random Forest Regressor
regr = RandomForestRegressor(max_depth=2, random_state=0)

X_train = train_set[['age_group_years', 'sex', 'race_code', 'population']]
y_train = train_set['deaths']

X_test = test_set[['age_group_years', 'sex', 'race_code', 'population']]
y_test = test_set['deaths']

# Fit the model on training data
regr.fit(X_train, y_train)

# Predict and evaluate on training data
y_train_pred = regr.predict(X_train)
print('Results for training data with Random Forest Regressor')
print('MAE is  ', mean_absolute_error(y_train, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_train, y_pred)))
print('MSE is ', mean_squared_error(y_train, y_pred))
print('R^2    ', r2_score(y_train, y_pred))

# Predict and evaluate on test data
y_test_pred = regr.predict(X_test)
print('Results for test data with Random Forest Regressor')
print('MAE is  ', mean_absolute_error(y_test, y_test_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('MSE is ', mean_squared_error(y_test, y_test_pred))
print('R^2    ', r2_score(y_test, y_test_pred))

# Save the trained model to a pickle file
regr = 'models/regr.pkl'
with open(regr, 'wb') as f:  # Save model to file
    pickle.dump(regr, f)

print("\nModel saved to 'regr.pkl'")

Results for training data with Random Forest Regressor
MAE is   15.571962860140275
RMSE is  23.35370072974408
MSE is  545.3953377744491
R^2     0.6330167640008322
Results for test data with Random Forest Regressor
MAE is   17.765812577116282
RMSE is  24.017492249213614
MSE is  576.839933941036
R^2     0.6267779697090966

Model saved to 'regr.pkl'


## Train and Evaluate Decision Tree Classifier Model

In [64]:
from matplotlib import cm
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming 'df' is your dataset
# Categorize 'deaths' into Low, Medium, and High
df['death_category'] = np.where(df['deaths'] < 15, 'Low',
                                np.where(df['deaths'] <= 47, 'Medium', 'High'))


# Use numeric features for classification 
X = df[['age_group_years', 'sex', 'race_code', 'year', 'population']]  
y = df['death_category']

# Stratified split to preserve percentage of each class in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Fit the model to the training data
clf.fit(X_train, y_train)

# Predict on the training and test data
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

# Evaluate the model's performance
print("Training Accuracy: ", accuracy_score(y_train, y_train_pred))
print("Test Accuracy: ", accuracy_score(y_test, y_test_pred))

# Print a detailed classification report
print("\nClassification Report on Test Data:")
print(classification_report(y_test, y_test_pred))

# Feature importance - Identify which variables matter most in predicting death category
feature_importances = clf.feature_importances_
print("\nFeature Importances:")
for feature, importance in zip(X.columns, feature_importances):
    print(f"{feature}: {importance}")

print("\nConfusion Matrix on Test Data:")
print(confusion_matrix(y_test, y_test_pred))

# Save the trained model to a pickle file
clf = 'models/clf_model.pkl'
with open(clf, 'wb') as f:  # Save model to file
    pickle.dump(clf, f)

print("\nModel saved to clf.pkl'")




Training Accuracy:  1.0
Test Accuracy:  0.7123769338959213

Classification Report on Test Data:
              precision    recall  f1-score   support

        High       0.77      0.78      0.77       346
         Low       0.63      0.63      0.63       341
      Medium       0.72      0.72      0.72       735

    accuracy                           0.71      1422
   macro avg       0.71      0.71      0.71      1422
weighted avg       0.71      0.71      0.71      1422


Feature Importances:
age_group_years: 0.1534313484439139
sex: 0.15924502752756703
race_code: 0.06016835293677794
year: 0.09539740509612805
population: 0.5317578659956133

Confusion Matrix on Test Data:
[[270   0  76]
 [  2 214 125]
 [ 80 126 529]]

Model saved to clf.pkl'


## Train and Evaluate Random Forest Model

In [65]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42)

# Fit the model to the training data
rf_clf.fit(X_train, y_train)

# Predict on the training and test data
y_train_pred_rf = rf_clf.predict(X_train)
y_test_pred_rf = rf_clf.predict(X_test)

# Evaluate the model's performance
print("Random Forest - Training Accuracy: ", accuracy_score(y_train, y_train_pred_rf))
print("Random Forest - Test Accuracy: ", accuracy_score(y_test, y_test_pred_rf))

# Print a detailed classification report
print("\nRandom Forest - Classification Report on Test Data:")
print(classification_report(y_test, y_test_pred_rf))

# Feature importance from Random Forest
rf_feature_importances = rf_clf.feature_importances_
print("\nRandom Forest - Feature Importances:")
for feature, importance in zip(X.columns, rf_feature_importances):
    print(f"{feature}: {importance}")

print("\nRandom Forest - Confusion Matrix on Test Data:")
print(confusion_matrix(y_test, y_test_pred_rf))

# Save the trained model to a pickle file
rf_clf = 'models/rf_clf_model.pkl'
with open(rf_clf, 'wb') as f:  # Save model to file
    pickle.dump(rf_clf, f)

print("\nModel saved to 'rf_clf.pkl'")

Random Forest - Training Accuracy:  1.0
Random Forest - Test Accuracy:  0.7461322081575246

Random Forest - Classification Report on Test Data:
              precision    recall  f1-score   support

        High       0.82      0.84      0.83       346
         Low       0.65      0.63      0.64       341
      Medium       0.75      0.75      0.75       735

    accuracy                           0.75      1422
   macro avg       0.74      0.74      0.74      1422
weighted avg       0.74      0.75      0.75      1422


Random Forest - Feature Importances:
age_group_years: 0.14013343936798117
sex: 0.13658922366409074
race_code: 0.047425309053392276
year: 0.08189632688896195
population: 0.5939557010255739

Random Forest - Confusion Matrix on Test Data:
[[291   0  55]
 [  0 216 125]
 [ 66 115 554]]

Model saved to 'rf_clf.pkl'


# Results
Basic results for models to predict deaths based on cleaned suicide dataset.
| Model | Training Features | Set | RMSE | R2 |
|:---|:---|:---|:---|:---|
|Linear Regression|age_group_years, sex, race_code|Training|35.89|13.31|
|Linear Regression|age_group_years, sex, race_code|Test|36.76|12.59|
|Linear Regression|age_group_years, sex, race_code, population|Training|23.35|63.30|
|Linear Regression|age_group_years, sex, race_code, population|Test|22.10|68.43|
|Random Forest Regressor|age_group_years, sex, race_code|Training|35.889|13.31|
|Random Forest Regressor|age_group_years, sex, race_code|Test|36.44|14.10|
|Random Forest Regressor|age_group_years, sex, race_code, population|Training|23.35|63.30|
|Random Forest Regressor|age_group_years, sex, race_code, population|Test|24.02|62.68|

Basic results for models to predict feature importance based on cleaned suicide dataset. 
| Model | Training Features | Set | Accuracy | 
|:---|:---|:---|:---|
|Decision Tree Model|age_group_years, sex, race_code, population|Training|1.0|
|Decision Tree Model|age_group_years, sex, race_code, population|Test|.71|
|Random Forest Model|age_group_years, sex, race_code, population|Training|1.0|
|Random Forest Model|age_group_years, sex, race_code, population|Test|.75|

**Decision Tree Classifier Feature Importance**
| Feature | Feature Importance |
|:---|:---|
|age_group_years|0.15|
|sex|0.14|
|race_code|.06|
|year|0.10|
|population|0.53|

**Random Forest Feature Importance**
| Feature | Feature Importance |
|:---|:---|
|age_group_years|0.14|
|sex|0.14|
|race_code|.05|
|year|0.08|
|population|0.59|

**Decision Tree Classifier Confusion Matrix**
| Predicted \ Actual | Low  | Medium | High |
|:---|:---|:---|:---|
| Low                | 270  | 0      | 76   |
| Medium             | 2    | 214    | 125  |
| High               | 80   | 126    | 529  |


**Random Tree Confusion Matrix**
| Predicted \ Actual | Low  | Medium | High |
|:---|:---|:---|:---|
| Low                | 291 | 0      | 55   |
| Medium             | 0    | 216    | 125  |
| High               | 66   | 115    | 524  |


#### Discussion of Results
The initial linear regression model performed poorly on both training and test sets noted by relatively high errors and very low R^2. This model does not fit well to the data. When "population" was added to the linear regression model, the RMSE improved and the R^2 increased significantly. This suggests "population" may be important in predicting the number of deaths. However, further analysis is needed to determine if the rate of death is proportional to the size of the population.

The Random Forest Regressor model's initial training and test sets were very similar to the initial linear regression models. Similar to the linear regression model, with the inclusion of "population" into the variables, the Random Forest Regressor performed better. However, when including population, the linear regression model's test model performed better.

When considering predicting the number of deaths, a linear regression model with all features is the best performing model. It is possible the data is not complex enough for the Random Forest Regressor model to offer a performance advantage.

The Decision Tree Classifier Model performed well on the training set with 100% accuracy, which suggests the model is overfitting. The test set performed okay, with 71% accuracy, but this is not generalizable to other data. The Decision Tree Classifier confusion matrix shows the model was good at classifying the "High" death category; the "High" category may be overemphasized in this model.

The Random Forest Classifier also had 100% accuracy on the training set, which again suggests overfitting. The test set performed better than the Decision Tree Classifier's test set with 75%, which is still not great for generalizability, but still better than the Decision Tree Classifier.
Regarding the Random Forest Classifiers, the confusion matrix did not do well in classifying "High" or "Low" categories, again suggesting the Decision Tree Classifier as the better performing model when deciding feature importance in death prediction.

Both classifier models note "population" as the most important feature, followed by "sex" and "age_group_years."
