# Context


This notebook summarizes the steps in the development process for our survey model.

### The data

The model leverages a [public dataset](https://wwwn.cdc.gov/nchs/nhanes/default.aspx) from the National Health and Nutrition Examination Survey (NHANES). NHANES is a national survey that monitors the health and nutrional status of adults and children across the US. NHANCES is run by the National Center for Health Statistics (NCHS). NCHS is part of centers for disease control and prevention (CDC) and is responsible for producing vital and health statistics for the nation. The public database contains data to 40+ surveys that were each conducted 12 times between 1999-2020. 

### EDA & Data Pipeline

We explored roughly half of the survey files available in the database. For our final model we are using {XXX} different survey files spanning years of {YYY}. Our repository contains links to our jupyter notebooks used for creating our data-pipeline {ZZZ} & performing exploratory data analysis (EDA) {AAA}. 

### Model Development

#### Model Objective

The objective of our model is to predict whether a survey respondent is depressed. 
- The outcome variable is binary indicator variable called "MDD". The value is 1 if the individual has been diagnosed & is taking medication related to depression. 
- The input variables include data around demographic traits, medical & physical health, mood/behavioral data.

#### Training Phase

We explored several options during the training phase in an effort to build the best model and better understand our target sample.


1) We explored the following 6 classifiers for our model;
- Logistic Regression: XXX
- Random Forest: XXX
- Decision Tree: XXX
- K-Nearest Neighbor (KNN): XXX
- Naive Bayes: XXX
- Gradient Boosting Classifier: XXX

2) Positive class imbalance: Across our entire dataset ~4% of respondents have MDD=1. Among preganant woman, ~10% have MDD=1. We explored 3 stragies to account for this;
- SMOTE: XXX
- Test/Train Proportion: XXX
- Adding additional survey years: XXX
 
3) Segmentation: Our project is oriented around predicting PPD, hence our model focuses on predicting depression among woman that have been pregnant. We did however want to compare performance when across several groups; A) all respondents B) males only C) females only D) females that have been preganant only. This is used to understand potential differences in predicting depression among different audiences.

#### Test Phase



- Evaluation criteria: 
- Feature selection:
- Hyperparameters: 
- Error analysis:

# Model Development

## Setup

### Load Packages

In [1]:
# TODO: add annotations describing usage of different modules

from operator import mod
from os import getcwd
from os.path import exists, join

import joblib
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVR
import pandas as pd
import numpy as np
# from ydata_profiling import ProfileReport

from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.linear_model import LogisticRegression, LinearRegression
import warnings
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
from sklearn.ensemble import  GradientBoostingClassifier
# import xgboost as xgb
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC, LinearSVC 
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold

from sklearn.metrics import recall_score

from sklearn import tree
from sklearn.decomposition import PCA, SparsePCA

from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
import json
import pickle
from IPython.display import Image
import warnings

from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from collections import Counter

import altair as alt
import random
import warnings

warnings.filterwarnings('ignore')

  from pandas.core.computation.check import NUMEXPR_INSTALLED


### Load Data

In [2]:
cdc_survey = pd.read_csv('../data/cdc_nhanes_survey_responses_clean.csv')

# filter to pregnant moms
cdc_survey_pmom = cdc_survey[cdc_survey['has_been_pregnant'] == 1]
cdc_survey_pmom

Unnamed: 0,SEQN,SMQ681,SMQ690A,SMQ710,SMQ720,SMQ725,SMQ690B,SMQ740,SMQ690C,SMQ770,...,live_birth_count,age_at_first_birth,age_at_last_birth,months_since_birth,horomones_not_bc,smoked_100_cigs,currently_smoke,height_in,weight_lbs,attempt_weight_loss_1yr
8,109284,2.0,,,,,,,,,...,4.0,17.0,24.0,,2.0,2.0,,60.0,178.0,1.0
11,109290,2.0,,,,,,,,,...,3.0,,,,2.0,2.0,,63.0,155.0,1.0
12,109291,2.0,,,,,,,,,...,4.0,21.0,39.0,,2.0,2.0,,64.0,148.0,2.0
15,109295,2.0,,,,,,,,,...,3.0,18.0,24.0,,2.0,2.0,,63.0,137.0,2.0
18,109300,2.0,,,,,,,,,...,2.0,30.0,33.0,,2.0,2.0,,59.0,130.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29140,83700,2.0,,,,,,,,,...,1.0,,27.0,,2.0,2.0,,64.0,153.0,2.0
29141,83701,2.0,,,,,,,,,...,1.0,,26.0,,2.0,2.0,,65.0,160.0,2.0
29144,83711,2.0,,,,,,,,,...,2.0,27.0,30.0,,2.0,2.0,,61.0,185.0,
29146,83717,2.0,,,,,,,,,...,7.0,19.0,40.0,,2.0,2.0,,,93.0,2.0


### Available Features

In [3]:
all_columns = [
    # Depression screener
    'little_interest_in_doing_things',
    'feeling_down_depressed_hopeless',
    'trouble_falling_or_staying_asleep',
    'feeling_tired_or_having_little_energy',
    'poor_appetitie_or_overeating',
    'feeling_bad_about_yourself',
    'trouble_concentrating',
    'moving_or_speaking_to_slowly_or_fast',
    'thoughts_you_would_be_better_off_dead',
    'difficult_doing_daytoday_tasks',
    # Alcohol & smoking
    'has_smoked_tabacco_last_5days',
    'alcoholic_drinks_past_12mo', 
    'drank_alc',
    'alc_drinking_freq',
    'alc_per_day',
    'times_with_4or5_alc',
    'times_with_8plus_alc',
    'times_with_12plus_alc',
    '4plus_alc_daily',
    'days_4plus_drinks_occasion',
    #Blood Pressure & Cholesterol
    'high_bp',
    'age_hypertension',
    'hypertension_prescription',
    'high_bp_prescription',
    'high_cholesterol',
    'cholesterol_prescription',
    #Cardiovascular Health
    'chest_discomfort',
    # Diet & Nutrition
    'how_healthy_is_your_diet',    
    'count_lost_10plus_pounds',
    'has_tried_to_lose_weight_12mo', 
    'breastfed',
    'milk_consumption_freq',
    'govmnt_meal_delivery',
    'nonhomemade_meals',
    'fastfood_meals',
    'readytoeat_meals',
    'frozen_pizza',
    #Food Security
    'emergency_food_received',
    'food_stamps_used',
    'wic_benefit_used',
    #Hospital Utilization & Access to Care
    'general_health',
    'regular_healthcare_place',
    'time_since_last_healthcare',
    'overnight_in_hospital',
    'seen_mental_health_professional',
    #Health Insurance
    'have_health_insurance',
    'have_private_insurance',
    'plan_cover_prescriptions',
    #Income
    'family_poverty_level',
    'family_poverty_level_category',
    #Medical Conditions
    'asthma',
    'anemia_treatment',
    'blood_transfusion',
    'arthritis',
    'heart_failure',
    'coronary_heart_disease',
    'angina_pectoris',
    'heart_attack',
    'stroke',
    'thyroid_issues',
    'respiratory_issues',
    'abdominal_pain',
    'gallstones',
    'gallbladder_surgery',
    'cancer',
    'dr_recommend_lose_weight',
    'dr_recommend_exercise',
    'dr_recommend_reduce_salt',
    'dr_recommend_reduce_fat',
    'currently_losing_weight',
    'currently_increase_exercise',
    'currently_reducing_salt',
    'currently_reducing_fat',
    'metal_objects',
    #Occupation
    'hours_worked',
    'over_35_hrs_worked',
    'work_schedule',
    #Physical Activity
    'vigorous_work',
    'walk_or_bicycle',
    'vigorous_recreation',
    'moderate_recreation',
    # Physical health & Medical History
    'count_days_seen_doctor_12mo',
    'duration_last_healthcare_visit',        
    'count_days_moderate_recreational_activity',   
    'count_minutes_moderate_recreational_activity',
    'count_minutes_moderate_sedentary_activity',
    'general_health_condition',    
    'has_diabetes',
    'has_overweight_diagnosis',  
    #Reproductive Health
    'regular_periods',
    'age_last_period',
    'try_pregnancy_1yr',
    'see_dr_fertility',
    'pelvic_infection',
    'pregnant_now',
    'pregnancy_count',
    'diabetes_pregnancy',
    'delivery_count',
    'live_birth_count',
    'age_at_first_birth',
    'age_at_last_birth',
    'months_since_birth',
    'horomones_not_bc',
    #Smoking
    'smoked_100_cigs',
    'currently_smoke',
    #Weight History
    'height_in',
    'weight_lbs',
    'attempt_weight_loss_1yr',
    # Demographic data
    'food_security_level_household',   
    'food_security_level_adult',    
    'monthly_poverty_index_category',
    'monthly_poverty_index',
    'count_hours_worked_last_week',
    'age_in_years',   
    'education_level',
    'is_usa_born',    
    'has_health_insurance',
    'has_health_insurance_gap'   
]
len(all_columns)

118

### Function to create test & train dataset

In [63]:
def get_model_data(original_df, 
                   columns, 
                   test_size_prop=0.2,
                   drop_null_rows=False,
                   null_imputer_strategy='median', # mean, median, most_frequent
                   use_value_scaler=True):
    """
    Function to build feature & indicator matrices for both train & test.
    """
    
    # add target column (MDD)
    cols_to_use = columns.copy()
    cols_to_use.insert(0, 'MDD')
    
    df_to_use = original_df[cols_to_use]
    
    if drop_null_rows:
        df_to_use.dropna(inplace=True)
    
    # Create test & train data
    x = df_to_use.iloc[:,1:].values
    y = df_to_use['MDD'].values
    
    if not drop_null_rows:
        # SimpleImputer() = fill in missing values
        # note imputer may drop columns if no values exist for it
        imputer = SimpleImputer(strategy=null_imputer_strategy)  
        x = imputer.fit_transform(x)

    # RobustScaler() = scale features to remove outliers
    if use_value_scaler:
        trans = RobustScaler()
        x = trans.fit_transform(x)

    x_train, x_test, y_train, y_test = train_test_split(
        x, 
        y, 
        test_size=test_size_prop, 
        random_state=42
    ) 
    
    return x_train, x_test, y_train, y_test

### Function to performance across different models

In [5]:
def get_performance_df(label_actual, label_pred, model_name):
    """
    Function to calculate performance metrics for model.
    Includes precision, recal, F1, & support.
    """
    # create classification report
    result_table = classification_report(label_actual, label_pred, output_dict=True)
    result_table = pd.DataFrame.from_dict(result_table)

    # store for later
    accuracies = result_table['accuracy']

    # rename grouping
    result_table.columns = [
        'depressed_no',
        'depressed_yes',
        'accuracy',
        'macro_avg',
        'weighted_avg'
    ]

    # create dataframe with 1 row per grouping
    result_table.drop(labels = 'accuracy', axis = 1, inplace=True)
    result_table = result_table.transpose()
    result_table['accuracy'] = list(accuracies)
    result_table = result_table.reset_index()
    result_table.rename(columns = {'index':'grouping'},inplace=True)
    result_table['model'] = model_name
    result_table = result_table[['model','grouping','precision','recall','f1-score','support','accuracy']]
    return result_table

def baseline_models(
    x_train, 
    y_train, 
    x_test, 
    y_test,
    do_smote=False,
    show_confusion_matrix=False,
    show_score_dataframe=False,
    show_all_groupings=False):
    """
    Function that trains and makes predictions using 5 of the classifiers went over during the class.
    Meant as a helper function for easier testing of different modeling pipelines.
    """

    #  do_smote
    if do_smote:
        sm = SMOTE(random_state=42)
        x_train, y_train = sm.fit_resample(x_train, y_train)

    # K-Nearest Neighbors
    knn = KNeighborsClassifier()
    knn.fit(x_train, y_train)
    pred_labels_knn  = knn.predict(x_test)
    score_knn = get_performance_df(y_test, pred_labels_knn,'Knn')
    
    # Logistic Regression
    lm = LogisticRegression()
    lm.fit(x_train, y_train)
    pred_labels_lr  = lm.predict(x_test)
    score_lr = get_performance_df(y_test, pred_labels_lr,'Logistic Regression')
        
    # Bernoulii Naive Bayes
    bnb = BernoulliNB()
    bnb.fit(x_train, y_train)
    pred_labels_bnb  = bnb.predict(x_test)
    score_bnb = get_performance_df(y_test, pred_labels_bnb,'Bernoulli Naive Bayes')    
        
    # Gaussian Naive Bayes
    gnb = GaussianNB()
    gnb.fit(x_train, y_train)
    pred_labels_gnb  = gnb.predict(x_test)
    score_gnb = get_performance_df(y_test, pred_labels_gnb,'Gaussian Naive Bayes')    

    # Random Forest
    rf = RandomForestClassifier(random_state=0)
    rf.fit(x_train, y_train)
    pred_labels_rf  = rf.predict(x_test)
    predictions_posterior_rf = rf.predict_proba(x_test)
    score_rf = get_performance_df(y_test, pred_labels_rf,'Random Forest')   
    
    #Decision Tree
    dt = DecisionTreeClassifier()
    dt.fit(x_train, y_train)
    pred_labels_dt = dt.predict(x_test)
    score_dt = get_performance_df(y_test, pred_labels_dt,'Decision Tree')

    #Gradient Boosting Classifier
    gb = GradientBoostingClassifier()
    gb.fit(x_train, y_train)
    pred_labels_gb = gb.predict(x_test)
    score_gb = get_performance_df(y_test, pred_labels_gb,'Gradient Boosting Classifier')
    
    # make dataframe with scores
    scores = pd.concat([score_knn, score_lr, score_bnb, score_gnb, score_rf, score_dt, score_gb])
    scores = scores.sort_values(by = 'recall', ascending=False)
    
    if show_score_dataframe:
        display(scores.style.set_table_attributes('style="font-size: 17px"').hide_index())
    
    if show_confusion_matrix:
        print('\nK-Nearest Neighbors Confusion Matrix')
        plot_confusion_matrix(y_test, pred_labels_knn)
        print('Logistic Regression Confusion Matrix')
        plot_confusion_matrix(y_test, pred_labels_lr)
        print('Bernoulli Naive Bayes Confusion Matrix')
        plot_confusion_matrix(y_test, pred_labels_bnb)
        print('Gaussian Naive Bayes Confusion Matrix')
        plot_confusion_matrix(y_test, pred_labels_gnb)
        print('Random Forest Confusion Matrix')
        plot_confusion_matrix(y_test, pred_labels_rf)
        print('Decision Tree Confusion Matrix')
        plot_confusion_matrix(y_test, pred_labels_dt)
        print('Gradient Boosting Confusion Matrix')
        plot_confusion_matrix(y_test, pred_labels_gbt)

    if not show_all_groupings:
        scores = scores[scores['grouping'] == 'macro_avg']

    return scores

## Baseline Model

In [73]:
x_train, x_test, y_train, y_test = get_model_data(
    use_value_scaler=True,
    drop_null_rows=False,
    null_imputer_strategy='median', # mean, median, most_frequent
    original_df = cdc_survey_pmom,
    columns = all_columns[0:10] # just depression screener
)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(6192, 10)
(1549, 10)
(6192,)
(1549,)


In [74]:
baseline_model = baseline_models(x_train, y_train, x_test, y_test, do_smote=False)
baseline_model[['model','precision','recall','f1-score']]

Unnamed: 0,model,precision,recall,f1-score
2,Bernoulli Naive Bayes,0.614506,0.675232,0.63239
2,Gaussian Naive Bayes,0.618641,0.660506,0.633699
2,Decision Tree,0.593721,0.559726,0.56993
2,Random Forest,0.688099,0.54384,0.555191
2,Logistic Regression,0.685605,0.53827,0.546401
2,Knn,0.655004,0.534046,0.539683
2,Gradient Boosting Classifier,0.719752,0.519854,0.513698


## Baseline Model plus permutations

In [78]:
model_performance = pd.DataFrame(columns = [
    'model','grouping','precision','recall','f1-score','support','accuracy',
    'null_imputer_strategy','use_value_scaler_option','drop_null_rows_option','do_smote_option']
)


columns_to_use = all_columns[0:10] # just depression screener     
null_imputer_strategies = ['median','mean','most_frequent']
use_value_scaler_options = [True,False]
drop_null_rows_options = [True,False]
do_smote_options = [True,False]

for null_imputer_strategy in null_imputer_strategies:
    for use_value_scaler_option in use_value_scaler_options:
        for drop_null_rows_option in drop_null_rows_options:
            for do_smote_option in do_smote_options:
                

                x_train, x_test, y_train, y_test = get_model_data(
                    original_df = cdc_survey_pmom,
                    columns = columns_to_use,
                    null_imputer_strategy=null_imputer_strategy,
                    use_value_scaler=use_value_scaler_option,
                    drop_null_rows=drop_null_rows_option
                )
                
                baseline_model = baseline_models(
                    x_train, 
                    y_train, 
                    x_test,
                    y_test, 
                    show_all_groupings=True,
                    do_smote=do_smote_option
                )
                
                baseline_model['null_imputer_strategy'] = null_imputer_strategy
                baseline_model['use_value_scaler_option'] = use_value_scaler_option
                baseline_model['drop_null_rows_option'] = drop_null_rows_option
                baseline_model['do_smote_option'] = do_smote_option
                
                model_performance = pd.concat([model_performance,baseline_model])

In [81]:
model_performance

Unnamed: 0,model,grouping,precision,recall,f1-score,support,accuracy,null_imputer_strategy,use_value_scaler_option,drop_null_rows_option,do_smote_option
0,Gradient Boosting Classifier,depressed_no,0.908832,0.946588,0.927326,1011.0,0.866548,median,True,True,True
0,Random Forest,depressed_no,0.906008,0.924827,0.915321,1011.0,0.846085,median,True,True,True
0,Decision Tree,depressed_no,0.912210,0.894164,0.903097,1011.0,0.827402,median,True,True,True
3,Gradient Boosting Classifier,weighted_avg,0.841535,0.866548,0.852675,1124.0,0.866548,median,True,True,True
3,Random Forest,weighted_avg,0.832407,0.846085,0.838993,1124.0,0.846085,median,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...
1,Decision Tree,depressed_yes,0.263158,0.157233,0.196850,159.0,0.868302,most_frequent,False,False,False
1,Random Forest,depressed_yes,0.441176,0.094340,0.155440,159.0,0.894771,most_frequent,False,False,False
1,Logistic Regression,depressed_yes,0.500000,0.088050,0.149733,159.0,0.897353,most_frequent,False,False,False
1,Knn,depressed_yes,0.406250,0.081761,0.136126,159.0,0.893480,most_frequent,False,False,False


In [84]:
model_performance[model_performance['grouping'] == "macro_avg"].sort_values(by='recall', ascending=False)

Unnamed: 0,model,grouping,precision,recall,f1-score,support,accuracy,null_imputer_strategy,use_value_scaler_option,drop_null_rows_option,do_smote_option
2,Logistic Regression,macro_avg,0.592019,0.693521,0.597918,1549.0,0.759845,median,False,False,True
2,Logistic Regression,macro_avg,0.592019,0.693521,0.597918,1549.0,0.759845,most_frequent,False,False,True
2,Logistic Regression,macro_avg,0.592019,0.693521,0.597918,1549.0,0.759845,median,True,False,True
2,Logistic Regression,macro_avg,0.592019,0.693521,0.597918,1549.0,0.759845,most_frequent,True,False,True
2,Logistic Regression,macro_avg,0.589854,0.686245,0.596152,1549.0,0.761782,mean,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
2,Gradient Boosting Classifier,macro_avg,0.617339,0.510307,0.497309,1124.0,0.896797,median,False,True,False
2,Gradient Boosting Classifier,macro_avg,0.600628,0.509812,0.496861,1124.0,0.895907,mean,False,True,False
2,Gradient Boosting Classifier,macro_avg,0.600628,0.509812,0.496861,1124.0,0.895907,most_frequent,True,True,False
2,Gradient Boosting Classifier,macro_avg,0.600628,0.509812,0.496861,1124.0,0.895907,median,True,True,False


In [82]:
alt.Chart(model_performance).mark_point(
).encode(
    x='model:N',
    y='f1-score:Q',
).facet(
    facet='grouping:N',
    columns=2
)

In [85]:
alt.Chart(model_performance).mark_point(
).encode(
    x='model:N',
    y='recall:Q',
).facet(
    facet='grouping:N',
    columns=2
)