# AMS 595 Group Project: Identifying Risk Factors for Major Depressive Disorder

### Abby Bindelglass, Jane Condon, Nicholas Tardugno, Sydney Walters-Diaz

## Data Preparation 

### Importing Necessary Libraries

In [4]:
# Insert libraries here
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier

### Importing the Data

In [199]:
# Reading the dataset into python using pandas
df = pd.read_csv("mhcld_puf_2023.csv", low_memory=False)

In [200]:
# Looking at the first few rows to ensure that the data has been imported correctly
df.head()

Unnamed: 0,YEAR,AGE,EDUC,ETHNIC,RACE,SEX,SPHSERVICE,CMPSERVICE,OPISERVICE,RTCSERVICE,...,ODDFLG,PDDFLG,PERSONFLG,SCHIZOFLG,ALCSUBFLG,OTHERDISFLG,STATEFIP,DIVISION,REGION,CASEID
0,2023,4,-9,3,5,2,2,1,1,2,...,0,0,0,0,0,0,1,6,3,20230000001
1,2023,5,4,4,3,2,1,2,2,2,...,0,0,0,1,0,0,1,6,3,20230000002
2,2023,8,4,4,5,2,1,1,2,2,...,0,0,0,1,0,0,1,6,3,20230000003
3,2023,11,-9,3,5,2,2,1,2,2,...,0,0,0,0,0,1,1,6,3,20230000004
4,2023,3,3,4,5,2,1,1,2,2,...,0,0,0,0,0,0,1,6,3,20230000005


### Data Preprocessing

#### Mapping For Easier Readability

As shown above, the readability of the data is very poor, as is often seen with large survey data. To make the data easier to interpret and understand, we will map our variables to the values/explanations provided in the codebook. We will start with the target variable, "DEPRESSFLG". 

In [204]:
# Target variable: depressive disorder variable map

depressflg_map = {
    1: "Depressive disorder reported",
    0: "Not reported"
}


We can also map the predictor variables into something that is easier to understand. We start with the co-occurring mental health disorder variables. Since the values have the same meaning for each of the variables, (i.e., 0 = disorder reported, 1 = disorder not reported), we can combine them into a single map. 

In [206]:
# Predictor map: co-occurring mental health disorders map

disorder_flags = ["ANXIETYFLG", "ADHDFLG", "CONDUCTFLG", "DELIRDEMFLG", "BIPOLARFLG",
         "ODDFLG", "PDDFLG", "PERSONFLG", "SCHIZOFLG", "OTHERDISFLG",
         "TRAUSTREFLG", "ALCSUBFLG"]

binary_flag_map = {1: "Reported", 0: "Not reported"}

For our next predictor variable, substance use disorder, we have two variables that we may use:

* SAP (binary): Gives a value of 1 if substance use disorder present, 2 if substance use disorder not present, and 0 if the response to this survey question is missing.
* SUB (numeric): Gives a value of 1-13 indicating a client's substance use diagnosis during the reporting period (e.g., 9 = alcohol abuse, 10 = cocaine abuse), or a value of -9 for a missing/invalid diagnosis.

Since the category codes of these variables are different from those of the co-occurring mental health disorder variables and are NOT binary, we must create a separate map. We can use the SAP variable to construct a simple yes/no variable indicating whether or not a survey respondent has been diagnosed with any type of substance use disorder. If we want to look into this further, the SUB variable indicates WHICH substance use disorder an individual has been diagnosed with.

In [208]:
# Predictor map: substance use map (SUB)

sub_use_map = {
    1: "Trauma/stressor disorder",
    2: "Anxiety disorder",
    3: "Attention deficit/hyperactivity disorder (ADHD)",
    4: "Conduct disorder",
    5: "Delirium/dementia disorder",
    6: "Bipolar disorder",
    7: "Depressive disorder",
    8: "Oppositional defiant disorder",
    9: "Pervasive developmental disorder",
    10: "Personality disorder",
    11: "Schizophrenia/psychotic disorder",
    12: "Alcohol or Substance Use Disorder",
    13: "Other disorder",
    -9: "Missing",
    -8: "Not applicable"
}


In [209]:
# Predictor map: substance use map (SAP)
sap_map = {
    1: "Substance use problem reported",
    0: "No substance use problem reported",
    -9: "Missing",
    -8: "Not applicable"
}

# Applying to SAP column:
df["SAP_LABEL"] = df["SAP"].map(sap_map)

# Creating into a simple yes/no binary variable
df["HAS_SAP"] = df["SAP"].isin([1]).astype(int)

Some other useful predictor variables that we can map are age, sex, education status, marital status, residential status, veteran status, employment status, race/ethnicity, and geographic region (in the U.S.). 

In [211]:
# Predictor map: age

age_map = {
    1: "0–11",
    2: "12–14",
    3: "15–17",
    4: "18–20",
    5: "21–24",
    6: "25–29",
    7: "30–34",
    8: "35–39",
    9: "40–44",
    10: "45–49",
    11: "50–54",
    12: "55–59",
    13: "60–64",
    14: "65+",
    -9: "Missing",
    -8: "Not applicable"
}


In [212]:
# Predictor map: sex
sex_map = {
    1: "Male",
    2: "Female",
    -9: "Missing",
    -8: "Not applicable"
}


In [213]:
# Predictor map: education level
educ_map = {
    1: "Special education",
    2: "8 years or less",
    3: "9–11 years",
    4: "12 years or GED",
    5: "13 years or more",
    -9: "Missing",
    -8: "Not applicable"
}


In [214]:
# Predictor map: marital status
marstat_map = {
    1: "Never married",
    2: "Now married",
    3: "Separated",
    4: "Divorced or widowed",
    -9: "Missing",
    -8: "Not applicable"
}


In [215]:
# Predictor map: residential status
livarag_map = {
    1: "Experiencing homelessness",
    2: "Private residence",
    3: "Other",
    -9: "Missing",
    -8: "Not applicable"
}


In [216]:
# Predictor map: veteran status
veteran_map = {
    1: "Veteran",
    2: "Not a veteran",
    -9: "Missing",
    -8: "Not applicable"
}


In [217]:
# Predictor map: employment status
employ_map = {
    1: "Full-time",
    2: "Part-time",
    3: "Employed (not differentiated)",
    4: "Unemployed",
    5: "Not in labor force",
    -9: "Missing",
    -8: "Not applicable"
}

In [218]:
# Predictor map: ethnicity
ethnic_map = {
    1: "Hispanic or Latino",
    2: "Not Hispanic or Latino",
    -9: "Missing",
    -8: "Not applicable"
}

In [219]:
# Predictor map: race
race_map = {
    1: "White",
    2: "Black or African American",
    3: "American Indian or Alaska Native",
    4: "Asian",
    5: "Native Hawaiian or Pacific Islander",
    6: "Multiracial",
    -9: "Missing",
    -8: "Not applicable"
}

In [220]:
region_map = {
    1: "Northeast",
    2: "Midwest",
    3: "South",
    4: "West"
}

#### Creating a Pre-Processing Function

Next, we can create a pre-processing function that:
* Applies every mapping dictionary to the dataframe
* Creates binary flags (simple yes/no variable for appropriate variables)
* Handles missing values (using median/mode imputation)
* Returns a numeric dataframe for modeling, as well as a labeled dataframe that is easier to read/interpret

In [223]:
def preprocess_data(df):

    # Dealing with missing SAMHSA codes
    missing_codes = [-9, -8, -7, -6]
    df = df.replace(missing_codes, np.nan)

    # Dropping unnecessary columns
    predictor_vars = [
        "AGE", "SEX", "EDUC", "MARSTAT", "LIVARAG", "VETERAN",
        "EMPLOY", "ETHNIC", "RACE", "SUB", "SAP",
        "REGION", "DEPRESSFLG"
    ] + disorder_flags

    id_cols = ["CASEID", "STATEFIP"]
    df = df.drop(columns=[col for col in id_cols if col in df.columns])

    # Keep only variables listed in project proposal
    df = df[[col for col in df.columns if col in predictor_vars]]

    # Applying mappings to the dataframe
    df["AGE_LABEL"] = df["AGE"].map(age_map)
    df["SEX_LABEL"]  = df["SEX"].map(sex_map)
    df["EDUC_LABEL"] = df["EDUC"].map(educ_map)
    df["MARSTAT_LABEL"] = df["MARSTAT"].map(marstat_map)
    df["LIVARAG_LABEL"] = df["LIVARAG"].map(livarag_map)
    df["VETERAN_LABEL"] = df["VETERAN"].map(veteran_map)
    df["EMPLOY_LABEL"] = df["EMPLOY"].map(employ_map)
    df["ETHNIC_LABEL"] = df["ETHNIC"].map(ethnic_map)
    df["RACE_LABEL"] = df["RACE"].map(race_map)
    df["REGION_LABEL"] = df["REGION"].map(region_map)

    df["SAP_LABEL"] = df["SAP"].map(sap_map)
    df["DEPRESS_LABEL"] = df["DEPRESSFLG"].map(depressflg_map)

    for f in disorder_flags:
        df[f + "_LABEL"] = df[f].map(binary_flag_map)

    # Creating binary modeling flags (simple yes/no binary variables)
    df["HAS_SUBSTANCE_USE"] = (df["SUB"] == 12).astype(int)
    df["HAS_SAP"] = (df["SAP"] == 1).astype(int)
    df["MDD"] = (df["DEPRESSFLG"] == 1).astype(int)
    df["ANY_OTHER_MH_DISORDER"] = df[disorder_flags].fillna(0).max(axis=1)
    df["IS_VETERAN"] = (df["VETERAN"] == 1).astype(int)
    df["IS_HOMELESS"] = (df["LIVARAG"] == 1).astype(int)
    df["IS_MARRIED"] = (df["MARSTAT"] == 2).astype(int) if "MARSTAT" in df.columns else np.nan

    # Removing raw target before dummy encoding
    df = df.drop(columns=["DEPRESSFLG"], errors="ignore")

    # Using one-hot encoding for categorical variables
    categ_cols = [
        "AGE_LABEL", "SEX_LABEL", "EDUC_LABEL", "MARSTAT_LABEL",
        "LIVARAG_LABEL", "VETERAN_LABEL", "EMPLOY_LABEL",
        "ETHNIC_LABEL", "RACE_LABEL",
        "SAP_LABEL",
        "REGION_LABEL"
    ]

    model_df = pd.get_dummies(df.copy(), columns=categ_cols, drop_first=True)

    # Removing any SUB_LABEL dummies that leaked through copy()
    sub_cols = [c for c in model_df.columns if c.startswith("SUB_LABEL")]
    model_df = model_df.drop(columns=sub_cols, errors="ignore")

    # Dropping label columns
    label_cols = model_df.select_dtypes(include="object").columns
    model_df = model_df.drop(columns=label_cols)

    # Ensuring the DEPRESSFLG is no longer in the predictor matrix
    model_df = model_df.drop(columns=["DEPRESSFLG"], errors="ignore")

    # Handling missing values using imputation 
    num_cols = model_df.select_dtypes(include=["float64", "int64"]).columns
    cat_cols = model_df.select_dtypes(include=["object", "category"]).columns

    # For numerical variables: median imputation
    num_imputer = SimpleImputer(strategy="median")
    model_df[num_cols] = num_imputer.fit_transform(model_df[num_cols])

    # For categorical variables: mode imputation
    if len(cat_cols) > 0:
        cat_imputer = SimpleImputer(strategy="most_frequent")
        model_df[cat_cols] = cat_imputer.fit_transform(model_df[cat_cols])

    return df, model_df



In [224]:
# Calling the function
clean_df, model_df = preprocess_data(df)
model_df.isna().sum().sum() # Ensuring that missing values have been dealt with

0

#### Constructing Target Vector and Predictor Matrix

In [226]:
# Target variable
y = model_df["MDD"]

In [227]:
# Predictors matrix
X = model_df.drop(columns=["MDD", "DEPRESSFLG"], errors="ignore")

## Exploratory Data Analysis

In [229]:
# Sydney's section

## Data Visualization

In [38]:
# Nick's section

## Logistic Regression

In [40]:
# Abby's section

## Predictive Modeling

### Splitting Data Into Training Set and Test Set

In [238]:
model_df = model_df.reset_index(drop=True)

In [239]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

### Random Forest Model

In [45]:
# Defining random forest model
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    max_features="sqrt",
    n_jobs=-1,
    class_weight="balanced",
    random_state=42
)


In [46]:
# Fitting random forest model
rf.fit(X_train, y_train)

In [47]:
# Making predictions on y 
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]

In [48]:
print("RF AUC:", roc_auc_score(y_test, y_prob))
print(classification_report(y_test, y_pred))

RF AUC: 0.9949617723881715
              precision    recall  f1-score   support

           0       0.99      0.94      0.96   1285911
           1       0.86      0.97      0.91    473000

    accuracy                           0.95   1758911
   macro avg       0.92      0.96      0.94   1758911
weighted avg       0.95      0.95      0.95   1758911



### Stochastic Gradient Descent (SGD) Model

Since we are unable to use a support vector machine model with data of this size, we implement a stochastic gradient descent model instead.

In [271]:
# Scaling the data and fitting the model
sgd = Pipeline([
    ("scaler", StandardScaler(with_mean=False)), 
    ("sgd", SGDClassifier(
        loss="log_loss",
        max_iter=1000,
        tol=1e-3,
        random_state=42
    ))
])

In [273]:
# Fitting the model
sgd.fit(X_train, y_train)

In [274]:
# Making predictions on y
y_pred = sgd.predict(X_test)

In [275]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.87      0.84   1285911
           1       0.57      0.48      0.52    473000

    accuracy                           0.76   1758911
   macro avg       0.70      0.67      0.68   1758911
weighted avg       0.75      0.76      0.76   1758911



### Neural Network Model