# A workflow for exploring TEDS-A

### A final project to fulfill the requirements of "Applied Data Science for Practitioners" at 
#### Washington University in St. Louis

### Author: Mason T. Breitzig

##### Instructor: Asim Banskota

##### Date: 4/27/20

## Overview of the workflow and project

### Purpose of this workflow

Many national datasets are updated annually with survey results from the previous year. The "Treatment Episode Admissions Data Set" (TEDS-A) is one such example. This survey is updated anually to provide data on admissions to substance abuse treatment facilities. It is backed by state laws, requiring collection and reporting of data from all publically-funded facilities. It also includes some data from privately funded facilities. 

Although the yearly data is sometimes merged with previous years, there is often a delay in the preparation of the combined datasets. The merged datasets also include all variables from all years, which can result in unwieldy file sizes. Rather than wait for the data to be merged by the managing organization, and to avoid battling with data wrangling as much as possible, this workflow focuses on automating the process in order to produce a curated dataset and associated models based on selected variables and informed user imput (business objective).

More specifically, this workflow is tailored to the TEDS-A data and will run through a standard selection of data science processes as follows (outline of steps): 

1. Merge previous years with the most current data release, 
2. Select desired variables and drop all others, 
3. Conduct standard pre-processing procedures (including imputation and feature engineering), 
4. Generate descriptive and visual analyses of the data (exploratory data analysis), 
5. Conduct modelling, 
6. Evaluation and model selection, and 
7. Export (and deploy) the data and models. 

These objectives will be acomplished from mostly a data science point of view rather than a statistical one. Although some model assumptions and diagnostics will be observed, the overall purpose is to predict an outcome with the greatest accuracy achieveable while still maintaining a research point of view regarding variable parsimony. 

Furthermore, this workflow has two critical assumptions of its own: 1) the TEDS-A dataset(s) are based on surveys that are generally the same with few changes year-to-year (note that diagnosis criteria may change); 2) the user has a basic understanding of the dataset from the available documentation and has a pre-determined question. The user will be required to supply the separate datasets for merging as well as the variables of interest. For further simplification, the user will also be queried for input on data wrangling and other steps. The result will be a curated dataset and models tailored to the needs of the user.

### Example use of this workflow

In order to demonstrate the power and utility of this workflow, I have selected a particular research question that I would like to explore in the TEDS-A dataset (research/academic objective). I am aware that the TEDS-A dataset includes information on  patients' mental health status based on diagnosis using the Diagnostic and Statistical Manual of Mental Disorders (DSM). Accordingly, I would like to determine whether the variables in this dataset, and particularly patterns and types of drugs used and arrest, can accurately predict certain mental health diagnoses. Building models to help facilitate medical diagnosis, or at least the appropriation of medical services, is an important task for further improving our ability to manage the well-being of a growing population (Cho, Yim, Choi, Ko, & Lee, 2019).

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline

# Set directory
PROJ_DIR = r"C:\Users\mbrei\Desktop\Example_data"
DATA_DIR = os.path.join(PROJ_DIR, 'data')

# Set directory of the source code
import sys
sys.path.append(os.path.join(PROJ_DIR, 'src'))

## Step 1 - Data Import & Merging

### Data import and merging overview:

The purpose of this step is to allow the user to import multiple years of TEDS-A data and merge them into a single dataset. This is useful for conducting specific multi-year analyses. However, this step assumes that the user is aware of any differences in the data across years. 

For example, in 2010 the survey might have requested data on whether or not an individual used psychedelics; whereas in 2016, psychedelics may be broken down into different types of drugs (e.g. psilocybin, LSD, peyote, etc.) to obtain more nuanced data. This type of inconsistency (i.e. gaps in the data across years) can quickly result in the inability to conduct meaningful analyses. Although the workflow incorporates a brief reminder, it largely assumes the user is attempting to merge datasets that have at least some non-demographic variables in common.

### Example for Step 1

For demonstration of this workflow, we will attempt to import and merge TEDS-A data from 2015, 2016, and 2017 (2017 is the most current data available publically). For the time being and simplicity's sake, the workflow currently allows a maximum of five separate data files to be merged. We will be using three separate data files since years 2015-2017 of TEDS-A collected data on the same variables. This will also allow demonstration of how to merge less than five datasets.

All of the data is publically available and can be downloaded from: https://www.datafiles.samhsa.gov/study-series/treatment-episode-data-set-admissions-teds-nid13518

Three randomly sampled and downsized versions of the 2015, 2016, and 2017 datasets are also available in the github repository for this workflow (code for the sampling/truncation also available in a separate notebook): https://github.com/mbreitzig/ADS_2020

### Import and merge the data

Users are asked to create a local directory and store all relevant TEDS-A .csv files in the new directory. The directory can then be entered and will be used by the workflow to import all TEDS-A data files and merge them. Since TEDS-A is a dataset with strict criteria, it should always be in a relatively standard format with a unique ID for each observation. The merging process assumes that the user is importing and merging TEDS-A data. Attempting to import other (incompatible) data might cause an error or result in messy/unuseable data.

Functionality for importing TEDS-A data via URL could be added in the future. However, this poses a challenge as the data usually requires a user agreement. It is also a somewhat superfluous functionality given that the data would still need to be downloaded into a directory prior to import. Therefore, the following code only supports importing from a local folder.

For the example, please copy-paste and enter the following data path: 

In [None]:
# Grab the data and merge
from teds_script import import_data
TEDS_A_merged = import_data()

In [None]:
TEDS_A_merged

## Step 2 - Exploratory Data Analysis (EDA)

### EDA overview:

The purpose of this step is to take a first look at the data. Now that the data have been merged we can begin to look for important associations. This will inform which variables we wish to keep going forward (if not all). The workflow will automatically generate some plots for the variables in the data. However, the user can also generate additional plots to explore their own interests in the data.

For this example, we will be most interested in plotting the relation between some drug variables and mental health diagnoses as well as the distribution of mental health diagnoses across certain sociodemographic variables.

1. Please copy-paste and enter: DSMCRIT
2. Please copy-paste and enter: EDUC GENDER AGE RACE SUB1 ARRESTS

In [None]:
# Grab the data and merge
from teds_script import EDA
user_XY_list, info1, info2, TEDS_Ana, user_target, user_X_list = EDA(TEDS_A_merged)

# Cleanly print the info
pd.DataFrame(info1)
pd.DataFrame(info2)

# Generate the plots
for i in user_XY_list:
    sns.countplot(y=TEDS_Ana[i], data=TEDS_Ana, color="c")
    plt.show()

In [None]:
# Grab the crosstabs function
from teds_script import crosstabs

# Cleanly print the crosstabs
crosstabs_out = crosstabs(TEDS_Ana, user_target, user_X_list)
for ctabs in crosstabs_out:
    crosstabs_out[ctabs]

In [None]:
# Grab the visual crosstabs function
from teds_script import vis_crosstabs

# Cleanly print the visual crosstabs
vis_crosstabs_out = vis_crosstabs(TEDS_Ana, user_target, user_X_list)
for vctabs in vis_crosstabs_out:
    vis_crosstabs_out[vctabs]

## Step 3 - Data Truncation / Selection

### Data truncation / selection overview:

The purpose of this step is to allow the user to drop variables they deem unecessary for the aims of their analysis. This is a common step in research-focused analyses. However, it is less in data-science which typically aims to build the most accurate predictive model (involving all available data).

### Example for Step 3

To demonstrate the utility of this workflow and keep the analysis simple, we will drop several variables. The user can specify as many variables as desired to be kept during this step. Here we will keep only the following variables to determine the relation between drug use and mental health on admission to a treatment facility:

1) Please copy and paste: ADMYR CASEID AGE EDUC ALCFLG RACE ETHNIC GENDER SUB1 SERVICES DSMCRIT ARRESTS

With greater processing power and time, it is possible to analyze all of the variables in the TEDS-A dataset. This selection of variables was made due to the interest in predicting mental health diagnosis based on the primary substance and some other select factors. It is in the best interest to try and predict the outcome (diagnosis) with as little information as possible since a patient's record may not always be complete and/or they may not be able/willing to provide certain information.

In [None]:
# Grab the visual crosstabs function
from teds_script import drop_data

# Run the drop function
TEDS_A = drop_data(TEDS_A_merged)

In [None]:
TEDS_A

## Step 4 - Data Pre-processing

### Data pre-processing overview:

The purpose of this step is to prepare the data for analysis. Pre-processing is a common step that involves recoding of values, data type reassignment (e.g. integer to string), imputation, feature engineering, and a range of other actions that occur before modelling.

This step will enable the user to tailor the data to their needs and conduct some standard preprocessing procedures.

To demonstrate the functionality of this step, we will walk through each substep with the objective (research question) of predicting mental health diagnosis.

### Step 4.1 - Target Selection & Multiple Imputation

The purpose of this step is to impute missing values. This can be done using a range of methods. Since the data in the TEDS-A datasets are dependent on patient/hospital records it is possible that some missing values are missing completely at random MCAR). However, self-report and interviewer biases are also very likely in this type of data. Therefore, this workflow will treat all missing values as missing not at random (MNAR)

For this type of missing data, it is necessary to use methods that are more advanced than replacement with central mean tendencies. This workflow will utilize 'IterativeImputer.' This experimental imputer models each variable with missing observations as a function of other variables until all selected missing values have been imputed.

More info on this method can be found at: 
https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

This step is conducted before recoding since categorical variables can be difficult to handle alongside numeric variables when conducting multiple imputation. Additionally, up to this point the outcome has been ambiguous. During this step, the user will be asked to select a target variable. This will later facilitate splitting the data into training and testing sets (80% to 20%, respectively).

This step also converts the non-standard missing value of -9 to NaN. At this time, TEDS-A codes all missing observations as -9. Therefore this code automatically imports the data recognizing -9 as NaN. It is easier to handle this pre-processing step here than later in the workflow.

In [None]:
TEDS_A.isna().sum()

In [None]:
# Grab the imputation function
from teds_script import imputation

# Multiple imputation using Iterative Imputer
TEDS_A_Imputed = imputation(TEDS_A)

In [None]:
TEDS_A_Imputed.isna().sum()

### Step 4.2 - Feature Engineering

The purpose of this step is to generate some features for unique analyses. Note that this step occurs before the official "data cleaning step" since the TEDS-A data is stored in a cleaned format that is appropriate for generation of polynomial features (or in this case interaction terms for categorical variables).

In the field of public health, and especially mental health, it is common for sociodemogrphic variables to demonstrate meaningful interactions. For example, someone who is a White male may face fewer health complications compared to a Black male, but especially fewer than a Black female (Silva, Loureiro, & Cardoso, 2016). This indicates that interaction terms among sociodemographic variables may reveal additional details within the data. It is also possible that interactions of sociodemographics with other variables in the data may reveal hidden interactions that explain more nuance of the data.

Specifically, in this example we will use sklearn to generate 2-degree polynomial features for the variables selected by the user. Allowing the user to specify which features to engineer is essential for both data science based objectives (e.g. highest accuracy prediction) and public health research (parsimonious consideration of interaction terms). 

For the example, we will enter the sociodemographic variables as well as the ARRESTS variable. The is based on the hypothesis that the patterns of arrest may be different between individuals of varying races, ethnicities, education levels, etc. and that this may impact the distributions of mental health (Silva, Loureiro, & Cardoso, 2016).

1) Please copy and paste: EDUC ARRESTS GENDER RACE

In [None]:
# Grab the feature engineering function
from teds_script import user_interactions

# Run the feature engineering function
TEDS_Af = user_interactions(TEDS_A_Imputed)

In [None]:
# Display the current data with polynomial features
TEDS_Af

### Step 4.3 - Data Cleaning

The purpose of this step is to ensure that the variables and their values are in the proper format. For the TEDS-A dataset, this means we need to check the variable types and recode any leveled (e.g. factor) variables with their proper categorical label. This can be handled without user input since the TEDS-A data rarely changes variables and/or data format. It is reasonable for this program to be updated to handle changes in the TEDS-A data given the low frequency of change.

Therefore, this iteration of the program will focus on properly pre-processing the TEDS-A data based on the codebook from years 2015-2017. Note that the pre-processing accuracy is liable to be affected by changes/variations in the TEDS-A data format in future years. For the time being, this workflow will also not handle any variables that were present in the years before 2015, but not included in 2015-2017.

This step will also give meaningful labels to the data for better understanding.

#### Note:

One issue with the TEDS-A dataset is that all variables are coded to be numeric when in fact many of them are actually categorical. This necessitates a substantial amount of recoding in order to be logical. This recoding process will be executed automatically by the workflow for variables present in the 2015-2017 datasets (based on the codebook).

Furthermore, not all variables that should be recoded currently have code implimented for recoding. Future iterations of this workflow could include additional recoding procedures.

In [None]:
# Grab the data cleaning function
from teds_script import var_recode

# Run the data cleaning function
TEDS_A_Imputed, col_list_cat, col_list_ord, col_list_cat_f = var_recode(TEDS_A_Imputed, user_target, TEDS_Af)

In [None]:
TEDS_A_Imputed

### Step 4.4 - Feature Engineering Part 2

This step serves as an optional step to collapse categories of the target variable. This is often a crucial step for simplifying target variables. Here we will collapse the target variable DSMCRIT into three categories:

1) Please copy and paste: Reference
2) Please copy and paste: Alcohol abuse
3) Please copy and paste: Anxiety disorders

This will set 17 out of the 19 variables as the "other" or "reference" category. The user can select any desired combination of three categories.

In [None]:
# Grab the second feature engineering function
from teds_script import combine_levels

# Run the second feature engineering function
TEDS_A_Imputed, user_combined1, user_combined2, user_combined3 = combine_levels(user_target, TEDS_A_Imputed)

In [None]:
TEDS_A_Imputed

### Step 4.5 - Data Cleaning Part 2

This step serves to conduct one-hot encoding and ordinal encoding for the variables in the data. Since TEDS-A is fairly standard, as mentioned previously, the admission year, age, education, and arrests will be coded as ordinal while all other variables will be treated as categorical. The case ID variable will be excluded from recoding. This step essentially is the final preparation for modeling.

In [None]:
# Grab the second data cleaning function
from teds_script import cleaning_pt2

# Run the second data cleaning function
TEDS_Aicf = cleaning_pt2(TEDS_A_Imputed, TEDS_Af, col_list_cat, col_list_cat_f, col_list_ord)

In [None]:
TEDS_Aicf

### Step 4.6 - Standardization

It is common for studies/analyses to standardize or conduct other transformations on numeric data. These types of actions can make it easier to visualize associations or compare different data. However, this step requires that the data be essentially continuous. Although several of the variables in TEDS-A may appear to be continuous variables, they are in fact categorical. Several of the variables have a large number of categories, but are still considered ordinal data. For example, age may be continous in some data sets, but in TEDS-A it is a series of several categories with a meaningful order. 

Given that there is no indication of the format of TEDS-A changing to inlude continuous variables, this step will be skipped for now. If it is needed in the future, it would be implimented by allowing the user to plot continous variables and then decide whether to transform them with a selection of functions such as the 'MinMaxScaler' from sklearn.

## Step 5 - Modelling

### Modelling overview:

The purpose of this step is to take the curated data and generate models for the user's target variable. Since all of the variables in the TEDS-A data are categorical or ordinal, the selection of model options is simplified. 

In this example we will generate a baseline model using the default settings of a random forest classifier from sklearn. We will then test a range of hyperparmeters for tuning using the random grid search method. Based on this output, a new model will be generated using the optimal hyperparameters.

Finally, the models will be examined (compared) and explored.

### Step 5.1 Train/Test Splitting

We will first split the data. The workflow allows the user to specify the percentage of splitting. It is recommended that the test data set size be between 20 and 40 percent.

1): Please copy and paste: 25

This will specify the percentage of the testing data.

In [None]:
# Grab the train/test split function
from teds_script import user_split

# Run the train/test split function
TEDS_X_train, TEDS_X_test, TEDS_Y_train, TEDS_Y_test, y_list = user_split(TEDS_Aicf, user_target)

### Step 5.2 Model Building & Hyperparameter Tuning

As mentioned, this step tunes the hyperparameters of the random forest classifier. For this workflow, we tune the n_estimators, the max_depth, and the min_samples_leaf. Once the tuning is complete, we re-run the model with the revised hyperparameters.

In [None]:
# Grab the model building and hyperparameter tuning function
from teds_script import model_building

# Run the model building and hyperparameter tuning function
Y_pred, Y_pred_randomsearch, best_grid_randomsearch = model_building(TEDS_X_train,  TEDS_Y_train, TEDS_X_test)

## Step 6 - Evaluation & Model Selection

### Evaluation overview:

Evaluation of the models is a critical step. It allows us to decide whether we have achieved our objective. For this workflow, a model that is highly predictive may have some applicability to the medical field (with substantial testing/vetting). 

Here we evaluate the baseline model and the model with tuned hyperparameters by comparing the mean absolute error and the accuracy score. We would expect the accuracy score of the tuned model to be higher than the baseline model, but sometimes the difference is very small, especially when the settings for tuning are not extensive (as in this case to save processing time).

As part of this evaluation, we visually explore the model's predictive capabilities by using a visual confusion matrix. As we can see, the model does an excellent job of classifying the "reference" category. This is expected since the number of observations in this aggregate category is far greater than the other two we selected. However, the model still does an okay job of identifying the others. Providing more data and conducting a more rigourus hyperparameter tuning process, would no doubt improve the capabilities of this model.

In [None]:
# Grab the evaluation function
from teds_script import model_eval

# Run the evaluation function
model_eval(TEDS_Y_test, Y_pred, Y_pred_randomsearch)

In [None]:
# Grab the confusion matrix function
from teds_script import model_confusion_matrix

# Run the confusion matrix function
model_confusion_matrix(user_combined1, user_combined2, user_combined3, 
                           TEDS_X_test, TEDS_Y_test, best_grid_randomsearch, TEDS_X_train, TEDS_Y_train)

## Step 7 - Export the data and models

### Exporting overview:

Finally, once we have completed the workflow, it is also important to export the end products. This workflow exports the model and cleaned data as pickles.

In [None]:
# Grab the export function
from teds_script import export_fun

# Run the export function
export_fun(best_grid_randomsearch, TEDS_Aicf)

## Notes on limitations & future expansions

There are several limitations for this workflow that could be improved in the future. The first limitation is that the data can only be imported from the beginning "unclean" state, that is, straight from the TEDS-A repository. It would be better if the user could import a prepared dataset or a pickle of a dataset previously passed through the workflow. 

It would also be beneficial if the user could collapse the target into more than three categories and could select from a range of modeling methods. These limitations, and others, could be addressed by adding additional functionality and decision trees for the workflow.

For the final step, it would also be useful to export the model and the data as pickle files. The code for this has been written but not implimented in order to reduce unecessary output.

## Citations

Cho, G., Yim, J., Choi, Y., Ko, J., & Lee, S. H. (2019, April 1). Review of machine learning algorithms for diagnosing mental 
    illness. Psychiatry Investigation. Korean Neuropsychiatric Association. https://doi.org/10.30773/pi.2018.12.21.2

scikit-learn.org. (2020). scikit-learn: machine learning in Python â€” scikit-learn 0.22.2 documentation. 
    Retrieved April 29, 2020, from https://scikit-learn.org/stable/

Silva, M., Loureiro, A., & Cardoso, G. (2016). Social determinants of mental health: A review of the evidence. 
    European Journal of Psychiatry. 
    Retrieved from http://scielo.isciii.es/scielo.php?script=sci_arttext&pid=S0213-61632016000400004