# Project Presentation

## Aim of the Project:
As a group, we are planning on developing a platform that identifies mental health disorders according to symptoms provided by the patient. Our platform will help classify mental disorders and support the clinician with diagnosing the mental health disorder. Also, it may raise awareness and encourage people to seek help. We will work with a large dataset with multiple variables, which will be eventually evaluated by a machine learning algorithm.

## Project Deliverables:
We aim to create a machine learning application to help diagnose mental disorders. We will provide a machine learning model to develop an indicator or classifier to determine the mental health state of a person to support early detection. We will analyze the given data to finally make a code that shows data set of results.

If our project leads to a "Full success", our model will be used to detect a mental disorder from several features or states of a person. In this case, since we plan to create a code for a data set, Jupyter Notebook will be an appropriate interface.

On the other hand, if we happen to come up with a "Partial success", we will be at least able to show several relationships between features; for example, how an willingness to seek help from psychologists is related to a satisfaction with life? In this case, we will deliver data analysis in code, therefore a Jupyter Notebook will be an appropriate interface.

## Data Acquisition

In [1]:
# standard imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [2]:
df = pd.read_sas("mhcld_puf_2019.sas7bdat")

In [3]:
df

Unnamed: 0,YEAR,AGE,EDUC,ETHNIC,RACE,GENDER,SPHSERVICE,CMPSERVICE,OPISERVICE,RTCSERVICE,...,ODDFLG,PDDFLG,PERSONFLG,SCHIZOFLG,ALCSUBFLG,OTHERDISFLG,STATEFIP,DIVISION,REGION,CASEID
0,2019.0,-9.0,-9.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,6.0,3.0,2.019000e+10
1,2019.0,14.0,4.0,4.0,6.0,2.0,1.0,2.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,3.0,2.019000e+10
2,2019.0,12.0,-9.0,4.0,3.0,2.0,1.0,1.0,2.0,2.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,6.0,3.0,2.019000e+10
3,2019.0,10.0,-9.0,4.0,5.0,2.0,1.0,1.0,1.0,2.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,6.0,3.0,2.019000e+10
4,2019.0,2.0,2.0,4.0,5.0,2.0,1.0,1.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,3.0,2.019000e+10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362039,2019.0,5.0,-9.0,4.0,5.0,1.0,2.0,1.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,99.0,0.0,0.0,2.019636e+10
6362040,2019.0,4.0,4.0,4.0,6.0,1.0,2.0,1.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,99.0,0.0,0.0,2.019636e+10
6362041,2019.0,8.0,1.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,...,0.0,0.0,0.0,1.0,0.0,0.0,99.0,0.0,0.0,2.019636e+10
6362042,2019.0,11.0,4.0,4.0,4.0,1.0,2.0,1.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,99.0,0.0,0.0,2.019636e+10


In [4]:
df.keys()

Index(['YEAR', 'AGE', 'EDUC', 'ETHNIC', 'RACE', 'GENDER', 'SPHSERVICE',
       'CMPSERVICE', 'OPISERVICE', 'RTCSERVICE', 'IJSSERVICE', 'MH1', 'MH2',
       'MH3', 'SUB', 'MARSTAT', 'SMISED', 'SAP', 'EMPLOY', 'DETNLF', 'VETERAN',
       'LIVARAG', 'NUMMHS', 'TRAUSTREFLG', 'ANXIETYFLG', 'ADHDFLG',
       'CONDUCTFLG', 'DELIRDEMFLG', 'BIPOLARFLG', 'DEPRESSFLG', 'ODDFLG',
       'PDDFLG', 'PERSONFLG', 'SCHIZOFLG', 'ALCSUBFLG', 'OTHERDISFLG',
       'STATEFIP', 'DIVISION', 'REGION', 'CASEID'],
      dtype='object')

In [12]:
def prep_data(data_df):
    '''
    prepares data for analysis
    data_df is dataframe to be prepared/processed
    returns dataframe with necessary columns and without nan values  
    '''
    data_df = data_df[["AGE", "EDUC", "ETHNIC","RACE","GENDER", "CMPSERVICE", "MH1","MH2","MH3", "SMISED","SAP", "NUMMHS", "EMPLOY", "MARSTAT"]]
    data_df = data_df.replace(-9.0, np.nan)
    data_df = data_df.dropna()
    return data_df

In [14]:
df1 = prep_data(df)
df1

Unnamed: 0,AGE,EDUC,ETHNIC,RACE,GENDER,CMPSERVICE,MH1,MH2,MH3,SMISED,SAP,NUMMHS,EMPLOY,MARSTAT
131,8.0,3.0,4.0,5.0,2.0,1.0,6.0,3.0,1.0,1.0,1.0,3.0,5.0,4.0
193,7.0,4.0,4.0,5.0,1.0,1.0,6.0,2.0,3.0,1.0,2.0,3.0,1.0,2.0
203,3.0,3.0,4.0,3.0,1.0,1.0,8.0,9.0,3.0,2.0,2.0,3.0,5.0,1.0
235,3.0,3.0,4.0,3.0,1.0,1.0,7.0,3.0,2.0,2.0,2.0,3.0,5.0,1.0
308,8.0,2.0,4.0,5.0,2.0,1.0,10.0,1.0,7.0,1.0,1.0,3.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6361237,5.0,5.0,2.0,6.0,1.0,1.0,6.0,2.0,7.0,1.0,1.0,3.0,4.0,1.0
6361357,9.0,2.0,2.0,6.0,2.0,1.0,7.0,13.0,12.0,1.0,1.0,3.0,5.0,2.0
6361365,3.0,3.0,2.0,5.0,2.0,1.0,6.0,3.0,8.0,2.0,1.0,3.0,5.0,1.0
6361450,12.0,4.0,4.0,1.0,1.0,2.0,11.0,12.0,13.0,1.0,1.0,3.0,5.0,1.0


## Understanding the Data

`df["AGE"]`: Calculated from the client's date of birth at midpoint of the state's elected reporting period:

    1 0–11 years
    2 12–14 years
    3 15–17 years
    4 18–20 years
    5 21–24 years
    6 25–29 years
    7 30–34 years
    8 35–39 years
    9 40–44 years
    10 45–49 years
    11 50–54 years
    12 55–59 years
    13 60–64 years
    -9 Missing/unknown/not collected/invalid 

`df["EDUC"]`: Specifies the school grade level of three sub-populations of clients, as follows:

    1 Special education
    2 0 to 8
    3 9 to 11 
    4 12 (or GED) 
    5 More than 12
    -9 Missing/unknown/not collected/invalid 

`df["ETHNIC"]`: Identifies whether or not the client is of Hispanic or Latino origin. Report the most recent available information for ethnicity at the end of the reporting period:

    1 Mexican
    2 Puerto Rican
    3 Other Hispanic or Latino origin
    4 Not of Hispanic or Latino origin
    -9 Missing/unknown/not collected/invalid
    

`df["RACE"]`: Specifies the client's most recent reported race at the end of the reporting period:
    
    1 American Indian/Alaska Native
    2 Asian
    3 Black or African American
    4 Native Hawaiian or Other Pacific Islander
    5 White
    6 Some other race alone/two or more races
    -9 Missing/unknown/not collected/invalid

`df["GENDER"]`: Identifies the client's most recent reported sex at the end of the reporting period:
    
    1 Male
    2 Female
    -9 Missing/unknown/not collected/invalid



`df["CMPSERVICE"]`: This field identifies whether a client received services from any Community Mental Health Centers (CMHCs), outpatient
clinics, partial care organizations, partial hospitalization programs, PACT programs, consumer run programs (including Club
Houses and drop-in centers), and all community support programs (CSP):
    
    1 Served in SMHA-funded/operated community-based program
    2 Not served in SMHA-funded/operated community-based program

`df["MH1"]`: MENTAL HEALTH DIAGNOSIS 1
Specifies the client's current first mental health diagnosis during the reporting period:

    1 Trauma- and stressor-related disorders
    2 Anxiety disorders 
    3 Attention deficit/hyperactivity disorder (ADHD) 
    4 Conduct disorders 
    5 Delirium, dementia 
    6 Bipolar disorders 
    7 Depressive disorders 
    8 Oppositional defiant disorders 
    9 Pervasive developmental disorders 
    10 Personality disorders 
    11 Schizophrenia or other psychotic disorders
    12 Alcohol or substance use disorders 
    13 Other disorders/conditions 
    -9 Missing/unknown/not collected/invalid/no or deferred diagnosis

`df["MH2"]`: MENTAL HEALTH DIAGNOSIS 2
Specifies the client's current second mental health diagnosis during the reporting period:

    1 Trauma- and stressor-related disorders
    2 Anxiety disorders 
    3 Attention deficit/hyperactivity disorder (ADHD) 
    4 Conduct disorders 
    5 Delirium, dementia 
    6 Bipolar disorders 
    7 Depressive disorders 
    8 Oppositional defiant disorders 
    9 Pervasive developmental disorders 
    10 Personality disorders 
    11 Schizophrenia or other psychotic disorders
    12 Alcohol or substance use disorders 
    13 Other disorders/conditions 
    -9 Missing/unknown/not collected/invalid/no or deferred diagnosis

`df["MH3"]`: MENTAL HEALTH DIAGNOSIS 3
Specifies the client's current third mental health diagnosis during the reporting period:

    1 Trauma- and stressor-related disorders
    2 Anxiety disorders 
    3 Attention deficit/hyperactivity disorder (ADHD) 
    4 Conduct disorders 
    5 Delirium, dementia 
    6 Bipolar disorders 
    7 Depressive disorders 
    8 Oppositional defiant disorders 
    9 Pervasive developmental disorders 
    10 Personality disorders 
    11 Schizophrenia or other psychotic disorders
    12 Alcohol or substance use disorders 
    13 Other disorders/conditions 
    -9 Missing/unknown/not collected/invalid/no or deferred diagnosis

`df["SMISED"]`: Indicates whether the client has serious mental illness (SMI) or serious emotional disturbance (SED) using the state
definition. Use the most recent available status at the end of the reporting period:
    
    1 SMI 
    2 SED and/or at risk for SED 
    3 Not SMI/SED 
    -9 Missing/unknown/not collected/invalid

`df["SAP"]`: Substance Use Problem:

Specifies the client’s substance use problem based on a substance use diagnosis and/or using other identification method
such as substance use screening results, enrollment in a substance use program, substance use survey, service claims
information, or other related sources of data:
    
    1 Yes 
    2 No 
    -9 Missing/unknown/not collected/invalid

`df["NUMMHS"]`: Number of mental health diagnoses reported:
Calculates the number of valid mental health diagnoses (maximum of three) that are reported for each client. For instance,
the value of this variable will be 3 if nonmissing values are provided for all three mental health diagnoses (MH1, MH2, and
MH3):
    
    0 0 
    1 1 
    2 2 
    3 3

`df["EMPLOY"]`: Specifies the client’s employment status at discharge (for new clients) or the most recent available employment status at
the end of the reporting period (for continuing clients). This data element is reported for all clients (16 years old and
over) who are receiving services in non-institutional setting. Institutional settings include correctional facilities like prison,
jail, detention centers, and mental health care facilities like state hospitals, other psychiatric inpatient facilities, nursing
homes, or other institutions that keep a person, otherwise able, from entering the labor force:

    1 Full-time 
    2 Part-time 
    3 Employed full-time/part-time not differentiated 
    4 Unemployed 
    5 Not in labor force 
    -9 Missing/unknown/not collected/invalid


`df["MARSTAT"]`: Identifies the client's marital status:

    1 Never married 
    2 Now married 
    3 Separated 
    4 Divorced, widowed 
    -9 Missing/unknown/not collected/invalid

## Summary Table

In [15]:
def summary_table(group_cols,value_cols):
    return df1.groupby(group_cols)[value_cols].mean().round(2)

In [16]:
summary_table(["GENDER","EDUC"],["SAP","NUMMHS"])

Unnamed: 0_level_0,Unnamed: 1_level_0,SAP,NUMMHS
GENDER,EDUC,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1.0,1.72,3.0
1.0,2.0,1.5,3.0
1.0,3.0,1.48,3.0
1.0,4.0,1.47,3.0
1.0,5.0,1.5,3.0
2.0,1.0,1.71,3.0
2.0,2.0,1.59,3.0
2.0,3.0,1.53,3.0
2.0,4.0,1.54,3.0
2.0,5.0,1.57,3.0
