# Capstone Project: Progress Report and Preliminary Findings

#### --Context--

##### Public Health Issue: Negative outcomes experienced by foster care youth who age out of the system

There is a sub-population of children in the foster care system (~10% of the foster care population) who experience a phenomenon known as foster care drift and end up aging out of the system (at age 18) with no permanent home. Because these youth experience extreme instability and usually lack knowledge and skills that lead to self-sufficiency in adulthood, this sub-population is at a higher risk of experiencing negative outcomes such as homelessness, unemployment, incarceration, and mental health issues.  

The John H. Chafee Foster Care Independence Program (CFCIP) was initiated to address this public health issue by providing federal funds to states for the design and administration of services geared towards helping these youth transition into adulthood successfully.

CFCIP considers a former foster youth as successfully transitioned into adulthood if he or she is able to obtain positive well-being, health, education, and finanacial outcomes. 

##### CFCIP Logic Model

<img src="files/cfcip_logic_model.png">

#### --Capstone Project Specific Aim and Goals--

Build and develop a predictive model in order to:
    
1) Evaluate the John H. Chafee Foster Care Independence Program (CFCIP)

2) Identify which services provided lead to what outcomes

3) Identify outcomes experienced by former foster youth who receive services

4) Recommend which services to focus funds on
____________________________________________________________________________________________________________________

#### --Problem statement--

Which services provided by Chafee funded counties or agencies lead to positive (well-being, health, financial, and educational) outcomes for foster youth who age out of the system?

__________________________________________________________________________________________________________________

#### --EDA Highlights--

Currently creating plots in Tableau. Plots soon to come.

Initial EDA:
    
- Reviewed distributions of categorical variables

- Reviewed comparisons of categorical variables for both cohorts

- Reviewed relationships between various variables
    
Refined EDA:
    
- Feature PreProcessing

- Grouped Vars

- Feature Importance

__________________________________________________________________________________________________________________

#### --Analysis Plan/Approach--

1) Check if samples representative of target population
    
- Use Hypothesis Test for One Sample and a Dichotomous Outcome to:

    - Check for difference between the baseline populations and W1/W2 populations for both cohorts in data (services received)

    - Check for difference between Cohort 1 baseline population and Cohort 2 baseline population (services received)

2) If time permits: Check if changes between Wave 1 outcomes and Wave 2 outcomes for cohort 1 are significant
    
- If model performs poorly, results of this step may explain part of why...

3) Modeling technique: Explore Tree-Based Models
    
- Tree-based classifier is well-suited for machine learning problem

4) Preprocessing if needed for chosen modeling techniques
    
- First need to decide how to handle significant portion of the population: Respondents who left question blank or declined to respond
    
- Then need to use label encoder to transform features and targets into categorical variables (from strings to integers)
    
5) Dealing with a multi-output problem (aka multidimensional dataset), where there are several y-variables to predict.
    
- See the following documentation:

    1) http://scikit-learn.org/stable/modules/tree.html#tree-multioutput

    2) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4601565/pdf/TSWJ2015-821798.pdf

A) First, need to group y-variables into four target variables: financial, educational, well-being, and health outcomes. Each target variable would have class positive vs negative.

  ##### Currently at this point. Groupings will significantly impact results

- Need to decide what to include/how (weighting, scoring, etc.):

    - Financial outcome would be based on following features: PubFoodAs, PubHousAs, OthrFinAs, CurrEmpl, SocScrty, EducAid, PubFinAs
    - Educational outcome would be based on following features: HighEdCert, EmplySkills, CurrenRoll
    - Well-being outcome would be based on following features: CnctAdult, Homeless, SubAbuse, Marriage, Children, Incarc
    - Health outcome would be based on following features: PrescIn, MDCD, OthrHlthIn, MedicalIn, MntlHlthIn

B) Then, need to incorporate information from documentation link #1 above:

<img src="files/multi_y_doc.png">


6) Feature Importance in order to reduce dimensionality:
    
- Conduct feature importance in order to select most impactful features

- Depending on decision in Step 5, may need to conduct feature importance based on each target variable (total of four models) 

7) Build classifier

- Train and Test on 2011 data (Wave 2 outcomes based on Wave 1 Services and Demographics data)

- Build 3 classifiers, cross-validate on 2011 data; x = services_2011, y = outcomes_2013:

    -random forests

    -kNN neighbors (for comparison)

    -logistic regression (for comparison)

8) Model Evaluation:

- Compare performance scores/metrics

- Pick best model (or try all....if time permits)

- Predict outcomes_2016 based on services_2014 (wave 2 of cohort 2)

    - Evaluate model using: confusion matrix, classification report, and accuracy score

9) Next steps:

- Get 2015 data

    - Predict outcomes_2015 based on services_2011 (wave 3 of cohort 1)

    - Evaluate model using: confusion matrix, classification report, and accuracy score


#### --Initial Findings--

1) Setbacks:
    
- Diminishing data: At each stage of the project, I have had to reduce the dataset in order to address missing values and other such issues (data quality issues)

2) Roadblocks:
    
- I am currently in the process of deciding how best to deal with multi-output problem (will be based on research of public health issue as well as research of methods)

3) Surprises:
    
- Many assumptions, risks, and arbitrary groupings that I have to be mindful of that can totally change my analysis results.
