# Data: Dictionary, Risks and Assumptions


#### For project context, aims, and problem statement, please check out my accompanying git hub blog post at this link: 
https://kgracia44.github.io/capstone_post/

____________________________________________________________________________________________________________________

### --Data Dictionary (describing the contents of each variable) --

##### (for csv file of data_dict, see data_dict_capstone.csv in git hub folder)

In [1]:
import pandas as pd

data_dict = pd.read_csv('data_dict_capstone.csv')

data_dict

Unnamed: 0,Variable Name,Total (not null) Values,data type,variable type,values,description
0,Wave,34017,str (dtype object),categorical,Wave 1: Age 17 Baseline Survey; Wave 2: Age 19...,Indicates what wave (first survey or follow-up...
1,StFCID,34017,str (dtype object),unique ID/tracker,Confidential (combo of state and rec number),Use this variable to follow records longitudin...
2,RepDate_outcomes,34017,datetime64[ns],date -- periods A and B,yyyymm [where “yyyy” and “mm” correspond to th...,The report date corresponds with the end of th...
3,OutcmRpt,34017,str (dtype object),categorical,Youth participated = 1; Youth declined = 2; Pa...,The outcomes reporting status represents the y...
4,OutcmDte,29882,datetime64[ns],date,[mm/dd/yyyy] where: yyyy is the year; mm is th...,The date of outcome data collection is the lat...
5,OutcmFCS,34017,str (dtype object),categorical,"Yes, is in FC on Date; No, is not in FC on Dat...",The youth is in foster care if the youth is un...
6,CurrFTE,34017,str (dtype object),categorical,"Yes, employed full time; No; Declined; Blank",A youth is employed full-time if employed at l...
7,CurrPTE,34017,str (dtype object),categorical,"Yes, employed part time; No; Declined; Blank",A youth is employed part-time if employed betw...
8,EmplySklls,34017,str (dtype object),categorical,Yes; No; Declined; Blank,A youth has obtained employment-related skills...
9,SocSecrty,34017,str (dtype object),categorical,Yes; No; Declined; Blank,A youth is receiving some of Social Security i...


____________________________________________________________________________________________________________________

### -- Risks and Assumptions --

- Risk: Data incomplete
    
  1) Issue: Full complement of baseline data will not be complete until after May 15 of "A" pd for next FY
  
  2) Issue: For follow-up surveys (Waves 2+): responses collected anytime in 6 mo. period
  
  3) Issue: States are encouraged, but not required, to collect data early in period to avoid performing survey in 1 
     period and reporting results in next pd.
       
       
  - Assumptions:
    
      - Survey collected and reported in same fiscal year
      - Regardless of whether data is complete for each reporting period, will be looking at data by fiscal year
      - If date is needed for any part of the analysis, the date reported will be used

____________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________

- Risk: Sample data (n) not representative of baseline population (N)

  1) Data sets exclude data from the following states:
        
     a) Connecticutt (due to Confidentiality Restrictions; excluded from raw dataset by National Data Archive on Child Abuse and Neglect (NDACAN))

     b) NY and Puerto Rico did not participate in wave 2 survey for cohort 1, so data for participants from those states are missing 
     
     c) I had to drop data from the following states: "HI", "IN", "KY", "MS", "OR", "TX", "TN" because the unique ID column--the one I need for tracking participants from baseline to surveys--for those states were corrupted. Could not use in my dataset and there is no way to impute or fix on my end.
    
  2) Self-selection bias:
       
     a) For baseline population demographic and services data: no sampling done; data collected for all population
     
     b) N = population consists of ALL youth in foster care at age 17 = baseline population
     
     c) N_no_states = baseline population excluding 7 states due to data quality issue
     
     d) Participation in survey is completely voluntary; no incentive to participate
     
     e) If respondents are significantly different from non-respondents (due to combination of response rate variation across states and survey design constraints), survey results are potentially biased and not representative of non respondents and overall, not adequately representative of outcomes of foster care youth population of 17 and 19 year olds for whom survey is intended to assess.
     
     f) Cohorts population = n  
        
     - Cohort 1, Wave 1: 
         - Self-selected, non-probabilistic sample of youth in baseline
         - Self-selection because youth are not randomly selected

     - Cohort 1 and Cohort 2, Wave 2+:
         - Optional probabilistic sampling; choice is left up to state
         - If sampling method chosen, only done once (so wave 2 sampled pop. = wave 3 sampled pop.). Sample
           size calculation is standard (using Finite Population Correction; plus 30% of pop. size for 
           attrition). Sample size based on Baseline population.

     - In FY 2011 (cohort 1, wave 2 and 3): 12 states opted for sampling.
         - Only these 12 states employed sampling methods that address selection bias, and only for wave 2 
           participants

     - No states had more than 5,000 youth in their cohorts

         
         
   - Assumptions: 

       - The Children’s Bureau employed a weighting methodology with the NYTD survey responses to identify and
         correct potential non-response bias. 
       - Assuming that the weighting methodology was appropriate and effective, then assuming no selection bias and 
         that cohorts are representative.
       - I will conduct hypothesis testing during the exploratory data analysis to determine if the sample population 
         (n) is representative of the baseline population (N_no_states)
          
____________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________

- Risk: Non-standardization of survey administration and data collection

  1) There is only one regulation concerning survey administration: Surveys are administered to the participant 
     directly, meaning no one can answer for youth, nor can data from other sources be used to answer survey 
     questions.
        
     - However, nothing in place to enforce or ensure proper survey administration
    
  2) Surveys can be administered in person, via internet, via phone
    
  3) Data collection procedures are not standardized across states or local entities       
       
       
   - Assumptions: 
   
       - Responses are from foster youth participant and no one else
       - Response rates varied dramatically by state; this may be a reflection of variance in data collection 
         procedures
        
____________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________

- Risk: Raw data/Initial dataset

  1) Raw data already had imputed values for data that was missing from initial reporting:
     
     - Missing values for covariates were imputed using a recursive hot deck algorithm seeded with a sort list of  
       state by sex. 
    
  2) For services data (raw dataset A) only: County FIPS code with fewer than 1,000 records are recoded to 8 
     (LcLFIPSSv)
    
    
   - Assumptions: 
   
       - Assuming data imputation done correctly and based on valid methodology. 
       - Excluding LcLFIPSSv from analysis because I do not have codebook for values for this feature
           - Perhaps can use in future