In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
import math
import psycopg2 as psy
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL

### Raw Data Source:

Raw data obtained from the National Youth in Transition Database (NYTD). See README for more information on sources.

##### Starting raw data sets:

###### A) Services and demographic data for Fiscal Years (FY) 2011 - 2014 (Services population for each FY)
        
- Cross-sectional data collected over 8 six-month-periods

- For all foster youth receiving independent living services funded through CFCIP for each fiscal year

- Data collected from all 50 states (my data excludes data from Connecticut) and D.C. and Puerto Rico

- Total of 675645 rows, and 31 columns
        
###### B) Outcomes data for Wave 1 and Wave 2 of Cohort 1
            
- Cohort 1 consists of foster youth who received services in FY 2011 AND participated in surveys

- Longitudinal data collected, 2 follow-up periods (every 2 years)

- About 5% of services population (Dataset A, FY 2011)

- Wave 1 = Outcomes Survey collected in FY 2011, within 45 days of youth's 17th birthday, from 49 states and D.C. and Puerto Rico (data excludes data from Connecticut). Outcomes data from Wave 1 for cohort 1 will not be used in the analysis because the data is outside the scope of this project. Instead, only the unique ID from this data will be used to identify the baseline population for cohort 1.

- Wave 2 = Outcomes Follow-Up Survey collected in FY 2013, within 45 days of youth's 19th birthday, from 48 states and D.C. (data excludes data from Connecticut, New York and Puerto Rico)

- Missing: Wave 3 = Outcomes Follow-Up Survey collected in FY 2015, within 45 days of youth's 21st birthday

- Total of 22811 rows, and 48 columns
        
###### C) Outcomes data for Wave 1 of Cohort 2
            
- Cohort 2 consists of foster youth who received services in FY 2014 AND participated in surveys

- Longitudinal data collected, 2 follow-up periods (every 2 years)

- Data collected from 49 states and D.C. and Puerto Rico (data excludes data from Connecticut) 

- Wave 1 = Outcomes Survey collected in FY 2014, within 45 days of youth's 17th birthday. Outcomes data from Wave 1 for cohort 2 will not be used in the analysis because the data is outside the scope of this project. Instead, only the unique ID from this data will be used to identify the baseline population for cohort 2. 

- Missing: Wave 2 = Outcomes Follow-Up Survey to be collected in FY 2016, within 45 days of youth's 21st birthday

- Missing: Wave 3 = Outcomes Follow-Up Survey to be collected in FY 2018, within 45 days of youth's 21st birthday

- Total of 23775 rows, and 49 columns


__________________________________________________________________________________________
### Data Cleaning and Munging Plan:
    
1) Load 3 raw datasets () into pandas dataframe for each

- As described in cell above: Dataset A, Dataset B, Dataset C

2) Create dataframe for Cohort 1 -- Services Population Dataset:

- Services and demographic data for foster youth in FY 2011 from dataset A 

- Initial Data Review (look at tables, dtypes, etc.)

- Clean data: 
   
    - Remove duplicates (based on unique IDs)
            
- Save cleaned dataset as CSV file
    
3) Create dataframe for Cohort 1 -- Baseline Population Excluding 7 States Dataset:

- Services and demographic data for foster youth in FY 2011 from dataset A 

- Outcomes data for Wave 1 Participants (specifically the unique ID column) from dataset B

- Unfortunately, several states have badly encoded unique ID values, which prevents me from tracking those records. 

- Need to drop data rows from the following states due to data quality issue: 
  ["HI", "IN", "KY", "MS", "OR", "TX", "TN"]
  
- Save dataset as CSV file
       
4) Create dataframe for Cohort 1 -- Wave 2 Outcomes Dataset:

- Outcomes data for Wave 2 Participants from dataset B

- Only include foster youth who completed surveys in Wave 2

- Clean data:

    - Drop rows that have missing values in multiple columns (>10)

    - Convert data type of columns AgeMP, EduLevlSv, RaceDcl, RaceUnkn to int

    - Convert data types of columns RepDates_outcomes, RepDates_services and DOB to datetime

    - Identify categorical variable columns that may need to be dummified for later analysis

    - Get rid of special characters (that are not UTF-8) in column HighEdCert
    
- Save cleaned dataset as CSV file and as table in local postgreSQL database

5) Create dataframe for Cohort 2 -- Baseline Population Dataset:

- Services and demographic data for foster youth in FY 2014 from dataset A

- Outcomes data for Wave 1 Participants (specifically the unique ID column) from dataset C

- Initial Data Review (look at tables, dtypes, etc.)

- Clean data: 
   
    - Remove duplicates (based on unique IDs)
            
- Save cleaned dataset as CSV file

** Please Note: This dataset not contain any outcomes data for cohort 2. In order to complete my predictive model performance evaluation, I will need to obtain Wave 2 and Wave 3 data for cohort 2, which is currently unavailable. (Should be available December 2017.)
__________________________________________________________________________________________

<b>Recap -- I will end up with the following datasets:</b>

1) Target population: All foster youth who receive services (funded by CFCIP) during fiscal year of interest

2) Target population for this project: All foster youth who receive services (funded by CFCIP) during fiscal year of interest, excluding youth from states with bad encoding

3) Sampling Frame for FY 2011: Cohort 1 Baseline Population (FY 2011)

4) Sample: Cohort 1, Wave 2 Population (Outcomes data from FY 2013)

5) Sampling Frame for FY 2014: Cohort 2 Baseline Population (FY 2014)

For detailed explanations of population terms (e.g., target population, sampling frame, etc.), check out these websites:

http://www.theanalysisfactor.com/target-population-sampling-frame/

http://www.socialresearchmethods.net/kb/sampterm.php

http://www.statisticshowto.com/sampling-frame/

# START Data Clean and Munge

In [2]:
# Load 3 initial raw datasets into pandas DF

In [3]:
services_11_14 = pd.read_csv('~/Desktop/dsi_projects_backup/capstone/data_to_use/raw/Services_2014.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
outcomes_2013 = pd.read_stata('~/Desktop/dsi_projects_backup/capstone/data_to_use/raw/Outcomes2011Wave2_.dta')

In [5]:
outcomes_2014 = pd.read_stata('~/Desktop/dsi_projects_backup/capstone/data_to_use/raw/Outcomes14_W1.dta')

In [6]:
list(services_11_14)

['FY',
 'RepDate',
 'StFIPS',
 'St',
 'RecNumbr',
 'DOB',
 'Sex',
 'AmIAKN',
 'Asian',
 'BlkAfrAm',
 'HawaiiPI',
 'White',
 'RaceUnkn',
 'RaceDcln',
 'HisOrgin',
 'FCStatSv',
 'LclFIPSsv',
 'TribeSv',
 'DelinqntSv',
 'EdLevlSv',
 'SpecEdSv',
 'ILNAsv',
 'AcSuppSv',
 'PSEdSuppSv',
 'CareerSv',
 'EmplyTrSv',
 'BudgetSv',
 'HousEdSv',
 'HlthEdSv',
 'FamSuppSv',
 'MentorSv',
 'SILsv',
 'RmBrdFASv',
 'EducFinaSv',
 'OthrFinaSv',
 'StFCID',
 'Race',
 'RaceEthn',
 'AgeMP']

In [7]:
services_11_14.describe()

Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,...,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0,675645.0
mean,2012.462685,201252.261118,26.395412,1.516631,0.036691,0.020037,0.368907,0.011923,0.548146,0.318765,...,8.897445,8.878888,8.827611,8.738966,8.663393,8.696876,8.750952,8.793846,9.967709,5.345485
std,1.11324,111.310146,16.358375,0.499724,0.617899,0.605054,0.761098,0.598528,0.770803,1.419459,...,24.12384,24.129933,24.148215,24.181827,24.200287,24.194873,24.176128,24.167719,27.141931,15.597203
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2011.0,201109.0,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2012.0,201209.0,25.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
75%,2013.0,201309.0,39.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,2.0,6.0
max,2014.0,201409.0,72.0,2.0,77.0,77.0,77.0,77.0,77.0,77.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [8]:
list(outcomes_2013)

['Wave',
 'StChID',
 'StFIPS',
 'St',
 'RecNumbr',
 'RepDate',
 'dob',
 'Sex',
 'AmIAKN',
 'Asian',
 'BlkAfrAm',
 'HawaiiPI',
 'White',
 'RaceUnkn',
 'RaceDcln',
 'HisOrgin',
 'OutcmRpt',
 'OutcmDte',
 'OutcmFCS',
 'CurrFTE',
 'CurrPTE',
 'EmplySklls',
 'SocSecrty',
 'EducAid',
 'PubFinAs',
 'PubFoodAs',
 'PubHousAs',
 'OthrFinAs',
 'HighEdCert',
 'CurrenRoll',
 'CnctAdult',
 'Homeless',
 'SubAbuse',
 'Incarc',
 'Children',
 'Marriage',
 'Medicaid',
 'OthrHlthIn',
 'MedicalIn',
 'MentlHlthIn',
 'PrescripIn',
 'SampleState',
 'InSample',
 'Baseline',
 'FY11Cohort',
 'Elig19',
 'Weight',
 'Responded']

In [9]:
outcomes_2013.describe()



Unnamed: 0,Weight
count,22811.0
mean,2.505278
std,2.241763
min,0.7
25%,
50%,
75%,
max,80.71


In [10]:
list(outcomes_2014)

['wave',
 'stfips',
 'st',
 'recnumbr',
 'repdate',
 'dob',
 'sex',
 'amiakn',
 'asian',
 'blkafram',
 'hawaiipi',
 'white',
 'raceunkn',
 'racedcln',
 'hisorgin',
 'outcmrpt',
 'outcmdte',
 'outcmfcs',
 'currfte',
 'currpte',
 'emplysklls',
 'socsecrty',
 'educaid',
 'pubfinas',
 'pubfoodas',
 'pubhousas',
 'othrfinas',
 'highedcert',
 'currenroll',
 'cnctadult',
 'homeless',
 'subabuse',
 'incarc',
 'children',
 'marriage',
 'medicaid',
 'othrhlthin',
 'medicalin',
 'mentlhlthin',
 'prescripin',
 'baseline',
 'fy14cohort',
 'elig19',
 'samplestate',
 'insample',
 'responded',
 'race',
 'raceethn',
 'stfcid']

In [11]:
outcomes_2014.describe()

Unnamed: 0,stfips,responded
count,23775.0,23775.0
mean,26.338212,0.73346
std,16.927417,0.442159
min,1.0,0.0
25%,9.0,0.0
50%,25.0,1.0
75%,40.0,1.0
max,72.0,1.0


In [12]:
# First I will create a data set that has services and demographic data (from 2011) for cohort 1 services population

In [13]:
# For cohort 1, only need services data for FY 2011
cohort_1 = services_11_14[services_11_14.FY == 2011]

# Dropping columns that are not needed for this project
cohort_1 = cohort_1.drop(['RecNumbr','LclFIPSsv'],axis= 1)

cohort_1.describe()

Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,...,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0,173184.0
mean,2011.0,201106.089974,26.569672,1.519696,0.030886,0.014118,0.373782,0.006317,0.531926,0.292885,...,14.48169,14.470448,14.425882,14.342358,14.283796,14.309959,14.370282,14.401954,9.74698,5.642167
std,0.0,2.998659,16.210084,0.499613,0.17301,0.117978,0.483808,0.079228,0.498981,0.637311,...,29.737938,29.743221,29.764113,29.814385,29.836409,29.824874,29.796868,29.798552,26.86216,16.506973
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2011.0,201103.0,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2011.0,201109.0,25.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0
75%,2011.0,201109.0,36.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,6.0
max,2011.0,201109.0,72.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [14]:
# Checking to see if unique IDs are repeated. Then need to determine if repeat is just a duplicate or means something

ID_count = dict(cohort_1.StFCID.value_counts())
set(cohort_1.StFCID.value_counts())

{1, 2}

In [15]:
duplicate_count = pd.DataFrame.from_dict(ID_count, orient='index')
duplicate_count[0].value_counts()

2    59411
1    54362
Name: 0, dtype: int64

##### Based on the information above: 
    
    - There are 113,773 total rows of data.
    
    - Of that total, 59,411 are repeated

______________________________________________________________________
In the next two cells, I am checking the counts of duplicates in two different ways by utilizing 
the kwarg 'keep' in the 'DF.duplicated' class. For more information, see the documentation: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html


In [16]:
dup_check_dict = dict(cohort_1.duplicated(['StFCID'],keep=False))

dup_count2 = pd.DataFrame.from_dict(dup_check_dict, orient='index')
dup_count2[0].value_counts()

True     118822
False     54362
Name: 0, dtype: int64

In [17]:
dup_check_dict = dict(cohort_1.duplicated(['StFCID'],keep='first'))

dup_count2 = pd.DataFrame.from_dict(dup_check_dict, orient='index')
dup_count2[0].value_counts()

False    113773
True      59411
Name: 0, dtype: int64

In [18]:
# Adding a new boolean column to services population dataframe that will 
# identify a duplicated order line item based on unique ID (False=Not a duplicate; True=Duplicate).
# Then sorting values based on Unique ID and RepDate columns

cohort_1['duplicated'] = cohort_1.duplicated(['StFCID'], keep=False)
cohort_1_dups = cohort_1[cohort_1['duplicated'] == True]
cohort_1_dups = cohort_1_dups.sort_values(['StFCID', 'RepDate'], ascending=[True, True])


In [19]:
# So, after examining resulting tables from cell above, I determined that the repeats are 
# those youth who were reported in both reporting periods for FY 2011 (RepDate: 201103 or 201109).
# Some repeats are duplicates of each other (containing the same information in both rows), 
# while some repeats need to be reviewed further.

# First, I will separate the repeats that are duplicates, drop one of the two rows for every unique ID, and merge 
# the remaining data rows with the non-duplicated data from cohort_1.

cohort_1_dups_b = cohort_1_dups.drop(['RepDate', 'AgeMP'], axis=1)

cohort_1_dups['duplicated_4_sure'] = cohort_1_dups_b.duplicated(keep=False)
cohort_1_dups.duplicated_4_sure.value_counts()


False    93602
True     25220
Name: duplicated_4_sure, dtype: int64

In [20]:
cohort_1_dups_2bAdded = cohort_1_dups[cohort_1_dups['duplicated_4_sure'] == True]
cohort_1_dups_2bAdded = cohort_1_dups_2bAdded.drop_duplicates('StFCID')
cohort_1_dups_2bAdded.duplicated_4_sure.value_counts()


True    12610
Name: duplicated_4_sure, dtype: int64

In [21]:
# Merging remaining data rows that are now unique and 
# adding to new dataframe with other data rows that were already unique...

cohort_1_NoDups = cohort_1[cohort_1['duplicated'] == False]
cohort_1_NoDups['duplicated'].value_counts()

False    54362
Name: duplicated, dtype: int64

In [22]:
cohort_1_dups_2bAdded = cohort_1_dups_2bAdded.drop(['duplicated','duplicated_4_sure'], axis=1)
cohort_1_NoDups = cohort_1_NoDups.drop('duplicated',axis=1)

frames = [cohort_1_dups_2bAdded, cohort_1_NoDups]

cohort_1_NoDups = pd.concat(frames)
cohort_1_NoDups.describe()

Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,...,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0,66972.0
mean,2011.0,201105.667801,25.980216,1.504823,0.030684,0.013767,0.375231,0.006555,0.513976,0.30023,...,18.860628,18.860793,18.81767,18.749895,18.701398,18.721167,18.773338,18.803425,11.020426,6.4346
std,0.0,2.981573,15.436015,0.49998,0.172463,0.116523,0.484186,0.080698,0.499808,0.651068,...,32.871587,32.871495,32.895545,32.933194,32.956046,32.953075,32.924173,32.915421,28.697648,18.517889
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2011.0,201103.0,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2011.0,201103.0,27.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0
75%,2011.0,201109.0,36.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,6.0
max,2011.0,201109.0,72.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [23]:
# Now I need to review the remaining repeated IDs, and determine which data row I should keep. 
# I will keep the data row that has the larger sum of number of services recieved for the unique ID.

cohort_1_reviewDups = cohort_1_dups[cohort_1_dups['duplicated_4_sure'] == False]
services = cohort_1_reviewDups[['TribeSv','DelinqntSv', 'SpecEdSv', 'ILNAsv', 'AcSuppSv', \
            'PSEdSuppSv', 'CareerSv', 'EmplyTrSv', 'BudgetSv', 'HousEdSv', 'HlthEdSv', \
            'FamSuppSv', 'MentorSv', 'SILsv', 'RmBrdFASv', 'EducFinaSv', 'OthrFinaSv']]

services_count = services.T

cohort_1_reviewDups['Num_services'] = (services_count == 1).sum()
cohort_1_reviewDups.Num_services.value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


1     16065
2     16010
3     12406
4      9290
5      7443
6      6699
7      5745
0      4875
8      4628
9      3645
10     2399
11     1652
12     1086
13      832
14      598
15      203
16       21
17        5
Name: Num_services, dtype: int64

In [24]:
# Sorting values based on Unique ID and Num_services columns
# Then dropping duplicates with smaller value in Num_services columns

cohort_1_reviewDups = cohort_1_reviewDups.sort_values(['StFCID', 'Num_services'], ascending=[True, True])
cohort_1_DupsKeep = cohort_1_reviewDups.drop_duplicates('StFCID',keep='last')
cohort_1_DupsKeep.describe()

Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn,Num_services
count,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,...,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0,46801.0
mean,2011.0,201107.161706,25.759129,1.532895,0.034187,0.015299,0.361894,0.006795,0.556398,0.295336,...,4.210017,4.151514,4.005149,3.922032,3.958655,4.04923,4.117968,8.786329,4.369287,5.012072
std,0.0,2.765972,17.179924,0.498922,0.181712,0.12274,0.480553,0.08215,0.496814,0.64171,...,16.574347,16.587335,16.645194,16.654974,16.632378,16.613285,16.68486,25.319119,12.502751,3.373394
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
25%,2011.0,201103.0,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0
50%,2011.0,201109.0,21.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,4.0
75%,2011.0,201109.0,40.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,2.0,5.0,7.0
max,2011.0,201109.0,72.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0,17.0


In [25]:
cohort_1_DupsKeep.columns

Index([u'FY', u'RepDate', u'StFIPS', u'St', u'DOB', u'Sex', u'AmIAKN',
       u'Asian', u'BlkAfrAm', u'HawaiiPI', u'White', u'RaceUnkn', u'RaceDcln',
       u'HisOrgin', u'FCStatSv', u'TribeSv', u'DelinqntSv', u'EdLevlSv',
       u'SpecEdSv', u'ILNAsv', u'AcSuppSv', u'PSEdSuppSv', u'CareerSv',
       u'EmplyTrSv', u'BudgetSv', u'HousEdSv', u'HlthEdSv', u'FamSuppSv',
       u'MentorSv', u'SILsv', u'RmBrdFASv', u'EducFinaSv', u'OthrFinaSv',
       u'StFCID', u'Race', u'RaceEthn', u'AgeMP', u'duplicated',
       u'duplicated_4_sure', u'Num_services'],
      dtype='object')

In [26]:
# Merging remaining data rows that are now unique and 
# adding to cohort_1_NoDups dataframe
# Resulting df is the Services Population for FY 2011 data set

cohort_1_DupsKeep = cohort_1_DupsKeep.drop(['Num_services','duplicated','duplicated_4_sure'], axis=1)

frames = [cohort_1_DupsKeep, cohort_1_NoDups]
cohort_1_NoDups = pd.concat(frames)
cohort_1_NoDups.describe()

Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,...,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0,113773.0
mean,2011.0,201106.282325,25.889271,1.51637,0.032125,0.014397,0.369745,0.006654,0.531427,0.298217,...,12.844796,12.834135,12.784685,12.684582,12.621843,12.648546,12.716514,12.762501,10.101421,5.585025
std,0.0,2.986699,16.176449,0.499734,0.176334,0.119122,0.482738,0.081298,0.499014,0.647237,...,28.297735,28.302384,28.323883,28.373436,28.395768,28.386166,28.356981,28.363432,27.380399,16.345857
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2011.0,201103.0,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2011.0,201109.0,25.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0
75%,2011.0,201109.0,36.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0,6.0
max,2011.0,201109.0,72.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [27]:
cohort_1_services_pop = cohort_1_NoDups.copy()

In [28]:
cohort_1_services_pop.to_csv('~/Desktop/capstone_clean_data/cohort_1_services_pop.csv')

In [29]:
# Now to get services N without data from states that have bad unique ID encodings...

# Issue with encoding the tracking ID (StFIPS). Need to drop states that have the issue: HI, IN, KY, MS, OR, TX, TN

bad_encode_states = ["HI", "IN", "KY", "MS", "OR", "TX", "TN"]
cohort_1_services_pop_noBadCodeSt = cohort_1_NoDups.loc[~cohort_1_NoDups['St'].isin(bad_encode_states)]
cohort_1_services_pop_noBadCodeSt.describe()


Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,...,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0,97160.0
mean,2011.0,201106.266035,24.450185,1.511671,0.034592,0.013843,0.383007,0.004179,0.508419,0.305445,...,14.944267,14.939255,14.886569,14.821912,14.756114,14.774979,14.838174,14.863864,11.11055,5.772077
std,0.0,2.988196,16.151715,0.499866,0.182746,0.11684,0.486123,0.064508,0.499932,0.652896,...,30.108971,30.111375,30.136583,30.167364,30.198514,30.189601,30.159638,30.147412,28.783247,16.743587
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2011.0,201103.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2011.0,201109.0,24.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0
75%,2011.0,201109.0,36.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,6.0
max,2011.0,201109.0,72.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [30]:
cohort_1_services_pop_noBadCodeSt.to_csv('~/Desktop/capstone_clean_data/cohort_1_services_pop_noBadCodeSt.csv')

---------------------------------------------------------------------------------------------------------------------
- Target population: FY 2011 Services Population = 113773 total foster youth who received services in FY 2011

- Target population for this project: FY 2011 Services Population, without states with bad encodings of unique ID (thus excluded from analysis) = 97160 total foster youth who received services in FY 2011 (excluding certain states)


Now I will make a dataset for Cohort 1, Baseline Population (N)....
---------------------------------------------------------------------------------------------------------------

In [31]:
outcomes_2013.Wave.value_counts()

Wave 1: Age 17 Baseline Survey    28635
Wave 2: Age 19 Followup           15235
Name: Wave, dtype: int64

In [32]:
# From outcomes_2013 dataset, I only want to examine data for foster youth in Wave 1.
# Resulting df is the Cohort 1 Baseline Population (FY 2011) data set

# I will do an inner join of the outcomes data from FY 2011 (Wave 1) 
# with services and demographics data from FY 2011, joining on unique ID

cohort_1_services_pop_noBadCodeSt.rename(columns={'RepDate':'RepDate_services'}, inplace=True)

outcomes_2013_keep = outcomes_2013.drop(['Weight','StFIPS','St','RecNumbr','dob','Sex','AmIAKN','Asian','BlkAfrAm',\
                                         'HawaiiPI','White','RaceUnkn','RaceDcln','HisOrgin'], axis=1)

outcomes_2013_keep.rename(columns={'RepDate':'RepDate_outcomes', 'StChID': 'StFCID'}, inplace=True)

cohort1_baseline = pd.merge(outcomes_2013_keep, cohort_1_services_pop_noBadCodeSt, on='StFCID', how='inner')
cohort1_baseline = cohort1_baseline[cohort1_baseline.Wave == 'Wave 1: Age 17 Baseline Survey']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)


In [33]:
cohort1_baseline.describe()

Unnamed: 0,FY,RepDate_services,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,...,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0,13775.0
mean,2011.0,201106.728058,23.929873,1.504247,0.034773,0.0151,0.352886,0.005009,0.527187,0.328639,...,10.84196,10.841815,10.77735,10.678984,10.578947,10.571833,10.6098,10.672886,12.183013,5.164283
std,0.0,2.91042,17.121552,0.5,0.183211,0.121954,0.477885,0.0706,0.499278,0.670547,...,26.298196,26.298253,26.323514,26.36171,26.400122,26.402837,26.388322,26.364064,30.198332,14.799711
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2011.0,201103.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2011.0,201109.0,22.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0
75%,2011.0,201109.0,36.0,2.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,6.0
max,2011.0,201109.0,72.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [34]:
cohort1_baseline.to_csv('~/Desktop/capstone_clean_data/cohort1_baseline.csv')

---------------------------------------------------------------------------------------------------------------------
<b>- Sampling Frame: Cohort 1 Baseline Population = N = 13775 total foster youth who were eligible for outcomes surveys</b>

Now I will make a dataset for Cohort 1, Wave 2 Population (n) -- which is the study sample....
---------------------------------------------------------------------------------------------------------------

In [35]:
outcomes_2013.Wave.value_counts()

Wave 1: Age 17 Baseline Survey    28635
Wave 2: Age 19 Followup           15235
Name: Wave, dtype: int64

In [38]:
# For now, I only want to examine data for foster youth who participated in the Wave 2 FU survey
# The project analysis and predictive model will be based on this Wave 2 population/dataset

# I will do an inner join of the outcomes data from FY 2013 (Wave 2) 
# with services and demographics data from FY 2011, joining on unique ID

cohort_1_services_pop_noBadCodeSt.rename(columns={'RepDate':'RepDate_services'}, inplace=True)

outcomes_2013_keep = outcomes_2013.drop(['Weight','StFIPS','St','RecNumbr','dob','Sex','AmIAKN','Asian','BlkAfrAm',\
                                         'HawaiiPI','White','RaceUnkn','RaceDcln','HisOrgin'], axis=1)

outcomes_2013_keep.rename(columns={'RepDate':'RepDate_outcomes', 'StChID': 'StFCID'}, inplace=True)

cohort1_wave2 = pd.merge(outcomes_2013_keep, cohort_1_services_pop_noBadCodeSt, on='StFCID', how='inner')
cohort1_wave2 = cohort1_wave2[cohort1_wave2.Wave == 'Wave 2: Age 19 Followup']


In [39]:
cohort1_wave2.describe()

Unnamed: 0,FY,RepDate_services,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,...,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0,7432.0
mean,2011.0,201106.993003,24.903122,1.508746,0.042653,0.014128,0.335845,0.00592,0.572928,0.289693,...,3.312702,3.328983,3.255786,3.145587,3.008477,3.017626,3.060684,3.131862,9.295748,3.967169
std,0.0,2.831082,16.947751,0.499957,0.202088,0.118027,0.472317,0.076721,0.494686,0.648725,...,14.669816,14.666685,14.680616,14.700877,14.724896,14.723334,14.715906,14.703341,26.149488,11.277109
min,2011.0,201103.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2011.0,201103.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2011.0,201109.0,22.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
75%,2011.0,201109.0,38.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,3.0
max,2011.0,201109.0,72.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


---------------------------------------------------------------------------------------------------------------------

Need to clean this Wave 2 dataset....
--------------------------------------------------------------------------------------------------------------

In [40]:
# Need to figure out how to treat missing values....

dict(cohort1_wave2.isnull().sum())

{'AcSuppSv': 0,
 'AgeMP': 0,
 'AmIAKN': 0,
 'Asian': 0,
 'Baseline': 0,
 'BlkAfrAm': 0,
 'BudgetSv': 0,
 'CareerSv': 0,
 'Children': 78,
 'CnctAdult': 78,
 'CurrFTE': 78,
 'CurrPTE': 78,
 'CurrenRoll': 78,
 'DOB': 0,
 'DelinqntSv': 0,
 'EdLevlSv': 0,
 'EducAid': 78,
 'EducFinaSv': 0,
 'Elig19': 0,
 'EmplySklls': 78,
 'EmplyTrSv': 0,
 'FCStatSv': 0,
 'FY': 0,
 'FY11Cohort': 0,
 'FamSuppSv': 0,
 'HawaiiPI': 0,
 'HighEdCert': 78,
 'HisOrgin': 0,
 'HlthEdSv': 0,
 'Homeless': 78,
 'HousEdSv': 0,
 'ILNAsv': 0,
 'InSample': 0,
 'Incarc': 78,
 'Marriage': 78,
 'Medicaid': 78,
 'MedicalIn': 78,
 'MentlHlthIn': 78,
 'MentorSv': 0,
 'OthrFinAs': 78,
 'OthrFinaSv': 0,
 'OthrHlthIn': 78,
 'OutcmDte': 0,
 'OutcmFCS': 78,
 'OutcmRpt': 78,
 'PSEdSuppSv': 0,
 'PrescripIn': 78,
 'PubFinAs': 78,
 'PubFoodAs': 78,
 'PubHousAs': 78,
 'Race': 0,
 'RaceDcln': 0,
 'RaceEthn': 0,
 'RaceUnkn': 0,
 'RepDate_outcomes': 0,
 'RepDate_services': 0,
 'Responded': 0,
 'RmBrdFASv': 0,
 'SILsv': 0,
 'SampleState': 0,
 '

In [41]:
# For now, I am dropping rows that have NaNs in 10+ columns....

cohort1_wave2 = cohort1_wave2[cohort1_wave2.HighEdCert.notnull()]
# cohort_1 = cohort_1[cohort_1.PubHousAs.notnull()]

In [42]:
dict(cohort_1.isnull().sum())

{'AcSuppSv': 0,
 'AgeMP': 0,
 'AmIAKN': 0,
 'Asian': 0,
 'BlkAfrAm': 0,
 'BudgetSv': 0,
 'CareerSv': 0,
 'DOB': 0,
 'DelinqntSv': 0,
 'EdLevlSv': 0,
 'EducFinaSv': 0,
 'EmplyTrSv': 0,
 'FCStatSv': 0,
 'FY': 0,
 'FamSuppSv': 0,
 'HawaiiPI': 0,
 'HisOrgin': 0,
 'HlthEdSv': 0,
 'HousEdSv': 0,
 'ILNAsv': 0,
 'MentorSv': 0,
 'OthrFinaSv': 0,
 'PSEdSuppSv': 0,
 'Race': 0,
 'RaceDcln': 0,
 'RaceEthn': 0,
 'RaceUnkn': 0,
 'RepDate': 0,
 'RmBrdFASv': 0,
 'SILsv': 0,
 'Sex': 0,
 'SpecEdSv': 0,
 'St': 0,
 'StFCID': 0,
 'StFIPS': 0,
 'TribeSv': 0,
 'White': 0,
 'duplicated': 0}

In [43]:
dict(cohort1_wave2.dtypes)

{'AcSuppSv': dtype('int64'),
 'AgeMP': dtype('O'),
 'AmIAKN': dtype('int64'),
 'Asian': dtype('int64'),
 'Baseline': dtype('O'),
 'BlkAfrAm': dtype('int64'),
 'BudgetSv': dtype('int64'),
 'CareerSv': dtype('int64'),
 'Children': dtype('O'),
 'CnctAdult': dtype('O'),
 'CurrFTE': dtype('O'),
 'CurrPTE': dtype('O'),
 'CurrenRoll': dtype('O'),
 'DOB': dtype('O'),
 'DelinqntSv': dtype('int64'),
 'EdLevlSv': dtype('O'),
 'EducAid': dtype('O'),
 'EducFinaSv': dtype('int64'),
 'Elig19': dtype('O'),
 'EmplySklls': dtype('O'),
 'EmplyTrSv': dtype('int64'),
 'FCStatSv': dtype('int64'),
 'FY': dtype('int64'),
 'FY11Cohort': dtype('O'),
 'FamSuppSv': dtype('int64'),
 'HawaiiPI': dtype('int64'),
 'HighEdCert': dtype('O'),
 'HisOrgin': dtype('int64'),
 'HlthEdSv': dtype('int64'),
 'Homeless': dtype('O'),
 'HousEdSv': dtype('int64'),
 'ILNAsv': dtype('int64'),
 'InSample': dtype('O'),
 'Incarc': dtype('O'),
 'Marriage': dtype('O'),
 'Medicaid': dtype('O'),
 'MedicalIn': dtype('O'),
 'MentlHlthIn': dty

In [44]:
#Need to convert AgeMP, EduLevlSv to int

cohort1_wave2['AgeMP'] = cohort1_wave2['AgeMP'].apply(lambda x: int(x))
cohort1_wave2['EdLevlSv'] = cohort1_wave2['EdLevlSv'].apply(lambda x: int(x))

In [45]:
# Fix mixed types in RaceDcl, RaceUnkn (need to be type int):

def treat_blank_strings(row_with_string):
    """Find blank strings ("") and convert to value 77 (which represents blank in raw data codebook). 
    Also, convert all values to type int."""
    if type(row_with_string) == int:
        return row_with_string
    elif row_with_string == "0":
        return 0
    elif row_with_string == "1":
        return 1
    else:
        return 77

In [46]:
print cohort1_wave2.RaceDcln.value_counts()
cohort1_wave2['RaceDcln'] = cohort1_wave2['RaceDcln'].apply(treat_blank_strings)
cohort1_wave2.RaceDcln.value_counts()

0    4670
0    1979
      282
1     269
1     154
Name: RaceDcln, dtype: int64


0     6649
1      423
77     282
Name: RaceDcln, dtype: int64

In [47]:
print cohort1_wave2.RaceUnkn.value_counts()
cohort1_wave2['RaceUnkn'] = cohort1_wave2['RaceUnkn'].apply(treat_blank_strings)
cohort1_wave2.RaceUnkn.value_counts()

0    6111
0    1087
1     102
       41
1      13
Name: RaceUnkn, dtype: int64


0     7198
1      115
77      41
Name: RaceUnkn, dtype: int64

In [48]:
#Need to convert RepDates and DOBs to datetime

cohort1_wave2.RepDate_services = pd.to_datetime(cohort1_wave2['RepDate_services'], format="%Y%m")
cohort1_wave2.DOB = pd.to_datetime(cohort1_wave2['DOB'])

In [49]:
# Need to address blanks in RepDates_outcomes column before converting to datetime....

print cohort1_wave2.RepDate_outcomes.value_counts()

def treat_blank_dates(row_with_string):
    """Find blank dates ("") and convert to value 199901 (I need to decide what to do with blanks later)."""
    if len(row_with_string) == 0:
        return int("199901")
    else:
        return row_with_string

cohort1_wave2['RepDate_outcomes'] = cohort1_wave2['RepDate_outcomes'].apply(treat_blank_dates)
print cohort1_wave2.RepDate_outcomes.value_counts()
cohort1_wave2.RepDate_outcomes = pd.to_datetime(cohort1_wave2['RepDate_outcomes'], format="%Y%m")

201309    3656
201303    3416
           282
Name: RepDate_outcomes, dtype: int64
201309    3656
201303    3416
199901     282
Name: RepDate_outcomes, dtype: int64


In [50]:
dict(cohort1_wave2.dtypes)

{'AcSuppSv': dtype('int64'),
 'AgeMP': dtype('int64'),
 'AmIAKN': dtype('int64'),
 'Asian': dtype('int64'),
 'Baseline': dtype('O'),
 'BlkAfrAm': dtype('int64'),
 'BudgetSv': dtype('int64'),
 'CareerSv': dtype('int64'),
 'Children': dtype('O'),
 'CnctAdult': dtype('O'),
 'CurrFTE': dtype('O'),
 'CurrPTE': dtype('O'),
 'CurrenRoll': dtype('O'),
 'DOB': dtype('<M8[ns]'),
 'DelinqntSv': dtype('int64'),
 'EdLevlSv': dtype('int64'),
 'EducAid': dtype('O'),
 'EducFinaSv': dtype('int64'),
 'Elig19': dtype('O'),
 'EmplySklls': dtype('O'),
 'EmplyTrSv': dtype('int64'),
 'FCStatSv': dtype('int64'),
 'FY': dtype('int64'),
 'FY11Cohort': dtype('O'),
 'FamSuppSv': dtype('int64'),
 'HawaiiPI': dtype('int64'),
 'HighEdCert': dtype('O'),
 'HisOrgin': dtype('int64'),
 'HlthEdSv': dtype('int64'),
 'Homeless': dtype('O'),
 'HousEdSv': dtype('int64'),
 'ILNAsv': dtype('int64'),
 'InSample': dtype('O'),
 'Incarc': dtype('O'),
 'Marriage': dtype('O'),
 'Medicaid': dtype('O'),
 'MedicalIn': dtype('O'),
 'Men

In [51]:
# Depending on tyoe of modeling techniques I will use, 
# I will need to turn the following into numeric categories: 

# wave, OutcmRpt, OutcmFCS, CurrFTE, CurrPTE, EmplySkills, SocSecrty,
# 'EducAid','PubFinAs','PubFoodAs','PubHousAs','OthrFinAs','HighEdCert','CurrenRoll','CnctAdult',
# 'Marriage','Medicaid','OthrHlthIn','MedicalIn','MentlHlthIn','PrescripIn','Baseline','FY11Cohort',
# 'Elig19','SampleState','InSample','Responded'

In [52]:
# Need to fix encoding in column HighEdCert

cohort1_wave2.HighEdCert.value_counts()

Blank                      2829
High school diploma/GED    2508
None of the above          1708
78.0                        117
Declined                     95
Vocational certificate       62
Vocational license           19
Higher degree                 7
Associate�s degree            7
Bachelor�s degree             2
Name: HighEdCert, dtype: int64

In [53]:
def change_weird_chars(row):
    if type(row) == str:
        if 'Bachelor' in row:
            return "Bachelor's Degree"
        elif 'Associate' in row:
            return "Associate's Degree"
        else:
            return row
    
cohort1_wave2.HighEdCert = cohort1_wave2['HighEdCert'].apply(change_weird_chars)

cohort1_wave2.HighEdCert.value_counts()

Blank                      2829
High school diploma/GED    2508
None of the above          1708
Declined                     95
Vocational certificate       62
Vocational license           19
Associate's Degree            7
Higher degree                 7
Bachelor's Degree             2
Name: HighEdCert, dtype: int64

In [54]:
cohort1_wave2.head()

Unnamed: 0,Wave,StFCID,RepDate_outcomes,OutcmRpt,OutcmDte,OutcmFCS,CurrFTE,CurrPTE,EmplySklls,SocSecrty,...,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn,AgeMP
1,Wave 2: Age 19 Followup,AK450290395006,2013-03-01,Youth participated,2012-12-28,"No, is not in FC on Date",No,No,No,No,...,0,0,0,0,0,0,1,1,1,17
3,Wave 2: Age 19 Followup,AK450448396586,2013-03-01,Youth participated,2012-12-03,"No, is not in FC on Date",No,No,No,No,...,0,0,0,0,0,0,1,3,3,17
6,Wave 2: Age 19 Followup,AK450540097503,2013-03-01,Youth participated,2013-01-17,"No, is not in FC on Date",No,No,No,No,...,0,0,0,0,0,1,1,1,1,17
8,Wave 2: Age 19 Followup,AK450652098623,2013-03-01,Youth participated,2012-11-19,"No, is not in FC on Date",No,"Yes, employed part time",No,No,...,0,0,0,0,0,0,0,1,1,17
11,Wave 2: Age 19 Followup,AK451448406587,2013-09-01,Youth participated,2013-05-17,"No, is not in FC on Date","Yes, employed full time",No,No,No,...,0,0,0,0,0,0,1,3,3,17


In [55]:
cohort1_wave2.describe()

Unnamed: 0,FY,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,RaceUnkn,RaceDcln,...,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn,AgeMP
count,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,...,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0,7354.0
mean,2011.0,24.421539,1.507615,0.04297,0.014278,0.337775,0.005847,0.57275,0.444928,3.010199,...,3.354229,3.281208,3.177047,3.034539,3.044874,3.088251,3.159097,9.12755,3.937313,16.571118
std,0.0,16.349738,0.499976,0.202803,0.118642,0.472984,0.076248,0.494713,5.73389,14.77776,...,14.742194,14.756146,14.7754,14.800515,14.798741,14.791215,14.778642,25.87891,11.332458,0.495225
min,2011.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,16.0
25%,2011.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,16.0
50%,2011.0,22.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,17.0
75%,2011.0,37.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,3.0,17.0
max,2011.0,56.0,2.0,1.0,1.0,1.0,1.0,1.0,77.0,77.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0,18.0


---------------------------------------------------------------------------------------------------------------------
- Cohort 1 Wave 2 Population = <b>n = 7354 total foster youth who participated in FY 2013 Survey</b> 

- n/N = 7354/13775 = 0.5338 ==> <b>sample is 53.0% of baseline population </b>

- n/N_no_states = 7354/97160 = 0.07568 ==> <b>sample is 7.56% of target population </b>

Now that I have dataset for cohort1_wave2 in a decent state, I will load this into a local postgres DB for storage
--------------------------------------------------------------------------------------------------------------

In [57]:
engine = create_engine('postgresql://cguy@localhost:5432/nytd_clean_data')
cohort1_wave2.to_sql('cohort1_wave2', engine)

In [58]:
%load_ext sql
# %reload_ext sql

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


In [59]:
%%sql postgresql://cguy@localhost:5432/nytd_clean_data
        
SELECT * FROM cohort1_wave2 LIMIT 5;

5 rows affected.


index,Wave,StFCID,RepDate_outcomes,OutcmRpt,OutcmDte,OutcmFCS,CurrFTE,CurrPTE,EmplySklls,SocSecrty,EducAid,PubFinAs,PubFoodAs,PubHousAs,OthrFinAs,HighEdCert,CurrenRoll,CnctAdult,Homeless,SubAbuse,Incarc,Children,Marriage,Medicaid,OthrHlthIn,MedicalIn,MentlHlthIn,PrescripIn,SampleState,InSample,Baseline,FY11Cohort,Elig19,Responded,FY,RepDate_services,StFIPS,St,DOB,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,RaceUnkn,RaceDcln,HisOrgin,FCStatSv,TribeSv,DelinqntSv,EdLevlSv,SpecEdSv,ILNAsv,AcSuppSv,PSEdSuppSv,CareerSv,EmplyTrSv,BudgetSv,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn,AgeMP
1,Wave 2: Age 19 Followup,AK450290395006,2013-03-01 00:00:00,Youth participated,2012-12-28 00:00:00,"No, is not in FC on Date",No,No,No,No,No,No,Yes,No,No,None of the above,No,Yes,Yes,No,No,No,Not Applicable,Yes,No,Not Applicable,Not Applicable,Not Applicable,No,No,Yes,Yes,Yes,Responded to Survey,2011,2011-09-01 00:00:00,2,AK,1993-10-15 00:00:00,2,0,0,0,0,1,0,0,0,1,0,0,11,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,17
3,Wave 2: Age 19 Followup,AK450448396586,2013-03-01 00:00:00,Youth participated,2012-12-03 00:00:00,"No, is not in FC on Date",No,No,No,No,No,No,No,No,No,High school diploma/GED,No,Yes,No,No,No,No,Not Applicable,No,No,Not Applicable,Not Applicable,Not Applicable,No,No,Yes,Yes,Yes,Responded to Survey,2011,2011-03-01 00:00:00,2,AK,1993-12-15 00:00:00,2,1,0,0,0,0,0,0,0,1,1,0,77,77,1,0,0,0,0,0,0,0,0,0,0,0,0,1,3,3,17
6,Wave 2: Age 19 Followup,AK450540097503,2013-03-01 00:00:00,Youth participated,2013-01-17 00:00:00,"No, is not in FC on Date",No,No,No,No,Yes,No,Yes,No,No,High school diploma/GED,Yes,Yes,No,Yes,Yes,No,Not Applicable,Yes,No,Not Applicable,Not Applicable,Not Applicable,No,No,Yes,Yes,Yes,Responded to Survey,2011,2011-09-01 00:00:00,2,AK,1993-10-15 00:00:00,2,0,0,0,0,1,0,0,0,1,0,0,11,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,17
8,Wave 2: Age 19 Followup,AK450652098623,2013-03-01 00:00:00,Youth participated,2012-11-19 00:00:00,"No, is not in FC on Date",No,"Yes, employed part time",No,No,No,No,Yes,No,No,High school diploma/GED,No,Yes,Yes,No,No,Yes,No,Yes,No,Not Applicable,Not Applicable,Not Applicable,No,No,Yes,Yes,Yes,Responded to Survey,2011,2011-09-01 00:00:00,2,AK,1994-02-15 00:00:00,1,0,0,0,0,1,0,0,0,1,0,0,11,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,17
11,Wave 2: Age 19 Followup,AK451448406587,2013-09-01 00:00:00,Youth participated,2013-05-17 00:00:00,"No, is not in FC on Date","Yes, employed full time",No,No,No,No,No,Yes,No,No,High school diploma/GED,No,Yes,Yes,No,No,Yes,No,Yes,Yes,Don't Know,Not Applicable,Not Applicable,No,No,Yes,Yes,Yes,Responded to Survey,2011,2011-09-01 00:00:00,2,AK,1994-05-15 00:00:00,2,1,0,0,0,0,0,0,0,1,1,0,11,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,3,3,17


In [60]:
cohort1_wave2.to_csv('~/Desktop/capstone_clean_data/cohort1_wave2.csv')

# Now Cohort 2, baseline population...

In [61]:
# First I will create a data set that has services and demographic data (from 2014) for cohort 2 services population

In [62]:
# For cohort 2, only need services and demographic data for FY 2014
cohort_2 = services_11_14[services_11_14.FY == 2014]

# Dropping columns that are not needed for this project
cohort_2 = cohort_2.drop(['RecNumbr','LclFIPSsv'],axis= 1)

cohort_2.describe()

Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,...,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0,161495.0
mean,2014.0,201406.043673,26.060014,1.514096,0.052534,0.037264,0.384996,0.02727,0.574408,0.38434,...,0.405102,0.387733,0.312152,0.222638,0.128054,0.167782,0.221375,0.272151,9.336605,3.874838
std,0.0,2.999691,16.71916,0.499803,1.224435,1.218746,1.298554,1.214904,1.301511,2.66021,...,1.761729,1.740005,1.743896,1.710864,1.438206,1.722234,1.689343,1.749509,26.104197,9.955126
min,2014.0,201403.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2014.0,201403.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2014.0,201409.0,24.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
75%,2014.0,201409.0,41.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,6.0
max,2014.0,201409.0,72.0,2.0,77.0,77.0,77.0,77.0,77.0,77.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [63]:
# Checking to see if unique IDs are repeated. Then need to determine if repeat is just a duplicate or means something

C2_ID_count = dict(cohort_2.StFCID.value_counts())
set(cohort_2.StFCID.value_counts())

{1, 2}

In [64]:
C2_duplicate_count = pd.DataFrame.from_dict(C2_ID_count, orient='index')
C2_duplicate_count[0].value_counts()


2    55336
1    50823
Name: 0, dtype: int64

##### Based on the information above: 
    
    - There are 161,495 total rows of data.
    
    - Of that total, 55,336 are repeated

______________________________________________________________________
In the next two cells, I am checking the counts of duplicates in two different ways by utilizing 
the kwarg 'keep' in the 'DF.duplicated' class. For more information, see the documentation: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

In [65]:
C2_dup_check_dict = dict(cohort_2.duplicated(['StFCID'],keep=False))

C2_dup_count2 = pd.DataFrame.from_dict(C2_dup_check_dict, orient='index')
C2_dup_count2[0].value_counts()

True     110672
False     50823
Name: 0, dtype: int64

In [66]:
C2_dup_check_dict = dict(cohort_2.duplicated(['StFCID'],keep='first'))

C2_dup_count2 = pd.DataFrame.from_dict(C2_dup_check_dict, orient='index')
C2_dup_count2[0].value_counts()

False    106159
True      55336
Name: 0, dtype: int64

In [67]:
# Adding a new boolean column to baseline population dataframe that will 
# identify a duplicated order line item based on unique ID (False=Not a duplicate; True=Duplicate).
# Then sorting values based on Unique ID and RepDate columns

cohort_2['duplicated'] = cohort_2.duplicated(['StFCID'], keep=False)
cohort_2_dups = cohort_2[cohort_2['duplicated'] == True]
cohort_2_dups = cohort_2_dups.sort_values(['StFCID', 'RepDate'], ascending=[True, True])

In [68]:
# So, after examining resulting tables from cell above, I determined that the repeats are 
# those youth who were reported in both reporting periods for FY 2014 (RepDate: 201403 or 201409).
# Some repeats are duplicates of each other (containing the same information in both rows), 
# while some repeats need to be reviewed further.

In [69]:
# First, I will separate the repeats that are duplicates, drop one of the two rows for every unique ID, and merge 
# the remaining data rows with the non-duplicated data from cohort_2.

cohort_2_dups_b = cohort_2_dups.drop(['RepDate', 'AgeMP'], axis=1)

cohort_2_dups['duplicated_4_sure'] = cohort_2_dups_b.duplicated(keep=False)
cohort_2_dups.duplicated_4_sure.value_counts()



False    87518
True     23154
Name: duplicated_4_sure, dtype: int64

In [70]:
cohort_2_dups_2bAdded = cohort_2_dups[cohort_2_dups['duplicated_4_sure'] == True]
cohort_2_dups_2bAdded = cohort_2_dups_2bAdded.drop_duplicates('StFCID')
cohort_2_dups_2bAdded.duplicated_4_sure.value_counts()


True    11577
Name: duplicated_4_sure, dtype: int64

In [71]:
# Merging remaining data rows that are now unique and 
# adding to new dataframe with other data rows that were already unique...

cohort_2_NoDups = cohort_2[cohort_2['duplicated'] == False]
cohort_2_NoDups['duplicated'].value_counts()


False    50823
Name: duplicated, dtype: int64

In [72]:
cohort_2_dups_2bAdded = cohort_2_dups_2bAdded.drop(['duplicated','duplicated_4_sure'], axis=1)
cohort_2_NoDups = cohort_2_NoDups.drop('duplicated',axis=1)

frames = [cohort_2_dups_2bAdded, cohort_2_NoDups]

cohort_2_NoDups = pd.concat(frames)
cohort_2_NoDups.describe()


Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,...,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0,62400.0
mean,2014.0,201405.556442,27.401747,1.496506,0.069776,0.051571,0.400353,0.043173,0.593061,0.410817,...,0.384295,0.378253,0.307532,0.233365,0.158077,0.201715,0.232083,0.282308,8.986442,3.957388
std,0.0,2.967052,16.578142,0.499992,1.668721,1.663922,1.720406,1.661636,1.720775,3.03985,...,2.440781,2.382148,2.377479,2.370307,2.279781,2.405835,2.350295,2.414462,25.589749,10.63051
min,2014.0,201403.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2014.0,201403.0,12.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2014.0,201403.0,26.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
75%,2014.0,201409.0,42.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,6.0
max,2014.0,201409.0,72.0,2.0,77.0,77.0,77.0,77.0,77.0,77.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [73]:
# Now I need to review the remaining repeated IDs, and determine which data row I should keep. 
# I will keep the data row that has the larger sum of number of services recieved for the unique ID.

cohort_2_reviewDups = cohort_2_dups[cohort_2_dups['duplicated_4_sure'] == False]
C2_services = cohort_2_reviewDups[['TribeSv','DelinqntSv', 'SpecEdSv', 'ILNAsv', 'AcSuppSv', \
            'PSEdSuppSv', 'CareerSv', 'EmplyTrSv', 'BudgetSv', 'HousEdSv', 'HlthEdSv', \
            'FamSuppSv', 'MentorSv', 'SILsv', 'RmBrdFASv', 'EducFinaSv', 'OthrFinaSv']]

C2_services_count = C2_services.T

cohort_2_reviewDups['Num_services'] = (C2_services_count == 1).sum()
cohort_2_reviewDups.Num_services.value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


2     14565
3     12703
1     11168
4     10410
5      8527
6      7278
7      6537
8      5285
9      3930
10     2550
11     1668
12     1023
14      810
13      630
15      316
0        76
16       39
17        3
Name: Num_services, dtype: int64

In [74]:
# Sorting values based on Unique ID and Num_services columns
# Then dropping duplicates with smaller value in Num_services columns

cohort_2_reviewDups = cohort_2_reviewDups.sort_values(['StFCID', 'Num_services'], ascending=[True, True])
cohort_2_DupsKeep = cohort_2_reviewDups.drop_duplicates('StFCID',keep='last')
cohort_2_DupsKeep.describe()


Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn,Num_services
count,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,...,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0,43759.0
mean,2014.0,201406.804932,24.218835,1.524829,0.037958,0.023607,0.362508,0.012203,0.563016,0.349345,...,0.488128,0.397175,0.258393,0.140474,0.159373,0.245732,0.329189,9.999429,3.920931,5.678923
std,0.0,2.89003,16.702254,0.499389,0.551352,0.539,0.706097,0.528702,0.716593,2.09599,...,0.619355,0.610875,0.437756,0.347482,0.366028,0.430525,0.469924,27.073695,9.828792,3.165849
min,2014.0,201403.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
25%,2014.0,201403.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0
50%,2014.0,201409.0,21.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,5.0
75%,2014.0,201409.0,39.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,6.0,8.0
max,2014.0,201409.0,72.0,2.0,77.0,77.0,77.0,77.0,77.0,77.0,...,77.0,77.0,1.0,1.0,1.0,1.0,1.0,99.0,99.0,17.0


In [75]:
cohort_2_DupsKeep.columns

Index([u'FY', u'RepDate', u'StFIPS', u'St', u'DOB', u'Sex', u'AmIAKN',
       u'Asian', u'BlkAfrAm', u'HawaiiPI', u'White', u'RaceUnkn', u'RaceDcln',
       u'HisOrgin', u'FCStatSv', u'TribeSv', u'DelinqntSv', u'EdLevlSv',
       u'SpecEdSv', u'ILNAsv', u'AcSuppSv', u'PSEdSuppSv', u'CareerSv',
       u'EmplyTrSv', u'BudgetSv', u'HousEdSv', u'HlthEdSv', u'FamSuppSv',
       u'MentorSv', u'SILsv', u'RmBrdFASv', u'EducFinaSv', u'OthrFinaSv',
       u'StFCID', u'Race', u'RaceEthn', u'AgeMP', u'duplicated',
       u'duplicated_4_sure', u'Num_services'],
      dtype='object')

In [76]:
# Merging remaining data rows that are now unique and 
# adding to cohort_2_NoDups dataframe
# Resulting df is the Services Population for FY 2014 data set

cohort_2_DupsKeep = cohort_2_DupsKeep.drop(['Num_services','duplicated','duplicated_4_sure'], axis=1)

frames = [cohort_2_DupsKeep, cohort_2_NoDups]
cohort_2_NoDups = pd.concat(frames)
cohort_2_NoDups.describe()

Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,...,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0,106159.0
mean,2014.0,201406.071073,26.089743,1.508181,0.05666,0.040044,0.384753,0.030407,0.580676,0.385478,...,0.439765,0.423544,0.344483,0.243682,0.150821,0.184261,0.237709,0.301632,9.403998,3.94236
std,0.0,2.999172,16.702977,0.499935,1.32753,1.321865,1.39485,1.318472,1.397276,2.691354,...,1.899743,1.869908,1.864997,1.838906,1.762057,1.859525,1.823006,1.87568,26.216231,10.307565
min,2014.0,201403.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2014.0,201403.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2014.0,201409.0,25.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
75%,2014.0,201409.0,41.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,6.0
max,2014.0,201409.0,72.0,2.0,77.0,77.0,77.0,77.0,77.0,77.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [77]:
cohort_2_services_pop = cohort_2_NoDups.copy()

In [78]:
cohort_2_services_pop.to_csv('~/Desktop/capstone_clean_data/cohort_2_services_pop.csv')

In [79]:
# Now to get services N without data from states that have bad unique ID encodings...

# Issue with encoding the tracking ID (StFIPS). Need to drop states that have the issue: HI, IN, KY, MS, OR, TX, TN

bad_encode_states = ["HI", "IN", "KY", "MS", "OR", "TX", "TN"]
cohort_2_services_pop_noBadCodeSt = cohort_2_NoDups.loc[~cohort_2_NoDups['St'].isin(bad_encode_states)]
cohort_2_services_pop_noBadCodeSt.describe()


Unnamed: 0,FY,RepDate,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,HisOrgin,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,...,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0,90526.0
mean,2014.0,201406.042352,24.747487,1.503955,0.062844,0.04275,0.401851,0.031273,0.560237,0.410214,...,0.382045,0.372225,0.288072,0.22546,0.121888,0.155005,0.211354,0.268508,10.396008,3.947882
std,0.0,2.999718,16.838006,0.499987,1.436278,1.430008,1.498122,1.426287,1.500124,2.892926,...,1.037534,0.971994,0.957176,0.906135,0.703493,0.952219,0.865154,0.985979,27.689439,10.145152
min,2014.0,201403.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,2014.0,201403.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,2014.0,201409.0,24.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0
75%,2014.0,201409.0,39.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,6.0
max,2014.0,201409.0,72.0,2.0,77.0,77.0,77.0,77.0,77.0,77.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


---------------------------------------------------------------------------------------------------------------------
- Target population: FY 2014 Services Population = 106159 total foster youth who received services in FY 2014

- Target population for this project: FY 2014 Services Population, without states with bad encodings of unique ID (thus excluded from analysis) = 90526 total foster youth who received services in FY 2014 (excluding certain states)


Now I will make a dataset for Cohort 2, Baseline Population (N)....
---------------------------------------------------------------------------------------------------------------

In [85]:
outcomes_2014.wave.value_counts()


Age 17 Baseline Survey    23775
Name: wave, dtype: int64

In [86]:
# From outcomes_2014 dataset, I only want to examine data for foster youth in Wave 1.
# Resulting df is the Cohort 2 Baseline Population (FY 2014) data set

# I will do an inner join of the outcomes data from FY 2014 (Wave 1) 
# with services and demographics data from FY 2014, joining on unique ID

cohort_2_services_pop_noBadCodeSt.rename(columns={'RepDate':'RepDate_services'}, inplace=True)

outcomes_2014_keep = outcomes_2014.drop(['stfips','st','recnumbr','dob','sex','amiakn','asian','blkafram',\
                                         'hawaiipi','white','raceunkn','racedcln','hisorgin'], axis=1)

outcomes_2014_keep.rename(columns={'repdate':'RepDate_outcomes', 'stfcid': 'StFCID'}, inplace=True)

cohort2_baseline = pd.merge(outcomes_2014_keep, cohort_2_services_pop_noBadCodeSt, on='StFCID', how='inner')
cohort2_baseline = cohort2_baseline[cohort2_baseline.wave == 'Age 17 Baseline Survey']

In [87]:
cohort2_baseline.describe()


Unnamed: 0,responded,FY,RepDate_services,StFIPS,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,...,HousEdSv,HlthEdSv,FamSuppSv,MentorSv,SILsv,RmBrdFASv,EducFinaSv,OthrFinaSv,Race,RaceEthn
count,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,...,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0,11432.0
mean,0.774318,2014.0,201406.679671,23.644594,1.500612,0.046361,0.023181,0.361704,0.012071,0.560444,...,0.40859,0.410952,0.323041,0.23338,0.096221,0.095609,0.138733,0.258922,11.381386,3.906578
std,0.418049,0.0,2.922122,16.429667,0.500021,0.745759,0.731154,0.861879,0.723787,0.87081,...,0.868105,0.868351,0.854778,0.831186,0.773894,0.773574,0.794631,0.838951,29.004389,9.436258
min,0.0,2014.0,201403.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,1.0,2014.0,201403.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,1.0,2014.0,201409.0,22.0,2.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0
75%,1.0,2014.0,201409.0,36.0,2.0,0.0,0.0,1.0,0.0,1.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,6.0
max,1.0,2014.0,201409.0,72.0,2.0,77.0,77.0,77.0,77.0,77.0,...,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,99.0,99.0


In [88]:
cohort2_baseline.to_csv('~/Desktop/capstone_clean_data/cohort2_baseline.csv')



---------------------------------------------------------------------------------------------------------------------
- Sampling Frame: Cohort 2 Baseline Population = <b>N = 11432 total foster youth who were eligible for outcomes surveys</b>

- Currently waiting for Cohort 2, Wave 2 Outcomes Data to be made available
---------------------------------------------------------------------------------------------------------------