# Analysis Part 2:
In Part 2 of this analysis, we find an updated estimate on the number of Opportunity Youth in South King County.

We will then dive deeper into this population by breaking down the number of Opportunity Youth by puma code, age groups, educational attainment and employment availability. 

## Imports:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import os

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

In [3]:
from src.data_cleaning import data_cleaning as dc

### 1. Updated Population Estimate of OY:
Using the person weights from the 2017 pums data, we found an updated estimate of the Opportunity Youth in South King County to be 6723.

To generate this number, a pandas dataframe was created in an iterative way to isolate the OY population in South King County.  

As mentioned in the introduction of this analysis, functions were created to generate a dataframe of the South King County population of 16-24 years with columns added to categorise into age groups and whether a row is OY or not.  To count the number of OY, we isolated down to the `'is_oy' = True` population and then used the `.sum()` method on the person weights to count how many OY are is South King County:

In [4]:
# create skc df and then subset the OY population:
skc_df = dc.final_skc_df()
oy_df = skc_df[skc_df['is_oy']]
oy_df['person_weight'].sum()

6723.0

### 2. OY Population by PUMA:
We defined South King County by the following 4 puma codes:

In [5]:
puma_names = dc.create_skc_puma_df()
puma_names

Unnamed: 0,state_fips,state_name,cpuma0010,puma,geoid,gisjoin,puma_name
42,53,Washington ...,1044,11613,5311613,G53011613,King County (Southwest Central)--Kent City ...
43,53,Washington ...,1044,11614,5311614,G53011614,King County (Southwest)--Auburn City & Lakelan...
44,53,Washington ...,1044,11615,5311615,G53011615,"King County (Southeast)--Maple Valley, Covingt..."
47,53,Washington ...,1046,11612,5311612,G53011612,"King County (Far Southwest)--Federal Way, Des ..."


We can see the break down of the Opportunity Youth population by PUMA code as follows:

In [43]:
dc.puma_breakdown(oy_df)

PUMA 11612: 1977.0
PUMA 11613: 2006.0
PUMA 11614: 1530.0
PUMA 11615: 1210.0


We can see this same breakdown with the percentages of the OY per PUMA code:

- PUMA 11612 - King County (Far Southwest): 1977 - 29%

- PUMA 11613 - King County (Southwest Central): 2006 - 30%

- PUMA 11614 - King County (Southwest): 1530 - 23%

- PUMA 11615 - King County (Southeast): 1210 - 18%

This distribution of the OY population might lead to further investigations about why the Southwest regions of King County account for 82% of the OY population in South King County.  What are the differentiating factors between Southwest and Southeast King County?  

### 3. OY Population by Age Group:
We decided to break the OY population into three age groups: 16-18yo, 19-21yo, 22-24yo.  This was so we could make comparisons with the 2016 report as well as being able to capture the potentially unique circumstances of each group.  A subset of the dataframe was created to show the population breakdown by age groups as well as the corresponding percentage of the total 16-24yo population:  

In [12]:
# Create a subset of the skc_df to count the total population of 16-24yos
tot_pops_df = skc_df.groupby(['age_group']).sum().drop(columns = ['age', 'is_oy'])
tot_pops_df.rename(columns = {'person_weight': 'population'}, inplace = True)
tot_pop = tot_pops_df.sum()

# Subset the dataframe again to isolate the OY population and calculate the percentage of OY amonst the total pop.
oy_pop_df = oy_df.groupby(['age_group']).sum().drop(columns = ['age', 'is_oy'])
oy_pop_df.rename(columns = {'person_weight': 'OY population'}, inplace = True)
oy_pop_df['percentage'] = round(oy_pop_df['OY population'] / tot_pops_df['population'] * 100)
oy_pop_df

Unnamed: 0_level_0,OY population,percentage
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1
Ages 16-18,1230.0,6.0
Ages 19-21,2541.0,14.0
Ages 22-24,2952.0,15.0


We can see these numbers visually in the following bar graph:

![pop_ages.png](../../reports/figures/pop_ages.png)

From this bar graph, we can see that the population of OY is highest for the 22-24 year olds and lowest for the 16-18 year olds.  This might be a trend we expect to see, since 16-18 years are more likely to still be under the supervison of family members and/or adults and so have more accountability to be in school or working.  

Further investigation into the living situations of OY 19 and older might shed light on the factors contributing to their school enrollment and employment status.  

### 4. Education Attainment Amongst Opportunity Youth:
Since education is one of the definig factors of OY, it seems necessary to investigate the levels of education this population has attained, given that they are not currently in school.  

### OY Population with No Diploma:
We subsetted our data via education attainment category to investigate each level.  The 'No Diploma' category raised issues in how we queried the data based on the structure of the data.  It is important to note that all values in our dataframe, except 'age' and 'person weight', are strings, despite most of the response options in the survey being integer answers.  This matters because now the format of how we query each response is important.  For the 'education attainment' feature, responses 'bb' through to '15' were valid responses to mean a person had no diploma.  At first, we were making queries using the `range(1, 16)` function and then casting the values as a list.  After querying the data in this way, we noticed that we had missed a lot of the data.  Upon further investigation into the format of the response data, we saw that single-digit responses in the 'education attainment' column were listed with a '0' before the number, e.g. '01' rather than '1'.  This meant our `range()` function was missing all single-digit responses:

In [13]:
# Previously we missed data because we were searching of '1' instead of '01' for single-digit values:
skc_df[skc_df['education_attained'] == '1']

Unnamed: 0,id,age,sex,person_weight,puma,school_enrollment,education_attained,employment_status,avail_for_work,look_for_work,absent_from_work,layoff,age_group,is_oy


As opposed to:

In [14]:
# vs. :
skc_df[skc_df['education_attained'] == '01']

Unnamed: 0,id,age,sex,person_weight,puma,school_enrollment,education_attained,employment_status,avail_for_work,look_for_work,absent_from_work,layoff,age_group,is_oy
4565,2013000916904,17.0,1,14.0,11614,1,1,6,5,2,2,2,Ages 16-18,True
5968,2013001189997,23.0,2,20.0,11613,1,1,6,3,2,2,2,Ages 22-24,True
7985,2014000095629,18.0,2,12.0,11613,1,1,3,1,1,3,2,Ages 16-18,True
10373,2014000564460,22.0,2,2.0,11614,1,1,6,5,2,2,2,Ages 22-24,True
14145,2014001327292,21.0,1,4.0,11614,1,1,3,1,1,2,2,Ages 19-21,True
19015,2015000781393,21.0,2,12.0,11615,1,1,1,5,2,2,2,Ages 19-21,False
21962,2015001371640,24.0,2,21.0,11614,1,1,6,5,2,2,2,Ages 22-24,True
24170,2016000305053,23.0,2,50.0,11613,1,1,1,5,3,3,3,Ages 22-24,False
26159,2016000707834,23.0,2,34.0,11613,1,1,6,5,2,2,2,Ages 22-24,True
28897,2016001245397,21.0,1,5.0,11614,1,1,1,5,3,3,3,Ages 19-21,False


So we created a pad function to cast the single digit entries to include a '0' before them:

In [15]:
# Create function to add 0's to single-digit entries: - function located in data_cleaning.py
def pad(num):
    if num < 10:
        return f'0{num}'
    else: 
        return str(num)  

Thus, we were able to find all valid 'no diploma' entries as follows:

In [16]:
# What is the total number of 16-24 OY with no diploma?  Valid entries:  bb-15.
no_dip = ['bb'] + list(map(pad, range(1, 16)))

In [17]:
no_dip_oy_df = oy_df[oy_df['education_attained'].isin(no_dip)]
no_dip_oy_df

Unnamed: 0,id,age,sex,person_weight,puma,school_enrollment,education_attained,employment_status,avail_for_work,look_for_work,absent_from_work,layoff,age_group,is_oy
288,2013000058010,17.0,2,45.0,11614,1,13,6,5,2,2,2,Ages 16-18,True
517,2013000100470,18.0,2,16.0,11613,1,13,3,5,3,3,3,Ages 16-18,True
610,2013000118713,23.0,2,25.0,11613,1,14,6,5,2,2,2,Ages 22-24,True
796,2013000155051,19.0,2,2.0,11614,1,14,6,5,2,2,2,Ages 19-21,True
1028,2013000204235,17.0,1,17.0,11613,1,13,6,5,2,2,2,Ages 16-18,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35945,2017001151675,17.0,2,26.0,11612,1,14,6,5,3,3,3,Ages 16-18,True
36211,2017001208143,17.0,1,16.0,11612,1,13,6,5,2,2,2,Ages 16-18,True
37134,2017001386502,18.0,1,17.0,11613,1,11,6,5,3,3,3,Ages 16-18,True
37718,2017001470135,23.0,1,17.0,11613,1,14,6,5,3,3,3,Ages 22-24,True


For the remaining 'education attainment' categories, we queried the data in a similar fashion as above.  This lead us to creating the following visualisation:

![education_oy.png](../../reports/figures/education_oy.png)

Across all ages, 49% of Opportunity Youth had a high school diploma or a GED.  We can see from the figure above that, when broken down by age group, we can see that 55% of 19-21yo and 48% of 22-24yo have achieved a high school diploma or GED whereas 58% of the 16-18yo have not attained a diploma.  This is not too surprising given that people between 16-18 years old are 'school aged', so it is typical to see 16-18yo without a diploma.  Further investigation into these circumstances is necessary.  

### 5. Opportunity Youth Looking for Work:
As employment is the second defining factor of Opportunity Youth, it is compelling to look into the motivations Opportunity Youth have toward seeking employment.  

We queried the 'looking for work' category in a similar way to how we queried education.  The valid responses were:

b = N/A (less than 16 years old/at work/temporarily absent/informed of recall)

1 = Yes

2 = No

3 = Did not report

Again, all these numbers were string values in our data.  While we could have cast all our data to `int`, we chose to handle it on an add-needs basis.  We also considered renaming the responses to something mor readable but again, decided for our purposes it was not necessary.

In [20]:
# Create list of relevant responses:
response = ['1', '2', '3']

# Query the number of OY who are looking for work:
oy_df[oy_df['look_for_work'] == '1']['person_weight'].sum()

2107.0

We continued in this way to develop an understanding of the percentage of OY looking for work and then generated the following visualisation:

![oy_lfw.png](../../reports/figures/oy_lfw.png)

While 12% of the data was not reported, 56.7% of Opportunity Youth reported 'No' in the survey, in response to the category 'Looking for Work' (80% of which came from ages 19-24).  This sparks the question "why are Opportunity Youths not looking for work if they are neither in school or working?".  

## Summary
Our analysis so far has highlighted the need for further investigation into the education and employment status' of the Opportunity Youth.  

Variables such as motivation for seeking education or employment, living circumstances and accountability systems for each age group require further research to gain a deeper understanding of the characteristics defining Opportunity Youth.