In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
import sys
import os

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

In [7]:
from src.data_cleaning import data_cleaning as dc

# Analysis of Opportunity Youth in South King County:

## Setting the Scene:

### Aims:

Our aims are 3 fold:
- We are trying to give an updated estimate on the number of Opportunity Youth in South King County
- Our goal is not to make any concrete conclusions about this population but rather get a closer look of what this group of people looks like. 
    - We deep dive into looking at 2 characteristics defining opporunity youth:  education and employment. 
    - We compare the oy population with the non-oy population in regards to these 2 categories.
- Observe any trends between the 2014 data and our current data 

### How do we define OY and SKC?

- We define OY as people between the ages of 16 and 24 who are both displaced from school and work. 
     - We define non-oy population as being people between the ages of 16 and 24 who are either working or are in school. 
- We define SKC using the puma codes as per the ACS website.  The puma codes we used to define SKC were:
    - 11612, 11613, 11614, 11615
    - We defined it as any of the King County regions 
- It should be noted that the area used in the 2014 data was defined differently so our data is not a 1:1 match

### Analysis Takeaways for Future Investigation:

- What role does education attainment play in defining OY?
- Motivation to work defined by 'available to work' and 'looking for work'
- What life style factors influence 16-24yos to be disengaged from work and school?
- Highlight the need to get a better understanding of the OY population by looking into the defining factors of Opportunity Youth; i.e. education and employment.

# EDA:

## Data Cleaning/Processing:

### Where did we get our data from?

- We sourced our data from the ACS website and primarily worked with the 2017 5-year persons PUMS data. 
    - The tables we used to collate our data from were:
        - pums_2017:  This table gave us data on weighted individuals
        - puma_names_2010:  This table gave us information about the puma codes and labels
        - wa_geo_xwalk:  This table gave us geographical information on our data

### What features did we choose to address each part of the problem/analysis and why?

#### Defining SKC:
- To define the South King County are, we utilised the puma_names_2010 data table
- By using the state column, we were able to filter down to data for Washington 
- Using the Tigerweb app, we were able to identify the 16 puma codes that define King County.  Thus, we were able to filter down to all of the King County data.

In [8]:
king_county = dc.create_kc_df()
king_county

Unnamed: 0,state_fips,state_name,cpuma0010,puma,geoid,gisjoin,puma_name
32,53,Washington ...,1039,11601,5311601,G53011601,Seattle City (Northwest) ...
33,53,Washington ...,1040,11602,5311602,G53011602,Seattle City (Northeast) ...
34,53,Washington ...,1041,11603,5311603,G53011603,Seattle City (Downtown)--Queen Anne & Magnolia...
35,53,Washington ...,1042,11604,5311604,G53011604,Seattle City (Southeast)--Capitol Hill ...
36,53,Washington ...,1043,11605,5311605,G53011605,Seattle City (West)--Duwamish & Beacon Hill ...
37,53,Washington ...,1044,11606,5311606,G53011606,"King County (Northwest)--Shoreline, Kenmore & ..."
38,53,Washington ...,1044,11607,5311607,G53011607,"King County (Northwest)--Redmond, Kirkland Cit..."
39,53,Washington ...,1044,11608,5311608,G53011608,King County (Northwest Central)--Greater Belle...
40,53,Washington ...,1044,11609,5311609,G53011609,"King County (Central)--Sammamish, Issaquah, Me..."
41,53,Washington ...,1044,11610,5311610,G53011610,"King County (Central)--Renton City, Fairwood, ..."


- From here, we decided to define South King County as all of the puma names that identified 'South' in the puma name.  This reduced us down to 4 puma codes:

In [9]:
skc_pumas = dc.create_skc_puma_df()
skc_pumas

Unnamed: 0,state_fips,state_name,cpuma0010,puma,geoid,gisjoin,puma_name
42,53,Washington ...,1044,11613,5311613,G53011613,King County (Southwest Central)--Kent City ...
43,53,Washington ...,1044,11614,5311614,G53011614,King County (Southwest)--Auburn City & Lakelan...
44,53,Washington ...,1044,11615,5311615,G53011615,"King County (Southeast)--Maple Valley, Covingt..."
47,53,Washington ...,1046,11612,5311612,G53011612,"King County (Far Southwest)--Federal Way, Des ..."


This is what we chose to be our final definition of South King County to perform our analysis on.

#### Defining OY:
- The 2017 pums data table gave us access to 286 features relating to (weighted) individual persons.  
- We then made judgement calls on the most appropriate features to include in order to answer our initial exploratory questions. 
- We first chose the features that would help us define the OY population.  This included selecting the features:
    - 'Age' - this feature allows us to isolate the relevant age range of the OY pop.
    - 'Employment Status' - this allows us to query the employment status of an individual so that if they aren't employed **and** not in school, we can categorise them as OY 
    - 'School Enrollment' - this allows us to query the school enrollment of an individual (same as above)
    - 'Person Weights' - since this was just a sample of the skc population, it was important to include this feature in order to get the correct person-weighting scaling of the data
    - We also used the 'serialno' feature as an 'id' identifier for each row
- Since OY are defined by employment and education, it seemed fitting to include more features relating to those two fields, as they might prove useful throughout our exploration.  Thus, we also chose to include:
    - 'Grade Attending' - this would help us if we wanted more insight on the non-oy pop for comparison sake
    - 'Education Attained' - this would help us see the distribution of what levels of education OYs have obtained
    - 'Absent from work' - this allows us to investigate further into the OY that may be unemployed but may just be absent from work
    - 'Available for work' - this will allow us to query how many OY are available for work in comparison to the non-oy pop and if there are any significant differences between the 2 pops
    - 'Looking for work' - this will allow us to investigate how many OY are actively searching for work (again in comparison to non-oy) and how many are unemployed and not looking for work
    - 'Layoff from work' - this feature will let us see how many OY may be on layoff from work. 
- We of course had to also include the 'puma' feature so that we could refine our data set to SKC as per the puma codes we found from the puma_names_2010 table 
- Note that we also updated the column names to something more readable
- We then filtered down to the appropriate OY age range:  16-24, so that our data set only contained individuals who were between 16 and 24

In [10]:
skc_df = dc.skc_df()
skc_df.head()

Unnamed: 0,id,age,sex,person_weight,puma,school_enrollment,education_attained,employment_status,avail_for_work,look_for_work,absent_from_work,layoff
32,2013000007063,19.0,1,30.0,11612,2,18,6,5,2,2,2
36,2013000008046,17.0,2,36.0,11613,2,13,6,5,2,2,2
48,2013000011255,17.0,2,13.0,11614,2,12,6,5,2,2,2
54,2013000012970,21.0,2,29.0,11612,3,18,6,5,2,2,2
57,2013000013525,18.0,2,24.0,11613,2,15,6,5,2,2,2


- Finally, it became obvious that it would be helpful to be able to easily know whether an individual row was OY or not and to break the age range into 3 groups.  This lead to adding 2 new columns to our dataframe:
    - 'is_oy' - this is a boolean column that holds True if the row is an OY and False otherwise.  This column was created by using a list comprehension and querying the 'employment status' and 'school enrollment' columns
    - 'age_group' - this column categorised each person into 3 age groups:  'Ages 16-18', 'Ages 19-21', 'Ages 22-24'.  
        - We decided to use these three age groups:
        1. because the data from the 2016 report categorised their data in this way so it would allow for comparisons and 
        2. because these age groups seem representative of slightly different life stages
- Our final dataframe now looks like:

In [39]:
skc_df = dc.add_cols_skc(skc_df)
skc_df.head()

Unnamed: 0,id,age,sex,person_weight,puma,school_enrollment,education_attained,employment_status,avail_for_work,look_for_work,absent_from_work,layoff,age_group,is_oy
32,2013000007063,19.0,1,30.0,11612,2,18,6,5,2,2,2,Ages 19-21,False
36,2013000008046,17.0,2,36.0,11613,2,13,6,5,2,2,2,Ages 16-18,False
48,2013000011255,17.0,2,13.0,11614,2,12,6,5,2,2,2,Ages 16-18,False
54,2013000012970,21.0,2,29.0,11612,3,18,6,5,2,2,2,Ages 19-21,False
57,2013000013525,18.0,2,24.0,11613,2,15,6,5,2,2,2,Ages 16-18,False


### What assumptions did we make when picking these features?

- When picking these features, we assumed that each row could only have one possible value for each column.  e.g. 'education_attained' would not specify a number correlating to each grade of school attended, but rather the highest level of education they've attained.  
- We are assuming that we don't have risk of 'double counting' if, for example, someone who is unemployed could also be categorised as 'absent from work' and/or laid off.

### What problems did we face when working with the data?

- Discrepancy in how SKC was defined in the 2014 data set 
    - When trying to compare our 2017 data with the 2014 data, it was difficult to define SKC in the same way that the 2016 report seemed to.  
    - We then needed to find a way to scale our data so that we could make reasonable comparisons
- Further investigation needs to be done in order to determine if someone who is unemployed could also be categorised as 'absent from work' and/or laid off
    - If this is the case, we need to define how we determine 'unemployed' - is it simply being registered as 'unemployed' or does it need to take into account these other two possibilities

### How did we overcome these problems?

- We chose to compare the 2014 vs. 2017 data using percentages to display the proportions of the categorised populations

## Data Analysis:
This analysis was completed in four stages:
1. A map was generated to help visualise the part of King County that are a part of South King County.

2. An updated estimate of the number of Opportunity Youth in South King County as well as a breakdown of this count by age group, PUMA code, educational attainment and employment availability.  

3. An update of the table "Opportunity Youth Status by Age" from the 2016 report "Opportunity Youth in the Road Map Project Region".

4. A visualisation highlighting a trend between the 2014 data and the current 2017 data.
