# Real-world Data Wrangling

In [2]:
# !python -m pip install kaggle==1.6.12

In [3]:
## !pip install --target=/workspace ucimlrepo numpy==1.24.3

**Note:** Restart the kernel to use updated package(s).

In [5]:
import pandas as pd
import numpy as np
import requests

## 1. Gather data

### **1.1.** Problem Statement
In this project, we will investigate the educational attainment levels among individuals employed in computer-related occupations in King County, Washington, by gathering, cleaning, and analyzing relevant data to identify trends, distributions, and potential correlations within the local workforce. Data from the Seattle-Tacoma area will be used as proxy to answer the research questions

#### Research Questions ####
1. What is the level of education of people employed in computer related roles
2. How does the level of education influence salary for computer-realated occupations

### **1.2.** Gathered U.S Bureau of Labor Statistics and U.S Census Bureau Dataset

##### RAW DATA SOURCE #####
- **U.S. Census B - Visual tool to generate custom us census table with API link provided**
    -https://data.census.gov/mdat/#/search?ds=ACSPUMS1Y2023&cv=WRK,SEX&rv=SOCP,ucgid&nv=SCHL&wt=PWGTP&g=795P200US5323301,5323302,5323303,5323304,5323305,5323306,5323307,5323308,5323310,5323311

- **US Labor and Stats**
    - https://www.bls.gov/oes/tables.htm
    - Metropolitan and nonmetropolitan area (HTML) (XLS)


#### **1.2.1** 2023 Occupational Employment and Wage (OEW) Statistics from U.S Bureau of Labor Statistics Dataset

Type: XLS file

Method: The data was gathered using the "Downloading files" method from U.S Bureau of Labor Statistics (Occupational Employment and Wage Statistics tables)

Dataset variables:

*   *AREA:* Area code
*   *AREA_TITLE:* Title of the metropolitan area
*   *OCC_CODE:* Occupational Code
*   *OCC_TITLE:* Occupational Title
*   *A_MEAN:* Mean Annual Wage

In [None]:
## Load OEW Dataset to access king county/seattle-tacoma area dataset
oew_data = pd.read_excel('./data/oew_statistics_2023_raw.xlsx')

In [None]:
## show the first 5 rows of dataset
oew_data.head(5)

#### **1.2.2** 2023 U.S Census Bureau Public Use Microdata Site (PUMS) Dataset - Custom Table

Type: API

Method: The data was gathered using the "API" method from the United States Census Bureau Public Use Microdata Sample site

Dataset variables:

*   *SOCP:* Standard Occupational Classification (SOC) codes for 2018 and later based oin 2018 SOC codes
*   *SCHL:* Educational Attainment
*   *SCHL_RC1:* Educational Attainment recode 

In [None]:
## Access PUMS API for census information for king county/seattle-tacoma dataset
url = 'https://api.census.gov/data/2023/acs/acs1/pums?get=PWGTP,SOCP,SCHL_RC1,SCHL&ucgid=795P200US5323304&recode+SCHL_RC1=%7B%22b%22:%22SCHL%22,%22d%22:%5B%5B%220%22,%2201%22,%2202%22,%2203%22,%2204%22,%2205%22,%2206%22,%2207%22,%2208%22,%2209%22,%2210%22,%2211%22,%2212%22,%2213%22,%2214%22,%2215%22%5D,%5B%2216%22,%2217%22%5D,%5B%2218%22,%2219%22%5D,%5B%2220%22,%2221%22%5D,%5B%2222%22,%2223%22,%2224%22%5D%5D%7D'
pums_response = requests.get(url)
pums_response.raise_for_status()

## Get the json
pums_response_data = pums_response.json()

## Create dataframe from json
columns = pums_response_data[0]
rows = pums_response_data[1:]
pums_data = pd.DataFrame(rows, columns=columns)

In [None]:
## show the first 5 rows of dataset
print(pums_data.head(5))

In [None]:
## Make copy of dataset before assessment and cleaning
pums_data_copy = pums_data.copy()

In [None]:
### header is the first row of data 
print(pums_data_copy.head(5))

In [None]:
# Saving raw PUMS dataset
pums_data.to_csv('./data/pums_2024_compsci_edlevel_king_county_raw.csv', index=False)

## 2. Assess data

Assessed the data according to data quality and tidiness metrics as reported below.

### Quality Issue 1: Invalid Values in A_MEAN Variable

In [None]:
#Inspecting dataframe for invalid characters in A_MEAN colums
oew_data.head(10)

In [None]:
#Inspecting dataframe for invalid characters programmatically
oew_data['A_MEAN'].value_counts()

Issue and justification:
- **Issue**: Invalid Values in A_MEAN
- **Explanation**: Presence of invalid characters (‘*’ and ‘#’) in numerical fields affects data’s validity.

### Quality Issue 2: Incorrect Data Type for A_MEAN

In [None]:
#Inspecting dataframe for invalid data type for A_MEAN
oew_data.info()

In [None]:
#Inspecting dataframe for invalid data type for A_MEAN programmatically
oew_data.dtypes

Issue and justification:
- **Issue**: Incorrect Data Type for A_MEAN 
- **Explanation**: Incorrect data type is a validity issue as it prevents numerical operations; completeness typically refers to missing data rather than incorrect types

### Tidiness Issue 1: Limited dataset to Seattle-Tacoma-Washington only

In [None]:
# Inspecting the dataframe visually
oew_data['AREA'].value_counts()

In [None]:
# Inspecting the dataframe programmatically
oew_data['AREA'].nunique()

Issue and justification:
- **Issue**: Dataset includes area outside the Seattle-Tacoma-Washinton region
- **Explanation**: The dataset includes regions beyond Seattle-Tacoma-Washington, which need to be filtered out to focus only on the relevant area.

### Tidiness Issue 2: Column Headers as Variables in PUMS Dataset

In [None]:
# Inspecting the dataframe visually
pums_data.head(5)

In [None]:
#Inspecting the dataframe programmatically
pums_data.info()

Issue and justification: 
- **Issue**: Column Headers as Variables
- **Explanation**: Columns acting as variables violate tidy data principles, thus categorizing it as a tidiness issue.

## 3. Clean data
- Cleaning data to solve the issues corresponding to data quality and tidiness found in the assessing step

In [None]:
## Make copy of dataset before cleaning
oew_cleaned = oew_data.copy()
pums_data_copy = pums_data.copy()

### **Quality Issue 1: Invalid Values in A_MEAN Variable**

In [None]:
# suppress warning
pd.set_option('future.no_silent_downcasting', True)

## clean - replace the * and # with np.nan
oew_cleaned['A_MEAN'] = oew_cleaned['A_MEAN'].replace(['*','#'], [np.nan, np.nan])

## drop NA values
#oew_cleaned = oew_cleaned.dropna()

In [None]:
## verify cleaning
oew_cleaned['A_MEAN'].value_counts()

Justification: 
- The column "A_MEAN" contains invalid values ('*' and '#') replacing with np.nan allows for the rows with those values to be dropped with dropna

### **Quality Issue 2: Incorrect Data Type for A_MEAN**

In [None]:
## convert A_MEAN datatype to float
oew_cleaned['A_MEAN'] = oew_cleaned['A_MEAN'].astype('float')

In [None]:
## verify data type conversion
assert oew_cleaned['A_MEAN'].dtype == 'float'

Justification: 
- *For calculation accuracy coverting A_MEAN to float from object*

### **Tidiness Issue 1: imited dataset to Seattle-Tacoma-Washington only**

In [None]:
# Limit OEW dataset to Seattle-Tacoma-Washington area only
seattle_area = 42660
oew_cleaned = oew_cleaned[oew_cleaned['AREA'] == seattle_area]

In [None]:
# Verify - Inspecting the dataframe visually
oew_cleaned.head()

Justification: 
- *Filtering by location focuses the dataset on a specific observational unit*

### [DELETE] **Tidiness Issue 2: Column Headers as Variables in PUMS Dataset**

In [None]:
# [MOVE TO FINAL CLEANUP] drop PWTP/PUMA/STATE variables
pums_cleaned = pums_cleaned[['SOCP','SCHL_RC1', 'SCHL']].reindex()

In [None]:
# [DELETE] Cleaning verification
pums_cleaned.columns

[UPDATE] Justification: 
- *Easier to programmatically use column names rather than column id or number*

### **Remove unnecessary variables and combine datasets**

#### **Remove unnecessary variables** ###

In [None]:
# Remove Unnecessary Columns from OEW dataset
oew_cleaned = oew_cleaned[['AREA', 'OCC_CODE', 'OCC_TITLE', 'A_MEAN']].reindex()

In [None]:
# Limit OEW dataset to Computer related occupations 
comsci_startswith = '15-'
oew_cleaned = oew_cleaned[ oew_cleaned['OCC_CODE'].str.startswith(comsci_startswith)]
assert(oew_cleaned[~oew_cleaned['OCC_CODE'].str.startswith(comsci_startswith)].empty)

In [None]:
## Limit PUMS dataset Computer-Related Occupations only
pums_cleaned = pums_cleaned[pums_cleaned['SOCP'].str.contains(r'^15\d{4}$', case=False, regex=True)]
assert( pums_cleaned[~pums_cleaned['SOCP'].str.contains(r'^15\d{4}$', case=False, regex=True)].empty)

In [None]:
## Fix format of OEW/OCC_CODE to match PUMS/SOCP by removing hyphen
oew_cleaned['OCC_CODE'] = oew_cleaned['OCC_CODE'].replace('-', '', regex=True)
assert not (oew_cleaned['OCC_CODE'].str.contains('-').any())

In [None]:
# Remove invalid Data in SOCP ( == 'N' ) Variable that is not relevant in this context
pums_cleaned = pums_cleaned[pums_cleaned['SOCP'] != 'N']
assert( pums_cleaned[pums_cleaned['SOCP'] == 'N'].empty)

#### **Combine Dataset** ####

In [None]:
## Create lookup list for education level (ED_LEVEL)
ed_level_data = {
    'SCHL_RC1': ['1', '2', '3', '4', '5'],
    'ED_LEVEL': [
        'No high school diploma',    # SCHL_RC1 == 0
        'High school diploma',       # SCHL_RC1 == 1
        'Completed Some College',    # SCHL_RC1 == 2
        'Graduated College',         # SCHL_RC1 == 3
        'Completed Advanced Degree'   # SCHL_RC1 == 4
    ]
}

ed_level_df = pd.DataFrame(ed_level_data)

# merge with pums dataset using SCHL_RC1 as key
pums_merged = pums_cleaned.merge(ed_level_df, on='SCHL_RC1', how='left')
pums_merged.head()

In [None]:
# merge oew and pums
pum_oew_merged = pums_merged.merge(oew_cleaned, left_on='SOCP', right_on='OCC_CODE', how='left')
pum_oew_merged.head(10)

## 4. Update your data store
Updating local data store with the cleaned data

In [None]:
#Saving cleaned OEW dataset
oew_cleaned.to_csv('./data/oews_2024_compsci_wages_king_county_cleaned.csv', index=False)

In [None]:
# Saving cleaned PUMS dataset
pums_merged.to_csv('./data/pums_2024_compsci_edlevel_king_county_cleaned.csv', index=False)

In [None]:
#saving data
pum_oew_merged.to_csv('./data/pums_oews_2024_compsci_edlevel_merged.csv', index=False)

## 5. Answer the research question

#### Research Questions ####
1. What is the level of education of people employed in computer related roles
2. How does the level of education influence salary for computer-realated occupations

In [None]:
%matplotlib inline

In [None]:
## Plot of level of education to the size in the dataset 
pum_oew_ed_level = pum_oew_merged.groupby('ED_LEVEL').size()
pum_oew_ed_level.plot(kind='barh', ylabel='Educational Level', legend=False, grid=True);

*Answer to research question:* 
- A majority of those employeed in computer related occupations have graduated collegge or have advanced degrees

In [None]:
# Plot of impact of the level of education on salary
pum_oew_merged.plot(kind='scatter', y='ED_LEVEL', x='A_MEAN', 
                    ylabel='Educational Level', xlabel='Mean Annual Income',
                    legend=False,grid=True);

In [None]:
# Calculate the Pearson correlation coefficient
pearson_cc = pum_oew_merged['A_MEAN'].corr(pum_oew_merged['SCHL_RC1'])
print(f'Pearson Correlation Coefficient is {pearson_cc:.3f}')

*Answer to research question:* 
- The scatter plot shows a weak trend, where average annual income does not increase by the level of education attained in computer related occupations. The Pearson coefficient of 0.152 confirms the weak correlation between eduational level and salary.

### **5.2:** Reflection
If I had more time for this project I will do more exploration in the following areas:
#### Data Related #### 
 - Investigate outliers in the mean annual income
 - Ensure data values and formatting align with industry standards
#### Analysis Related ###
- Analyze trends over time can help forecast salary expectations in the tech field
- Explore job type distribution to see which occupations are more or least common