# Merging SOC 2018 and Census 2010 Occupational Codes



## Statement of the Problem

To explore the question of flexibility, the inherent properties of occupations should be considered. To do this, I use data from O\*Net to build an index of flexibility for a particular occupation. However, the O\*Net database uses the Standard Occupation Classification (SOC) Code, updated annually, while the CPS and ATUS use the 2010 Census Occupation Classification Codes. 

Linking the occupation codes requires utilizing the Crosswalk (https://www2.census.gov/programs-surveys/demo/guidance/industry-occupation/2018-occupation-code-list-and-crosswalk.xlsx). The design of the Crosswalk between 2018 SOC and 2010 Census is not friendly to easy linking. 

In order to use the Crosswalk to link the 2018 SOC with the 2010 Census, we must first extract some codes from the 2018 Census Title column, fill in empty cells (imported as NaN), create a dictionary, and append the 2010 Census codes to some O\*Net data. 

With the final .csv, we can merge the CPS/ATUS data with the O\*Net job characteristics via the 2010 Census occupation code. 

## Extracting 2018 SOC Codes

In [1]:
import pandas as pd

Download the crosswalk table, removing the first three rows of title, date updated, and white space. 

In [2]:
crosswalk = pd.read_csv('../ONET/2010 to 2018 Crosswalk -Table 1.csv',skiprows=3)

Looking at the first ten rows of the data, we see that there are SOC Codes in the Census Title column. We want to extract these codes and put them into the SOC Code column. 

In [3]:
crosswalk[0:10]

Unnamed: 0,2010 SOC code,2010 Census Code,2010 Census Title \n,2018 SOC Code,2018 Census Code,2018 Census Title
0,11-1011,10.0,Chief Executives,11-1011,10.0,Chief Executives
1,11-1021,20.0,General and Operations Managers,11-1021,20.0,General and Operations Managers
2,11-1031,30.0,Legislators,11-1031,30.0,Legislators
3,11-2011,40.0,Advertising and Promotions Managers,11-2011,40.0,Advertising and Promotions Managers
4,11-2020,50.0,Marketing and Sales Managers,,,
5,,,,11-2021,51.0,Marketing Managers
6,,,,11-2022,52.0,Sales Managers
7,11-2031,60.0,Public Relations and Fundraising Managers,11-2030,60.0,Public Relations and Fundraising Managers
8,,,,,,Public Relations Managers (11-2032)
9,,,,,,Fundraising Managers (11-2033)


To extract the codes, we are going to use the split function.

First we split along the first parenthesis.

In [4]:
crosswalk[['2018 Census Title short','tmp_Code']] = crosswalk['2018 Census Title '].str.split("(", expand=True)

In [24]:
crosswalk.head(10)

Unnamed: 0,2010 SOC code,2010 Census Code,2010 Census Title \n,2018 SOC Code,2018 Census Code,2018 Census Title,2018 Census Title short,Code
0,11-1011,10,Chief Executives,11-1011,10.0,Chief Executives,Chief Executives,
1,11-1021,20,General and Operations Managers,11-1021,20.0,General and Operations Managers,General and Operations Managers,
2,11-1031,30,Legislators,11-1031,30.0,Legislators,Legislators,
3,11-2011,40,Advertising and Promotions Managers,11-2011,40.0,Advertising and Promotions Managers,Advertising and Promotions Managers,
4,11-2020,50,Marketing and Sales Managers,,,,,
5,,50,,11-2021,51.0,Marketing Managers,Marketing Managers,
6,,50,,11-2022,52.0,Sales Managers,Sales Managers,
7,11-2031,60,Public Relations and Fundraising Managers,11-2030,60.0,Public Relations and Fundraising Managers,Public Relations and Fundraising Managers,
8,,60,,11-2032,,Public Relations Managers (11-2032),Public Relations Managers,11-2032
9,,60,,11-2033,,Fundraising Managers (11-2033),Fundraising Managers,11-2033


Now we split along the second parenthesis, and drop the unnecessary columns. 

In [6]:
crosswalk[['Code','paren']] = crosswalk['tmp_Code'].str.split(")", expand=True)
crosswalk.drop(['tmp_Code', 'paren'], axis = 1, inplace = True)

In [7]:
crosswalk.head(10)

Unnamed: 0,2010 SOC code,2010 Census Code,2010 Census Title \n,2018 SOC Code,2018 Census Code,2018 Census Title,2018 Census Title short,Code
0,11-1011,10.0,Chief Executives,11-1011,10.0,Chief Executives,Chief Executives,
1,11-1021,20.0,General and Operations Managers,11-1021,20.0,General and Operations Managers,General and Operations Managers,
2,11-1031,30.0,Legislators,11-1031,30.0,Legislators,Legislators,
3,11-2011,40.0,Advertising and Promotions Managers,11-2011,40.0,Advertising and Promotions Managers,Advertising and Promotions Managers,
4,11-2020,50.0,Marketing and Sales Managers,,,,,
5,,,,11-2021,51.0,Marketing Managers,Marketing Managers,
6,,,,11-2022,52.0,Sales Managers,Sales Managers,
7,11-2031,60.0,Public Relations and Fundraising Managers,11-2030,60.0,Public Relations and Fundraising Managers,Public Relations and Fundraising Managers,
8,,,,,,Public Relations Managers (11-2032),Public Relations Managers,11-2032
9,,,,,,Fundraising Managers (11-2033),Fundraising Managers,11-2033


Now that we have extracted the SOC Codes from the Census title, we need to move the values to the 2018 SOC Code column

In [8]:
crosswalk['2018 SOC Code'].fillna(value=crosswalk['Code'], inplace=True)

In [9]:
crosswalk.head(10)

Unnamed: 0,2010 SOC code,2010 Census Code,2010 Census Title \n,2018 SOC Code,2018 Census Code,2018 Census Title,2018 Census Title short,Code
0,11-1011,10.0,Chief Executives,11-1011,10.0,Chief Executives,Chief Executives,
1,11-1021,20.0,General and Operations Managers,11-1021,20.0,General and Operations Managers,General and Operations Managers,
2,11-1031,30.0,Legislators,11-1031,30.0,Legislators,Legislators,
3,11-2011,40.0,Advertising and Promotions Managers,11-2011,40.0,Advertising and Promotions Managers,Advertising and Promotions Managers,
4,11-2020,50.0,Marketing and Sales Managers,,,,,
5,,,,11-2021,51.0,Marketing Managers,Marketing Managers,
6,,,,11-2022,52.0,Sales Managers,Sales Managers,
7,11-2031,60.0,Public Relations and Fundraising Managers,11-2030,60.0,Public Relations and Fundraising Managers,Public Relations and Fundraising Managers,
8,,,,11-2032,,Public Relations Managers (11-2032),Public Relations Managers,11-2032
9,,,,11-2033,,Fundraising Managers (11-2033),Fundraising Managers,11-2033


Now, we want to forward fill the 2010 Census code to remove the NaN caused by blanks in the Excel spreadsheet.

In [10]:
crosswalk['2010 Census Code'].fillna(method = 'ffill', inplace = True)

In [11]:
crosswalk.head(15)

Unnamed: 0,2010 SOC code,2010 Census Code,2010 Census Title \n,2018 SOC Code,2018 Census Code,2018 Census Title,2018 Census Title short,Code
0,11-1011,10,Chief Executives,11-1011,10.0,Chief Executives,Chief Executives,
1,11-1021,20,General and Operations Managers,11-1021,20.0,General and Operations Managers,General and Operations Managers,
2,11-1031,30,Legislators,11-1031,30.0,Legislators,Legislators,
3,11-2011,40,Advertising and Promotions Managers,11-2011,40.0,Advertising and Promotions Managers,Advertising and Promotions Managers,
4,11-2020,50,Marketing and Sales Managers,,,,,
5,,50,,11-2021,51.0,Marketing Managers,Marketing Managers,
6,,50,,11-2022,52.0,Sales Managers,Sales Managers,
7,11-2031,60,Public Relations and Fundraising Managers,11-2030,60.0,Public Relations and Fundraising Managers,Public Relations and Fundraising Managers,
8,,60,,11-2032,,Public Relations Managers (11-2032),Public Relations Managers,11-2032
9,,60,,11-2033,,Fundraising Managers (11-2033),Fundraising Managers,11-2033


## Dictionary to Match Census and SOC Codes

We are ready to create the dictionary. As we have many SOC codes to one Census code, we want to groupby SOC code and pull the relevant Census code.

In [12]:
crosswalk_dict = dict(crosswalk.groupby('2018 SOC Code')['2010 Census Code'].apply(list))

In [13]:
crosswalk_dict

{'11-1011': ['0010'],
 '11-1021': ['0020'],
 '11-1031': ['0030'],
 '11-2011': ['0040'],
 '11-2021': ['0050'],
 '11-2022': ['0050'],
 '11-2030': ['0060'],
 '11-2032': ['0060'],
 '11-2033': ['0060'],
 '11-3012': ['0100'],
 '11-3013': ['0100'],
 '11-3021': ['0110'],
 '11-3031': ['0120'],
 '11-3051': ['0140'],
 '11-3061': ['0150'],
 '11-3071': ['0160'],
 '11-3111': ['0135'],
 '11-3121': ['0136'],
 '11-3131': ['0137'],
 '11-9013': ['0205'],
 '11-9021': ['0220'],
 '11-9030': ['0230'],
 '11-9031': ['0230'],
 '11-9032': ['0230'],
 '11-9033': ['0230'],
 '11-9039': ['0230'],
 '11-9041': ['0300'],
 '11-9051': ['0310'],
 '11-9070': ['0330'],
 '11-9071': ['0330'],
 '11-9072': ['0330'],
 '11-9081': ['0340'],
 '11-9111': ['0350'],
 '11-9121': ['0360'],
 '11-9131': ['0400'],
 '11-9141': ['0410'],
 '11-9151': ['0420'],
 '11-9161': ['0425'],
 '11-9171': ['0325'],
 '11-9179': ['0425'],
 '11-9199': ['0430'],
 '13-1011': ['0500'],
 '13-1021': ['0510'],
 '13-1022': ['0520'],
 '13-1023': ['0530'],
 '13-1030'

## Using Dictionary to Rename Occupation Codes

ONet to determine level of flexibility by occupation. Measures include: 
- Work Context — Freedom to Make Decisions (https://www.onetonline.org/find/descriptor/result/4.C.3.a.4)
    - No freedom 0 ~ 100 A lot of freedom
- Work Context — Structured versus Unstructured Work (https://www.onetonline.org/find/descriptor/result/4.C.3.b.8)
    - Structured (no freedom) 0 ~ Unstructured (a lot of freedom)
- Work Context — Time Pressure (https://www.onetonline.org/find/descriptor/result/4.C.3.d.1)
    - Never 0 ~ 100 Every day
- Work Context — Regular Work Schedules (https://www.onetonline.org/find/descriptor/result/4.C.3.d.4)
    - Regular/established schedule 0 ~ 100 Seasonal/only during certain times of the year
- Work Styles — Independence (https://www.onetonline.org/find/descriptor/result/1.C.6)
    - No independence 0 ~ 100 A lot of independence
    - "Job requires developing one's own ways of doing things, guiding oneself with little or no supervision, and depending on oneself to get things done."

These were determined to impact flexibility from the browse tool (https://www.onetonline.org/find/descriptor/browse/)

In [14]:
free = pd.read_csv('../ONET/Freedom_to_Make_Decisions.csv')
struct = pd.read_csv('../ONET/Structured_versus_Unstructured_Work.csv')
time = pd.read_csv('../ONET/Time_Pressure.csv')
sched = pd.read_csv('../ONET/Work_Schedules.csv')
indep = free = pd.read_csv('../ONET/Freedom_to_Make_Decisions.csv')

In [15]:
free = free.rename(columns = {'Context':'Freedom_to_Make_Decisions'})
struct = struct.rename(columns = {'Context':'Structured_v_Unstructured'})
time = time.rename(columns = {'Context':'Time_Pressure'})
sched = sched.rename(columns = {'Context':'Regular_Schedule'})
indep = indep.rename(columns = {'Context':'Independence'})

In [16]:
tmp = pd.merge(free, struct, how='left', on=['Code','Occupation'])
tmp = pd.merge(tmp, time, how='left', on=['Code','Occupation'])
tmp = pd.merge(tmp, sched, how='left', on=['Code','Occupation'])
tmp = pd.merge(tmp, indep, how='left', on=['Code','Occupation'])

onet = tmp
# onet['2018 SOC Code'] = onet['Code'].astype('str')
onet.rename(columns = {'Code': '2018 SOC Code'}, inplace=True)
onet.head()

Unnamed: 0,Freedom_to_Make_Decisions,2018 SOC Code,Occupation,Structured_v_Unstructured,Time_Pressure,Regular_Schedule,Independence
0,100,29-1213.00,Dermatologists,94,70,0,100
1,100,29-1215.00,Family Medicine Physicians,87,93,0,100
2,100,23-1023.00,"Judges, Magistrate Judges, and Magistrates",98,93,0,100
3,100,29-1024.00,Prosthodontists,93,77,6,100
4,100,29-1229.06,Sports Medicine Physicians,94,72,27,100


In [17]:
onet[['2018 SOC Code','drop']] = onet['2018 SOC Code'].str.split(".", expand=True)
onet.drop(columns='drop', inplace=True)

onet.head()

Unnamed: 0,Freedom_to_Make_Decisions,2018 SOC Code,Occupation,Structured_v_Unstructured,Time_Pressure,Regular_Schedule,Independence
0,100,29-1213,Dermatologists,94,70,0,100
1,100,29-1215,Family Medicine Physicians,87,93,0,100
2,100,23-1023,"Judges, Magistrate Judges, and Magistrates",98,93,0,100
3,100,29-1024,Prosthodontists,93,77,6,100
4,100,29-1229,Sports Medicine Physicians,94,72,27,100


Trying to remove the 2010 Census code from the list in the DateFrame showed that there were missing values in the Census Code column. Investigation found that there were two values not mapping in the dictionary. 

- 2018 SOC Code 53-6031 is listed as 56-6031 in the Crosswalk file, but I am confident that this is a typo as it is listed under 53-6030. The corresponding 2010 Census code is 9360.
- 2018 SOC Code 17-3012 is listed as having a space in the Crosswalk file that was not removed in the string split procedure above. The corresponding 2010 Census code is 1540.


In [18]:
onet['2010 Census Code'] = onet['2018 SOC Code'].map(crosswalk_dict)
nulls = onet[onet['2010 Census Code'].isna()]

nulls

Unnamed: 0,Freedom_to_Make_Decisions,2018 SOC Code,Occupation,Structured_v_Unstructured,Time_Pressure,Regular_Schedule,Independence,2010 Census Code
235,85,53-6031,Automotive and Watercraft Service Attendants,69,90,11,85,
667,70,17-3012,Electrical and Electronics Drafters,70,83,15,70,


In [19]:
CORRECTION1 = (onet['2018 SOC Code'] == '53-6031')
CORRECTION2 = (onet['2018 SOC Code'] == '17-3012')

onet.loc[CORRECTION1, '2010 Census Code'] = '[9360]'
onet.loc[CORRECTION2, '2010 Census Code'] = '[1540]'

In [20]:
onet.head()

Unnamed: 0,Freedom_to_Make_Decisions,2018 SOC Code,Occupation,Structured_v_Unstructured,Time_Pressure,Regular_Schedule,Independence,2010 Census Code
0,100,29-1213,Dermatologists,94,70,0,100,[3060]
1,100,29-1215,Family Medicine Physicians,87,93,0,100,[3060]
2,100,23-1023,"Judges, Magistrate Judges, and Magistrates",98,93,0,100,[2110]
3,100,29-1024,Prosthodontists,93,77,6,100,[3010]
4,100,29-1229,Sports Medicine Physicians,94,72,27,100,[3060]


Having corrected for the missing values, we can remove the code from the list. 

In [21]:
onet['2010 Census Code'] = onet['2010 Census Code'].apply(lambda x: x[0])

In [22]:
onet.head()

Unnamed: 0,Freedom_to_Make_Decisions,2018 SOC Code,Occupation,Structured_v_Unstructured,Time_Pressure,Regular_Schedule,Independence,2010 Census Code
0,100,29-1213,Dermatologists,94,70,0,100,3060
1,100,29-1215,Family Medicine Physicians,87,93,0,100,3060
2,100,23-1023,"Judges, Magistrate Judges, and Magistrates",98,93,0,100,2110
3,100,29-1024,Prosthodontists,93,77,6,100,3010
4,100,29-1229,Sports Medicine Physicians,94,72,27,100,3060


In [23]:
onet.to_csv('onet_flex.csv', index=False)