# DSI Summer Workshops Series

## June 28, 2018

Peggy Lindner<br>
Center for Advanced Computing & Data Science (CACDS)<br>
Data Science Institute (DSI)<br>
University of Houston  
plindner@uh.edu 


This jupyter notebook is available at:
http://130.211.184.150/hub/login


## How Much Money Should Machines Earn? *
### - A journey into computerization (jobs that will be taken over by machines)

Let's learn some R by creating an interactive visualization of some open data because you will train many important skills of a data scientist: 
* loading,  
* transforming and 
* combinig data, 
* cleaning and
* performing a suitable visualization. 

### Datasets used

1. The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in [this paper](https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf) in 2013.

2. Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.


In [None]:
R needs some additional packages to do the work ...

In [2]:
# Load libraries
library(dplyr)
library(tabulizer)
library(rlist)
library(readxl)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



#### Data (Down)Loading

In [3]:
#############################################################################################
# Download and parse data about probability of computerisation
#############################################################################################

# set some variables to be used for download
urlfile <- "https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf"
file <- "The_Future_of_Employment.pdf"

# download the pdf file (if we haven't done so already)
if (!file.exists(file)) {
    download.file(urlfile, destfile = file, mode = 'wb')
}

#### Extracting data from a pdf file

using [Tabula](https://tabula.technology/) from within R

In [4]:
# Extract tables using tabulizer - that looks a little bit like magic( and it takes some time)
out <- extract_tables(file, encoding="UTF-8")

In [5]:
# let's have a look a the "thing" that we just got
out

0,1,2,3,4,5,6,7,8
0.0,,,0.0,,,0.0,,
0.0,0.5,1.0,0.0,0.5,1.0,0.0,,0.5 1
,Probability of,,,Probability of,,,,Probability of
,Computerisation,,,Computerisation,,,,Computerisation
0.0,,,100.0,,,80.0,,
,,,,Cramped work space 60,,,,Originalit
50.0,,,50.0,,,40.0,,
,,,,,,20.0,,
0.0,,,0.0,,,0.0,,
0.0,0.5,1.0,0.0,0.5,1.0,0.0,,0.5 1

0,1
Variable,Probability of Computerisation
,Low Medium High
Assisting and caring for others,48±20 41±17 34±10
Persuasion,48±7.1 35±9.8 32±7.8
Negotiation,44±7.6 33±9.3 30±8.9
Social perceptiveness,51±7.9 41±7.4 37±5.5
Fine arts,12±20 3.5±12 1.3±5.5
Originality,51±6.5 35±12 32±5.6
Manual dexterity,22±18 34±15 36±14
Finger dexterity,36±10 39±10 40±10

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
1.,0.0028,,29-1125,Recreational Therapists
2.,0.003,,49-1011,"First-Line Supervisors of Mechanics, Installers, and Repairers"
3.,0.003,,11-9161,Emergency Management Directors
4.,0.0031,,21-1023,Mental Health and Substance Abuse Social Workers
5.,0.0033,,29-1181,Audiologists
6.,0.0035,,29-1122,Occupational Therapists
7.,0.0035,,29-2091,Orthotists and Prosthetists
8.,0.0035,,21-1022,Healthcare Social Workers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
40.,0.0077,,25-2054,"Special Education Teachers, Secondary School"
41.,0.0078,,25-2031,"Secondary School Teachers, Except Special and Career/Technical Edu-"
,,,,cation
42.,0.0081,0,21-2011,Clergy
43.,0.0081,,19-1032,Foresters
44.,0.0085,,21-1012,"Educational, Guidance, School, and Vocational Counselors"
45.,0.0088,,25-2032,"Career/Technical Education Teachers, Secondary School"
46.,0.009,0,29-1111,Registered Nurses

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
88.,0.021,,17-2131,Materials Engineers
89.,0.021,0,27-1022,Fashion Designers
90.,0.021,,29-1123,Physical Therapists
91.,0.021,,27-4021,Photographers
92.,0.022,,27-2012,Producers and Directors
93.,0.022,,27-1025,Interior Designers
94.,0.023,,29-1023,Orthodontists
95.,0.023,,27-1011,Art Directors

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
135.,0.047,,15-2021,Mathematicians
136.,0.047,,27-1023,Floral Designers
137.,0.047,,11-9013,"Farmers, Ranchers, and Other Agricultural Managers"
138.,0.048,,33-2022,Forest Fire Inspectors and Prevention Specialists
139.,0.049,,29-2041,Emergency Medical Technicians and Paramedics
140.,0.055,,27-3041,Editors
141.,0.055,,29-1024,Prosthodontists
142.,0.055,0,29-9799,"Healthcare Practitioners and Technical Workers, All Other"

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
184.,0.13,,19-3051,Urban and Regional Planners
185.,0.13,,21-1093,Social and Human Service Assistants
186.,0.13,,25-3021,Self-Enrichment Education Teachers
187.,0.13,,27-4014,Sound Engineering Technicians
188.,0.14,,29-1041,Optometrists
189.,0.14,,17-2151,"Mining and Geological Engineers, Including Mining Safety Engineers"
190.,0.14,,29-1071,Physician Assistants
191.,0.15,,25-2012,"Kindergarten Teachers, Except Special Education"

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
229.,0.26,,25-2023,"Career/Technical Education Teachers, Middle School"
230.,0.27,,53-5021,"Captains, Mates, and Pilots of Water Vessels"
231.,0.27,,31-2012,Occupational Therapy Aides
232.,0.27,,49-9062,Medical Equipment Repairers
233.,0.28,,41-1011,First-Line Supervisors of Retail Sales Workers
234.,0.28,0,27-2021,Athletes and Sports Competitors
235.,0.28,,39-1011,Gaming Supervisors
236.,0.29,,39-5094,Skincare Specialists

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
275.,0.41,,51-2041,Structural Metal Fabricators and Fitters
276.,0.41,1,23-1012,Judicial Law Clerks
277.,0.41,,49-2094,"Electrical and Electronics Repairers, Commercial and Industrial Equip-"
,,,,ment
278.,0.42,,19-4093,Forest and Conservation Technicians
279.,0.42,,53-1021,"First-Line Supervisors of Helpers, Laborers, and Material Movers,"
,,,,Hand
280.,0.43,,39-3093,"Locker Room, Coatroom, and Dressing Room Attendants"

0,1,2,3,4,5,6,7
,Computerisable,,,,,,
Rank,Probability,Label,SOC code,Occupation,,,
322.,0.57,,33-3052,Transit and Railroad Police,,,
323.,0.57,,37-1012,"First-Line Supervisors of Landscaping,",Lawn,"Service,",and
,,,,Groundskeeping Workers,,,
324.,0.58,,13-2052,Personal Financial Advisors,,,
325.,0.59,,49-9044,Millwrights,,,
326.,0.59,,25-4013,Museum Technicians and Conservators,,,
327.,0.59,,47-5042,Mine Cutting and Channeling Machine Operators,,,
328.,0.59,0,11-3071,"Transportation, Storage, and Distribution Managers",,,

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
369.,0.67,,51-9196,"Paper Goods Machine Setters, Operators, and Tenders"
370.,0.67,,51-4071,Foundry Mold and Coremakers
371.,0.67,,19-2021,Atmospheric and Space Scientists
372.,0.67,1,53-3021,"Bus Drivers, Transit and Intercity"
373.,0.67,,33-9092,"Lifeguards, Ski Patrol, and Other Recreational Protective Service Work-"
,,,,ers
374.,0.67,,49-9041,Industrial Machinery Mechanics
375.,0.68,,43-5052,Postal Service Mail Carriers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
417.,0.76,,49-2092,"Electric Motor, Power Tool, and Related Repairers"
418.,0.76,,45-4021,Fallers
419.,0.77,,19-4091,"Environmental Science and Protection Technicians, Including Health"
420.,0.77,,49-9094,Locksmiths and Safe Repairers
421.,0.77,,37-3013,Tree Trimmers and Pruners
422.,0.77,,35-3011,Bartenders
423.,0.77,,13-1023,"Purchasing Agents, Except Wholesale, Retail, and Farm Products"
424.,0.77,1,35-9021,Dishwashers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
463.,0.83,,47-3011,"Helpers–Brickmasons, Blockmasons, Stonemasons, and Tile and Mar-"
,,,,ble Setters
464.,0.83,,47-4091,Segmental Pavers
465.,0.83,,47-2131,"Insulation Workers, Floor, Ceiling, and Wall"
466.,0.83,,51-5112,Printing Press Operators
467.,0.83,,53-6031,Automotive and Watercraft Service Attendants
468.,0.83,,47-4071,Septic Tank Servicers and Sewer Pipe Cleaners
469.,0.83,,39-6011,Baggage Porters and Bellhops

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
509.,0.87,,47-2043,Floor Sanders and Finishers
510.,0.87,1,53-6021,Parking Lot Attendants
511.,0.87,,47-4051,Highway Maintenance Workers
512.,0.88,,47-2061,Construction Laborers
513.,0.88,,43-5061,"Production, Planning, and Expediting Clerks"
514.,0.88,,51-9141,Semiconductor Processors
515.,0.88,,17-1021,Cartographers and Photogrammetrists
516.,0.88,,51-4051,Metal-Refining Furnace Operators and Tenders

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
553.,0.91,,53-4013,"Rail Yard Engineers, Dinkey Operators, and Hostlers"
554.,0.91,,49-2093,"Electrical and Electronics Installers and Repairers, Transportation"
,,,,Equipment
555.,0.91,,35-9011,Dining Room and Cafeteria Attendants and Bartender Helpers
556.,0.91,,51-4191,"Heat Treating Equipment Setters, Operators, and Tenders, Metal and"
,,,,Plastic
557.,0.91,,19-4041,Geological and Petroleum Technicians
558.,0.91,,49-3021,Automotive Body and Related Repairers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
596.,0.94,,49-3091,Bicycle Repairers
597.,0.94,,49-9091,"Coin, Vending, and Amusement Machine Servicers and Repairers"
598.,0.94,,51-4121,"Welders, Cutters, Solderers, and Brazers"
599.,0.94,1,43-5021,Couriers and Messengers
600.,0.94,,43-4111,"Interviewers, Except Eligibility and Loan"
601.,0.94,,35-2015,"Cooks, Short Order"
602.,0.94,,53-7032,Excavating and Loading Machine and Dragline Operators
603.,0.94,,47-3014,"Helpers–Painters, Paperhangers, Plasterers, and Stucco Masons"

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
640.,0.96,,49-9093,"Fabric Menders, Except Garment"
641.,0.96,,35-2014,"Cooks, Restaurant"
642.,0.96,,39-3031,"Ushers, Lobby Attendants, and Ticket Takers"
643.,0.96,,43-3021,Billing and Posting Clerks
644.,0.97,,53-6011,Bridge and Lock Tenders
645.,0.97,,51-7042,"Woodworking Machine Setters, Operators, and Tenders, Except Sawing"
646.,0.97,,51-2092,Team Assemblers
647.,0.97,,51-6042,Shoe Machine Operators and Tenders

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
687.,0.98,,43-4151,Order Clerks
688.,0.98,,43-4011,Brokerage Clerks
689.,0.98,,43-9041,Insurance Claims and Policy Processing Clerks
690.,0.98,,51-2093,Timing Device Assemblers and Adjusters
691.,0.99,1,43-9021,Data Entry Keyers
692.,0.99,,25-4031,Library Technicians
693.,0.99,,43-4141,New Accounts Clerks
694.,0.99,,51-9151,Photographic Process Workers and Processing Machine Operators


#### Data Transformation

In [7]:
# We are not interested in first two tables - so let's remove them
list.remove(out, c(1:2)) -> tables

# now let's look what we got
tables

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
1.,0.0028,,29-1125,Recreational Therapists
2.,0.003,,49-1011,"First-Line Supervisors of Mechanics, Installers, and Repairers"
3.,0.003,,11-9161,Emergency Management Directors
4.,0.0031,,21-1023,Mental Health and Substance Abuse Social Workers
5.,0.0033,,29-1181,Audiologists
6.,0.0035,,29-1122,Occupational Therapists
7.,0.0035,,29-2091,Orthotists and Prosthetists
8.,0.0035,,21-1022,Healthcare Social Workers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
40.,0.0077,,25-2054,"Special Education Teachers, Secondary School"
41.,0.0078,,25-2031,"Secondary School Teachers, Except Special and Career/Technical Edu-"
,,,,cation
42.,0.0081,0,21-2011,Clergy
43.,0.0081,,19-1032,Foresters
44.,0.0085,,21-1012,"Educational, Guidance, School, and Vocational Counselors"
45.,0.0088,,25-2032,"Career/Technical Education Teachers, Secondary School"
46.,0.009,0,29-1111,Registered Nurses

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
88.,0.021,,17-2131,Materials Engineers
89.,0.021,0,27-1022,Fashion Designers
90.,0.021,,29-1123,Physical Therapists
91.,0.021,,27-4021,Photographers
92.,0.022,,27-2012,Producers and Directors
93.,0.022,,27-1025,Interior Designers
94.,0.023,,29-1023,Orthodontists
95.,0.023,,27-1011,Art Directors

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
135.,0.047,,15-2021,Mathematicians
136.,0.047,,27-1023,Floral Designers
137.,0.047,,11-9013,"Farmers, Ranchers, and Other Agricultural Managers"
138.,0.048,,33-2022,Forest Fire Inspectors and Prevention Specialists
139.,0.049,,29-2041,Emergency Medical Technicians and Paramedics
140.,0.055,,27-3041,Editors
141.,0.055,,29-1024,Prosthodontists
142.,0.055,0,29-9799,"Healthcare Practitioners and Technical Workers, All Other"

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
184.,0.13,,19-3051,Urban and Regional Planners
185.,0.13,,21-1093,Social and Human Service Assistants
186.,0.13,,25-3021,Self-Enrichment Education Teachers
187.,0.13,,27-4014,Sound Engineering Technicians
188.,0.14,,29-1041,Optometrists
189.,0.14,,17-2151,"Mining and Geological Engineers, Including Mining Safety Engineers"
190.,0.14,,29-1071,Physician Assistants
191.,0.15,,25-2012,"Kindergarten Teachers, Except Special Education"

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
229.,0.26,,25-2023,"Career/Technical Education Teachers, Middle School"
230.,0.27,,53-5021,"Captains, Mates, and Pilots of Water Vessels"
231.,0.27,,31-2012,Occupational Therapy Aides
232.,0.27,,49-9062,Medical Equipment Repairers
233.,0.28,,41-1011,First-Line Supervisors of Retail Sales Workers
234.,0.28,0,27-2021,Athletes and Sports Competitors
235.,0.28,,39-1011,Gaming Supervisors
236.,0.29,,39-5094,Skincare Specialists

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
275.,0.41,,51-2041,Structural Metal Fabricators and Fitters
276.,0.41,1,23-1012,Judicial Law Clerks
277.,0.41,,49-2094,"Electrical and Electronics Repairers, Commercial and Industrial Equip-"
,,,,ment
278.,0.42,,19-4093,Forest and Conservation Technicians
279.,0.42,,53-1021,"First-Line Supervisors of Helpers, Laborers, and Material Movers,"
,,,,Hand
280.,0.43,,39-3093,"Locker Room, Coatroom, and Dressing Room Attendants"

0,1,2,3,4,5,6,7
,Computerisable,,,,,,
Rank,Probability,Label,SOC code,Occupation,,,
322.,0.57,,33-3052,Transit and Railroad Police,,,
323.,0.57,,37-1012,"First-Line Supervisors of Landscaping,",Lawn,"Service,",and
,,,,Groundskeeping Workers,,,
324.,0.58,,13-2052,Personal Financial Advisors,,,
325.,0.59,,49-9044,Millwrights,,,
326.,0.59,,25-4013,Museum Technicians and Conservators,,,
327.,0.59,,47-5042,Mine Cutting and Channeling Machine Operators,,,
328.,0.59,0,11-3071,"Transportation, Storage, and Distribution Managers",,,

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
369.,0.67,,51-9196,"Paper Goods Machine Setters, Operators, and Tenders"
370.,0.67,,51-4071,Foundry Mold and Coremakers
371.,0.67,,19-2021,Atmospheric and Space Scientists
372.,0.67,1,53-3021,"Bus Drivers, Transit and Intercity"
373.,0.67,,33-9092,"Lifeguards, Ski Patrol, and Other Recreational Protective Service Work-"
,,,,ers
374.,0.67,,49-9041,Industrial Machinery Mechanics
375.,0.68,,43-5052,Postal Service Mail Carriers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
417.,0.76,,49-2092,"Electric Motor, Power Tool, and Related Repairers"
418.,0.76,,45-4021,Fallers
419.,0.77,,19-4091,"Environmental Science and Protection Technicians, Including Health"
420.,0.77,,49-9094,Locksmiths and Safe Repairers
421.,0.77,,37-3013,Tree Trimmers and Pruners
422.,0.77,,35-3011,Bartenders
423.,0.77,,13-1023,"Purchasing Agents, Except Wholesale, Retail, and Farm Products"
424.,0.77,1,35-9021,Dishwashers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
463.,0.83,,47-3011,"Helpers–Brickmasons, Blockmasons, Stonemasons, and Tile and Mar-"
,,,,ble Setters
464.,0.83,,47-4091,Segmental Pavers
465.,0.83,,47-2131,"Insulation Workers, Floor, Ceiling, and Wall"
466.,0.83,,51-5112,Printing Press Operators
467.,0.83,,53-6031,Automotive and Watercraft Service Attendants
468.,0.83,,47-4071,Septic Tank Servicers and Sewer Pipe Cleaners
469.,0.83,,39-6011,Baggage Porters and Bellhops

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
509.,0.87,,47-2043,Floor Sanders and Finishers
510.,0.87,1,53-6021,Parking Lot Attendants
511.,0.87,,47-4051,Highway Maintenance Workers
512.,0.88,,47-2061,Construction Laborers
513.,0.88,,43-5061,"Production, Planning, and Expediting Clerks"
514.,0.88,,51-9141,Semiconductor Processors
515.,0.88,,17-1021,Cartographers and Photogrammetrists
516.,0.88,,51-4051,Metal-Refining Furnace Operators and Tenders

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
553.,0.91,,53-4013,"Rail Yard Engineers, Dinkey Operators, and Hostlers"
554.,0.91,,49-2093,"Electrical and Electronics Installers and Repairers, Transportation"
,,,,Equipment
555.,0.91,,35-9011,Dining Room and Cafeteria Attendants and Bartender Helpers
556.,0.91,,51-4191,"Heat Treating Equipment Setters, Operators, and Tenders, Metal and"
,,,,Plastic
557.,0.91,,19-4041,Geological and Petroleum Technicians
558.,0.91,,49-3021,Automotive Body and Related Repairers

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
596.,0.94,,49-3091,Bicycle Repairers
597.,0.94,,49-9091,"Coin, Vending, and Amusement Machine Servicers and Repairers"
598.,0.94,,51-4121,"Welders, Cutters, Solderers, and Brazers"
599.,0.94,1,43-5021,Couriers and Messengers
600.,0.94,,43-4111,"Interviewers, Except Eligibility and Loan"
601.,0.94,,35-2015,"Cooks, Short Order"
602.,0.94,,53-7032,Excavating and Loading Machine and Dragline Operators
603.,0.94,,47-3014,"Helpers–Painters, Paperhangers, Plasterers, and Stucco Masons"

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
640.,0.96,,49-9093,"Fabric Menders, Except Garment"
641.,0.96,,35-2014,"Cooks, Restaurant"
642.,0.96,,39-3031,"Ushers, Lobby Attendants, and Ticket Takers"
643.,0.96,,43-3021,Billing and Posting Clerks
644.,0.97,,53-6011,Bridge and Lock Tenders
645.,0.97,,51-7042,"Woodworking Machine Setters, Operators, and Tenders, Except Sawing"
646.,0.97,,51-2092,Team Assemblers
647.,0.97,,51-6042,Shoe Machine Operators and Tenders

0,1,2,3,4
,Computerisable,,,
Rank,Probability,Label,SOC code,Occupation
687.,0.98,,43-4151,Order Clerks
688.,0.98,,43-4011,Brokerage Clerks
689.,0.98,,43-9041,Insurance Claims and Policy Processing Clerks
690.,0.98,,51-2093,Timing Device Assemblers and Adjusters
691.,0.99,1,43-9021,Data Entry Keyers
692.,0.99,,25-4031,Library Technicians
693.,0.99,,43-4141,New Accounts Clerks
694.,0.99,,51-9151,Photographic Process Workers and Processing Machine Operators


##### Parse table into something that can be used in the next step

In [8]:
# First we create a placefolder 
prob_comput_df=data.frame()

# Now we go over each of the tables
for (i in 1:length(tables))
{
  # We keep just SOC Code, rank and probability of computerisation
  # We also remove first to lines of each element of table since they are non interesting
  tables[[i]][-c(1,2),c(1,2,4)] %>% 
    as.data.frame(stringsAsFactors = FALSE) %>% 
    rbind(prob_comput_df) -> prob_comput_df
}

In [9]:
# Let's check what we got
prob_comput_df

V1,V2,V3
687.,0.98,43-4151
688.,0.98,43-4011
689.,0.98,43-9041
690.,0.98,51-2093
691.,0.99,43-9021
692.,0.99,25-4031
693.,0.99,43-4141
694.,0.99,51-9151
695.,0.99,13-2082
696.,0.99,43-5011


In [11]:
# Let's give this thing some proper column names
colnames(prob_comput_df) = c("rank", "probability", "soc")

prob_comput_df

rank,probability,soc
687.,0.98,43-4151
688.,0.98,43-4011
689.,0.98,43-9041
690.,0.98,51-2093
691.,0.99,43-9021
692.,0.99,25-4031
693.,0.99,43-4141
694.,0.99,51-9151
695.,0.99,13-2082
696.,0.99,43-5011


In [None]:
#### Data Cleaning

In [12]:
# what does R think it is looking at?
str(prob_comput_df)

'data.frame':	740 obs. of  3 variables:
 $ rank       : chr  "687." "688." "689." "690." ...
 $ probability: chr  "0.98" "0.98" "0.98" "0.98" ...
 $ soc        : chr  "43-4151" "43-4011" "43-9041" "51-2093" ...


In [13]:
prob_comput_df %>% 
  # convert things that look like numbers into numbers
  mutate(rank=gsub("\\.","", rank) %>% as.numeric()) %>% 
  #let's get rid of missing data
  na.omit() -> prob_comput_df

In [14]:
str(prob_comput_df)

'data.frame':	702 obs. of  3 variables:
 $ rank       : num  687 688 689 690 691 692 693 694 695 696 ...
 $ probability: chr  "0.98" "0.98" "0.98" "0.98" ...
 $ soc        : chr  "43-4151" "43-4011" "43-9041" "51-2093" ...
 - attr(*, "na.action")= 'omit' Named int  30 57 77 92 97 108 112 117 120 125 ...
  ..- attr(*, "names")= chr  "30" "57" "77" "92" ...


In [None]:
# finally let's delete the file that we just downloaded
file.remove(file)

#### Data (Down)Loading 

In [17]:
#############################################################################################
# Download job statistics
#############################################################################################

# set some variables to be used for download
urlfile <- "https://www.bls.gov/emp/ind-occ-matrix/occupation.xlsx"
file <- "occupation.xlsx"
# Download xlsx file 
if (!file.exists(file)) {
    download.file(urlfile, destfile = file, mode = 'wb')
}

In [18]:
# read excel file into R
job_stats_df <- read_excel(file, 
                           sheet="Table 1.7", 
                           skip=3,
                           col_names = c("job_title",
                                         "soc",
                                         "occupation_type",
                                         "employment_2016",
                                         "employment_2026",
                                         "employment_change_2016_26_nu",
                                         "employment_change_2016_26_pe",
                                         "self_employed_2016_pe",
                                         "occupational_openings_2016_26_av",
                                         "median_annual_wage_2017",
                                         "typical_education_entry",
                                         "work_experience_related_occ",	
                                         "typical_training_needed"))

In [19]:
# now we can remove the downloaded file
file.remove(file)

In [22]:
# let's look what we got here
job_stats_df

job_title,soc,occupation_type,employment_2016,employment_2026,employment_change_2016_26_nu,employment_change_2016_26_pe,self_employed_2016_pe,occupational_openings_2016_26_av,median_annual_wage_2017,typical_education_entry,work_experience_related_occ,typical_training_needed
"Total, all occupations",00-0000,Summary,156063.8,167582.3,11518.6,7.4,6.1,18742.0,37690,—,—,—
Management occupations,11-0000,Summary,9533.1,10340.4,807.3,8.5,19.8,841.5,102590,—,—,—
Top executives,11-1000,Summary,2627.5,2824.5,197.0,7.5,3.2,235.0,103120,—,—,—
Chief executives,11-1011,Line item,308.9,296.8,-12.1,-3.9,22.8,20.0,183270,Bachelor's degree,5 years or more,
General and operations managers,11-1021,Line item,2263.1,2468.3,205.2,9.1,0.6,210.7,100410,Bachelor's degree,5 years or more,
Legislators,11-1031,Line item,55.5,59.4,3.9,7.1,—,4.4,25630,Bachelor's degree,Less than 5 years,
"Advertising, marketing, promotions, public relations, and sales managers",11-2000,Summary,708.6,768.9,60.3,8.5,3.3,67.9,123100,—,—,—
Advertising and promotions managers,11-2011,Line item,31.3,33.0,1.7,5.5,5.2,3.4,106130,Bachelor's degree,Less than 5 years,
Marketing and sales managers,11-2020,Summary,603.8,654.8,51.0,8.4,3.6,57.6,125290,—,—,—
Marketing managers,11-2021,Line item,218.3,240.4,22.1,10.1,3.5,21.3,132230,Bachelor's degree,5 years or more,


#### Data Transformation & Cleaning

We are going to merge (join) the 2 data sets and keep only the columns that we need.

In [23]:
#############################################################################################
# Join data frames
#############################################################################################
results = prob_comput_df %>% 
  inner_join(job_stats_df, by = "soc") %>% 
  select(job_title, 
         probability, 
         employment_2016, 
         median_annual_wage_2017, 
         typical_education_entry) %>% 
  mutate(probability=as.numeric(probability),
         median_annual_wage_2017=as.numeric(median_annual_wage_2017),
         typical_education_entry=iconv(typical_education_entry, "latin1", "ASCII")) %>% 
  # get rid of missing data
  na.omit()

“NAs introduced by coercion”

In [25]:
# Aehmm, can we do that a little slower?
#first, we join using the soc column
first_step <- prob_comput_df %>% 
     inner_join(job_stats_df, by = "soc")

first_step

rank,probability,soc,job_title,occupation_type,employment_2016,employment_2026,employment_change_2016_26_nu,employment_change_2016_26_pe,self_employed_2016_pe,occupational_openings_2016_26_av,median_annual_wage_2017,typical_education_entry,work_experience_related_occ,typical_training_needed
687,0.98,43-4151,Order clerks,Line item,179.0,175.3,-3.7,-2.1,1.2,19.4,33510,High school diploma or equivalent,,Short-term on-the-job training
688,0.98,43-4011,Brokerage clerks,Line item,60.4,63.4,3.0,5.0,—,6.6,49800,High school diploma or equivalent,,Moderate-term on-the-job training
689,0.98,43-9041,Insurance claims and policy processing clerks,Line item,308.5,342.6,34.1,11.1,1,35.6,38790,High school diploma or equivalent,,Moderate-term on-the-job training
690,0.98,51-2093,Timing device assemblers and adjusters,Line item,0.8,0.6,-0.2,-20.1,—,0.1,34800,High school diploma or equivalent,,Moderate-term on-the-job training
691,0.99,43-9021,Data entry keyers,Line item,203.8,160.6,-43.3,-21.2,1.6,16.8,30930,High school diploma or equivalent,,Short-term on-the-job training
692,0.99,25-4031,Library technicians,Line item,99.2,108.2,9.0,9.1,—,14.4,33690,Postsecondary nondegree award,,
693,0.99,43-4141,New accounts clerks,Line item,42.0,39.4,-2.6,-6.2,—,4.0,35260,High school diploma or equivalent,,Moderate-term on-the-job training
694,0.99,51-9151,Photographic process workers and processing machine operators,Line item,26.9,22.0,-4.9,-18.1,3.3,3.3,27480,High school diploma or equivalent,,Short-term on-the-job training
695,0.99,13-2082,Tax preparers,Line item,95.9,106.2,10.3,10.8,23.6,11.5,38730,High school diploma or equivalent,,Moderate-term on-the-job training
696,0.99,43-5011,Cargo and freight agents,Line item,89.8,99.1,9.3,10.4,0.2,8.6,41820,High school diploma or equivalent,,Short-term on-the-job training


In [26]:
#second, we select only columns that we want
second_step <- first_step %>%
  select(job_title, 
         probability, 
         employment_2016, 
         median_annual_wage_2017, 
         typical_education_entry)

second_step

job_title,probability,employment_2016,median_annual_wage_2017,typical_education_entry
Order clerks,0.98,179.0,33510,High school diploma or equivalent
Brokerage clerks,0.98,60.4,49800,High school diploma or equivalent
Insurance claims and policy processing clerks,0.98,308.5,38790,High school diploma or equivalent
Timing device assemblers and adjusters,0.98,0.8,34800,High school diploma or equivalent
Data entry keyers,0.99,203.8,30930,High school diploma or equivalent
Library technicians,0.99,99.2,33690,Postsecondary nondegree award
New accounts clerks,0.99,42.0,35260,High school diploma or equivalent
Photographic process workers and processing machine operators,0.99,26.9,27480,High school diploma or equivalent
Tax preparers,0.99,95.9,38730,High school diploma or equivalent
Cargo and freight agents,0.99,89.8,41820,High school diploma or equivalent


In [28]:
#third, we create 2 new columns using the existing columns

third_step <- second_step %>% 
  mutate(probability=as.numeric(probability),
         median_annual_wage_2017=as.numeric(median_annual_wage_2017),
         typical_education_entry=iconv(typical_education_entry, "latin1", "ASCII")) 

third_step

#that looks the same to me, but internally we change some data types
str(second_step)
str(third_step)

“NAs introduced by coercion”

job_title,probability,employment_2016,median_annual_wage_2017,typical_education_entry
Order clerks,0.98,179.0,33510,High school diploma or equivalent
Brokerage clerks,0.98,60.4,49800,High school diploma or equivalent
Insurance claims and policy processing clerks,0.98,308.5,38790,High school diploma or equivalent
Timing device assemblers and adjusters,0.98,0.8,34800,High school diploma or equivalent
Data entry keyers,0.99,203.8,30930,High school diploma or equivalent
Library technicians,0.99,99.2,33690,Postsecondary nondegree award
New accounts clerks,0.99,42.0,35260,High school diploma or equivalent
Photographic process workers and processing machine operators,0.99,26.9,27480,High school diploma or equivalent
Tax preparers,0.99,95.9,38730,High school diploma or equivalent
Cargo and freight agents,0.99,89.8,41820,High school diploma or equivalent


'data.frame':	687 obs. of  5 variables:
 $ job_title              : chr  "Order clerks" "Brokerage clerks" "Insurance claims and policy processing clerks" "Timing device assemblers and adjusters" ...
 $ probability            : chr  "0.98" "0.98" "0.98" "0.98" ...
 $ employment_2016        : num  179 60.4 308.5 0.8 203.8 ...
 $ median_annual_wage_2017: chr  "33510" "49800" "38790" "34800" ...
 $ typical_education_entry: chr  "High school diploma or equivalent" "High school diploma or equivalent" "High school diploma or equivalent" "High school diploma or equivalent" ...
'data.frame':	687 obs. of  5 variables:
 $ job_title              : chr  "Order clerks" "Brokerage clerks" "Insurance claims and policy processing clerks" "Timing device assemblers and adjusters" ...
 $ probability            : num  0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.99 ...
 $ employment_2016        : num  179 60.4 308.5 0.8 203.8 ...
 $ median_annual_wage_2017: num  33510 49800 38790 34800 30930 ...
 $ typi

In [31]:
#do we have some missing data points?
is.na(third_step)

job_title,probability,employment_2016,median_annual_wage_2017,typical_education_entry
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE


In [32]:
#show me the rows with missing data
third_step[!complete.cases(third_step),]

Unnamed: 0,job_title,probability,employment_2016,median_annual_wage_2017,typical_education_entry
13,Mathematical technicians,0.99,0.6,,Bachelor's degree
23,Team assemblers,0.97,1130.9,,High school diploma or equivalent
25,Electromechanical equipment assemblers,0.97,45.7,,High school diploma or equivalent
89,Electrical and electronic equipment assemblers,0.95,218.9,,High school diploma or equivalent
174,Medical and clinical laboratory technologists,0.9,171.4,,Bachelor's degree
187,Tour guides and escorts,0.91,45.8,,High school diploma or equivalent
195,Segmental pavers,0.83,2.1,,High school diploma or equivalent
232,Miscellaneous agricultural workers,0.87,847.3,23710.0,
237,"Buyers and purchasing agents, farm products",0.87,13.7,,Bachelor's degree
246,"Purchasing agents, except wholesale, retail, and farm products",0.77,309.4,,Bachelor's degree


In [30]:
# and last but not least we remove the rows with missing data
results <- third_step %>%
    na.omit()

job_title,probability,employment_2016,median_annual_wage_2017,typical_education_entry
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE


In [33]:
#what did we get?
results

Unnamed: 0,job_title,probability,employment_2016,median_annual_wage_2017,typical_education_entry
1,Order clerks,0.98,179.0,33510,High school diploma or equivalent
2,Brokerage clerks,0.98,60.4,49800,High school diploma or equivalent
3,Insurance claims and policy processing clerks,0.98,308.5,38790,High school diploma or equivalent
4,Timing device assemblers and adjusters,0.98,0.8,34800,High school diploma or equivalent
5,Data entry keyers,0.99,203.8,30930,High school diploma or equivalent
6,Library technicians,0.99,99.2,33690,Postsecondary nondegree award
7,New accounts clerks,0.99,42.0,35260,High school diploma or equivalent
8,Photographic process workers and processing machine operators,0.99,26.9,27480,High school diploma or equivalent
9,Tax preparers,0.99,95.9,38730,High school diploma or equivalent
10,Cargo and freight agents,0.99,89.8,41820,High school diploma or equivalent


#### Finally, let's create a visualization

We are ging to use [Highcharter](http://jkunst.com/highcharter/index.html) which is just one of many ways to create interactive visualizations in R.

In [34]:
#we need some more packages 
library(highcharter)
library(htmlwidgets)
library(IRdisplay)

Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use


In [37]:
#let's create an object that is actually a visual
x=hchart(results, 
       "scatter", 
       hcaes(x = probability*100, 
             y = median_annual_wage_2017, 
             group=typical_education_entry, 
             size=employment_2016)) %>% 
  hc_title(text = "How Much Money Should Machines Earn?") %>%
  hc_subtitle(text = "Probability of Computerisation and Wages by Job") %>% 
  hc_credits(enabled = TRUE, text = "Source: Oxford Martin School and US Department of Labor") %>% 
  hc_xAxis(title = list(text = "Probability of Computerisation"), labels = list(format = "{value}%")) %>% 
  hc_yAxis(title = list(text = "Median Annual Wage 2017"), labels = list(format = "{value}$")) %>% 
  hc_plotOptions(bubble = list(minSize = 3, maxSize = 35)) %>% 
  hc_tooltip(formatter = JS("function(){
                            return ('<b>'+ this.point.job_title + '</b><br>'+
                            'Probability of computerisation: '+ Highcharts.numberFormat(this.x, 0)+'%' + 
                            '<br>Median annual wage 2017 ($): '+ Highcharts.numberFormat(this.y, 0) + 
                            '<br>Employment 2016 (000s): '+ Highcharts.numberFormat(this.point.size, 0) )}")) %>% 
  hc_chart(zoomType = "xy") %>%
  hc_exporting(enabled = TRUE)

# it's an object!
str(x)

List of 8
 $ x            :List of 6
  ..$ hc_opts  :List of 11
  .. ..$ title             :List of 1
  .. .. ..$ text: chr "How Much Money Should Machines Earn?"
  .. ..$ yAxis             :List of 3
  .. .. ..$ title :List of 1
  .. .. .. ..$ text: chr "Median Annual Wage 2017"
  .. .. ..$ type  : chr "linear"
  .. .. ..$ labels:List of 1
  .. .. .. ..$ format: chr "{value}$"
  .. ..$ credits           :List of 2
  .. .. ..$ enabled: logi TRUE
  .. .. ..$ text   : chr "Source: Oxford Martin School and US Department of Labor"
  .. ..$ exporting         :List of 1
  .. .. ..$ enabled: logi TRUE
  .. ..$ plotOptions       :List of 4
  .. .. ..$ series :List of 3
  .. .. .. ..$ turboThreshold: num 0
  .. .. .. ..$ showInLegend  : logi TRUE
  .. .. .. ..$ marker        :List of 1
  .. .. .. .. ..$ enabled: logi TRUE
  .. .. ..$ treemap:List of 1
  .. .. .. ..$ layoutAlgorithm: chr "squarified"
  .. .. ..$ bubble :List of 2
  .. .. .. ..$ minSize: num 3
  .. .. .. ..$ maxSize: num 35
  .. 

In [38]:
# and now let's get this object showing up in our jupyter notebook
saveWidget(x, 'demox.html', selfcontained = FALSE)
display_html('<iframe src="demox.html", width = 900, height = 500 ></iframe>')

A full size version of the visualization can be found [here](https://fronkonstin.com/wp-content/uploads/2018/06/machines_wage.html)


And thanks again to the person who wrote the [original post](https://fronkonstin.com/2018/06/17/how-much-money-should-machines-earn/)!

These are some insights:

* There is a moderate negative correlation between wages and probability of computerisation.
* Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.
* In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.
* Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).
* Don’t study for librarian or archivist: it seems a bad way to invest your time
* Mathematicians will survive to machines

##### What do you see there?