# Probability Converter
This script converts the `occupational_fatalities.csv` dataset into an index of liklihoods to die by each profession, then joins on the `deaths_age_gender_race_mechanism_cause.csv` dataset by mechanism of death. Next, the full dataset is converted into daily probabilities of dying by age , gender, race, and occupation by means of modeling the daily probabilities via linear models connecting each age, in years.

In [1]:
import pandas as pd
import numpy as np

In [2]:
job_deaths = pd.read_csv("../data/occupational_fatalities.csv")
cdc_deaths = pd.read_csv("../data/deaths_age_gender_race_mechanism_cause.csv")

  interactivity=interactivity, compiler=compiler, result=result)


### Converting BLS Data to Probabilities by Occupation

In [3]:
job_deaths.head()

Unnamed: 0,Occupation,Hierarchy Levels,variable,value
0,Total,0,Total fatal injuries (number),5250
1,Total,0,Transportation incidents,2080
2,Transportation and material moving occupations,0,Total fatal injuries (number),1443
3,Motor vehicle operators,1,Total fatal injuries (number),1044
4,Transportation and material moving occupations,0,Transportation incidents,1014


In [4]:
job_deaths.shape

(4361, 4)

Because there are 4 levels in the hierarchy and the user experience will be poor if they have to search through too many job titles to find something close to theirs (a mentally taxing task), selecting the right level is paramount. To aid in this selection, looking at the volume and degree of detail should help. Level 3 is the most detailed so we'll start there and work our way up:

In [5]:
job_deaths.Occupation[job_deaths['Hierarchy Levels']==3].value_counts()

 Sawing machine setters, operators, and tenders, wood                                           7
 Bill and account collectors                                                                    7
 First-line supervisors of transportation and material-moving machine and vehicle operators     7
 Amusement and recreation attendants                                                            7
 Flight attendants                                                                              7
 First-line supervisors of fire fighting and prevention workers                                 7
 Mining and geological engineers, including mining safety engineers                             7
 Forest and conservation technicians                                                            7
 Molders, shapers, and casters, except metal and plastic                                        7
 Rotary drill operators, oil and gas                                                            7
 Dishwashers        

275 options is clearly too many. 

In [6]:
job_deaths.Occupation[job_deaths['Hierarchy Levels']==2].value_counts()

 Roustabouts, oil and gas                                                                       7
 Miscellaneous protective service workers                                                       7
 Aircraft mechanics and service technicians                                                     7
 Pumping station operators                                                                      7
 Industrial production managers                                                                 7
 Tour and travel guides                                                                         7
 Bill and account collectors                                                                    7
 Mining and geological engineers, including mining safety engineers                             7
 Real estate brokers and sales agents                                                           7
 Hunters and trappers                                                                           7
 First-line supervis

237 is still way too many.

In [7]:
job_deaths.Occupation[job_deaths['Hierarchy Levels']==1].value_counts()

 Supervisors of sales workers                                                       7
 Forest, conservation, and logging workers                                          7
 Operations specialties managers                                                    7
 Counselors, social workers, and other community and social service specialists     7
 Funeral service workers                                                            7
 Architects, surveyors, and cartographers                                           7
 Other personal care and service workers                                            7
 Art and design workers                                                             7
 Physical scientists                                                                7
 Social scientists and related workers                                              7
 Supervisors of food preparation and serving workers                                7
 Other management occupations                         

87 is getting more reasonable, but still annoying.

In [8]:
job_deaths.Occupation[job_deaths['Hierarchy Levels']==0].value_counts()

 Protective service occupations                                 7
 Computer and mathematical occupations                          7
 Legal occupations                                              7
 Personal care and service occupations                          7
 Architecture and engineering occupations                       7
 Business and financial operations occupations                  7
 Farming, fishing, and forestry occupations                     7
 Office and administrative support occupations                  7
 Healthcare support occupations                                 7
 Community and social services occupations                      7
 Military specific occupations(5)                               7
 Management occupations                                         7
 Healthcare practitioners and technical occupations             7
 Transportation and material moving occupations                 7
 Building and grounds cleaning and maintenance occupations      7
 Total    

In [9]:
len(job_deaths.Occupation[job_deaths['Hierarchy Levels']==0].value_counts())

24

24 is totally reasonable. The overall impact this will have on likelihood to die will be extraordinarily small and it's mainly included to help the user feel like it's more personalized.  

In order to prep the data for joining, we'll need to remove the extraneous data and generate a probability based on the the volume of deaths per occupation and mechanism out of the total deaths by mechanism.

In [10]:
job_deaths = job_deaths[job_deaths['Hierarchy Levels']==0]

We'll want to remove the levels and clean the columns names:

In [11]:
del job_deaths['Hierarchy Levels']

In [12]:
job_deaths.columns = ['job','mechanism','deaths']

Reset the index to allow for searching it

In [13]:
job_deaths.reset_index(drop = True, inplace = True)

In [14]:
probs = []
for i in range(job_deaths.shape[0]):
    prob = job_deaths.deaths.iloc[i]/\
        job_deaths[(job_deaths.job == job_deaths.job.iloc[i]) & 
                   (job_deaths.mechanism == "Total fatal injuries (number)")].deaths.iloc[0]
    probs.append(prob)

In [15]:
job_deaths['prob'] = probs

Just to make reading it easier, we'll sort it on job and deaths:

In [16]:
job_deaths = job_deaths.sort_values(["job", "deaths"], ascending = [True, False])

In [19]:
job_deaths.head(15)

Unnamed: 0,job,mechanism,deaths,prob
62,Architecture and engineering occupations,Total fatal injuries (number),30,1.0
74,Architecture and engineering occupations,Transportation incidents,21,0.7
122,Architecture and engineering occupations,"Falls, slips, trips",4,0.133333
132,Architecture and engineering occupations,Fires and explosions,1,0.033333
134,Architecture and engineering occupations,Violence and other injuries by persons or animals,0,0.0
153,Architecture and engineering occupations,Exposure to harmful substances or environments,0,0.0
159,Architecture and engineering occupations,Contact with objects and equipment,0,0.0
40,"Arts, design, entertainment, sports, and medi...",Total fatal injuries (number),71,1.0
64,"Arts, design, entertainment, sports, and medi...",Transportation incidents,29,0.408451
69,"Arts, design, entertainment, sports, and medi...",Violence and other injuries by persons or animals,25,0.352113


### Joining CDC and BLS Data

The CDC data has a `mechanism` field which matches the BLS data (albeit with many many more mechanisms).  This being the case, we'll want to first do what we did with the BLS data and create annual probabilities for each age, gender, race, and mechanism, and cause.  Then, for the mechanisms that do match, we'll expand a single row of the CDC data out by occupation, multiplying the probabilities using the multiplication rule of probability.  

Some useful documentation on how to [split, apply, and combine grouped data](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) (instead of using the for loop methodology like above).

In [20]:
cdc_deaths

Unnamed: 0,age,gender,race,mechanism_of_death,cause_of_death,deaths,population
0,0,Female,American Indian or Alaska Native,Fire/Flame,Exposure to uncontrolled fire in building or s...,1,36615
1,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Car occupant injured in collision with heavy t...,1,36615
2,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Person injured in unspecified motor-vehicle ac...,1,36615
3,0,Female,American Indian or Alaska Native,Suffocation,Accidental suffocation and strangulation in bed,7,36615
4,0,Female,American Indian or Alaska Native,Suffocation,Unspecified threat to breathing,3,36615
5,0,Female,American Indian or Alaska Native,Suffocation,"Hanging, strangulation and suffocation, undete...",1,36615
6,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Campylobacter enteritis,1,36615
7,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Other and unspecified gastroenteritis and coli...,1,36615
8,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Streptococcal septicaemia, unspecified",1,36615
9,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Septicaemia, unspecified",1,36615
