# Indexed Liklihood and Probability Converter
This script converts the Bureau of Labor Statistics (BLS) `occupational_hazards_data.csv` dataset into an index of likelihoods to die by each occupation and converts the Center for Disease Control (CDC) `deaths_age_gender_race_mechanism_cause.csv` dataset into annual probabilities of death by age, gender, and race.  

The `death_simulator.py` script has a function which accepts the user input for birthdate (converted to age in days), gender, race, and occupation (as well as other unused inputs). These inputs are used to subset to the relevant rows of data from the CDC and BLS datasets. Next, the annual probabilities are converted into daily probabilities of dying by age (in days), gender, and race this is modulated by the indexed probability of dying by occupation. Dice rolls are made for death on each day and if death happens, then a weighted probability roll is done to determine the specific mechanism and cause of death. If timer permits, the mechanism of death probability will be modulated by the occupation.

In [1]:
import pandas as pd
import numpy as np

In [2]:
job_deaths = pd.read_csv("../data/occupational_hazards_data.csv")

In [3]:
cdc_deaths = pd.read_csv("../data/deaths_age_gender_race_mechanism_cause.csv")

### Cleaning the BLS Data

First, the data needs to be subset to the highest hierarchy level (level 0) since lower hierarchy levels provide too many options from which the user needs to choose (i.e. a dropdon menu of 200+ occupations is too many).  Level 0 provides 21 occupations and is hence, much more reasonable and less frustrating for the user to select from.

In [4]:
job_deaths.head()

Unnamed: 0,occupation,hierarchy_level,mechanism_of_death,deaths,population
0,Total,0,Total,5250,161038
1,Total,0,Cut/Pierce,828,161038
2,Total,0,Motor Vehicle Traffic,2080,161038
3,Total,0,Fire/Flame,115,161038
4,Total,0,Fall,791,161038


In [5]:
job_deaths = job_deaths[job_deaths['hierarchy_level']==0]

With subsetting complete, the hierarchy levels are no longer needed.

In [6]:
del job_deaths['hierarchy_level']

Currently, population is a string with commas and needs to be converted to an integer for calculations.

In [7]:
job_deaths.population = job_deaths.population.str.replace(',', '').astype('int')

Because we're only interested in the indexed liklihood of dying by occupation, we won't need the `occupation == 'Total'` rows.  

In [8]:
job_deaths = job_deaths[job_deaths.occupation != 'Total']

In [9]:
# current output
job_deaths.head()

Unnamed: 0,occupation,mechanism_of_death,deaths,population
7,Management occupations,Total,387,10193
8,Management occupations,Cut/Pierce,82,10193
9,Management occupations,Motor Vehicle Traffic,159,10193
10,Management occupations,Fire/Flame,14,10193
11,Management occupations,Fall,35,10193


### Creating the Indexed Likelihood of Death by Occupation

At this point, we'll want to split the data into two tables where the first will be used to augment the probability of death and the second will be used to augment the mechanism of death if death occurs.
1. `job_indexed_liklihood`: A table which has the total deaths and population for each occupation
2. `job_mechanism_indexed_likelihood`: A table which has the non-total deaths and population for each mechanism

`job_indexed_liklihood` will be used to create the indexed likelihood of dying by occupation and `job_mechanism_indexed_likelihood` will be used to modulate the probability of dying by a specific mechanism once the dice roll for death occurs. This section covers creating `job_indexed_liklihood`.

In [10]:
job_indexed_likelihood = job_deaths[job_deaths.mechanism_of_death == 'Total']

In [11]:
job_indexed_likelihood

Unnamed: 0,occupation,mechanism_of_death,deaths,population
7,Management occupations,Total,387,10193
280,Business and financial operations occupations,Total,38,8590
427,Computer and mathematical occupations,Total,12,4674
476,Architecture and engineering occupations,Total,30,2699
595,"Life, physical, and social science occupations",Total,18,1323
763,Legal occupations,Total,15,1346
812,"Education, training, and library occupations",Total,27,9647
924,"Arts, design, entertainment, sports, and media...",Total,71,2900
1127,Healthcare practitioners and technical occupat...,Total,65,9108
1274,Healthcare support occupations,Total,32,4316


We'll want to express deaths and population as a ratio before generating indexed values.

In [12]:
job_indexed_likelihood['deaths_per_capita'] = job_indexed_likelihood.deaths / job_indexed_likelihood.population

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
job_indexed_likelihood

Unnamed: 0,occupation,mechanism_of_death,deaths,population,deaths_per_capita
7,Management occupations,Total,387,10193,0.037967
280,Business and financial operations occupations,Total,38,8590,0.004424
427,Computer and mathematical occupations,Total,12,4674,0.002567
476,Architecture and engineering occupations,Total,30,2699,0.011115
595,"Life, physical, and social science occupations",Total,18,1323,0.013605
763,Legal occupations,Total,15,1346,0.011144
812,"Education, training, and library occupations",Total,27,9647,0.002799
924,"Arts, design, entertainment, sports, and media...",Total,71,2900,0.024483
1127,Healthcare practitioners and technical occupat...,Total,65,9108,0.007137
1274,Healthcare support occupations,Total,32,4316,0.007414


In [14]:
np.mean(job_indexed_likelihood.deaths_per_capita)

0.04272993952516797

If we want this mean value to have an index of 1, then we simply divide by the mean.

In [15]:
job_indexed_likelihood['indexed_likelihood'] = job_indexed_likelihood.deaths_per_capita /\
    np.mean(job_indexed_likelihood.deaths_per_capita)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [16]:
job_indexed_likelihood = job_indexed_likelihood.sort_values("indexed_likelihood", ascending = False)

Finally, we can remove all the columns except occupation and indexed_likelihood and re-index.

In [17]:
job_indexed_likelihood = job_indexed_likelihood.drop(['mechanism_of_death', 'deaths', 'population', 
                            'deaths_per_capita'], axis = 1).reset_index(drop = True)

In [18]:
job_indexed_likelihood

Unnamed: 0,occupation,indexed_likelihood
0,"Farming, fishing, and forestry occupations",5.589364
1,Construction and extraction occupations,3.254264
2,Transportation and material moving occupations,3.115048
3,Protective service occupations,1.774433
4,"Installation, maintenance, and repair occupations",1.609229
5,Building and grounds cleaning and maintenance ...,1.422786
6,Management occupations,0.888539
7,"Arts, design, entertainment, sports, and media...",0.572965
8,Production occupations,0.557091
9,Sales and related occupations,0.358578


Perfect! The results align with intuitions (that farming, fishing, forestry, construction, and extraction sorts of jobs are more dangerous and computer, mathematical, and education are some of the least dangerous.

### Creating the Indexed Likelihood of Death by Mechanism within Occupation

This step is a nice-to-have given how small a difference it'll make overall so we'll deprioritize this scope for the MVP unless time allows.

### Creating the CDC Annual Death Probabilities by Age, Gender, and Race

As with the BLS data, we'll split the CDC data into two datasets where the first will be used to determine the overall probability of death and the second will be used to determine the mechanism and cause of death if death occurs.

1. `annualDeathProbs_age_gender_race` is, as the name suggests, the probability of dying within a year based on age, race, and gender.
2. `annualCauseofDeathProbs_age_gender_race` is the annual probability of each cause and related mechanism of death given that death occurs within a specific age, gender, and race.

This section covers the creation of `annualDeathProbs_age_gender_race`

In [19]:
# overview of data
cdc_deaths.head()

Unnamed: 0,age,gender,race,mechanism_of_death,cause_of_death,deaths,population
0,0,Female,American Indian or Alaska Native,Fire/Flame,Exposure to uncontrolled fire in building or s...,1,36615
1,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Car occupant injured in collision with heavy t...,1,36615
2,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Person injured in unspecified motor-vehicle ac...,1,36615
3,0,Female,American Indian or Alaska Native,Suffocation,Accidental suffocation and strangulation in bed,7,36615
4,0,Female,American Indian or Alaska Native,Suffocation,Unspecified threat to breathing,3,36615


In order to convert to annual probabilities, we'll want to group deaths on age, gender, race, and population, then divide by the population in that grouping.

In [20]:
annualDeathProbs_age_gender_race =\
    cdc_deaths.groupby(['age', 'gender', 'race', 'population'], as_index = False).sum()

In [21]:
annualDeathProbs_age_gender_race['annual_death_prob'] = annualDeathProbs_age_gender_race.deaths /\
                                                    annualDeathProbs_age_gender_race.population

Finally, to tidy up, we can remove the population and deaths columns since they won't be needed by the death simulator now that we have probability.

In [22]:
annualDeathProbs_age_gender_race = annualDeathProbs_age_gender_race.drop(['population', 'deaths'], axis = 1)

In [23]:
# new data structure
annualDeathProbs_age_gender_race.head()

Unnamed: 0,age,gender,race,annual_death_prob
0,0,Female,American Indian or Alaska Native,0.003714
1,0,Female,Asian or Pacific Islander,0.003723
2,0,Female,Black or African American,0.009152
3,0,Female,White,0.004184
4,0,Male,American Indian or Alaska Native,0.004777


### Creating the CDC Annual Cause of Death Probabilities by Age, Gender, and Race

In [24]:
annualCauseofDeathProbs_age_gender_race = cdc_deaths.copy()

Here, population doesn't matter since that field is only used for calculating the probability of death (used in `annualDeathProbs_age_gender_race`).  Instead, we'll want to know: for those who died, what percent died by each cause.  This being the case, we can drop population and calculate the cause of death probability within the age/gender/race combination.

In [25]:
del annualCauseofDeathProbs_age_gender_race['population']

In [26]:
annualCauseofDeathProbs_age_gender_race = \
    annualCauseofDeathProbs_age_gender_race.assign(cause_of_death_prob = \
        annualCauseofDeathProbs_age_gender_race.deaths /\
            annualCauseofDeathProbs_age_gender_race.groupby(['age', 'gender', 'race']).deaths.transform('sum'))

The final step is removing the unneeded deaths column.

In [27]:
del annualCauseofDeathProbs_age_gender_race['deaths']

In [28]:
# inspect data structure
annualCauseofDeathProbs_age_gender_race.head()

Unnamed: 0,age,gender,race,mechanism_of_death,cause_of_death,cause_of_death_prob
0,0,Female,American Indian or Alaska Native,Fire/Flame,Exposure to uncontrolled fire in building or s...,0.007353
1,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Car occupant injured in collision with heavy t...,0.007353
2,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Person injured in unspecified motor-vehicle ac...,0.007353
3,0,Female,American Indian or Alaska Native,Suffocation,Accidental suffocation and strangulation in bed,0.051471
4,0,Female,American Indian or Alaska Native,Suffocation,Unspecified threat to breathing,0.022059


### Write Results to .csv

In [29]:
job_indexed_likelihood.to_csv('../data/job_indexed_likelihood.csv', index = False)
annualDeathProbs_age_gender_race.to_csv('../data/annualDeathProbs_age_gender_race.csv', index = False)
annualCauseofDeathProbs_age_gender_race.to_csv('../data/annualCauseofDeathProbs_age_gender_race.csv', index = False)

In [30]:
len(np.unique(annualCauseofDeathProbs_age_gender_race.cause_of_death))

3753