This is a notebook concerning the exercise for Junior Data Scientist at Our World In Data. 

I've been asked to calculate the crude and age-standardised death rate for chronic obstructive pulmonary disease (COPD) in both the United States and Uganda in 2019.

In this calculation, I've used the population from July 2019, as to best represent the average population of each country in 2019, rather than taking the population in January 2019 or 2020. For age-standardisation, I've used the WHO World Standard, as outlined in Dataset 2 below. 

To calculate the crude death rate, I have multiplied the percentage population of each age group by its respective death rate and summed the answers for both the USA and Uganda data. To calculate the age-standardised death rates, I did the same, but instead using the WHO World Standard population percentages rather than using their actual population.

In our results, we found that the USA had a much higher crude death rate than Uganda:
* USA: 57.2
* Uganda: 5.8

Looking at the data, though, we see that the USA mostly has a lower death rate for each respective age than Uganda. Looking at the population data, we see that Uganda skews more towards a younger population, whereas the USA skews towards an older population. Since older people are more likely to die from COPD in both datasets, we can see the death rate is much higher in the USA mostly because of an older population. So we expect with age-standardisation for the USA death rate to decrease and the Uganda death rate to increase.

In our results, we found the age-standardised death rates per 100,000 people were:
* USA: 28.4
* Uganda: 28.7

As we expected, the age-standardised death rates are much closer than the crude death rate. We see that, in fact, the death rates are fairly close between the two countries, when age is less of a factor.

Data Sources:

1. UN World Population Prospects (2022) — Population Estimates 1950-2021 (obtained from https://population.un.org/wpp/Download/Standard/CSV/, Population on 01 July, by 5-year age groups.)
2. WHO Standard Population — Table 1 in 'Ahmad OB, Boschi-Pinto C, Lopez AD, Murray CJ, Lozano R, Inoue M (2001). Age standardization of rates: a new WHO standard.' (obtained from https://cdn.who.int/media/docs/default-source/gho-documents/global-health-estimates/gpe_discussion_paper_series_paper31_2001_age_standardization_rates.pdf)
3. Table of age-specific death rates of COPD (obtained from https://owid.notion.site/Data-analysis-exercise-Our-World-in-Data-Junior-Data-Scientist-application-ab287a3c07264b4d91aadc436021b8c0)

Firstly, let's load our data and see what we're working with.

In [1]:
import numpy as np
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
dfPopulation = pd.read_csv('/kaggle/input/our-world-in-data-datasets/WPP2022_PopulationByAge5GroupSex_Medium/WPP2022_PopulationByAge5GroupSex_Medium.csv')
dfAgeStandard = pd.read_csv('/kaggle/input/our-world-in-data-datasets/WHO Standard Population Age Standardization.csv')
dfAgeDeath = pd.read_csv('/kaggle/input/our-world-in-data-datasets/Age Specific Death Rates of COPD.csv')

print('Dataset 1: UN World Population Prospects (2022) — Population Estimates 1950-2021')
print(dfPopulation.head())
print("Dataset 2: WHO Standard Population — Table 1 in 'Ahmad OB, Boschi-Pinto C, Lopez AD, Murray CJ, Lozano R, Inoue M (2001). Age standardization of rates: a new WHO standard.'")
print(dfAgeStandard.head())
print("Dataset 3: Table of age-specific death rates of COPD:")
print(dfAgeDeath.head())

/kaggle/input/our-world-in-data-datasets/WHO Standard Population Age Standardization.csv
/kaggle/input/our-world-in-data-datasets/Age Specific Death Rates of COPD.csv
/kaggle/input/our-world-in-data-datasets/WPP2022_PopulationBySingleAgeSex_Medium_1950-2021.csv
/kaggle/input/our-world-in-data-datasets/WPP2022_PopulationByAge5GroupSex_Medium/WPP2022_PopulationByAge5GroupSex_Medium.csv


  dfPopulation = pd.read_csv('/kaggle/input/our-world-in-data-datasets/WPP2022_PopulationByAge5GroupSex_Medium/WPP2022_PopulationByAge5GroupSex_Medium.csv')


Dataset 1: UN World Population Prospects (2022) — Population Estimates 1950-2021
   SortOrder  LocID Notes ISO3_code ISO2_code  SDMX_code  LocTypeID  \
0          1    900   NaN       NaN       NaN        1.0          1   
1          1    900   NaN       NaN       NaN        1.0          1   
2          1    900   NaN       NaN       NaN        1.0          1   
3          1    900   NaN       NaN       NaN        1.0          1   
4          1    900   NaN       NaN       NaN        1.0          1   

  LocTypeName  ParentID Location  VarID Variant  Time  MidPeriod AgeGrp  \
0       World         0    World      2  Medium  1950       1950    0-4   
1       World         0    World      2  Medium  1950       1950    5-9   
2       World         0    World      2  Medium  1950       1950  10-14   
3       World         0    World      2  Medium  1950       1950  15-19   
4       World         0    World      2  Medium  1950       1950  20-24   

   AgeGrpStart  AgeGrpSpan     PopMale   

We've got a lot of data in Dataset 1 so we need to clean it up to only the data we need. Our death rate data is from 2019 so we only need population data from 2019. The population data also includes up to ages 100+ whereas our other data only includes up to 85+, let's correct this. Finally we add a percentage of each population to aid in calculation.

In [2]:
# only data from 2019
dfPopulation = dfPopulation[dfPopulation.Time == 2019]

# split into USA and Uganda population data
important_columns = ['AgeGrp', 'PopTotal']
dfPopUSA = dfPopulation[dfPopulation.Location == 'United States of America'][important_columns].reset_index(drop=True)
dfPopUganda = dfPopulation[dfPopulation.Location == 'Uganda'][important_columns].reset_index(drop=True)

# add 85+ category to match other data
age_ranges = ['85-89','90-94','95-99','100+']
dfPopUSA.loc[len(dfPopUSA)] = {'AgeGrp':'85+','PopTotal':sum(dfPopUSA[dfPopUSA.AgeGrp.isin(age_ranges)].PopTotal)}
dfPopUSA = dfPopUSA[~dfPopUSA.AgeGrp.isin(age_ranges)]
dfPopUganda.loc[len(dfPopUganda)] = {'AgeGrp':'85+','PopTotal':sum(dfPopUganda[dfPopUganda.AgeGrp.isin(age_ranges)].PopTotal)}
dfPopUganda = dfPopUganda[~dfPopUganda.AgeGrp.isin(age_ranges)]

# calculate population percentage
dfPopUSA['PopPercent'] = dfPopUSA.PopTotal / dfPopUSA.PopTotal.sum()
dfPopUganda['PopPercent'] = dfPopUganda.PopTotal / dfPopUganda.PopTotal.sum()

print('Ages and Populations in USA:')
print(dfPopUSA)
print('Ages and Populations in Uganda:')
print(dfPopUganda)

Ages and Populations in USA:
   AgeGrp   PopTotal  PopPercent
0     0-4  19848.556    0.059370
1     5-9  20697.075    0.061908
2   10-14  22092.167    0.066081
3   15-19  21895.123    0.065492
4   20-24  21871.808    0.065422
5   25-29  23406.928    0.070014
6   30-34  22842.151    0.068324
7   35-39  22296.952    0.066694
8   40-44  20694.555    0.061901
9   45-49  21244.258    0.063545
10  50-54  21346.434    0.063850
11  55-59  22347.500    0.066845
12  60-64  20941.064    0.062638
13  65-69  17500.872    0.052348
14  70-74  13688.595    0.040945
15  75-79   9272.809    0.027736
16  80-84   6118.867    0.018302
21    85+   6213.954    0.018587
Ages and Populations in Uganda:
   AgeGrp  PopTotal  PopPercent
0     0-4  7328.968    0.170643
1     5-9  6614.421    0.154006
2   10-14  5899.400    0.137358
3   15-19  5151.082    0.119935
4   20-24  4348.173    0.101240
5   25-29  3499.504    0.081480
6   30-34  2618.559    0.060969
7   35-39  1903.175    0.044312
8   40-44  1503.669    0

Now we can start our calculation for crude and age-standardised death rates.

Crude death rates are calculated by just using the population of the country to produce the average. This is great in principle, but can be skewed by some factors (eg. young adults are in general more likely to survive a disease.)

In [3]:
# merge with death rate from dataset 3
dfPopUSA = pd.merge(dfPopUSA, dfAgeDeath[['Age group (years)', 'Death rate United States 2019']], left_on='AgeGrp',right_on='Age group (years)')[['AgeGrp','PopTotal','PopPercent','Death rate United States 2019']]
dfPopUganda = pd.merge(dfPopUganda, dfAgeDeath[['Age group (years)', 'Death rate Uganda 2019']], left_on='AgeGrp',right_on='Age group (years)')[['AgeGrp','PopTotal','PopPercent','Death rate Uganda 2019']]

# calculate crude death rates
crude_usa = round(sum(dfPopUSA.PopPercent*dfPopUSA['Death rate United States 2019']),1)
crude_uganda = round(sum(dfPopUganda.PopPercent*dfPopUganda['Death rate Uganda 2019']),1)

print(f'Crude death rates:\nUSA: {crude_usa}\nUganda: {crude_uganda}')

Crude death rates:
USA: 57.2
Uganda: 5.8


Crude death rates per 100,000 people:
* USA: 57.2
* Uganda: 5.8

As you can see, the crude death rate of COPD in Uganda is much lower than that in the USA. There could be many reasons for this, but notice that, compared to the USA, Uganda's population skews much more to a younger population. We can see that even though Uganda has a higher death rate for most age groups compared to the USA, the death rate in older people is much higher, meaning the USA ends up with a higher crude death rate.

Now let's compare this to the age-standardised death rate. This is where we use the same proportion of each age group on each country, rather than relying on the country population. Since Uganda's death rate is higher than the USA in most age groups, we expect the age-standardised rate to be higher.

Here, we're using the WHO World Standard population.

In [4]:
dfAgeStandard['WHO World Standard'] /= 100 # convert percent to decimal

# merge with dataset 2 to get WHO Age Standard
dfPopUSA = pd.merge(dfPopUSA, dfAgeStandard[['Age group', 'WHO World Standard']], left_on='AgeGrp',right_on='Age group')[['AgeGrp','PopTotal','PopPercent','Death rate United States 2019', 'WHO World Standard']]
dfPopUganda = pd.merge(dfPopUganda, dfAgeStandard[['Age group', 'WHO World Standard']], left_on='AgeGrp',right_on='Age group')[['AgeGrp','PopTotal','PopPercent','Death rate Uganda 2019', 'WHO World Standard']]

# calculate age-standardised death rate
age_standard_usa = round(sum(dfPopUSA['WHO World Standard']*dfPopUSA['Death rate United States 2019']),1)
age_standard_uganda = round(sum(dfPopUganda['WHO World Standard']*dfPopUganda['Death rate Uganda 2019']),1)

print(f'Age-standardised death rates:\nUSA: {age_standard_usa}\nUganda: {age_standard_uganda}')

Age-standardised death rates:
USA: 28.4
Uganda: 28.7


Age-standardised death rates per 100,000 people:
* USA: 28.4
* Uganda: 28.7


This is mostly as we expected. The USA's population skews more towards an older population than the WHO World Standard, so has decreased, and the Uganda population skews more towards a younger population, so has increased.