# COPD exercise for "Our World in Data"

Given age-specific death rates from chronic obstructive pulmonary disease (COPD), for Uganda and the USA in 2019, calculate both crude death rate and age-standardized death rate for COPD for all ages. The numbers given are deaths per 100,000 people and the results should be stated in that form, to one decimal place.

## Data sources

1. UN WPP 2022. First I checked whether anything like the UN-provided WPP2022 R package exists for Python, but no equivalent was found . Then I weighed whether to learn the API https://population.un.org/dataportalapi/index.html but decided an xlsx download would be faster for a one-off task. At first the WPP2022 data I downloaded was Population total, but later I went back to acquire Population by "5-year age groups, both sexes". https://population.un.org/wpp/Download/Standard/Population/

2. WHO Standard Population â€” Table 1 in 'Ahmad et al (2001). Age standardization of rates: a new WHO standard.' https://cdn.who.int/media/docs/default-source/gho-documents/global-health-estimates/gpe_discussion_paper_series_paper31_2001_age_standardization_rates.pdf
Table copying from PDF was not clean, so to avoid manual actions (likely introducing error) I found https://seer.cancer.gov/stdpopulations/world.who.html

3. The provided table was saved to COPD_age-specific_USA_Uganda_2019.csv

## Calculation
I don't know how to do this. ChatGPT advises:

### Crude death rate
For each country, sum the age-specific death rates across all age groups.
Divide the total deaths by the total population of the country.
Multiply the result by 100,000.

### Age-Standardized death rate (ASDR)
For each country, calculate the weighted average of the age-specific death rates using the WHO standard population percentages.
Multiply the age-specific death rate for each age group by the corresponding WHO standard percentage.
Sum up the weighted death rates across all age groups.
Multiply the result by 100,000.

## Report

Initial results are wildly wrong:

Crude Death Rate (CDR) per 100,000 population:
USA: 0.6
Uganda: 4.8

Age-Standardized Death Rate (ASDR) per 100,000 population:
USA: 2848211.1
Uganda: 2872446.0

Since the USA and Uganda have different age structures (USA older) if COPD mortality is higher in old age (assumption), then I would expect the ratio of death rates Uganda:USA to be higher for CDR than ASDR. 

In [3]:
import numpy as np
import pandas as pd
import openpyxl as xl
import math

In [4]:
# Began with source 3
COPD_original = pd.read_csv("data/COPD_age-specific_USA_Uganda_2019.csv", skiprows=1, skipfooter=2, engine='python')
print(COPD_original)
# death rates per 100,000 population

   Age group (years)  Death rate, United States, 2019  \
0                0-4                             0.04   
1                5-9                             0.02   
2              10-14                             0.02   
3              15-19                             0.02   
4              20-24                             0.06   
5              25-29                             0.11   
6              30-34                             0.29   
7              35-39                             0.56   
8              40-44                             1.42   
9              45-49                             4.00   
10             50-54                            14.13   
11             55-59                            37.22   
12             60-64                            66.48   
13             65-69                           108.66   
14             70-74                           213.10   
15             75-79                           333.06   
16             80-84           

In [5]:
# Next, source 2
WHO_standard = pd.read_csv("data/WHO_World_Standard_from_NIH_NCI.csv", usecols=['Age Group', 'WHO World Standard (%)'], skiprows=1, skipfooter=2, engine='python')
WHO_standard = WHO_standard.rename(columns={
    'Age Group': 'Age group'
})
print(WHO_standard)

   Age group  WHO World Standard (%)
0        0-4                   8.860
1        5-9                   8.690
2      10-14                   8.600
3      15-19                   8.470
4      20-24                   8.220
5      25-29                   7.930
6      30-34                   7.610
7      35-39                   7.150
8      40-44                   6.590
9      45-49                   6.040
10     50-54                   5.370
11     55-59                   4.550
12     60-64                   3.720
13     65-69                   2.960
14     70-74                   2.210
15     75-79                   1.520
16     80-84                   0.910
17     85-89                   0.440
18     90-94                   0.150
19     95-99                   0.040
20      100+                   0.005


In [6]:
# In COPD data the final row is age 85+
# so for compatibility merge older cohorts in NIH's WHO_standard.
# Check it matches the percentage in Ahmad et al paper's table 1 85+ row (0.63%)
sum_percentage = WHO_standard.iloc[-4:]['WHO World Standard (%)'].sum()
print(sum_percentage)

0.635


In [7]:
# delete last 4 rows, create new 85+ row
WHO_standard.drop(WHO_standard.index[-4:], inplace=True)
new_row = pd.DataFrame({'Age group': ['85+'], 'WHO World Standard (%)': [sum_percentage]})
WHO_standard = pd.concat([WHO_standard, new_row], ignore_index=True)
print(WHO_standard)

# alternative one-step method
# WHO_standard = pd.concat([WHO_standard.iloc[:-4], new_row], ignore_index=True)

   Age group  WHO World Standard (%)
0        0-4                   8.860
1        5-9                   8.690
2      10-14                   8.600
3      15-19                   8.470
4      20-24                   8.220
5      25-29                   7.930
6      30-34                   7.610
7      35-39                   7.150
8      40-44                   6.590
9      45-49                   6.040
10     50-54                   5.370
11     55-59                   4.550
12     60-64                   3.720
13     65-69                   2.960
14     70-74                   2.210
15     75-79                   1.520
16     80-84                   0.910
17       85+                   0.635


In [None]:
# Lastly, source 1
wpp = pd.read_excel("data/WPP2022_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT_REV1.xlsx", skiprows=16)
# first pass inspect columns and identify those needed
print(wpp.columns.tolist())

In [33]:
# In early effort I also retrieved CDR (all causes) which is not needed.
wpp_originalColNames = pd.read_excel("data/WPP2022_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT_REV1.xlsx", skiprows=16, usecols=['Region, subregion, country or area *', 'Year', 'Total Population, as of 1 July (thousands)', 'Crude Death Rate (deaths per 1,000 population)'])
print(wpp_originalColNames.columns.tolist())

['Region, subregion, country or area *', 'Year', 'Total Population, as of 1 July (thousands)', 'Crude Death Rate (deaths per 1,000 population)']


In [34]:
# short column names, in case this needs interact with other two data sources. 
wpp = wpp_originalColNames.rename(columns={
    'Region, subregion, country or area *': 'Country',
    'Total Population, as of 1 July (thousands)': 'Population',
    'Crude Death Rate (deaths per 1,000 population)': 'CDR'
})
print(wpp.columns.tolist())

['Country', 'Year', 'Population', 'CDR']


In [12]:
print(wpp.iloc[0])

Country             WORLD
Year               1950.0
Population    2499322.157
CDR                19.518
Name: 0, dtype: object


In [13]:
# 'Year' is blank cell in row between each country
# prepare to restore Year to 4-digit form
wpp.dropna(subset=['Year'], inplace=True)

In [15]:
wpp['Year'] = wpp['Year'].astype(int)
print(wpp.iloc[0]['Year'])

1950


In [17]:
# Filter Uganda and USA
wpp_country_filter = wpp[(wpp['Country'] == 'United States of America') | (wpp['Country'] == 'Uganda')]
print(wpp_country_filter)
# I bet long-form "United States of America" will require changing later.

                        Country  Year  Population     CDR
2884                     Uganda  1950    5750.637  25.067
2885                     Uganda  1951    5909.819  24.859
2886                     Uganda  1952    6073.833   24.36
2887                     Uganda  1953    6243.883  23.845
2888                     Uganda  1954    6419.882  23.318
...                         ...   ...         ...     ...
18575  United States of America  2017  329791.231   8.424
18576  United States of America  2018  332140.037   8.387
18577  United States of America  2019  334319.671   8.325
18578  United States of America  2020  335942.003   9.651
18579  United States of America  2021  336997.624   9.743

[144 rows x 4 columns]


In [18]:
# Filter year 2019
WPP_2019 = wpp_country_filter[wpp_country_filter['Year'] == 2019]
print(WPP_2019)
# Population data unit is 1,000 - a thousand

                        Country  Year  Population    CDR
2953                     Uganda  2019    42949.08  5.823
18577  United States of America  2019  334319.671  8.325


In [27]:
# Total population of each of the 2 countries
population_usa = WPP_2019.loc[WPP_2019['Country'] == 'United States of America', 'Population'].values[0] * 1000
population_uganda = WPP_2019.loc[WPP_2019['Country'] == 'Uganda', 'Population'].values[0] * 1000
print("Population total number in 2019")
print("Uganda  {}".format(math.trunc(population_uganda)))
print("USA    {}".format(math.trunc(population_usa)))

Population total number in 2019
Uganda  42949080
USA    334319671


In [20]:
# harmonise age column name with WHO Standard Popualtion Distribution
# COPD = COPD_original.copy()
COPD = COPD_original.rename(columns={
    'Age group (years)': 'Age group'
})
print(COPD)

   Age group  Death rate, United States, 2019  Death rate, Uganda, 2019
0        0-4                             0.04                      0.40
1        5-9                             0.02                      0.17
2      10-14                             0.02                      0.07
3      15-19                             0.02                      0.23
4      20-24                             0.06                      0.38
5      25-29                             0.11                      0.40
6      30-34                             0.29                      0.75
7      35-39                             0.56                      1.11
8      40-44                             1.42                      2.04
9      45-49                             4.00                      5.51
10     50-54                            14.13                     13.26
11     55-59                            37.22                     33.25
12     60-64                            66.48                   

In [None]:
# Data ready, begin calculations (with help from ChatGPT)
# (more harmonising of column names may be needed)

# Death rates are conventionally expressed as a number per 100,000

# WHO World Standard is a percentage so need to divide by 100

In [28]:
# Calculate Crude Death Rate (CDR) for each country
total_deaths_usa = COPD['Death rate, United States, 2019'].sum()
total_deaths_uganda = COPD['Death rate, Uganda, 2019'].sum()

cdr_usa = (total_deaths_usa / population_usa) * 100000
cdr_uganda = (total_deaths_uganda / population_uganda) * 100000

print("Crude Death Rate (CDR) per 100,000 population:")
print("Uganda:", round(cdr_uganda, 1))
print("United States of America:", round(cdr_usa, 1))

Crude Death Rate (CDR) per 100,000 population:
Uganda: 4.8
United States of America: 0.6


In [24]:
# Calculate Crude Death Rate (CDR) for each country
total_deaths_usa = COPD['Death rate, United States, 2019'].sum()
total_deaths_uganda = COPD['Death rate, Uganda, 2019'].sum()

# original WPP numbers are in thousands
population_usa = WPP_2019.loc[WPP_2019['Country'] == 'United States of America', 'Population'].values[0]
population_uganda = WPP_2019.loc[WPP_2019['Country'] == 'Uganda', 'Population'].values[0]

cdr_usa = (total_deaths_usa / population_usa) * 100000
cdr_uganda = (total_deaths_uganda / population_uganda) * 100000

print("Crude Death Rate (CDR) per 100,000 population:")
print("Uganda:", round(cdr_uganda, 1))
print("United States of America:", round(cdr_usa, 1))

Crude Death Rate (CDR) per 100,000 population:
Uganda: 4793.2
United States of America: 647.5


In [29]:
# Calculate Age-Standardized Death Rate (ASDR) for each country
asdr_usa = 0
asdr_uganda = 0

for i, row in COPD.iterrows():
    asdr_usa += row['Death rate, United States, 2019'] * WHO_standard.loc[i, 'WHO World Standard (%)'] / 100
    asdr_uganda += row['Death rate, Uganda, 2019'] * WHO_standard.loc[i, 'WHO World Standard (%)'] / 100

asdr_usa *= 100000
asdr_uganda *= 100000

print("Age-Standardized Death Rate (ASDR) per 100,000 population:")
print("USA:", round(asdr_usa, 1))
print("Uganda:", round(asdr_uganda, 1))

Age-Standardized Death Rate (ASDR) per 100,000 population:
USA: 2848211.1
Uganda: 2872446.0


In [None]:
# Are these results in ballpark of reality e.g. USA COPD rates should be easy to find
# or orders of magnitude wrong?
# Ran out of time...
# Did I need WPP source total population breaking down by age cohort?

In [31]:
# Earlier approach (from Gemini!)
ASDR_usa = 0
ASDR_uganda = 0
total_deaths_usa = 0
total_deaths_uganda = 0

for index, row in COPD.iterrows():
    age_group = row['Age group']
    death_rate_usa = row['Death rate, United States, 2019']
    death_rate_uganda = row['Death rate, Uganda, 2019']
    WHO_standard_percentage = WHO_standard.loc[WHO_standard['Age group'] == age_group, 'WHO World Standard (%)'].values[0]

    total_deaths_usa += death_rate_usa
    total_deaths_uganda += death_rate_uganda
    
    ASDR_usa += (death_rate_usa * WHO_standard_percentage / 100)
    ASDR_uganda += (death_rate_uganda * WHO_standard_percentage / 100)

# Calculate Crude Death Rate (CDR)
crude_death_rate_usa = (total_deaths_usa / population_usa) 
crude_death_rate_uganda = (total_deaths_uganda / population_uganda)

print("COPD Crude Death Rate for Uganda:", math.trunc(crude_death_rate_uganda))
print("COPD Crude Death Rate for USA:", math.trunc(crude_death_rate_usa))

print("Age-Standardized Death Rate for COPD in Uganda:", math.trunc(ASDR_uganda))
print("Age-Standardized Death Rate for COPD in USA:", math.trunc(ASDR_usa))


COPD Crude Death Rate for Uganda: 0
COPD Crude Death Rate for USA: 0
Age-Standardized Death Rate for COPD in Uganda: 28
Age-Standardized Death Rate for COPD in USA: 28


In [None]:
# No, those results clearly wrong, go back to results from first formula, which at least don't feature any zeros!