# Process Description for Calculating Death Rates




### This report details the process used to calculate the crude and age-standardized death rates from Chronic Obstructive Pulmonary Disease (COPD) in Uganda and the United States for 2019.
Data Acquisition and Preprocessing:
-   Population Data:
    -   Obtained from the World Population Prospects database.
    -	Filtered to include only data for Uganda and the USA in 2019.
    -   Grouped by location and age group to obtain population totals for each age category.

-   COPD Death Rates:
    -	Obtained from a separate CSV file.
    -	Merged with the population data based on the corresponding age groups.

-   WHO Standard Population:
    -	Extracted from a PDF file using the "tabula" library.
    -	Processed to separate the standard population percentages for different regions into separate columns.
    -	Merged with the combined population and death rate data.


Death Rate Calculations:
-   Crude Death Rate (CDR):
    -	Calculated by dividing the total number of deaths from COPD by the total population and multiplying by 100,000.
-   Age-Standardized Death Rate (ASDR):
    -	Calculated by weighting the age-specific death rates by the corresponding WHO standard population and then summing these weighted values.
    -	The final result is divided by the total WHO standard population and multiplied by 100,000 to express the rate as deaths per 100,000 people.

#### - Assumptions made
-   The age-specific death rates provided were accurate and represented the entire population of each country. 
-   The WHO world standard population reasonably represents the global age structure, allowing for a comparatively fair comparison between countries with differing age distributions. 

#### - Outcome of the Analysis
While Uganda's crude death rate (5.8 deaths/100,000) is lower than the USA's (56.9 deaths/100,000), its age-standardized rate (75.3 deaths/100,000) is significantly lower than the USA's (3410.6 deaths/100,000). This indicates potentially differing age-specific mortality patterns.

## Data Acquisition and Preprocessing

### Population Data: population estimates 1915 -2021 for both the United States and Uganda from the UN World Population Prospects (2022)

- Load the population estimates data

In [1]:
#Load the Population estiamates data
import pandas as pd

#  path to the CSV file
csv_file_path = 'C:/Users/obypa/OWiDApplication/WPP2022_Population1JanuaryBySingleAgeSex_Medium_1950-2021.csv'

# Read the CSV file
df = pd.read_csv(csv_file_path, low_memory=False)

# Display the first few rows of the dataframe
print(df.head())


   SortOrder  LocID Notes ISO3_code ISO2_code  SDMX_code  LocTypeID  \
0          1    900   NaN       NaN       NaN        1.0          1   
1          1    900   NaN       NaN       NaN        1.0          1   
2          1    900   NaN       NaN       NaN        1.0          1   
3          1    900   NaN       NaN       NaN        1.0          1   
4          1    900   NaN       NaN       NaN        1.0          1   

  LocTypeName  ParentID Location  VarID Variant  Time  MidPeriod AgeGrp  \
0       World         0    World      2  Medium  1950       1950      0   
1       World         0    World      2  Medium  1950       1950      1   
2       World         0    World      2  Medium  1950       1950      2   
3       World         0    World      2  Medium  1950       1950      3   
4       World         0    World      2  Medium  1950       1950      4   

   AgeGrpStart  AgeGrpSpan    PopMale  PopFemale   PopTotal  
0            0           1  41312.322  39439.289  80751.611 

- Filter the Dataset to get the data for the United States of America and Uganda
    - The following code filters  the dataset accordingly since the task focuses on specific years 2019 and specific countries the United States and Uganda. 

In [2]:
# Filter the data for the year 2019 and for the United States and Uganda
population= df[(df['Time'] == 2019) & (df['Location'].isin(['United States of America', 'Uganda']))]

# Display the first few rows of the filtered dataset
print(population.head())


        SortOrder  LocID Notes ISO3_code ISO2_code  SDMX_code  LocTypeID  \
297849         45    800   NaN       UGA        UG      800.0          4   
297850         45    800   NaN       UGA        UG      800.0          4   
297851         45    800   NaN       UGA        UG      800.0          4   
297852         45    800   NaN       UGA        UG      800.0          4   
297853         45    800   NaN       UGA        UG      800.0          4   

         LocTypeName  ParentID Location  VarID Variant  Time  MidPeriod  \
297849  Country/Area       910   Uganda      2  Medium  2019       2019   
297850  Country/Area       910   Uganda      2  Medium  2019       2019   
297851  Country/Area       910   Uganda      2  Medium  2019       2019   
297852  Country/Area       910   Uganda      2  Medium  2019       2019   
297853  Country/Area       910   Uganda      2  Medium  2019       2019   

       AgeGrp  AgeGrpStart  AgeGrpSpan  PopMale  PopFemale  PopTotal  
297849      0        

- Filter the data to get the total population 

In [3]:
# Define a function to filter and summarize population data for a given location
def summarize_population_by_location(dataframe, location):
    return dataframe[dataframe['Location'] == location].groupby(['Location', 'AgeGrp'])['PopTotal'].sum().reset_index()

# Use the function to get the population summaries for Uganda and USA
uganda_population_summary = summarize_population_by_location(population, 'Uganda')
usa_population_summary = summarize_population_by_location(population, 'United States of America')

# Display the summaries
print("Uganda Population Summary:")
print(uganda_population_summary)

print("\nUSA Population Summary:")
print(usa_population_summary)


Uganda Population Summary:
    Location AgeGrp  PopTotal
0     Uganda      0  1535.099
1     Uganda      1  1481.227
2     Uganda     10  1218.088
3     Uganda   100+     0.085
4     Uganda     11  1193.915
..       ...    ...       ...
96    Uganda     95     0.204
97    Uganda     96     0.139
98    Uganda     97     0.094
99    Uganda     98     0.070
100   Uganda     99     0.054

[101 rows x 3 columns]

USA Population Summary:
                     Location AgeGrp  PopTotal
0    United States of America      0  3847.417
1    United States of America      1  3932.045
2    United States of America     10  4390.123
3    United States of America   100+    75.878
4    United States of America     11  4482.967
..                        ...    ...       ...
96   United States of America     95   196.333
97   United States of America     96   119.838
98   United States of America     97    90.328
99   United States of America     98    67.270
100  United States of America     99    45.762


- Transform  the raw population data, categorized by individual ages, into a more manageable and organised structure grouped by location and broader age categories.
    -   Mapping age groups: Define a function that categorizes individual ages (AgeGrp column) into broader age groups (AgeCategory column)

In [4]:
def map_age_to_group(age):
    if age == '100+':
        return '85+'
    age = int(age)  
    if age < 5:
        return '0-4'
    elif age < 10:
        return '5-9'
    elif age < 15:
        return '10-14'
    elif age < 20:
        return '15-19'
    elif age < 25:
        return '20-24'
    elif age < 30:
        return '25-29'
    elif age < 35:
        return '30-34'
    elif age < 40:
        return '35-39'
    elif age < 45:
        return '40-44'
    elif age < 50:
        return '45-49'
    elif age < 55:
        return '50-54'
    elif age < 60:
        return '55-59'
    elif age < 65:
        return '60-64'
    elif age < 70:
        return '65-69'
    elif age < 75:
        return '70-74'
    elif age < 80:
        return '75-79'
    elif age < 85:
        return '80-84'
    else:
        return '85+'

-   Grouping and sorting population data o ensure consistent sorting across different locations.

In [5]:
# Define a custom sort order for age categories
age_category_order = ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85+']
# Convert age_category_order to a dict that maps each category to its index for sorting
age_category_sort_order = {key: index for index, key in enumerate(age_category_order)}

In [6]:
def process_population_data(df):
    # Map ages to age groups
    df['AgeCategory'] = df['AgeGrp'].apply(map_age_to_group)
    
    # Sum population totals by Location and AgeCategory
    grouped_df = df.groupby(['Location', 'AgeCategory'], as_index=False)['PopTotal'].sum()
    
    # Add a sorting key based on the custom order
    grouped_df['SortKey'] = grouped_df['AgeCategory'].map(age_category_sort_order)
    
    # Sort by this key and drop it
    sorted_grouped_df = grouped_df.sort_values(['Location', 'SortKey']).drop(columns=['SortKey'])
    
    return sorted_grouped_df

# Process the dataframes 
uganda_pop_data = process_population_data(uganda_population_summary)
usa_pop_data = process_population_data(usa_population_summary)

# Display the corrected result
print("Uganda Population Grouped Corrected Order:")
print(uganda_pop_data)
print("\nUSA Population Grouped Corrected Order:")
print(usa_pop_data)

Uganda Population Grouped Corrected Order:
   Location AgeCategory  PopTotal
0    Uganda         0-4  7244.221
9    Uganda         5-9  6551.165
1    Uganda       10-14  5830.086
2    Uganda       15-19  5067.026
3    Uganda       20-24  4264.126
4    Uganda       25-29  3404.442
5    Uganda       30-34  2532.182
6    Uganda       35-39  1851.236
7    Uganda       40-44  1475.060
8    Uganda       45-49  1213.931
10   Uganda       50-54   928.466
11   Uganda       55-59   670.962
12   Uganda       60-64   489.872
13   Uganda       65-69   346.978
14   Uganda       70-74   187.581
15   Uganda       75-79    91.499
16   Uganda       80-84    43.954
17   Uganda         85+    19.510

USA Population Grouped Corrected Order:
                    Location AgeCategory   PopTotal
0   United States of America         0-4  19960.257
9   United States of America         5-9  20732.962
1   United States of America       10-14  22052.246
2   United States of America       15-19  21839.549
3   United

### Load WHO Standard Population — Table 1 in 'Ahmad OB, Boschi-Pinto C, Lopez AD, Murray CJ, Lozano R, Inoue M (2001). Age standardization of rates: a new WHO standard.

- Extract the WHO standard population data from a PDF, clean and transform it into a usable format for further analysis.

In [7]:
import tabula

# Path to the PDF file
file_path = "C:/Users/obypa/OWiDApplication/AgeStandardizationofRates-ANewWHOStandard.pdf"

# extract tables into a list of DataFrame objects
dfs = tabula.read_pdf(file_path, pages="11", multiple_tables=True)

# print the list of tables found in the PDF, each one as a separate DataFrame
for table in dfs:
    print(table)


Error importing jpype dependencies. Fallback to subprocess.
No module named 'jpype'


                                           Unnamed: 0  \
0                                           Age group   
1                                                 0-4   
2                                                 5-9   
3                                               10-14   
4                                               15-19   
5                                               20-24   
6                                               25-29   
7                                               30-34   
8                                               35-39   
9                                               40-44   
10                                              45-49   
11                                              50-54   
12                                              55-59   
13                                              60-64   
14                                              65-69   
15                                              70-74   
16                             

  df[c] = pd.to_numeric(df[c], errors="ignore")


- Select Desired Table

In [8]:
# The table needed is the first one on the page
data = dfs[0] if dfs else None
print(data.head())

  Unnamed: 0 Table 1. Standard Population Distribution (percent)
0  Age group  Segi (“world”) standard  Scandinavian (“Europe... 
1        0-4                                    12.00 8.00 8.86 
2        5-9                                    10.00 7.00 8.69 
3      10-14                                     9.00 7.00 8.60 
4      15-19                                     9.00 7.00 8.47 


- Preprocess Data by removing the first row (header) from data as it might hinder further processing.

In [9]:
# First, we remove the header row since it complicates splitting.
whostandard_pop =data.iloc[1:].copy()

# Splitting the 'Table 1. Standard Population Distribution (percent)' column into three new columns
whostandard_pop[['Segi Standard', 'Scandinavian Standard', 'WHO World Standard']] = whostandard_pop['Table 1. Standard Population Distribution (percent)'].str.split(expand=True)
whostandard_pop['Age group'] = whostandard_pop['Unnamed: 0']

# Drop the original combined column and the 'Unnamed: 0' column as they are no longer needed
whostandard_pop = whostandard_pop.drop(columns=['Table 1. Standard Population Distribution (percent)', 'Unnamed: 0'])

# Reorder the DataFrame to have 'age_group' as the first column
whostandard_pop = whostandard_pop[['Age group', 'Segi Standard', 'Scandinavian Standard', 'WHO World Standard']]

# Reset index
whostandard_pop.reset_index(drop=True, inplace=True)

# Remove the last two rows which are not needed 
whostandard_pop = whostandard_pop.iloc[:-2]

print(whostandard_pop.head())

  Age group Segi Standard Scandinavian Standard WHO World Standard
0       0-4         12.00                  8.00               8.86
1       5-9         10.00                  7.00               8.69
2     10-14          9.00                  7.00               8.60
3     15-19          9.00                  7.00               8.47
4     20-24          8.00                  7.00               8.22


- Convert the "WHO World Standard" column to numerical data type (float)

In [10]:
# Convert data to DataFrame
whostandard_pop["WHO World Standard"] = whostandard_pop["WHO World Standard"].astype(float)

### Load Table of age-specific death rates of COPD:

-   Load the COPD death rate data and rename columns within various DataFrames to enhance clarity for further analysis.

In [11]:
#  path to the CSV file
copd_file_path = 'C:/Users/obypa/OWiDApplication/age specific death rates of COPD.csv'
# Read the CSV file
copd_rates = pd.read_csv(copd_file_path, low_memory=False)

# Display the first few rows of the dataframe
print(copd_rates.head())

  Age group (years)  Death rate, United States, 2019  Death rate, Uganda, 2019
0               0-4                             0.04                      0.40
1               5-9                             0.02                      0.17
2             10-14                             0.02                      0.07
3             15-19                             0.02                      0.23
4             20-24                             0.06                      0.38


-   Rename the columns in the DataFrames to a more Python format

In [12]:
copd_rates.columns = ['age_group', 'death_rate_us_2019', 'death_rate_uganda_2019']
usa_pop_data.columns = ['location', 'age_group', 'population_total']
uganda_pop_data.columns = ['location', 'age_group', 'population_total']
whostandard_pop.columns = ['age_group', 'segi_standard', 'scandinavian_standard', 'who_world_standard']

### Merge the three tables to get a single data frame for the analysis

- Merge the necessary data for each country

In [13]:
# Merge population data with COPD death rates and WHO standard population for the USA
usa_merged_df = pd.merge(pd.merge(usa_pop_data, copd_rates.drop(columns="death_rate_uganda_2019"), on="age_group"), whostandard_pop, on="age_group")

# Merge population data with COPD death rates and WHO standard population for Uganda
uganda_merged_df = pd.merge(pd.merge(uganda_pop_data, copd_rates.drop(columns="death_rate_us_2019"), on="age_group"), whostandard_pop, on="age_group")


### Calculate the number of deaths

- Based on death rates and population totals, get the number of deaths

In [14]:
# calculate the number of age specific deaths for the uganda population
uganda_merged_df['number_of_deaths'] = uganda_merged_df.apply(
    lambda row: round(row['death_rate_uganda_2019'] * row['population_total'], 2), axis=1
)

# calculate the number of age specific deaths for the uganda population
usa_merged_df['number_of_deaths'] = usa_merged_df.apply(
    lambda row: round(row['death_rate_us_2019'] * row['population_total'], 2), axis=1
)

## Death Rate Calculations

- compute both crude and age-standardized death rates 

In [15]:


def calculate_death_rates(df, death_rate_column):
    """
    Calculates both crude and age-standardized death rates for a given dataframe.
  
    Args:
        df (pandas.DataFrame): The dataframe containing population and death rate data.
        death_rate_column (str): The name of the column containing death rates for the specific country.
  
    Returns:
        dict: A dictionary containing the crude death rate and age-standardized death rate.
    """
    # Calculate crude death rate
    total_deaths = df['number_of_deaths'].sum()
    total_population = df['population_total'].sum()
    crude_death_rate = (total_deaths / total_population) 
  
    # Calculate age-standardized death rate
    df['standardized_deaths'] = df[death_rate_column] / 100000 * df['population_total'] * df['who_world_standard'] / 100
    age_standardized_rate = df['standardized_deaths'].sum() / df['who_world_standard'].sum() * 100000
  
    return {
        "crude_death_rate": round(crude_death_rate, 1),
        "age_standardized_death_rate": round(age_standardized_rate, 1)
    }

# Example usage:
uganda_death_rates = calculate_death_rates(uganda_merged_df.copy(), 'death_rate_uganda_2019')
usa_death_rates = calculate_death_rates(usa_merged_df.copy(), 'death_rate_us_2019')

# Print results
print("Uganda:")
print(f"Crude death rate: {uganda_death_rates['crude_death_rate']:.1f} deaths per 100,000")
print(f"Age-standardized death rate: {uganda_death_rates['age_standardized_death_rate']:.1f} deaths per 100,000")

print("\nUSA:")
print(f"Crude death rate: {usa_death_rates['crude_death_rate']:.1f} deaths per 100,000")
print(f"Age-standardized death rate: {usa_death_rates['age_standardized_death_rate']:.1f} deaths per 100,000")


Uganda:
Crude death rate: 5.8 deaths per 100,000
Age-standardized death rate: 75.3 deaths per 100,000

USA:
Crude death rate: 56.9 deaths per 100,000
Age-standardized death rate: 3410.6 deaths per 100,000
