# Step 2: Data Cleaning & Preparation

## Objective
This notebook focuses on preparing the global mortality dataset for analysis
and modeling. Tasks include verifying data consistency, standardizing
categorical values, checking for duplicates, and saving a clean version
of the dataset for downstream use.

In [1]:
import pandas as pd
import numpy as np

df=pd.read_csv("../data/raw/deaths_and_causes_synthetic.csv")
df.head()

Unnamed: 0,Year,Country,Gender,Age_Group,Cause_of_Death,Number_of_Deaths,Mortality_Rate_per_1000
0,2022,India,Male,0-14,Natural Disasters,73272,2.79
1,2025,Germany,Female,0-14,Homicide,421169,0.71
2,2020,Japan,Female,30-44,Infectious Diseases,103315,0.75
3,2023,Germany,Male,30-44,Suicide,220423,11.86
4,2015,Nigeria,Female,15-29,Stroke,157810,9.5


In [2]:
# Checking missing values
df.isna().sum()

Year                       0
Country                    0
Gender                     0
Age_Group                  0
Cause_of_Death             0
Number_of_Deaths           0
Mortality_Rate_per_1000    0
dtype: int64

In [5]:
# Checking for duplicates
df.duplicated().sum()

0

In [6]:
# Standardize columns

text_cols=['Country','Gender','Age_Group','Cause_of_Death']

for col in text_cols:
    df[col]=df[col].str.strip()

In [7]:
for col in ['Gender', 'Age_Group']:
    print(f"\n{col} unique values:")
    print(df[col].unique())


Gender unique values:
['Male' 'Female']

Age_Group unique values:
['0-14' '30-44' '15-29' '60+' '45-59']


In [8]:
# Sorting data

df=df.sort_values(by="Year").reset_index(drop=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Year                     260 non-null    int64  
 1   Country                  260 non-null    object 
 2   Gender                   260 non-null    object 
 3   Age_Group                260 non-null    object 
 4   Cause_of_Death           260 non-null    object 
 5   Number_of_Deaths         260 non-null    int64  
 6   Mortality_Rate_per_1000  260 non-null    float64
dtypes: float64(1), int64(2), object(4)
memory usage: 14.3+ KB


In [10]:
df.to_csv("../data/processed/cleaned_global_mortality_data.csv", index=False)

## Step 2 Summary

- Verified absence of missing values and duplicates
- Standardized categorical text columns
- Ensured consistent data types
- Sorted data chronologically by year
- Saved a clean, processed dataset for downstream analysis
