# Project 2 Part 1
## Kaylah Benton 003049986

### 1. Dataset Information 
**Title**: A Socioeconomic Analysis of Life Expectancy Across Nations <br>
**Source**: https://www.kaggle.com/datasets/mjshri23/life-expectancy-and-socio-economic-world-bank?resource=download <br>
**Real-world Context**: Life expectancy at birth is a standard public-health metric that summarizes prevailing mortality patterns. This dataset pairs life expectancy with socioeconomic and health-related indicators (undernourishment, CO2 emissions, health/education expenditure, unemployment, corruption perception, sanitation, and disease burden). The goal is to explore how these factors associate with life expectancy across countries and over time. <br>
**Short Summary**: This notebook performs an initial exploration of the life_expectancy.csv dataset. We will load the data, inspect structure and types, summarize distributions, detect missing/duplicate values, create a derived column, and extract interesting subsets to prepare for Part 2 visualization and modeling.

In [1]:
# Import the Libraries
import pandas as pd
import numpy as np

### 2. Basic Data Exploration

In [2]:
# Load the dataset
df = pd.read_csv("life_expectancy.csv")
df

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,Year,Life Expectancy World Bank,Prevelance of Undernourishment,CO2,Health Expenditure %,Education Expenditure %,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable
0,Afghanistan,AFG,South Asia,Low income,2001,56.308,47.8,730.000000,,,10.809000,,,2179727.10,9689193.70,5795426.38
1,Angola,AGO,Sub-Saharan Africa,Lower middle income,2001,47.059,67.5,15960.000000,4.483516,,4.004000,,,1392080.71,11190210.53,2663516.34
2,Albania,ALB,Europe & Central Asia,Upper middle income,2001,74.288,4.9,3230.000000,7.139524,3.45870,18.575001,,40.520895,117081.67,140894.78,532324.75
3,Andorra,AND,Europe & Central Asia,High income,2001,,,520.000000,5.865939,,,,21.788660,1697.99,695.56,13636.64
4,United Arab Emirates,ARE,Middle East & North Africa,High income,2001,74.544,2.8,97200.000000,2.484370,,2.493000,,,144678.14,65271.91,481740.70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3301,Vanuatu,VUT,East Asia & Pacific,Lower middle income,2019,70.474,12.4,209.999993,3.360347,1.77788,1.801000,3.0,,12484.18,26032.56,69213.56
3302,Samoa,WSM,East Asia & Pacific,Lower middle income,2019,73.321,4.4,300.000012,6.363094,4.70625,8.406000,4.0,47.698788,6652.84,9095.19,43798.62
3303,South Africa,ZAF,Sub-Saharan Africa,Upper middle income,2019,64.131,6.3,439640.014648,9.109355,5.91771,28.469999,,,3174676.10,13198944.71,10214261.89
3304,Zambia,ZMB,Sub-Saharan Africa,Low income,2019,63.886,,6800.000191,5.312203,4.46518,12.520000,2.5,,510982.75,4837094.00,2649687.82


In [3]:
print("First 5 rows of the dataset:\n")
print(df.head())

print("\nLast 5 rows of the dataset:\n")
print(df.tail())

print("\nSize of the dataset (rows, columns):\n")
print(df.shape)

First 5 rows of the dataset:

           Country Name Country Code                      Region  \
0           Afghanistan          AFG                  South Asia   
1                Angola          AGO          Sub-Saharan Africa   
2               Albania          ALB       Europe & Central Asia   
3               Andorra          AND       Europe & Central Asia   
4  United Arab Emirates          ARE  Middle East & North Africa   

           IncomeGroup  Year  Life Expectancy World Bank  \
0           Low income  2001                      56.308   
1  Lower middle income  2001                      47.059   
2  Upper middle income  2001                      74.288   
3          High income  2001                         NaN   
4          High income  2001                      74.544   

   Prevelance of Undernourishment      CO2  Health Expenditure %  \
0                            47.8    730.0                   NaN   
1                            67.5  15960.0              4.483516

The dataset contains 3,306 rows and 16 columns, covering various countries, years, and health-related indicators. It includes information on life expectancy, socioeconomic factors, healthcare and education expenditures, environmental measures, and causes of mortality. Some columns have missing values, which will be considered during the analysis.Several columns in the dataset contain missing values, which may need to be addressed during analysis. While key identifiers like Country Name, Country Code, Region, IncomeGroup, Year, and major mortality indicators have complete data, other columns such as Life Expectancy World Bank, Prevalence of Undernourishment, CO2, Health Expenditure %, Education Expenditure %, Unemployment, Corruption, and Sanitation have varying amounts of missing values. Handling these missing entries will be important to ensure accurate and reliable analysis.

### 3. Data Types and Structures <br>

This section identifies the types of data in each column — such as numerical (for measurable values like CO₂ or life expectancy) and categorical (for labels like region or income group). Recognizing data types helps determine what statistical methods and visualizations can be applied.

In [4]:
print("\nSummary of the dataset:\n")
df.info()

print("\nSum of null values per column:\n")
print(df.isnull().sum())

print("\nData types of each column:\n")
print(df.dtypes)


Summary of the dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3306 entries, 0 to 3305
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Country Name                    3306 non-null   object 
 1   Country Code                    3306 non-null   object 
 2   Region                          3306 non-null   object 
 3   IncomeGroup                     3306 non-null   object 
 4   Year                            3306 non-null   int64  
 5   Life Expectancy World Bank      3118 non-null   float64
 6   Prevelance of Undernourishment  2622 non-null   float64
 7   CO2                             3154 non-null   float64
 8   Health Expenditure %            3126 non-null   float64
 9   Education Expenditure %         2216 non-null   float64
 10  Unemployment                    3002 non-null   float64
 11  Corruption                      975 non-null    float64
 12  Sanitati

### **Identifying Numerical vs. Categorical Columns**

In this dataset, the columns can be divided into **numerical** and **categorical (object)** types based on the kind of information they represent.


#### **Numerical Columns**
These contain measurable, quantitative values that can be used for statistical analysis. Examples include:

- `Year`  
- `Life Expectancy World Bank`  
- `Prevalence of Undernourishment`  
- `CO2`  
- `Health Expenditure %`  
- `Education Expenditure %`  
- `Unemployment`  
- `Corruption`  
- `Sanitation`  
- `Injuries`  
- `Communicable`  
- `NonCommunicable`


#### **Categorical (Object) Columns**
These represent descriptive or group-based information, often used for classification or grouping in analysis. Examples include:

- `Country Name`  
- `Country Code`  
- `Region`  
- `IncomeGroup`


#### **Explanation**
Categorical columns help identify the country or classification (region, income level), while numerical columns provide measurable indicators that can be summarized or correlated. Recognizing these types early is essential for selecting the right statistical and visualization methods in later analysis steps.


### 4. Descriptive Summary

This step summarizes both **categorical** and **numerical** columns to better understand the spread, range, and general characteristics of the data.  
The `.describe()` method gives key statistics like count, mean, minimum, maximum, and quartiles (25%, 50%, 75%) for numerical columns, while including frequency and mode for categorical columns when `include='all'` is used.

**Key observations:**
- The dataset covers **3,306 records** across **174 countries**, between **2001 and 2019**.
- **Average life expectancy** is about **69.7 years**, with a range from **40.37 to 84.36**.
- **Undernourishment** averages **10.66%**, though it ranges widely up to **70.9%**, suggesting strong inequality across regions.
- **CO₂ emissions** vary greatly — from **10** to over **10 million**, indicating large disparities between industrialized and developing countries.
- **Health expenditure** averages **6.36%** of GDP, while **education expenditure** averages **4.59%**.
- **Unemployment** averages **7.89%**, but goes as high as **37.25%** in some nations.
- The **Corruption** index averages **2.86** (on a likely scale of 1–5), with moderate variation (std = 0.62).
- **Sanitation access** ranges from **2.38% to 100%**, showing global inequality in infrastructure.
- Health-related variables (*Injuries, Communicable, NonCommunicable diseases*) have large standard deviations, indicating differing population sizes or reporting metrics across countries.

Overall, this summary highlights **significant variation between countries** in economic, health, and environmental factors, with some missing data concentrated in *Corruption*, *Education Expenditure %*, and *Sanitation* columns.

In [5]:
desc = df.describe(include="all")
display(desc)

print("\nCategorical Summary:\n")
display(df.select_dtypes(include='object').describe())

print("\nNumerical Summary:\n")
display(df.select_dtypes(include=['float64', 'int64']).describe())

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,Year,Life Expectancy World Bank,Prevelance of Undernourishment,CO2,Health Expenditure %,Education Expenditure %,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable
count,3306,3306,3306,3306,3306.0,3118.0,2622.0,3154.0,3126.0,2216.0,3002.0,975.0,2059.0,3306.0,3306.0,3306.0
unique,174,174,7,4,,,,,,,,,,,,
top,Afghanistan,AFG,Europe & Central Asia,High income,,,,,,,,,,,,
freq,19,19,893,1083,,,,,,,,,,,,
mean,,,,,2010.0,69.748362,10.663654,157492.4,6.364059,4.589014,7.89076,2.860513,52.738785,1318219.0,4686289.0,7392488.0
std,,,,,5.478054,9.408154,11.285897,772641.5,2.842844,2.119165,6.270832,0.621343,30.126762,5214068.0,18437270.0,29326880.0
min,,,,,2001.0,40.369,2.5,10.0,1.263576,0.85032,0.1,1.0,2.377647,430.49,330.16,2481.82
25%,,,,,2005.0,63.642,2.5,2002.5,4.205443,3.136118,3.733,2.5,24.746007,62456.88,57764.75,318475.8
50%,,,,,2010.0,72.1685,6.2,10205.0,5.892352,4.371465,5.92,3.0,49.317481,245691.0,314769.3,1350146.0
75%,,,,,2015.0,76.809,14.775,58772.5,8.119166,5.519825,10.0975,3.25,80.278847,846559.1,2831636.0,3918468.0



Categorical Summary:



Unnamed: 0,Country Name,Country Code,Region,IncomeGroup
count,3306,3306,3306,3306
unique,174,174,7,4
top,Afghanistan,AFG,Europe & Central Asia,High income
freq,19,19,893,1083



Numerical Summary:



Unnamed: 0,Year,Life Expectancy World Bank,Prevelance of Undernourishment,CO2,Health Expenditure %,Education Expenditure %,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable
count,3306.0,3118.0,2622.0,3154.0,3126.0,2216.0,3002.0,975.0,2059.0,3306.0,3306.0,3306.0
mean,2010.0,69.748362,10.663654,157492.4,6.364059,4.589014,7.89076,2.860513,52.738785,1318219.0,4686289.0,7392488.0
std,5.478054,9.408154,11.285897,772641.5,2.842844,2.119165,6.270832,0.621343,30.126762,5214068.0,18437270.0,29326880.0
min,2001.0,40.369,2.5,10.0,1.263576,0.85032,0.1,1.0,2.377647,430.49,330.16,2481.82
25%,2005.0,63.642,2.5,2002.5,4.205443,3.136118,3.733,2.5,24.746007,62456.88,57764.75,318475.8
50%,2010.0,72.1685,6.2,10205.0,5.892352,4.371465,5.92,3.0,49.317481,245691.0,314769.3,1350146.0
75%,2015.0,76.809,14.775,58772.5,8.119166,5.519825,10.0975,3.25,80.278847,846559.1,2831636.0,3918468.0
max,2019.0,84.356341,70.9,10707220.0,24.23068,23.27,37.25,4.5,100.000004,55636760.0,268564600.0,324637800.0


###  5. Missing or Duplicate Data

In this step, we analyze the completeness of the dataset by checking for **missing values** and **duplicate rows**. Missing data can significantly affect the accuracy of analysis and model performance if not handled properly.

**Observations from the dataset:**
- Several columns contain missing values:
  - `Corruption` has **2,331 missing entries**, making it the column with the highest proportion of null values.
  - `Education Expenditure %` and `Sanitation` also have **1,090** and **1,247** missing entries respectively.
  - Other columns like `Prevalence of Undernourishment`, `Unemployment`, `Health Expenditure %`, and `CO2` have moderate missingness.
- Columns such as `Country Name`, `Country Code`, `Region`, `IncomeGroup`, `Year`, `Injuries`, `Communicable`, and `NonCommunicable` have **no missing values**.
- No duplicate rows were found, suggesting clean data collection and entry practices.

**Impact on analysis:**
- The large number of missing values in `Corruption` may require **imputation**, **exclusion**, or use of **median substitution** if necessary.
- Missing values in key socioeconomic variables like `Education Expenditure %` and `Sanitation` could bias comparisons across countries and should be handled carefully before visualization or modeling.

In [6]:
missing_values = df.isnull().sum()
print("Missing Values per Column:\n")
print(missing_values)

duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

print("\nColumns with Missing Data:")
print(missing_values[missing_values > 0])

Missing Values per Column:

Country Name                         0
Country Code                         0
Region                               0
IncomeGroup                          0
Year                                 0
Life Expectancy World Bank         188
Prevelance of Undernourishment     684
CO2                                152
Health Expenditure %               180
Education Expenditure %           1090
Unemployment                       304
Corruption                        2331
Sanitation                        1247
Injuries                             0
Communicable                         0
NonCommunicable                      0
dtype: int64

Number of duplicate rows: 0

Columns with Missing Data:
Life Expectancy World Bank         188
Prevelance of Undernourishment     684
CO2                                152
Health Expenditure %               180
Education Expenditure %           1090
Unemployment                       304
Corruption                        2331
Sanit

###  6. Add a New Column Using Existing Columns

In this step, we create a new column derived from existing numerical data.  
The goal is to combine or transform variables in a way that adds meaningful insight into the dataset.

#### **New Column Created: `Health Investment Index`**

This column combines two key factors:
- **Health Expenditure %** (how much of GDP is spent on healthcare)
- **Education Expenditure %** (how much of GDP is spent on education)
By adding these two percentages together, we get a rough indicator of **total human capital investment**, showing how much a country allocates toward improving its populations well-being and future development.

#### **Purpose**
This new column allows for comparison between countries that prioritize social investment versus those that may allocate fewer resources to health and education.  
It could also be useful for examining whether higher investment correlates with greater life expectancy.

#### **Example Calculation**
\[
\text{Health Investment Index} = \text{Health Expenditure %} + \text{Education Expenditure %}
\]

In [7]:
df["Health Investment Index"] = df["Health Expenditure %"] + df["Education Expenditure %"]

print("Preview of new column added:\n")
print(df[["Country Name", "Year", "Health Expenditure %", "Education Expenditure %", "Health Investment Index"]].head())

print("\nSummary Statistics for Health Investment Index:\n")
print(df["Health Investment Index"].describe())

Preview of new column added:

           Country Name  Year  Health Expenditure %  Education Expenditure %  \
0           Afghanistan  2001                   NaN                      NaN   
1                Angola  2001              4.483516                      NaN   
2               Albania  2001              7.139524                   3.4587   
3               Andorra  2001              5.865939                      NaN   
4  United Arab Emirates  2001              2.484370                      NaN   

   Health Investment Index  
0                      NaN  
1                      NaN  
2                10.598224  
3                      NaN  
4                      NaN  

Summary Statistics for Health Investment Index:

count    2195.000000
mean       10.963694
std         3.991684
min         2.677777
25%         7.963849
50%        10.717630
75%        13.290015
max        39.697855
Name: Health Investment Index, dtype: float64


In [8]:
df.columns=df.columns.str.strip().str.replace(' ','_')
df.rename(columns={'Health_Expenditure_%': 'Health_Expenditure'}, inplace=True)
df.rename(columns={'Education_Expenditure_%': 'Education_Expenditure'}, inplace=True)
#get rid of the '%' in the column name 

### 7. Filter the Data

In this step, we’ll apply two different filters to extract interesting subsets of data.


#### **Filter 1: Countries with High Health Investment**
We will filter countries where the **Health Investment Index** (Health + Education Expenditure %) is greater than **15%**, indicating nations that heavily invest in health and education.  
This subset highlights countries that prioritize human capital development, which could be linked to higher life expectancy.


#### **Filter 2: Low-Income Countries with Low Life Expectancy**
We will filter rows where:
- `IncomeGroup` = **"Low income"**
- `Life Expectancy World Bank` < **60**

This subset helps identify developing countries that face health and economic challenges, possibly showing where investment and aid could make the greatest impact.


#### **Purpose**
These filtered views help us explore relationships between **economic status**, **investment levels**, and **health outcomes**.  
For example, we can investigate whether countries with higher health and education spending tend to achieve longer life expectancies.


In [9]:
#fills in the missing values of the Education Expenditure,Life Expectancy, and Health Expenditure based on the median for each Income Group and Region
df['Education_Expenditure'] = df.groupby(['IncomeGroup', 'Region'],observed=True)['Education_Expenditure'].transform(lambda x: x.fillna(x.median()))
df['Life_Expectancy_World_Bank'] = df.groupby(['IncomeGroup', 'Region'],observed=True)['Life_Expectancy_World_Bank'].transform(lambda x: x.fillna(x.median()))
df['Health_Expenditure'] = df.groupby(['IncomeGroup', 'Region'],observed=True)['Health_Expenditure'].transform(lambda x: x.fillna(x.median()))


df

Unnamed: 0,Country_Name,Country_Code,Region,IncomeGroup,Year,Life_Expectancy_World_Bank,Prevelance_of_Undernourishment,CO2,Health_Expenditure,Education_Expenditure,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable,Health_Investment_Index
0,Afghanistan,AFG,South Asia,Low income,2001,56.308000,47.8,730.000000,9.861581,3.373310,10.809000,,,2179727.10,9689193.70,5795426.38,
1,Angola,AGO,Sub-Saharan Africa,Lower middle income,2001,47.059000,67.5,15960.000000,4.483516,3.876120,4.004000,,,1392080.71,11190210.53,2663516.34,
2,Albania,ALB,Europe & Central Asia,Upper middle income,2001,74.288000,4.9,3230.000000,7.139524,3.458700,18.575001,,40.520895,117081.67,140894.78,532324.75,10.598224
3,Andorra,AND,Europe & Central Asia,High income,2001,79.759756,,520.000000,5.865939,5.062995,,,21.788660,1697.99,695.56,13636.64,
4,United Arab Emirates,ARE,Middle East & North Africa,High income,2001,74.544000,2.8,97200.000000,2.484370,4.771480,2.493000,,,144678.14,65271.91,481740.70,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3301,Vanuatu,VUT,East Asia & Pacific,Lower middle income,2019,70.474000,12.4,209.999993,3.360347,1.777880,1.801000,3.0,,12484.18,26032.56,69213.56,5.138227
3302,Samoa,WSM,East Asia & Pacific,Lower middle income,2019,73.321000,4.4,300.000012,6.363094,4.706250,8.406000,4.0,47.698788,6652.84,9095.19,43798.62,11.069344
3303,South Africa,ZAF,Sub-Saharan Africa,Upper middle income,2019,64.131000,6.3,439640.014648,9.109355,5.917710,28.469999,,,3174676.10,13198944.71,10214261.89,15.027065
3304,Zambia,ZMB,Sub-Saharan Africa,Low income,2019,63.886000,,6800.000191,5.312203,4.465180,12.520000,2.5,,510982.75,4837094.00,2649687.82,9.777383


In [10]:
high_investment = df[df["Health_Investment_Index"] > 15]
print("Filter 1: Countries with High Health Investment Index (>15%)")
print(f"Total Records: {len(high_investment)}\n")
display(high_investment[["Country_Name", "Year", "IncomeGroup", "Health_Investment_Index", "Life_Expectancy_World_Bank"]].head())

low_income_low_life = df[(df["IncomeGroup"] == "Low income") & (df["Life_Expectancy_World_Bank"] < 60)]
print("\nFilter 2: Low-Income Countries with Life Expectancy < 60")
print(f"Total Records: {len(low_income_low_life)}\n")
display(low_income_low_life[["Country_Name", "Year", "IncomeGroup", "Life_Expectancy_World_Bank", "Health_Investment_Index"]].head())

Filter 1: Countries with High Health Investment Index (>15%)
Total Records: 299



Unnamed: 0,Country_Name,Year,IncomeGroup,Health_Investment_Index,Life_Expectancy_World_Bank
38,Cuba,2001,Upper middle income,15.340188,76.905
43,Denmark,2001,High income,16.663776,76.792683
75,Iceland,2001,High income,15.491385,80.690244
84,Kiribati,2001,Lower middle income,24.317465,63.446
90,Lesotho,2001,Lower middle income,16.642509,46.197



Filter 2: Low-Income Countries with Life Expectancy < 60
Total Records: 305



Unnamed: 0,Country_Name,Year,IncomeGroup,Life_Expectancy_World_Bank,Health_Investment_Index
0,Afghanistan,2001,Low income,56.308,
12,Burundi,2001,Low income,49.93,9.308754
15,Burkina Faso,2001,Low income,50.893,
28,Central African Republic,2001,Low income,44.061,
47,Eritrea,2001,Low income,55.864,9.966476


### 8. Unique Values and Categories

Categorical variables are important for grouping, comparison, and visualization.  
This step lists unique values for each categorical column and highlights which are most useful for analysis.

#### **Categorical Columns**
1. `Region` – Groups countries by geographical region.  
2. `IncomeGroup` – Classifies countries by economic status.  
3. `Country Name` – Identifies individual countries.  
4. `Country Code` – Provides a standardized 3-letter code for each country.


#### **Purpose**
- Understanding unique categories helps identify how to **group or filter** the dataset in later analysis.  
- For example, we can compare **average life expectancy** or **Health Investment Index** by region or income group.


In [12]:
categorical_columns = ["Region", "IncomeGroup", "Country_Name", "Country_Code"]

for col in categorical_columns:
    print(f"Unique values in {col} ({df[col].nunique()} total):")
    print(df[col].unique())
    print("\n")

Unique values in Region (7 total):
['South Asia' 'Sub-Saharan Africa' 'Europe & Central Asia'
 'Middle East & North Africa' 'Latin America & Caribbean'
 'East Asia & Pacific' 'North America']


Unique values in IncomeGroup (4 total):
['Low income' 'Lower middle income' 'Upper middle income' 'High income']


Unique values in Country_Name (174 total):
['Afghanistan' 'Angola' 'Albania' 'Andorra' 'United Arab Emirates'
 'Argentina' 'Armenia' 'American Samoa' 'Antigua and Barbuda' 'Australia'
 'Austria' 'Azerbaijan' 'Burundi' 'Belgium' 'Benin' 'Burkina Faso'
 'Bangladesh' 'Bulgaria' 'Bahrain' 'Bosnia and Herzegovina' 'Belarus'
 'Belize' 'Bermuda' 'Bolivia' 'Brazil' 'Barbados' 'Bhutan' 'Botswana'
 'Central African Republic' 'Canada' 'Switzerland' 'Chile' 'China'
 "Cote d'Ivoire" 'Cameroon' 'Colombia' 'Comoros' 'Costa Rica' 'Cuba'
 'Cyprus' 'Germany' 'Djibouti' 'Dominica' 'Denmark' 'Dominican Republic'
 'Algeria' 'Ecuador' 'Eritrea' 'Spain' 'Estonia' 'Ethiopia' 'Finland'
 'Fiji' 'France' 'Gab

###  9. Summary and Conclusion

After exploring the life expectancy dataset, several key observations emerged:

- The dataset contains **3,306 records** across **174 countries** from **2001 to 2019**, with 16 columns covering health, socioeconomic, and environmental indicators.
- Columns are clearly divided into **numerical variables** (e.g., Life Expectancy, CO2 emissions, Health and Education Expenditure) and **categorical variables** (e.g., Country, Region, IncomeGroup), which helps guide analysis methods.
- **Descriptive statistics** show large disparities across countries:
  - Life expectancy ranges from **40.37 to 84.36 years**, averaging around **69.7 years**.
  - CO2 emissions and disease-related variables have extreme variation, reflecting differences between industrialized and developing nations.
  - Health and education spending also vary widely, from less than 1% to over 20% of GDP.
- **Missing values** are concentrated in columns like `Corruption`, `Education Expenditure %`, and `Sanitation`, while other columns are mostly complete. No duplicate rows were detected.
- A derived column, **Health Investment Index** (Health + Education Expenditure), was added, providing a single metric to compare countries’ investments in human capital.
- **Filtered subsets** highlighted:
  - Countries with **high health investment (>15%)**, usually high-income nations.
  - **Low-income countries with life expectancy <60 years**, revealing regions with critical health and economic challenges.
- **Categorical analysis** showed:
  - 7 regions and 4 income groups, which can be used to group data for comparative analysis.
  - Over 170 unique countries, ensuring broad global representation.

**Overall Conclusion:**  
The dataset reveals significant global disparities in life expectancy, health investment, and disease burden. High-income countries tend to invest more in health and education and have higher life expectancies, while low-income countries often face low investment and shorter life spans. These insights provide a strong foundation for further analysis, visualization, and exploration of relationships between socioeconomic factors and population health.


In [13]:
df.to_csv('life_expectancy_cleaned.csv')