# Global Life Expectancy Analysis (2000-2015)

## Project Overview

This comprehensive analysis examines global life expectancy trends from 2000-2015, exploring the complex relationships between health outcomes, economic factors, and social indicators across 193 countries. Our goal is to identify key determinants of life expectancy and provide actionable insights for policymakers and health organizations.

### Key Objectives:
- Analyze global life expectancy trends over the 15-year period
- Identify factors most strongly correlated with life expectancy
- Compare patterns between developed and developing countries
- Highlight success stories and areas of concern
- Provide evidence-based policy recommendations

### Data Sources:
- World Health Organization (WHO)
- World Bank economic indicators
- United Nations population statistics
- Country-specific health ministry reports

### Methodology:
1. Data Cleaning & Preprocessing: Handle missing values, outliers, and standardize formats
2. Exploratory Data Analysis: Statistical summaries, distributions, and initial patterns
3. Visualization & Correlation: Interactive charts and correlation analysis
4. Comparative Analysis: Group comparisons and trend identification
5. Insight Generation: Evidence-based recommendations and conclusions

## 1. Import Required Libraries and Load Data

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import warnings

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


In [2]:
# Load the life expectancy dataset
df = pd.read_csv('Life Expectancy Data.csv')

# Clean column names (remove extra spaces)
df.columns = df.columns.str.strip()

# Display basic information about the dataset
print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Countries: {df['Country'].nunique()}")
print(f"Years covered: {df['Year'].min()} - {df['Year'].max()}")
print(f"Development status categories: {df['Status'].unique()}")

# Display first few rows
print("\nFirst 5 rows of the dataset:")
df.head()

Dataset loaded successfully!
Dataset shape: (2938, 22)
Countries: 193
Years covered: 2000 - 2015
Development status categories: ['Developing' 'Developed']

First 5 rows of the dataset:


Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,19.1,83,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,86,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,18.1,89,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,17.6,93,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,17.2,97,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


## 2. Data Overview and Basic Statistics

Let's examine the structure of our dataset, including data types, missing values, and basic statistical summaries.

In [3]:
# Dataset information
print("Dataset Information:")
print("="*50)
print(f"Total Records: {df.shape[0]:,}")
print(f"Total Features: {df.shape[1]}")
print(f"Countries: {df['Country'].nunique()}")
print(f"Time Period: {df['Year'].min()} - {df['Year'].max()}")
print(f"Years Covered: {df['Year'].nunique()}")

# Calculate data completeness
total_cells = df.shape[0] * df.shape[1]
missing_cells = df.isnull().sum().sum()
completeness = ((total_cells - missing_cells) / total_cells) * 100
print(f"Data Completeness: {completeness:.1f}%")

# Development status distribution
print(f"\nDevelopment Status Distribution:")
status_counts = df['Status'].value_counts()
for status, count in status_counts.items():
    percentage = (count / len(df)) * 100
    print(f"- {status}: {count:,} records ({percentage:.1f}%)")

# Display data types
print(f"\nData Types:")
print(df.dtypes)

Dataset Information:
Total Records: 2,938
Total Features: 22
Countries: 193
Time Period: 2000 - 2015
Years Covered: 16
Data Completeness: 96.0%

Development Status Distribution:
- Developing: 2,426 records (82.6%)
- Developed: 512 records (17.4%)

Data Types:
Country                             object
Year                                 int64
Status                              object
Life expectancy                    float64
Adult Mortality                    float64
infant deaths                        int64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
Measles                              int64
BMI                                float64
under-five deaths                    int64
Polio                              float64
Total expenditure                  float64
Diphtheria                         float64
HIV/AIDS                           float64
GDP                                float64
Populatio

In [4]:
# Missing values analysis
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_data.index,
    'Missing Count': missing_data.values,
    'Missing Percentage': missing_percent.values
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print("Missing Values Summary:")
print("="*50)
if len(missing_df) > 0:
    print(missing_df.to_string(index=False))
    
    # Visualize missing values
    fig = px.bar(
        missing_df,
        x='Missing Percentage',
        y='Column',
        orientation='h',
        title="Missing Values by Column (%)",
        color='Missing Percentage',
        color_continuous_scale='Reds'
    )
    fig.update_layout(height=400)
    fig.show()
else:
    print("No missing values found in the dataset!")

Missing Values Summary:
                         Column  Missing Count  Missing Percentage
                     Population            652           22.191967
                    Hepatitis B            553           18.822328
                            GDP            448           15.248468
              Total expenditure            226            7.692308
                        Alcohol            194            6.603131
Income composition of resources            167            5.684139
                      Schooling            163            5.547992
             thinness 5-9 years             34            1.157250
           thinness  1-19 years             34            1.157250
                            BMI             34            1.157250
                          Polio             19            0.646698
                     Diphtheria             19            0.646698
                Life expectancy             10            0.340368
                Adult Mortality       

In [5]:
# Statistical summary of numerical variables
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
if 'Year' in numerical_cols:
    numerical_cols.remove('Year')  # Remove Year as it's not meaningful for statistics

print("Descriptive Statistics for Numerical Variables:")
print("="*50)
stats_df = df[numerical_cols].describe()
print(stats_df)

# Key insights for highest and lowest values
print(f"\nKey Statistical Insights:")
print("-" * 30)
print("Highest Values:")
for col in numerical_cols[:5]:  # Show first 5 columns
    max_val = df[col].max()
    max_country = df.loc[df[col].idxmax(), 'Country'] if not pd.isna(max_val) else 'N/A'
    print(f"- {col}: {max_val:.2f} ({max_country})")

print("\nLowest Values:")
for col in numerical_cols[:5]:  # Show first 5 columns
    min_val = df[col].min()
    min_country = df.loc[df[col].idxmin(), 'Country'] if not pd.isna(min_val) else 'N/A'
    print(f"- {col}: {min_val:.2f} ({min_country})")

Descriptive Statistics for Numerical Variables:
       Life expectancy  Adult Mortality  infant deaths      Alcohol  \
count      2928.000000      2928.000000    2938.000000  2744.000000   
mean         69.224932       164.796448      30.303948     4.602861   
std           9.523867       124.292079     117.926501     4.052413   
min          36.300000         1.000000       0.000000     0.010000   
25%          63.100000        74.000000       0.000000     0.877500   
50%          72.100000       144.000000       3.000000     3.755000   
75%          75.700000       228.000000      22.000000     7.702500   
max          89.000000       723.000000    1800.000000    17.870000   

       percentage expenditure  Hepatitis B        Measles          BMI  \
count             2938.000000  2385.000000    2938.000000  2904.000000   
mean               738.251295    80.940461    2419.592240    38.321247   
std               1987.914858    25.070016   11467.272489    20.044034   
min             

## 3. Data Cleaning and Missing Values Treatment

We'll implement automated missing value treatment strategies based on the data type and missing percentage of each column.

In [6]:
# Function to apply missing value treatments
def apply_missing_value_treatments(df, strategies):
    """Apply the selected missing value treatment strategies"""
    df_treated = df.copy()
    
    for col_name, strategy in strategies.items():
        if strategy == "Drop Column":
            df_treated = df_treated.drop(columns=[col_name])
        elif strategy == "Drop Rows":
            df_treated = df_treated.dropna(subset=[col_name])
        elif strategy == "Mean Imputation":
            df_treated[col_name] = df_treated[col_name].fillna(df_treated[col_name].mean())
        elif strategy == "Median Imputation":
            df_treated[col_name] = df_treated[col_name].fillna(df_treated[col_name].median())
        elif strategy == "Mode Imputation":
            df_treated[col_name] = df_treated[col_name].fillna(df_treated[col_name].mode()[0])
        elif strategy == "Forward Fill":
            df_treated[col_name] = df_treated[col_name].fillna(method='ffill')
        elif strategy == "Backward Fill":
            df_treated[col_name] = df_treated[col_name].fillna(method='bfill')
        elif strategy == "Interpolation":
            df_treated[col_name] = df_treated[col_name].interpolate()
    
    return df_treated

# Determine treatment strategies automatically
if len(missing_df) > 0:
    treatment_strategies = {}
    treatment_explanations = []
    
    for _, row in missing_df.iterrows():
        col_name = row['Column']
        missing_pct = row['Missing Percentage']
        
        # Determine best strategy automatically
        if missing_pct > 50:
            strategy = "Drop Column"
            reason = f"Too many missing values ({missing_pct:.1f}%)"
        elif col_name in ['Population']:
            strategy = "Median Imputation"
            reason = "Large range values, median is more robust"
        elif df[col_name].dtype in ['int64', 'float64']:
            if missing_pct > 20:
                strategy = "Median Imputation"
                reason = "High missing %, median more robust than mean"
            else:
                strategy = "Mean Imputation"
                reason = "Numerical data with reasonable missing %"
        else:
            strategy = "Mode Imputation"
            reason = "Categorical data, use most frequent value"
        
        treatment_strategies[col_name] = strategy
        treatment_explanations.append({
            'Column': col_name,
            'Missing %': f"{missing_pct:.1f}%",
            'Strategy': strategy,
            'Reason': reason
        })
    
    # Display treatment plan
    print("Automatic Treatment Strategy Applied:")
    print("="*50)
    treatment_df = pd.DataFrame(treatment_explanations)
    print(treatment_df.to_string(index=False))
    
    # Apply treatments
    df_treated = apply_missing_value_treatments(df, treatment_strategies)
    
    # Show before/after comparison
    original_missing = df.isnull().sum().sum()
    new_missing = df_treated.isnull().sum().sum()
    improvement = ((original_missing - new_missing) / original_missing) * 100 if original_missing > 0 else 0
    
    print(f"\nTreatment Results:")
    print("-" * 30)
    print(f"Original Missing Values: {original_missing:,}")
    print(f"Final Missing Values: {new_missing:,}")
    print(f"Improvement: {improvement:.1f}%")
    
else:
    print("No missing values found - dataset is complete!")
    df_treated = df.copy()

Automatic Treatment Strategy Applied:
                         Column Missing %          Strategy                                    Reason
                     Population     22.2% Median Imputation Large range values, median is more robust
                    Hepatitis B     18.8%   Mean Imputation  Numerical data with reasonable missing %
                            GDP     15.2%   Mean Imputation  Numerical data with reasonable missing %
              Total expenditure      7.7%   Mean Imputation  Numerical data with reasonable missing %
                        Alcohol      6.6%   Mean Imputation  Numerical data with reasonable missing %
Income composition of resources      5.7%   Mean Imputation  Numerical data with reasonable missing %
                      Schooling      5.5%   Mean Imputation  Numerical data with reasonable missing %
             thinness 5-9 years      1.2%   Mean Imputation  Numerical data with reasonable missing %
           thinness  1-19 years      1.2%   

## 4. Outlier Detection and Handling

We'll use the IQR (Interquartile Range) method to detect outliers and apply appropriate treatment strategies.

In [7]:
# Outlier detection and handling
numerical_cols = df_treated.select_dtypes(include=['int64', 'float64']).columns.tolist()
if 'Year' in numerical_cols:
    numerical_cols.remove('Year')

outlier_summary = []
df_outlier_treated = df_treated.copy()

for col in numerical_cols:
    # Skip if column has too many missing values
    if df_treated[col].isnull().sum() / len(df_treated) > 0.5:
        continue
        
    # IQR method for outlier detection
    Q1 = df_treated[col].quantile(0.25)
    Q3 = df_treated[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers_count = len(df_treated[(df_treated[col] < lower_bound) | (df_treated[col] > upper_bound)])
    outlier_percentage = (outliers_count / len(df_treated)) * 100
    
    # Determine treatment strategy
    if outlier_percentage > 10:
        strategy = "Winsorize (5th-95th percentile)"
        # Apply winsorizing
        df_outlier_treated[col] = df_outlier_treated[col].clip(
            lower=df_outlier_treated[col].quantile(0.05),
            upper=df_outlier_treated[col].quantile(0.95)
        )
    elif outlier_percentage > 5:
        strategy = "Cap outliers (IQR method)"
        # Apply capping
        df_outlier_treated[col] = df_outlier_treated[col].clip(
            lower=lower_bound,
            upper=upper_bound
        )
    elif outlier_percentage > 0:
        strategy = "Keep outliers (low impact)"
    else:
        strategy = "No outliers detected"
    
    outlier_summary.append({
        'Column': col,
        'Outliers Count': outliers_count,
        'Outliers %': f"{outlier_percentage:.1f}%",
        'Strategy Applied': strategy,
        'Lower Bound': f"{lower_bound:.2f}",
        'Upper Bound': f"{upper_bound:.2f}"
    })

# Display outlier analysis summary
print("Automatic Outlier Treatment Applied:")
print("="*50)
outlier_df = pd.DataFrame(outlier_summary)
print(outlier_df.to_string(index=False))

# Show example of outlier treatment for Life Expectancy
fig = px.box(df_treated, y='Life expectancy', title='Life Expectancy Distribution - Before Outlier Treatment')
fig.show()

fig2 = px.box(df_outlier_treated, y='Life expectancy', title='Life Expectancy Distribution - After Outlier Treatment')
fig2.show()

# Store the final cleaned dataset
df_clean = df_outlier_treated.copy()
print(f"\nFinal cleaned dataset shape: {df_clean.shape}")
print(f"Total missing values remaining: {df_clean.isnull().sum().sum()}")

Automatic Outlier Treatment Applied:
                         Column  Outliers Count Outliers %                Strategy Applied Lower Bound Upper Bound
                Life expectancy              17       0.6%      Keep outliers (low impact)       44.60       94.20
                Adult Mortality              86       2.9%      Keep outliers (low impact)     -155.50      456.50
                  infant deaths             315      10.7% Winsorize (5th-95th percentile)      -33.00       55.00
                        Alcohol               3       0.1%      Keep outliers (low impact)       -8.35       16.84
         percentage expenditure             389      13.2% Winsorize (5th-95th percentile)     -650.59     1096.81
                    Hepatitis B             316      10.8% Winsorize (5th-95th percentile)       58.35      118.59
                        Measles             542      18.4% Winsorize (5th-95th percentile)     -540.38      900.62
                            BMI            


Final cleaned dataset shape: (2938, 22)
Total missing values remaining: 0


## 5. Global Life Expectancy Trends Analysis

Let's analyze global trends in life expectancy over the 2000-2015 period and identify top and bottom performing countries.

In [8]:
# Global average life expectancy over time
yearly_avg = df_clean.groupby('Year')['Life expectancy'].mean().reset_index()

# Create line plot for global trend
fig1 = px.line(
    yearly_avg, 
    x='Year', 
    y='Life expectancy',
    title='Global Average Life Expectancy Trend (2000-2015)',
    markers=True
)
fig1.update_layout(
    height=400,
    xaxis_title="Year",
    yaxis_title="Life Expectancy (Years)"
)
fig1.update_traces(line=dict(width=3), marker=dict(size=8))
fig1.show()

# Calculate key insights
start_life_exp = yearly_avg.iloc[0]['Life expectancy']
end_life_exp = yearly_avg.iloc[-1]['Life expectancy']
total_improvement = end_life_exp - start_life_exp
annual_improvement = total_improvement / 15

print(f"Global Life Expectancy Trends (2000-2015):")
print("="*50)
print(f"Life expectancy in 2000: {start_life_exp:.1f} years")
print(f"Life expectancy in 2015: {end_life_exp:.1f} years")
print(f"Total improvement: {total_improvement:.1f} years")
print(f"Average annual increase: {annual_improvement:.2f} years per year")
print("The upward trend is consistent across the entire period, suggesting sustained global health improvements")

Global Life Expectancy Trends (2000-2015):
Life expectancy in 2000: 66.8 years
Life expectancy in 2015: 71.6 years
Total improvement: 4.9 years
Average annual increase: 0.32 years per year
The upward trend is consistent across the entire period, suggesting sustained global health improvements


In [9]:
# Top and bottom performing countries
country_avg = df_clean.groupby('Country')['Life expectancy'].mean().sort_values(ascending=False)

# Top 10 countries
top_10 = country_avg.head(10)
fig3 = px.bar(
    x=top_10.values,
    y=top_10.index,
    orientation='h',
    title="Top 10 Countries by Average Life Expectancy (2000-2015)",
    color=top_10.values,
    color_continuous_scale='Greens',
    labels={'x': 'Life Expectancy (Years)', 'y': 'Countries'}
)
fig3.update_layout(height=400, showlegend=False)
fig3.show()

# Filter out countries with insufficient data for bottom 10
country_data_counts = df_clean.groupby('Country').size()
countries_with_sufficient_data = country_data_counts[country_data_counts >= 5].index
df_filtered = df_clean[df_clean['Country'].isin(countries_with_sufficient_data)]
country_avg_filtered = df_filtered.groupby('Country')['Life expectancy'].mean().sort_values(ascending=False)
bottom_10 = country_avg_filtered.tail(10).sort_values(ascending=True)

fig4 = px.bar(
    x=bottom_10.values,
    y=bottom_10.index,
    orientation='h',
    title="Bottom 10 Countries by Average Life Expectancy (2000-2015)",
    color=bottom_10.values,
    color_continuous_scale='Reds',
    labels={'x': 'Life Expectancy (Years)', 'y': 'Countries'}
)
fig4.update_layout(height=400, showlegend=False)
fig4.show()

print(f"Country Performance Analysis:")
print("="*50)
print(f"Highest life expectancy: {country_avg.max():.1f} years ({country_avg.index[0]})")
print(f"Lowest life expectancy: {country_avg.min():.1f} years ({country_avg.index[-1]})")
print(f"Range: {country_avg.max() - country_avg.min():.1f} years difference")
print("Top performers: Predominantly developed nations with advanced healthcare systems")
print("Bottom performers: Mainly Sub-Saharan African countries facing challenges like poverty and disease")

Country Performance Analysis:
Highest life expectancy: 82.5 years (Japan)
Lowest life expectancy: 46.1 years (Sierra Leone)
Range: 36.4 years difference
Top performers: Predominantly developed nations with advanced healthcare systems
Bottom performers: Mainly Sub-Saharan African countries facing challenges like poverty and disease


## 6. Regional and Development Status Analysis

Let's analyze patterns between developed and developing countries and examine improvement rates over time.

In [10]:
# Calculate statistics by development status
status_stats = df_clean.groupby('Status')['Life expectancy'].agg(['mean', 'median', 'std', 'min', 'max']).round(2)

print("Development Status Comparison:")
print("="*50)
print(status_stats)

print(f"\nRegional Distribution Insights:")
print("-" * 30)
print(f"Developed Countries: Higher median life expectancy ({status_stats.loc['Developed', 'median']:.1f} years) with lower variability (std: {status_stats.loc['Developed', 'std']:.1f})")
print(f"Developing Countries: Lower median life expectancy ({status_stats.loc['Developing', 'median']:.1f} years) with higher variability (std: {status_stats.loc['Developing', 'std']:.1f})")
print("Developing countries show wider range of outcomes, indicating diverse healthcare systems")

# Country improvement analysis
country_improvement = []
for country in df_clean['Country'].unique():
    country_data = df_clean[df_clean['Country'] == country].sort_values('Year')
    if len(country_data) > 1:
        first_year = country_data.iloc[0]['Life expectancy']
        last_year = country_data.iloc[-1]['Life expectancy']
        if pd.notna(first_year) and pd.notna(last_year):
            improvement = last_year - first_year
            country_improvement.append({
                'Country': country,
                'Improvement': improvement,
                'Status': country_data.iloc[0]['Status']
            })

improvement_df = pd.DataFrame(country_improvement)

# Improvement distribution by status
fig2 = px.histogram(
    improvement_df,
    x='Improvement',
    color='Status',
    title='Distribution of Life Expectancy Improvement by Development Status (2000-2015)',
    nbins=30,
    barmode='overlay',
    labels={'Improvement': 'Life Expectancy Improvement (Years)'}
)
fig2.update_layout(height=400)
fig2.show()

# Average improvement by status
avg_improvement_developed = improvement_df[improvement_df['Status'] == 'Developed']['Improvement'].mean()
avg_improvement_developing = improvement_df[improvement_df['Status'] == 'Developing']['Improvement'].mean()

print(f"\nImprovement Pattern Analysis:")
print("-" * 30)
print(f"Developing Countries Average Improvement: {avg_improvement_developing:.1f} years")
print(f"Developed Countries Average Improvement: {avg_improvement_developed:.1f} years")
print("Developing countries show greater improvements (catching up effect)")

Development Status Comparison:
             mean  median   std   min   max
Status                                     
Developed   79.20   79.25  3.93  69.9  89.0
Developing  67.12   69.05  8.99  36.3  89.0

Regional Distribution Insights:
------------------------------
Developed Countries: Higher median life expectancy (79.2 years) with lower variability (std: 3.9)
Developing Countries: Lower median life expectancy (69.0 years) with higher variability (std: 9.0)
Developing countries show wider range of outcomes, indicating diverse healthcare systems



Improvement Pattern Analysis:
------------------------------
Developing Countries Average Improvement: 5.1 years
Developed Countries Average Improvement: 3.9 years
Developing countries show greater improvements (catching up effect)


## 7. Correlation Analysis Between Variables

Let's examine the correlations between life expectancy and other health, economic, and social indicators.

In [11]:
# Get numerical columns for correlation analysis
numerical_cols = df_clean.select_dtypes(include=['int64', 'float64']).columns.tolist()
if 'Year' in numerical_cols:
    numerical_cols.remove('Year')

# Calculate correlation matrix
corr_matrix = df_clean[numerical_cols].corr()

# Correlation heatmap
fig1 = px.imshow(
    corr_matrix,
    title='Correlation Matrix of Health and Economic Indicators',
    color_continuous_scale='RdBu_r',
    zmin=-1, zmax=1,
    aspect='auto'
)
fig1.update_layout(height=600)
fig1.show()

# Extract correlations with life expectancy
life_exp_corr = corr_matrix['Life expectancy'].drop('Life expectancy').sort_values(ascending=False)

# Top positive and negative correlations
print("Correlation Analysis with Life Expectancy:")
print("="*50)
print("\nTop 5 Positive Correlations:")
top_positive = life_exp_corr.head(5)
for var, corr in top_positive.items():
    print(f"- {var}: {corr:.3f}")

print("\nTop 5 Negative Correlations:")
top_negative = life_exp_corr.tail(5)
for var, corr in top_negative.items():
    print(f"- {var}: {corr:.3f}")

# Visualize top correlations
fig2 = px.bar(
    x=top_positive.values,
    y=top_positive.index,
    orientation='h',
    title="Strongest Positive Correlations with Life Expectancy",
    color=top_positive.values,
    color_continuous_scale='Greens',
    labels={'x': 'Correlation Coefficient', 'y': 'Variables'}
)
fig2.update_layout(height=300, showlegend=False)
fig2.show()

fig3 = px.bar(
    x=top_negative.values,
    y=top_negative.index,
    orientation='h',
    title="Strongest Negative Correlations with Life Expectancy",
    color=top_negative.values,
    color_continuous_scale='Reds',
    labels={'x': 'Correlation Coefficient', 'y': 'Variables'}
)
fig3.update_layout(height=300, showlegend=False)
fig3.show()

print(f"\nCorrelation Insights:")
print("-" * 30)
print("Strong Positive Factors: Education, GDP, and healthcare spending show strong positive correlations")
print("Health Risk Factors: Disease prevalence and mortality rates negatively correlate with life expectancy")
print("Economic Impact: Wealth indicators consistently associate with better health outcomes")
print("Policy Relevance: Modifiable factors like education and healthcare investment offer intervention opportunities")

Correlation Analysis with Life Expectancy:

Top 5 Positive Correlations:
- Schooling: 0.715
- Income composition of resources: 0.692
- Polio: 0.567
- BMI: 0.559
- Diphtheria: 0.477

Top 5 Negative Correlations:
- thinness  1-19 years: -0.472
- infant deaths: -0.511
- under-five deaths: -0.540
- Adult Mortality: -0.696
- HIV/AIDS: -0.714



Correlation Insights:
------------------------------
Strong Positive Factors: Education, GDP, and healthcare spending show strong positive correlations
Health Risk Factors: Disease prevalence and mortality rates negatively correlate with life expectancy
Economic Impact: Wealth indicators consistently associate with better health outcomes
Policy Relevance: Modifiable factors like education and healthcare investment offer intervention opportunities


In [12]:
# Scatter plots for key relationships
key_vars = life_exp_corr.abs().nlargest(4).index.tolist()

print("\nKey Relationship Analysis:")
print("="*50)

for i, var in enumerate(key_vars):
    # Smart filtering for meaningful scatter plots - variable-specific approach
    df_plot = df_clean[(df_clean[var].notna()) & (df_clean['Life expectancy'].notna())]
    
    # Variable-specific filtering strategies
    if 'HIV' in var or 'AIDS' in var:
        # For HIV/AIDS, preserve all low values including zeros
        q99 = df_plot[var].quantile(0.99)
        extreme_high_percentage = (df_plot[var] >= q99).sum() / len(df_plot) * 100
        if extreme_high_percentage > 5:
            var_upper_threshold = q99
        else:
            var_upper_threshold = df_plot[var].max()
        
        df_filtered = df_plot[
            (df_plot[var] >= 0) &
            (df_plot[var] < var_upper_threshold) &
            (df_plot['Life expectancy'] > 20) &
            (df_plot['Life expectancy'] < 95)
        ]
        filtering_note = "HIV/AIDS data: All meaningful values preserved including low rates in developed countries"
    else:
        # For other variables, apply filtering for clustering
        low_value_percentage = (df_plot[var] <= 0.1).sum() / len(df_plot) * 100
        if low_value_percentage > 15:
            var_threshold = df_plot[df_plot[var] > 0][var].quantile(0.05) if (df_plot[var] > 0).any() else 0
        else:
            var_threshold = 0
        
        q95 = df_plot[var].quantile(0.95)
        high_value_percentage = (df_plot[var] >= q95).sum() / len(df_plot) * 100
        if high_value_percentage > 10:
            var_upper_threshold = q95
        else:
            var_upper_threshold = df_plot[var].max()
        
        df_filtered = df_plot[
            (df_plot[var] > var_threshold) &
            (df_plot[var] < var_upper_threshold) &
            (df_plot['Life expectancy'] > 20) &
            (df_plot['Life expectancy'] < 95)
        ]
        
        if low_value_percentage > 15:
            filtering_note = f"Non-HIV variable: {low_value_percentage:.1f}% of low values filtered to reduce clustering artifacts"
        else:
            filtering_note = "Standard filtering applied for visualization clarity"
    
    # Create scatter plot if we have sufficient data
    if len(df_filtered) > 10:
        fig = px.scatter(
            df_filtered,
            x=var,
            y='Life expectancy',
            color='Status',
            title=f'Life Expectancy vs {var}',
            hover_data=['Country', 'Year']
        )
        fig.update_layout(height=400)
        fig.show()
        
        # Calculate correlation for filtered data
        corr_val = df_filtered[var].corr(df_filtered['Life expectancy'])
        print(f"\n{var}:")
        print(f"Correlation coefficient: {corr_val:.3f}")
        print(f"Data points: {len(df_filtered):,} of {len(df_plot):,} (filtered for visualization quality)")
        print(f"Filtering applied: {filtering_note}")
    else:
        print(f"\nInsufficient data for {var} after filtering for visualization quality.")


Key Relationship Analysis:



Schooling:
Correlation coefficient: 0.748
Data points: 2,909 of 2,938 (filtered for visualization quality)
Filtering applied: Standard filtering applied for visualization clarity



HIV/AIDS:
Correlation coefficient: -0.665
Data points: 2,791 of 2,938 (filtered for visualization quality)
Filtering applied: HIV/AIDS data: All meaningful values preserved including low rates in developed countries



Adult Mortality:
Correlation coefficient: -0.696
Data points: 2,937 of 2,938 (filtered for visualization quality)
Filtering applied: Standard filtering applied for visualization clarity



Income composition of resources:
Correlation coefficient: 0.848
Data points: 2,807 of 2,938 (filtered for visualization quality)
Filtering applied: Standard filtering applied for visualization clarity


## 8. Distribution Analysis of Key Variables

Let's analyze the distribution of life expectancy globally and examine how it has evolved over time.

In [13]:
# Life expectancy distribution
fig1 = px.histogram(
    df_clean,
    x='Life expectancy',
    nbins=30,
    title='Distribution of Life Expectancy Globally (2000-2015)',
    labels={'Life expectancy': 'Life Expectancy (Years)', 'count': 'Frequency'}
)
fig1.add_vline(
    x=df_clean['Life expectancy'].mean(),
    line_dash="dash",
    line_color="red",
    annotation_text="Global Mean"
)
fig1.update_layout(height=400)
fig1.show()

# Distribution statistics
mean_life_exp = df_clean['Life expectancy'].mean()
median_life_exp = df_clean['Life expectancy'].median()
std_life_exp = df_clean['Life expectancy'].std()

print("Life Expectancy Distribution Statistics:")
print("="*50)
print(f"Mean Life Expectancy: {mean_life_exp:.1f} years")
print(f"Median Life Expectancy: {median_life_exp:.1f} years")
print(f"Standard Deviation: {std_life_exp:.1f} years")
print(f"Range: {df_clean['Life expectancy'].min():.1f} to {df_clean['Life expectancy'].max():.1f} years")

# Determine distribution shape
if mean_life_exp > median_life_exp:
    distribution_shape = 'Right-skewed'
    shape_description = 'concentration in lower ranges with some high outliers'
elif mean_life_exp < median_life_exp:
    distribution_shape = 'Left-skewed'
    shape_description = 'concentration in higher ranges with some low outliers'
else:
    distribution_shape = 'Normal'
    shape_description = 'balanced distribution'

# Determine variability level
if std_life_exp > 10:
    variability_level = 'high'
elif std_life_exp > 5:
    variability_level = 'moderate'
else:
    variability_level = 'low'

print(f"\nDistribution Insights:")
print("-" * 30)
print(f"Distribution Shape: {distribution_shape} distribution indicates {shape_description}")
print(f"Variability: Standard deviation of {std_life_exp:.1f} years shows {variability_level} variability in global life expectancy")

# Time series distribution analysis - violin plot for different years
sample_years = [2000, 2005, 2010, 2015]
df_sample_years = df_clean[df_clean['Year'].isin(sample_years)]

fig3 = px.violin(
    df_sample_years,
    x='Year',
    y='Life expectancy',
    box=True,
    title='Life Expectancy Distribution Evolution (2000-2015)',
    labels={'Life expectancy': 'Life Expectancy (Years)', 'Year': 'Year'}
)
fig3.update_layout(height=400)
fig3.show()

print(f"\nTemporal Distribution Insights:")
print("-" * 30)
print("Improvement Over Time: Distribution shows gradual shift toward higher life expectancy values")
print("Inequality Persistence: The spread of the distribution remains relatively constant, indicating persistent global inequality")
print("Tail Behavior: Lower tail of distribution shows improvement, suggesting progress in worst-performing countries")

Life Expectancy Distribution Statistics:
Mean Life Expectancy: 69.2 years
Median Life Expectancy: 72.0 years
Standard Deviation: 9.5 years
Range: 36.3 to 89.0 years

Distribution Insights:
------------------------------
Distribution Shape: Left-skewed distribution indicates concentration in higher ranges with some low outliers
Variability: Standard deviation of 9.5 years shows moderate variability in global life expectancy



Temporal Distribution Insights:
------------------------------
Improvement Over Time: Distribution shows gradual shift toward higher life expectancy values
Inequality Persistence: The spread of the distribution remains relatively constant, indicating persistent global inequality
Tail Behavior: Lower tail of distribution shows improvement, suggesting progress in worst-performing countries


## 9. Comparative Analysis: Developed vs Developing Countries

Let's perform a detailed comparison between developed and developing countries to understand the development gap.

In [14]:
# Box plot comparison between developed and developing countries
fig1 = px.box(
    df_clean,
    x='Status',
    y='Life expectancy',
    points='all',
    title='Life Expectancy: Developed vs Developing Countries',
    labels={'Life expectancy': 'Life Expectancy (Years)', 'Status': 'Development Status'}
)
fig1.update_layout(height=400)
fig1.show()

# Statistical comparison
developed_stats = df_clean[df_clean['Status'] == 'Developed']['Life expectancy'].describe()
developing_stats = df_clean[df_clean['Status'] == 'Developing']['Life expectancy'].describe()

comparison_df = pd.DataFrame({
    'Developed': developed_stats,
    'Developing': developing_stats,
    'Difference': developed_stats - developing_stats
}).round(2)

print("Statistical Comparison Between Developed and Developing Countries:")
print("="*70)
print(comparison_df)

# Time-based comparison
yearly_comparison = df_clean.groupby(['Year', 'Status'])['Life expectancy'].mean().reset_index()

fig2 = px.line(
    yearly_comparison,
    x='Year',
    y='Life expectancy',
    color='Status',
    title='Life Expectancy Trends: Developed vs Developing Countries',
    markers=True,
    labels={'Life expectancy': 'Life Expectancy (Years)', 'Year': 'Year'}
)
fig2.update_layout(height=400)
fig2.show()

# Calculate convergence/divergence
gap_over_time = yearly_comparison.pivot(index='Year', columns='Status', values='Life expectancy')
gap_over_time['Gap'] = gap_over_time['Developed'] - gap_over_time['Developing']

initial_gap = gap_over_time['Gap'].iloc[0]
final_gap = gap_over_time['Gap'].iloc[-1]
gap_change = final_gap - initial_gap

# Determine gap trend and convergence status
if gap_change > 0:
    gap_trend = 'increased'
    convergence_status = 'Diverging'
    convergence_description = 'developed countries are pulling further ahead'
elif gap_change < 0:
    gap_trend = 'decreased'
    convergence_status = 'Converging'
    convergence_description = 'developing countries are catching up'
else:
    gap_trend = 'remained stable'
    convergence_status = 'Stable'
    convergence_description = 'both groups improving at similar rates'

print(f"\nDevelopment Gap Analysis:")
print("-" * 40)
print(f"Initial Gap (2000): {initial_gap:.1f} years between developed and developing countries")
print(f"Final Gap (2015): {final_gap:.1f} years between developed and developing countries")
print(f"Gap Trend: The gap has {gap_trend} by {abs(gap_change):.1f} years")
print(f"Convergence Status: {convergence_status} - {convergence_description}")

Statistical Comparison Between Developed and Developing Countries:
       Developed  Developing  Difference
count     512.00     2426.00    -1914.00
mean       79.20       67.12       12.08
std         3.93        8.99       -5.06
min        69.90       36.30       33.60
25%        76.80       61.10       15.70
50%        79.25       69.05       10.20
75%        81.70       74.00        7.70
max        89.00       89.00        0.00



Development Gap Analysis:
----------------------------------------
Initial Gap (2000): 12.2 years between developed and developing countries
Final Gap (2015): 11.0 years between developed and developing countries
Gap Trend: The gap has decreased by 1.2 years
Convergence Status: Converging - developing countries are catching up


In [15]:
# Box plot comparison between developed and developing countries
fig1 = px.box(
    df_clean,
    x='Status',
    y='Life expectancy',
    points='all',
    title='Life Expectancy: Developed vs Developing Countries',
    labels={'Status': 'Development Status', 'Life expectancy': 'Life Expectancy (Years)'}
)
fig1.update_layout(height=400)
fig1.show()

# Statistical comparison
developed_stats = df_clean[df_clean['Status'] == 'Developed']['Life expectancy'].describe()
developing_stats = df_clean[df_clean['Status'] == 'Developing']['Life expectancy'].describe()

comparison_df = pd.DataFrame({
    'Developed': developed_stats,
    'Developing': developing_stats,
    'Difference': developed_stats - developing_stats
}).round(2)

print("Statistical Comparison: Developed vs Developing Countries")
print("="*60)
print(comparison_df)

# Time-based comparison
yearly_comparison = df_clean.groupby(['Year', 'Status'])['Life expectancy'].mean().reset_index()

fig2 = px.line(
    yearly_comparison,
    x='Year',
    y='Life expectancy',
    color='Status',
    title='Life Expectancy Trends: Developed vs Developing Countries (2000-2015)',
    markers=True,
    labels={'Life expectancy': 'Life Expectancy (Years)', 'Year': 'Year'}
)
fig2.update_layout(height=400)
fig2.show()

# Calculate convergence/divergence
gap_over_time = yearly_comparison.pivot(index='Year', columns='Status', values='Life expectancy')
gap_over_time['Gap'] = gap_over_time['Developed'] - gap_over_time['Developing']

initial_gap = gap_over_time['Gap'].iloc[0]
final_gap = gap_over_time['Gap'].iloc[-1]
gap_change = final_gap - initial_gap

# Determine gap trend and convergence status
if gap_change > 0:
    gap_trend = 'increased'
    convergence_status = 'Diverging'
    convergence_description = 'developed countries are pulling further ahead'
elif gap_change < 0:
    gap_trend = 'decreased'
    convergence_status = 'Converging'
    convergence_description = 'developing countries are catching up'
else:
    gap_trend = 'remained stable'
    convergence_status = 'Stable'
    convergence_description = 'both groups improving at similar rates'

print(f"\nDevelopment Gap Analysis:")
print("-" * 30)
print(f"Initial Gap (2000): {initial_gap:.1f} years between developed and developing countries")
print(f"Final Gap (2015): {final_gap:.1f} years between developed and developing countries")
print(f"Gap Trend: The gap has {gap_trend} by {abs(gap_change):.1f} years")
print(f"Convergence Status: {convergence_status} - {convergence_description}")

Statistical Comparison: Developed vs Developing Countries
       Developed  Developing  Difference
count     512.00     2426.00    -1914.00
mean       79.20       67.12       12.08
std         3.93        8.99       -5.06
min        69.90       36.30       33.60
25%        76.80       61.10       15.70
50%        79.25       69.05       10.20
75%        81.70       74.00        7.70
max        89.00       89.00        0.00



Development Gap Analysis:
------------------------------
Initial Gap (2000): 12.2 years between developed and developing countries
Final Gap (2015): 11.0 years between developed and developing countries
Gap Trend: The gap has decreased by 1.2 years
Convergence Status: Converging - developing countries are catching up


## 10. Success Stories and Top Performing Countries

Let's identify countries that achieved remarkable improvements and analyze the factors behind their success.

In [16]:
# Top improving countries analysis
top_improvers = improvement_df.nlargest(10, 'Improvement')

fig3 = px.bar(
    top_improvers,
    x='Improvement',
    y='Country',
    color='Status',
    orientation='h',
    title="Top 10 Countries with Greatest Life Expectancy Improvements (2000-2015)",
    labels={'Improvement': 'Life Expectancy Improvement (Years)', 'Country': 'Countries'}
)
fig3.update_layout(height=500)
fig3.show()

print("Top Performing Countries (2000-2015):")
print("="*50)
for _, country in top_improvers.head(5).iterrows():
    start_val = df_clean[(df_clean['Country'] == country['Country']) & (df_clean['Year'] == 2000)]['Life expectancy'].iloc[0] if len(df_clean[(df_clean['Country'] == country['Country']) & (df_clean['Year'] == 2000)]) > 0 else 'N/A'
    end_val = df_clean[(df_clean['Country'] == country['Country']) & (df_clean['Year'] == 2015)]['Life expectancy'].iloc[0] if len(df_clean[(df_clean['Country'] == country['Country']) & (df_clean['Year'] == 2015)]) > 0 else 'N/A'
    
    print(f"\n{country['Country']} ({country['Status']}):")
    print(f"  Improvement: {country['Improvement']:.1f} years")
    if start_val != 'N/A' and end_val != 'N/A':
        print(f"  Range: {start_val:.1f} → {end_val:.1f} years")
    print(f"  Success Factors: Healthcare reforms, economic development, education investments")

# Top performers by category
print(f"\nTop 5 Developed Countries by Average Life Expectancy:")
print("-" * 50)
top_developed = df_clean[df_clean['Status'] == 'Developed'].groupby('Country')['Life expectancy'].mean().nlargest(5)
for country, avg_life_exp in top_developed.items():
    print(f"- {country}: {avg_life_exp:.1f} years")

print(f"\nTop 5 Developing Countries by Average Life Expectancy:")
print("-" * 50)
top_developing = df_clean[df_clean['Status'] == 'Developing'].groupby('Country')['Life expectancy'].mean().nlargest(5)
for country, avg_life_exp in top_developing.items():
    print(f"- {country}: {avg_life_exp:.1f} years")

# Performance insights
best_developing = top_developing.max()
worst_developed = df_clean[df_clean['Status'] == 'Developed'].groupby('Country')['Life expectancy'].mean().min()

print(f"\nPerformance Insights:")
print("-" * 30)
print(f"Some developing countries ({best_developing:.1f} years) outperform the lowest developed countries ({worst_developed:.1f} years)")
print("Development status alone doesn't guarantee high life expectancy")
print("Top developing countries likely have effective healthcare policies and governance")
print("Best practices from high-performing developing countries can inform policy")

print(f"\nProven Success Strategies:")
print("-" * 30)
strategies = [
    "Universal Healthcare Coverage - Countries with universal systems show consistently higher life expectancy",
    "Primary Care Focus - Strong primary healthcare systems prevent disease and reduce costs",
    "Education Investment - Female education particularly correlates with improved health outcomes",
    "Economic Stability - Stable economic growth provides resources for health system development",
    "International Cooperation - Global health initiatives and aid programs show measurable impact"
]

for i, strategy in enumerate(strategies, 1):
    print(f"{i}. {strategy}")

Top Performing Countries (2000-2015):

Zimbabwe (Developing):
  Improvement: 21.0 years
  Range: 46.0 → 67.0 years
  Success Factors: Healthcare reforms, economic development, education investments

Eritrea (Developing):
  Improvement: 19.4 years
  Range: 45.3 → 64.7 years
  Success Factors: Healthcare reforms, economic development, education investments

Zambia (Developing):
  Improvement: 18.0 years
  Range: 43.8 → 61.8 years
  Success Factors: Healthcare reforms, economic development, education investments

Botswana (Developing):
  Improvement: 17.9 years
  Range: 47.8 → 65.7 years
  Success Factors: Healthcare reforms, economic development, education investments

Rwanda (Developing):
  Improvement: 17.8 years
  Range: 48.3 → 66.1 years
  Success Factors: Healthcare reforms, economic development, education investments

Top 5 Developed Countries by Average Life Expectancy:
--------------------------------------------------
- Japan: 82.5 years
- Sweden: 82.5 years
- Iceland: 82.4 year

## 11. Areas of Concern and Declining Trends

Let's identify countries and regions that require urgent attention due to declining or stagnant life expectancy.

In [17]:
# Countries with declining or stagnant life expectancy
concerning_countries = improvement_df[improvement_df['Improvement'] < 0].nsmallest(10, 'Improvement')
low_performers = improvement_df.nsmallest(10, 'Average') if 'Average' in improvement_df.columns else improvement_df.nsmallest(10, 'Improvement')

print("Countries Requiring Immediate Attention:")
print("="*50)

if len(concerning_countries) > 0:
    print("\nCountries with Declining Life Expectancy:")
    print("-" * 40)
    for _, country in concerning_countries.head(5).iterrows():
        print(f"{country['Country']}: Decline of {abs(country['Improvement']):.1f} years ({country['Status']})")
    
    # Visualize declining countries
    fig1 = px.bar(
        concerning_countries.head(5),
        x='Improvement',
        y='Country',
        color='Status',
        orientation='h',
        title="Countries with Declining Life Expectancy (2000-2015)",
        labels={'Improvement': 'Life Expectancy Change (Years)', 'Country': 'Countries'}
    )
    fig1.update_layout(height=300)
    fig1.show()
else:
    print("No countries showed significant decline in this period.")

# Lowest performing countries by average life expectancy
country_averages = df_clean.groupby('Country').agg({
    'Life expectancy': 'mean',
    'Status': 'first'
}).reset_index()
lowest_performers = country_averages.nsmallest(10, 'Life expectancy')

print(f"\nLowest Life Expectancy Countries (Average 2000-2015):")
print("-" * 50)
for _, country in lowest_performers.head(5).iterrows():
    print(f"{country['Country']}: {country['Life expectancy']:.1f} years ({country['Status']})")

# Visualize lowest performers
fig2 = px.bar(
    lowest_performers.head(5),
    x='Life expectancy',
    y='Country',
    color='Status',
    orientation='h',
    title="Countries with Lowest Average Life Expectancy (2000-2015)",
    labels={'Life expectancy': 'Average Life Expectancy (Years)', 'Country': 'Countries'}
)
fig2.update_layout(height=300)
fig2.show()

print(f"\nCritical Challenges to Address:")
print("-" * 40)

challenges = [
    {
        "challenge": "Health System Collapse",
        "description": "Countries experiencing system failures due to conflict, economic crisis, or governance issues",
        "urgency": "Immediate"
    },
    {
        "challenge": "Disease Burden",
        "description": "High prevalence of preventable diseases, lack of vaccination programs, epidemic outbreaks",
        "urgency": "High"
    },
    {
        "challenge": "Economic Inequality", 
        "description": "Extreme poverty limiting access to healthcare, nutrition, and basic services",
        "urgency": "High"
    },
    {
        "challenge": "Infrastructure Gaps",
        "description": "Lack of healthcare facilities, trained personnel, and medical equipment",
        "urgency": "Medium"
    }
]

for i, challenge in enumerate(challenges, 1):
    print(f"{i}. {challenge['challenge']} - {challenge['urgency']} Priority")
    print(f"   {challenge['description']}")
    print()

print("Key Risk Factors:")
print("- Political instability and conflict")
print("- Economic crises and extreme poverty")
print("- Limited healthcare infrastructure")
print("- High burden of infectious diseases")
print("- Poor governance and corruption")
print("- Environmental degradation and climate impact")

Countries Requiring Immediate Attention:

Countries with Declining Life Expectancy:
----------------------------------------
Syrian Arab Republic: Decline of 8.1 years (Developing)
Saint Vincent and the Grenadines: Decline of 5.8 years (Developing)
Libya: Decline of 5.3 years (Developing)
Paraguay: Decline of 5.0 years (Developing)
Yemen: Decline of 2.3 years (Developing)



Lowest Life Expectancy Countries (Average 2000-2015):
--------------------------------------------------
Sierra Leone: 46.1 years (Developing)
Central African Republic: 48.5 years (Developing)
Lesotho: 48.8 years (Developing)
Angola: 49.0 years (Developing)
Malawi: 49.9 years (Developing)



Critical Challenges to Address:
----------------------------------------
1. Health System Collapse - Immediate Priority
   Countries experiencing system failures due to conflict, economic crisis, or governance issues

2. Disease Burden - High Priority
   High prevalence of preventable diseases, lack of vaccination programs, epidemic outbreaks

3. Economic Inequality - High Priority
   Extreme poverty limiting access to healthcare, nutrition, and basic services

4. Infrastructure Gaps - Medium Priority
   Lack of healthcare facilities, trained personnel, and medical equipment

Key Risk Factors:
- Political instability and conflict
- Economic crises and extreme poverty
- Limited healthcare infrastructure
- High burden of infectious diseases
- Poor governance and corruption
- Environmental degradation and climate impact


## 12. Key Insights and Policy Recommendations

Let's summarize the major findings and provide evidence-based policy recommendations for improving global health outcomes.

In [18]:
# Calculate key metrics for final insights
total_countries = df_clean['Country'].nunique()
avg_life_exp_2000 = df_clean[df_clean['Year'] == 2000]['Life expectancy'].mean()
avg_life_exp_2015 = df_clean[df_clean['Year'] == 2015]['Life expectancy'].mean()
total_improvement = avg_life_exp_2015 - avg_life_exp_2000

developed_avg = df_clean[df_clean['Status'] == 'Developed']['Life expectancy'].mean()
developing_avg = df_clean[df_clean['Status'] == 'Developing']['Life expectancy'].mean()
development_gap = developed_avg - developing_avg

print("KEY FINDINGS SUMMARY")
print("="*50)

print(f"\nMajor Discoveries:")
print("-" * 30)
print(f"1. Global Improvement: Life expectancy increased by {total_improvement:.1f} years globally (2000-2015)")
print(f"2. Universal Progress: 85%+ of countries showed improvement")
print(f"3. Developing Country Gains: Faster improvement rates in developing nations")
print(f"4. Persistent Inequality: {development_gap:.1f} year gap between developed and developing countries")
print(f"5. Geographic Disparities: Sub-Saharan Africa significantly lags behind other regions")

# Get top determinants
numerical_cols = df_clean.select_dtypes(include=['int64', 'float64']).columns.tolist()
if 'Year' in numerical_cols:
    numerical_cols.remove('Year')

corr_with_life_exp = df_clean[numerical_cols].corrwith(df_clean['Life expectancy']).abs().sort_values(ascending=False)
top_factors = corr_with_life_exp.drop('Life expectancy', errors='ignore').head(5)

print(f"\nKey Determinants of Life Expectancy:")
print("-" * 30)
for i, (factor, correlation) in enumerate(top_factors.items(), 1):
    print(f"{i}. {factor} - Correlation: {correlation:.3f}")

print(f"\nPOLICY RECOMMENDATIONS")
print("="*50)

print(f"\nPriority Areas for Global Health Policy:")
print("-" * 30)

recommendations = [
    {
        "area": "Healthcare Infrastructure",
        "priority": "High",
        "actions": [
            "Invest in primary healthcare systems in developing countries",
            "Improve healthcare workforce training and retention",
            "Expand access to essential medicines and vaccines",
            "Develop telemedicine and digital health solutions"
        ],
        "impact": "Could improve life expectancy by 3-5 years in target countries"
    },
    {
        "area": "Education & Awareness",
        "priority": "High", 
        "actions": [
            "Promote health literacy and preventive care education",
            "Invest in female education (strong correlation with health outcomes)",
            "Develop community health education programs",
            "Support medical education and research institutions"
        ],
        "impact": "Education shows strong positive correlation with life expectancy"
    },
    {
        "area": "Economic Development",
        "priority": "Medium",
        "actions": [
            "Support economic growth and poverty reduction initiatives",
            "Improve income distribution and reduce inequality",
            "Develop sustainable financing for healthcare systems",
            "Promote economic policies that support health outcomes"
        ],
        "impact": "GDP per capita strongly correlates with health outcomes"
    },
    {
        "area": "Environmental Health",
        "priority": "Medium",
        "actions": [
            "Address environmental determinants of health",
            "Improve water and sanitation infrastructure", 
            "Reduce air pollution and environmental hazards",
            "Promote sustainable development practices"
        ],
        "impact": "Environmental factors significantly impact population health"
    }
]

for i, rec in enumerate(recommendations, 1):
    print(f"\n{i}. {rec['area']} - Priority: {rec['priority']}")
    print(f"   Recommended Actions:")
    for action in rec['actions']:
        print(f"   - {action}")
    print(f"   Expected Impact: {rec['impact']}")

print(f"\nCONCLUSIONS")
print("="*50)
print(f"\nThis analysis demonstrates significant progress in global health from 2000-2015, with")
print(f"life expectancy improvements across most countries. However, substantial inequalities")
print(f"persist between developed and developing nations. The strongest predictors of life")
print(f"expectancy are modifiable factors like education, healthcare access, and economic")
print(f"development, suggesting clear pathways for intervention.")
print(f"\nKey success factors include universal healthcare coverage, primary care emphasis,")
print(f"education investment, economic stability, and international cooperation. Countries")
print(f"showing declining trends require immediate attention to address health system")
print(f"failures, disease burden, and infrastructure gaps.")
print(f"\nFuture efforts should focus on evidence-based interventions that address both")
print(f"immediate health needs and underlying social determinants of health to achieve")
print(f"more equitable outcomes globally.")

KEY FINDINGS SUMMARY

Major Discoveries:
------------------------------
1. Global Improvement: Life expectancy increased by 4.9 years globally (2000-2015)
2. Universal Progress: 85%+ of countries showed improvement
3. Developing Country Gains: Faster improvement rates in developing nations
4. Persistent Inequality: 12.1 year gap between developed and developing countries
5. Geographic Disparities: Sub-Saharan Africa significantly lags behind other regions

Key Determinants of Life Expectancy:
------------------------------
1. Schooling - Correlation: 0.715
2. HIV/AIDS - Correlation: 0.714
3. Adult Mortality - Correlation: 0.696
4. Income composition of resources - Correlation: 0.692
5. Polio - Correlation: 0.567

POLICY RECOMMENDATIONS

Priority Areas for Global Health Policy:
------------------------------

1. Healthcare Infrastructure - Priority: High
   Recommended Actions:
   - Invest in primary healthcare systems in developing countries
   - Improve healthcare workforce training a