# WHO Life Expectancy Data Analysis

This notebook analyzes the WHO Life Expectancy dataset to understand factors influencing life expectancy and to predict life expectancy based on socio-economic and health indicators.

## Objectives
1.  **Data Understanding**: Explore the dataset structure and quality.
2.  **Exploratory Data Analysis (EDA)**: Visualize relationships between variables.
3.  **Linear Regression Analysis**: Predict life expectancy using key features.
4.  **K-Means Clustering Analysis**: Group countries based on development indicators.

## 1. Setup and Data Loading

Importing necessary libraries and loading the dataset.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.metrics import r2_score, mean_squared_error

# Set plot style
sns.set_theme(style="whitegrid")

In [None]:
# Load the dataset
df = pd.read_csv('../data/life-exp-data.csv')
df.head()

## 2. Data Understanding

Checking for missing values, data types, and statistical summary to understand the data quality and distribution.

In [None]:
# Check for missing values
print("Missing Values:\n", df.isnull().sum())

# Check data types
print("\nData Types:\n", df.dtypes)

In [None]:
# Summary statistics
df.describe()

## 3. Exploratory Data Analysis (EDA)

Visualizing correlations between variables to identify potential predictors for life expectancy.

In [None]:
# Correlation Heatmap
plt.figure(figsize=(12, 10))
numeric_df = df.select_dtypes(include=['float64', 'int64'])
sns.heatmap(numeric_df.corr(), annot=False, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

## 4. Linear Regression Analysis

We will build a Linear Regression model to predict **Life Expectancy** based on selected features: **GDP per capita**, **Schooling**, and **Adult Mortality**.

In [None]:
# Select features and target
selected_features = ['GDP_per_capita', 'Schooling', 'Adult_mortality']
X = df[selected_features]
y = df['Life_expectancy']

# Train model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Calculate metrics
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)

print(f"RÂ² Score: {r2:.4f}")
print(f"Mean Squared Error: {mse:.4f}")

In [None]:
# Visualize Actual vs Predicted
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y, y=y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2) # Diagonal line
plt.xlabel("Actual Life Expectancy")
plt.ylabel("Predicted Life Expectancy")
plt.title("Actual vs Predicted Life Expectancy")
plt.show()

In [None]:
# Feature Coefficients
coef_df = pd.DataFrame({'Feature': selected_features, 'Coefficient': model.coef_})
coef_df.sort_values(by='Coefficient', ascending=False)

## 5. K-Means Clustering Analysis

We will group countries into clusters based on **GDP per capita** and **Life Expectancy** to identify patterns in development.

In [None]:
# Clustering parameters
cluster_features = ['GDP_per_capita', 'Life_expectancy']
k = 3

# Perform K-Means Clustering
X_cluster = df[cluster_features]
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X_cluster)

# Visualize Clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='GDP_per_capita', y='Life_expectancy', hue='Cluster', palette='viridis')
plt.title(f"Clustering: GDP per capita vs Life Expectancy (k={k})")
plt.show()

In [None]:
# Cluster Statistics
df.groupby('Cluster')[cluster_features].mean()