# Learning Objectives
## In this session you will:

1. Data Import and Loading
- Load the dataset and inspect its structure.

2. Exploratory Data Analysis (EDA) and Preprocessing
- Check and remove missing values.
- Detect and handle outliers; scale numerical data if needed.
- Visualize data distributions using histograms and boxplots.

3. Basic statistics
- Perform t-test
- Apply p-value to test hypothesis

4. Correlation and Linear Modeling
- Calculate pairwise correlations and visualize them using a heatmap.
- Fit a linear regression model to study relationships between features and the target variable, and use it to predict new data. Evaluate model performance using R².


# Introduction about the dataset

The case (dataset) we will be investigating today is the WHO world life expectancy data, adopted from https://www.kaggle.com/datasets/vikramamin/life-expectancy-who.

About this file
The CSV file contains 22 variables and 2938 rows. It is data pertaining to life expectancy of different countries spanning from 2000 to 2015. The columns include

1. Country: Country name
2. Year: Year of the data
3. Status: Country status of developed or developing
4. Life_Expectancy: Life expectancy in age
5. Adult_Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
6. infant.deaths: Number of Infant Deaths per 1000 population
7. Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
8. percentage.expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
9. Hepatitis.B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
10. Measles: number of reported cases per 1000 population
11. BMI: Average Body Mass Index of entire population
under.five.deaths: Number of under-five deaths per 1000 population
12. Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
13. Total.expenditure: General government expenditure on health as a percentage of total government expenditure (%)
14. Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)room)
15. HIV.AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
16. GDP: Gross Domestic Product per capita (in USD)
17. Population: Population of the country
18. thinness..1.19.years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
19. thinness.5.9.years: Prevalence of thinness among children for Age 5 to 9(%)
20. Income.composition.of.resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
21. Schooling: Number of years of Schooling(years)

# Download the dataset
Right click the link to download the dataset:
https://github.com/holab-hku/2025-HKU-Budding-Researcher-Programme/blob/main/life_expectancy_data_WHO.csv

# 1. Data Loading

In [None]:
# 1.1 Import and Load Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
import os
%matplotlib inline

In [None]:
df = pd.read_csv('life_expectancy_data_WHO.csv') # read data using pandas

In [None]:
df.head() # check data is imported

In [None]:
df['Year'].dtype

In [None]:
df.Year

In [None]:
df.dtypes

# 2. Data Cleaning

In [None]:
# 2.1 Rename messy column names
df.columns

In [None]:
df.rename(columns={'Total expenditure':'total_expenditure'}, inplace=True)

In [None]:
df.rename(columns={'Life expectancy ':'life_expectancy'}, inplace=True)

In [None]:
df.columns

In [None]:
## 2.2 Check missing values
df.isna().sum()

In [None]:
## 2.3 Dealing with Missing Value: Remove NAs
df = df.dropna() # there is more sensible way to deal in in detail which you could refer to this tutorial link:

In [None]:
df.isna().sum()

In [None]:
## 2.4: Removing Outliers

In [None]:
plt.boxplot(df['total_expenditure'])

In [None]:
sns.boxplot(y=df["total_expenditure"], color="skyblue", width=0.3)

In [None]:
sns.histplot(df["total_expenditure"], color="lightgreen")

In [None]:
def outlier_diagnostics(df, cols):
    records = []
    for col in cols:
        x = df[col].dropna()
        q1, q3 = x.quantile([0.25, 0.75])
        iqr = q3 - q1
        lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
        count = ((df[col] < lower) | (df[col] > upper)).sum()
        lower_frac = (x < lower).mean()
        upper_frac = (x > upper).mean()
        records.append({
            'Column': col,
            'Count': count,
            'Lower Limit': lower_frac,
            'Upper Limit': upper_frac,
            'Lower Limit (%)': round(lower_frac*100, 2),
            'Upper Limit (%)': round(upper_frac*100, 2),
            'Percentage of outliers': round(count / df[col].dropna().shape[0] * 100, 2)
        })
    return pd.DataFrame(records).sort_values('Upper Limit (%)', ascending=False).reset_index(drop=True)

In [None]:
# Example
cols = ["total_expenditure"]
diagnostics = outlier_diagnostics(df, cols)
print(diagnostics)

In [None]:
from scipy.stats.mstats import winsorize
def winsorize_outlier(df, col,lower_limit=0, upper_limit=0, show_plot=True):
    wins_data = winsorize(df[col], limits=(lower_limit, upper_limit))
    print(wins_data)
    if show_plot == True:
        sns.boxplot(y=wins_data, color="skyblue", width=0.3)
        plt.title('wins=({},{}) {}'.format(lower_limit, upper_limit, col))
        plt.show()

In [None]:
winsorize_outlier(df, col="total_expenditure", upper_limit=0.004245, show_plot=True)

# 3. Basic Statistical Analysis (t-test)

In [None]:
df['Status']

In [None]:
developing_country = (df.query("Status == 'Developing'")) # similar to use df[df.Status == 'Developed']
developed_country = (df.query("Status == 'Developed'"))

In [None]:
mean_exp_developing_country = developing_country['life_expectancy'].mean()
mean_exp_developed_country = developed_country['life_expectancy'].mean()

In [None]:
significance = 0.05
t_test = stats.ttest_ind(a = developing_country['life_expectancy'], b = developing_country['life_expectancy'], equal_var = False )
t_test

In [None]:
p_value = t_test[1]
p_value

In [None]:
def two_sample_t_test(s, p):
    if p < s:
        return 'reject null hypothesis'
    else:
        return 'cannot reject null hypothesis'

In [None]:
two_sample_t_test(significance, p_value)

# 4. Advanced Statistics (correlation and linear regression)

In [None]:
# 4.1 Correlation

In [None]:
# compute correlation matrix
corr_matrix = df.corr(numeric_only=True)  # ensures only numeric cols

# create figure and plot
plt.figure(figsize=(16, 12))
sns.heatmap(corr_matrix, annot=True,cmap='coolwarm', annot_kws={"size": 12} )


In [None]:
## 4.2 Linear regression modelling
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = df[['total_expenditure']]
y = df['life_expectancy']

model = LinearRegression()
model.fit(X, y)


print(f"R² = {model.score(X, y):.3f}") # R square is the level of fitness in regression

In [None]:
model.coef_ # b

In [None]:
model.intercept_ # a

In [None]:
### Make prediction
new_sample = pd.DataFrame({
    'total_expenditure': [8]})

model.predict(new_sample)

In [None]:
# 4. Linear Regression
# (Optional) Fitting the best line
plt.figure(figsize=(8, 6))

sns.regplot(
    data=df,
    x="total_expenditure",
    y="life_expectancy",
    scatter_kws={'s': 60, 'color': '#EF553B', 'alpha': 0.7, 'edgecolor': 'black'},
    line_kws={'color': '#636EFA', 'linewidth': 2},
)

plt.title("Total expenditure vs. Life Expectancy", fontsize=14, weight='bold')
plt.xlabel("Total expenditure", fontsize=12)
plt.ylabel("Life Expectancy (Years)", fontsize=12)

plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
## 4.2 Getting co-efficient for linear equation
from sklearn.linear_model import LinearRegression
factors = ['Schooling', 'total_expenditure', 'percentage expenditure']

target = 'life_expectancy'

scores = []
for col in factors:
    X = df[[col]].dropna()
    y = df.loc[X.index, target]  # align target with non-missing X
    model = LinearRegression().fit(X, y)
    r2 = model.score(X, y)
    scores.append((col, r2))

# Sort descending by R²
ranked = pd.DataFrame(scores, columns=['Feature', 'R²']).sort_values('R²', ascending=False).reset_index(drop=True)
print(ranked)