

# **WHO PROJECT** 
### by Hena Naeem, Bekim Gerguri,  Ginny Rendall, Ewen Henderson

 <p style="text-align:right;">
<img src="https://cdn.pixabay.com/photo/2020/03/30/16/40/who-4984801_1280.jpg"
     width="350" height="350" style="float: right; margin-right: 0px;" />
</p> 

## 'Prediction of life expectancies across countries.'

Section 2: Function File

<b>Models
- The minimalist model that upholds the highest standards of data ethics - <b> "Sensitive Friendly Model"


- The most accurate and elaborative predictive model - <b>"The Max Model"

<b>Why our models?
- How they align with WHO ethical standards 
- Improveability, accuracy and efficiency of our models

## Importing Libraries

In [14]:
# Importing libraries

# Importing the following libraries for data maninpulation and mathmatical operations
import pandas as pd 
import numpy as np

# Importing the following libraries for data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# Importing the following libraries for statistical analysis
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.tools
from sklearn.preprocessing import RobustScaler
from scipy.stats import zscore

## Ethical Message

In [2]:
# Prints a standard thank-you and framing message after user inputs

def print_ethics_message():
    print("\nThank you for providing those details.")
    print("This model uses health, education, and economic indicators to estimate life expectancy.")
    print("It’s designed to highlight disparities and support responsible, data-informed decision-making.\n")

## Input Validation Functions

In [8]:
# Prompt user to enter economy status
def get_valid_economy_status():                 
    while True:
        try:
            status = int(input("Enter economy status (1 for developed, 0 for not developed): "))
            if status in [0, 1]:
                return status
            else:
                  print("Invalid entry. Please enter 1 or 0.")
        except ValueError:                                      
             print("Please enter a numeric value (1 or 0).")

# Prompt user to select a valid region from a given list
def get_valid_region(regions):
    while True:
        region = input("Enter region (choose from: {}): ".format(", ".join(regions)))
        if region in regions:
            return region
        else:
            print("Invalid region. Please try again.")

# Prompt user to enter GDP per capita
def get_valid_GDP():
    while True:
        try:
            value = float(input("Please enter the GDP (in USD) per capita for the country you would like to know the average life expectancy for:"))
            if 0 < value:
                return value
            else:
                print("Please enter a value greater than 0.")
        except ValueError:
            print("Please enter a numeric value")
            
# Prompt user to enter thinness prevalence
def get_valid_thinness():
        while True:
            try:
                value = float(input("Enter thinness prevalence (10–19 yrs) in % (typical range 0-30): "))
                if 0.0 <= value <= 100.0:
                    return value
                else:
                    print("Please enter a value between 0 and 100.")
            except ValueError:                                                                         
                print("Please enter a numeric value.")

# Prompt user to enter average BMI(Body Mass Index)
def get_valid_BMI():
    while True:
        try:
            value = float(input("Please enter the average BMI for the country’s population. Typical range: (15-35):"))
            if 0 < value:
                return value
            else:
                print("Please enter a value greater than 0.")
        except ValueError:
            print("Please enter a numeric value")

# Prompt user to enter adult mortality rate
def get_valid_adult_mortality():
    while True:
        try:
            value = float(input("Enter adult mortality rate (per 1,000 people aged 15–60). Typical range: (40–800): "))
            if 0 <= value <= 1000:
                return value
            else:
                print("Please enter a value between 40 and 800.")
        except ValueError:
            print("Please enter a numeric value.")

# Prompt user to enter under-five death rate
def get_valid_under_five_deaths():
    while True:
        try:
            value = float(input("Enter under-five deaths per 1,000 population (typical range: 2–250): "))
            if 0 <= value <= 1000:
                return value
            else:
                print("Please enter a value between 2 and 250.")
        except ValueError:
            print("Please enter a numeric value.")


# Prompt user to enter HIV incidents
def get_valid_hiv_incidents():
    while True:
        try:
            value = float(input("Enter number of HIV cases per 1,000 people (typical range: 0–25): "))
            if 0 <= value <= 1000:
                return value
            else:
                print("Please enter a value between 0 and 25.")
        except ValueError:
            print("Please enter a numeric value.")

# Prompt user to enter average years of schooling
def get_valid_schooling_years():
     while True:
        try:
            value = float(input("Enter average years of schooling (typical range: 1–15): "))
            if 0 <= value <= 25:
                return value
            else:
                print("Please enter a value between 1 and 15.")
        except ValueError:
            print("Please enter a numeric value.")


## Sensitive Friendly Model

In [9]:
def min_model():
    # Load life expectancy dataset from CSV file into a DataFrame
    df = pd.read_csv('Life Expectancy Data.csv')

    # Identify outliers (rows with more than 4 column values outliers)
    z_scores = zscore(df.drop(columns = ['Country', 'Region']))
    outlier_counts = (np.abs(z_scores) > 3).sum(axis=1)
    outliers = df[outlier_counts >= 4]

    # Drop outliers
    df_out = df.copy()
    df_out = df.drop(index = outliers.index)
    
    X_m = df_out[['GDP_per_capita', 'Economy_status_Developed', 'Region', 'Adult_mortality']].copy() # Feature columns
    y_m = df_out['Life_expectancy'] # Target column


    X_m['log_GDP'] = np.log(X_m.GDP_per_capita) # Apply log scale to GDP_per_capita as a new column
    X_m.drop(columns = ['GDP_per_capita'], inplace = True)  # Drop GDP_per_capita

    X_m = pd.get_dummies(X_m, columns = ['Region'], drop_first = True, dtype = int) # OHE 'Region'

    X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_m, y_m, test_size = 0.2, random_state = 264) # Train-test split

    # Indexes and column names of the test & train data for post-scaling use
    columns_m = X_train_m.columns
    index_train_m = X_train_m.index
    index_test_m = X_test_m.index

    # Scale the columns
    rob_m = RobustScaler()
    X_train_m_scaled = rob_m.fit_transform(X_train_m)
    X_train_m_scaled = pd.DataFrame(X_train_m_scaled, columns = columns_m, index = index_train_m)

    X_test_m_scaled = rob_m.transform(X_test_m)
    X_test_m_scaled = pd.DataFrame(X_test_m_scaled, columns = columns_m, index = index_test_m)

    X_train_m_scaled = sm.add_constant(X_train_m_scaled)
    X_test_m_scaled = sm.add_constant(X_test_m_scaled)

    # Train and fit the model
    lm = sm.OLS(y_train_m, X_train_m_scaled)
    res = lm.fit()

    # Predict life expectancy and calculate RMSE
    y_test_pred_m = res.predict(X_test_m_scaled)
    rmse_m = sm.tools.eval_measures.rmse(y_test_m, y_test_pred_m)

    # Define list of valid regions for user input validation
    regions = ['Middle East', 'European Union', 'Asia', 'South America',
               'Central America and Caribbean', 'Rest of Europe', 'Africa',
               'Oceania', 'North America']
    
    # Collect validated user inputs for key life expectancy predictors
    GDP_per_capita = get_valid_GDP()                          
    Adult_mortality = get_valid_adult_mortality()
    Economy_status_Developed = get_valid_economy_status()
    Region = get_valid_region(regions)

    # Display ethical message after user input
    print_ethics_message() 
    
    # Create a dataframe of the same format as what is expected
    X_inputted = pd.DataFrame([{'Economy_status_Developed': Economy_status_Developed,'Adult_mortality': Adult_mortality,
                                'log_GDP': np.log(GDP_per_capita),'Region_Asia':0,
                                'Region_Central America and Caribbean':0, 'Region_European Union':0,
                                'Region_Middle East':0, 'Region_North America':0, 'Region_Oceania':0,
                                'Region_Rest of Europe':0, 'Region_South America':0}])
    
    for item in regions:
        if item == Region and Region == 'Asia':
            X_inputted['Region_Asia'] = 1
        elif item == Region and Region == 'Central America and Caribbean':
            X_inputted['Region_Central America and Caribbean'] = 1
        elif item == Region and Region == 'European Union':
            X_inputted['Region_European Union'] = 1
        elif item == Region and Region == 'Middle East':
            X_inputted['Region_Middle East'] = 1
        elif item == Region and Region == 'North America':
            X_inputted['Region_North America'] = 1
        elif item == Region and Region == 'Oceania':
            X_inputted['Region_Oceania'] = 1
        elif item == Region and Region == 'Rest of Europe':
            X_inputted['Region_Rest of Europe'] = 1
        elif item == Region and Region == 'South America':
            X_inputted['Region_South America'] = 1


    # Scale inputted values
    columns_inputted = X_inputted.columns
    index_inputted = X_inputted.index
    X_inputted_scaled = rob_m.transform(X_inputted)
    X_inputted_scaled = pd.DataFrame(X_inputted_scaled, columns = columns_inputted, index = index_inputted)
    X_inputted_scaled = sm.add_constant(X_inputted_scaled, has_constant = 'add') # Add constant

    # Return prediction from the model
    life_expectancy = res.predict(X_inputted_scaled)
    print(f'The predicted average life expectancy is: {round(list(life_expectancy)[0],2)} ± {round(rmse_m, 2)} years')

    

## Max Model

In [10]:
def max_model():
    # Load life expectancy dataset from CSV file into a DataFrame
    df = pd.read_csv('Life Expectancy Data.csv')

    # Identify outliers (rows with more than 4 column values outliers)
    z_scores = zscore(df.drop(columns = ['Country', 'Region']))
    outlier_counts = (np.abs(z_scores) > 3).sum(axis=1)
    outliers = df[outlier_counts >= 4]

    # Remove outliers
    df_out = df.copy()
    df_out = df.drop(index = outliers.index)

    # Remove columns that have high correlation with other predictors OR are uncorrelated to the target
    df_out.drop(columns = ['Measles', 'Polio', 'Hepatitis_B', 'Diphtheria', 'Population_mln', 'Alcohol_consumption', 'Infant_deaths', 'Thinness_five_nine_years', 'Economy_status_Developing', 'Country', 'Year'], inplace = True)
    
    df_out['log_GDP'] = np.log(df_out.GDP_per_capita) # Apply log scale to GDP_per_capita as a new column
    df_out['log_BMI'] = np.log(df_out.BMI) # Apply log scale to BMI as a new column

    df_out.drop(columns = ['GDP_per_capita', 'BMI'], inplace = True) # Drop GDP_per_capita and BMI columns

    df_out = pd.get_dummies(df_out, columns = ['Region'], drop_first = True, dtype = int) # OHE 'Region'

    X_log_out = df_out.drop(columns = ['Life_expectancy']) # Drop Life_expectancy column
    y_log_out = df_out.Life_expectancy

    X_train_out, X_test_out, y_train_out, y_test_out = train_test_split(X_log_out, y_log_out, test_size = 0.2, random_state = 264) # Train-test split

    columns_out = X_train_out.columns
    index_train_out = X_train_out.index
    index_test_out = X_test_out.index

    # Scale the columns
    rob_out = RobustScaler()
    X_train_out_scaled = rob_out.fit_transform(X_train_out)
    X_train_out_scaled = pd.DataFrame(X_train_out_scaled, columns = columns_out, index = index_train_out)

    X_test_out_scaled = rob_out.transform(X_test_out)
    X_test_out_scaled = pd.DataFrame(X_test_out_scaled, columns = columns_out, index = index_test_out)

    X_train_out_scaled = sm.add_constant(X_train_out_scaled)
    X_test_out_scaled = sm.add_constant(X_test_out_scaled)

    # Train and fit the model
    model = sm.OLS(y_train_out, X_train_out_scaled)
    result = model.fit()

    # Predict life expectancy and calculate RMSE
    y_test_pred_out = result.predict(X_test_out_scaled)
    rmse_out = sm.tools.eval_measures.rmse(y_test_out, y_test_pred_out)


    # Define list of valid regions for user input validation
    regions = ['Middle East', 'European Union', 'Asia', 'South America',
               'Central America and Caribbean', 'Rest of Europe', 'Africa',
               'Oceania', 'North America']
    
   # Collect validated user inputs for key life expectancy predictors
    GDP_per_capita = get_valid_GDP()
    Economy_status_Developed = get_valid_economy_status()
    Region = get_valid_region(regions)
    Thinness_ten_nineteen_years = get_valid_thinness()
    BMI = get_valid_BMI()
    Under_five_deaths = get_valid_under_five_deaths()
    Adult_mortality = get_valid_adult_mortality()
    Incidents_HIV = get_valid_hiv_incidents()
    Schooling = get_valid_schooling_years()

    # Display ethical message after user input
    print_ethics_message() 
    
    # Create a dataframe of the same format as what is expected
    X_inputted = pd.DataFrame([{'Under_five_deaths': Under_five_deaths, 'Adult_mortality': Adult_mortality,
                                'Incidents_HIV': Incidents_HIV, 'Thinness_ten_nineteen_years': Thinness_ten_nineteen_years,
                                'Schooling': Schooling, 'Economy_status_Developed': Economy_status_Developed,'log_GDP': np.log(GDP_per_capita), 'log_BMI': np.log(BMI), 
                                'Region_Asia':0, 'Region_Central America and Caribbean':0, 'Region_European Union':0,
                                'Region_Middle East':0, 'Region_North America':0, 'Region_Oceania':0,
                                'Region_Rest of Europe':0, 'Region_South America':0}])

    for item in regions:
        if item == Region and Region == 'Asia':
            X_inputted['Region_Asia'] = 1
        elif item == Region and Region == 'Central America and Caribbean':
            X_inputted['Region_Central America and Caribbean'] = 1
        elif item == Region and Region == 'European Union':
            X_inputted['Region_European Union'] = 1
        elif item == Region and Region == 'Middle East':
            X_inputted['Region_Middle East'] = 1
        elif item == Region and Region == 'North America':
            X_inputted['Region_North America'] = 1
        elif item == Region and Region == 'Oceania':
            X_inputted['Region_Oceania'] = 1
        elif item == Region and Region == 'Rest of Europe':
            X_inputted['Region_Rest of Europe'] = 1
        elif item == Region and Region == 'South America':
            X_inputted['Region_South America'] = 1

    # Scale inputted values
    columns_inputted = X_inputted.columns
    index_inputted = X_inputted.index
    X_inputted_scaled = rob_out.transform(X_inputted)
    X_inputted_scaled = pd.DataFrame(X_inputted_scaled, columns = columns_inputted, index = index_inputted)
    X_inputted_scaled = sm.add_constant(X_inputted_scaled, has_constant = 'add') # Add constant

    # Return prediction from the model
    life_expectancy = result.predict(X_inputted_scaled)
    print(f'The predicted average life expectancy is: {round(list(life_expectancy)[0],2)} ± {round(rmse_out, 2)} years')


## Main Function
> Prompts user for consent to use advanced population data, then runs either the Max Model or Sensitive Friendly Model based on their response.

In [11]:
def final_fun():
    # Ask user for consent to use advanced population data
    consent = input("Do you consent to using advanced population data, which may include protected information, for better accuracy? (Y/N)")
    # Infinite loop to ensure valid input
    while 0 != 1:
        # If user consents
        if (consent.lower() == 'y') or (consent.lower() == 'yes'):
            max_model()  # Run the maximal model
            break
        # If user declines
        elif (consent.lower() == 'n') or (consent.lower() == 'no'):
            min_model() # Run the sensitive friendly model
            break
        # If input is invalid
        else:
            # Ask for input again
            consent = input("Sorry, you didn't enter Y/N. Do you consent to using advanced population data, which may include protected information,for better accuracy? (Y/N)")
    

## Function Demo

* Please ensure you have ran the previous cells before running the cell below.
* This function is intented for a singular observation

In [12]:
final_fun()

Do you consent to using advanced population data, which may include protected information, for better accuracy? (Y/N) n
Please enter the GDP (in USD) per capita for the country you would like to know the average life expectancy for: 38000
Enter adult mortality rate (per 1,000 people aged 15–60). Typical range: (40–800):  55
Enter economy status (1 for developed, 0 for not developed):  1
Enter region (choose from: Middle East, European Union, Asia, South America, Central America and Caribbean, Rest of Europe, Africa, Oceania, North America):  Middle East



Thank you for providing those details.
This model uses health, education, and economic indicators to estimate life expectancy.
It’s designed to highlight disparities and support responsible, data-informed decision-making.

The predicted average life expectancy is: 79.99 ± 2.07 years


## Key Features at a Glance


 |Feature            | Max Model          | Min Model         | Benefit to WHO |
 |---------          |-----------         |-----------        |----------------|
 | Accuracy          | RMSE 1.18          | RMSE 2.07         | Better decision making |
 | Data Points       | 8 inputs           | 4 inputs          | Flexible privacy options |
 | Regional Analysis | Included           | Included          | Localized insights |
 | Ethical Compliance| Advanced consent   | Privacy first     | Regulatory ready |

## Why Choose Our Model?

### Our 'Sensitive Friendly Model':
* Has an RMSE 2.07 which gives a fairly accurate prediction without compromising private information
* Is built for WHO's strict privacy standards
* Can perform predictions for privacy-sensitive regions

### Our 'Max Model':
* Has an RMSE of 1.18, which is a 34% reduction from 1.80
* Optimised for policy planning
* Resource allocation
* Impact evaluation

### Both Models
* Support reproducibility
* Audience safety
* Ethical framing, with reusable validation
* Standardised thank-you message
