>

>

### Our models explained

For the WHO, we have built two separate models, an <b><u>accurate/robust</u></b> model and a <b><u>simple/ethical</u></b> model. Each model has its pros and cons, and we advise the WHO to use each according to its purpose.

Our robust model is built specifically for <b>accuracy</b>, aiming for as low a Root Mean Squared Error (RMSE) as possible. This has the benefit of being able to predict the life expectancy of any country much more accurately than our simple model. However, the aim for accuracy has a number of downsides:

- Generalisation issues: While the model is very accurate, it will not adapt well to data that is very different from our train/test. It will try to force established patterns, even if they are no longer relevant, e.g. an African country significantly improving its healthcare will still be heavily penalised for being in Africa
- It is less easy to understand: Most predictors in our robust model have been scaled, and we have also One Hot Encoded and feature combined others. This makes it difficult to understand some of the values in our transformed dataset
- No ethical considerations: This model includes many variables (e.g. disease frequency) that some countries might not wish to share for ethical reasons
- A higher condition number: As we are including  more variables, there is a higher likelihood of dependancy due to correlation issues, i.e. multicollinearity

Meanwhile, our ethical model is built for usability, ease of understanding, and, above all, an ethical use of the data. We only use two predictors, both of which are relatively ethical unbiased, particularly without context from other variables. Nonetheless, these characteristics come at a cost compared to our robust model though:

- Lower accuracy: While our ethical model meets the requirements the WHO set us with a RMSE of 1.6, this means it is nearly 3 times less accurate than our robust model

Nonetheless, we would recommend that users, when using new data which shows significant change from previous years, to make use of our <b> ethical </b> model over our <b> robust </b> model. While less accurate, our ethical model is able to generalise much better, and make more accurate predictions when using data for countries who might have significantly improved or deteriorated in terms of their predictors compared to the test/train data. Furthermore, it only makes use of two variables that are as unbiased as possible (for the data), making them more ethical to use.

>

>

### Included Predictors


|Field|Inclusion in Robust Model|Inclusion in Ethical Model|Reason|Explanation|
|:---|:---|:---|---|:---|
|Country|<b>Yes</b>|No|Significant bias and ethical issues| Including 'country' significantly increases the accuracy of our model, but also means our model would become very dependent on the 'country' predictor. This would lead to significant generalisation issues and limit its use cases. Importantly, there is also the <b>ethical</b> consideration: keeping country in our model would lead to the model continuing to bias against so-called underdeveloped countries in e.g. Africa, regardless of the progress they make in improving health standards. We therefore include 'country' in our robust model for accuracy, but exclude it from our ethical model for bias and ethical considerations |
|Year|<b>Yes</b>|No|Generalisation issues| As a predictor that can significantly improve the accuracy of our model, including 'year' runs the risk of our model becoming too reliant on the year in which the data is from, instead of being able to generalise from new trends. However, since it is a good predictor (including 'year' reduces the RMSE from 0.67 to 0.51 in our robust model), it is included in our robust model|
|Status|No|No|Ethics issues|Whether a country is categorised as developed or developing is subjective and it would be understandable for users of this model not to want to input something that they might not consider a fair reflection of the state of their country. Furthermore, in future years countries might change their 'status', making the 'status' of a country in this model innacurate. For the robust model, adding 'status' also does not improve our accuracy, and it is therefore also excluded|
|Adult mortality|<b>Yes</b>|<b>Yes</b>|Unbiased float|We chose to include 'adult mortality' because without the context of other predictors, this variable is simply a float without meaning. It does not impart bias and is actually useful to include for generalisation purposes. All countries have deaths and therefore should not be a variable a user would have an issue with inputting. We have also included it in our robust model|
|Infant deaths|No*|<b>Yes</b>|Unbiased float|Similar to 'adult mortality', but slightly more contentious. While all countries also experience infant deaths, some might be less willing to share this data. Nonetheless, It is a useful predictor from an accuracy perspective, and similarly, again, to 'adult mortality', without any context due to a lack of other predictors. For these reasons we have included it in our ethical model. We have not included it in our robust model as we have combined the variable with 'under five deaths' into the new predictor 'mortality rates'. Incluidng both would lead to multicollinearity issues.|
|Under five deaths|No*|No|Multicollinearity issues|Including both 'under five deaths' and 'infant deaths' in our model leads to multicollinearity. In our robust model we have solved this by feature combining the two variables into one, 'mortality rates', but we have excluded it from our ethical model to keep our model simple and understandable|
|Alcohol|No|No|Minimalisation/multicollinearity|Adding 'alcohol' does not make our model significantly better, and we have therefore excluded it to keep our model simple and intuitive. Even if we did include it, there would be bias issues due to e.g. certain countries not consuming/banning alcohol for religious purposes. There could be a number of countries unwilling to input this variable. We have also excluded it from our robust model for multicollinearity issues|
|Percentage expenditure|No|No|Not included in dataset|While the dataset dictionary makes mention of this variable, it is not included in the dataset|
|Hepatitis B|No|No|Ethical and bias issues|While diseases can be a useful predictor, some countries might not want to report on health-specific issues for privacy reasons. Furthermore, unlike e.g adult mortality, not all countries experience all diseases in our dataset equally, or at all, leading to bias. It is howevever useful in terms of accuracy, and is therefore included in our robust model, but as part of 'illnesses'|
|Measles|No|No|Ethical and bias issues|While diseases can be a useful predictor, some countries might not want to report on health-specific issues for privacy reasons. Furthermore, unlike e.g adult mortality, not all countries experience all diseases in our dataset equally, or at all, leading to bias. From the perspective of our robust model, it also doesn't add anything in terms of accuracy|
|BMI|No|No|Ethical issues|BMI can be considered a variable that countries would not want to report or use as an input. While we largely excluded this predictor for ethical reasons, it is also true that there is a higher correlation between 'developed' countries and 'BMI', which might lead to generalisation issues and nonsense extrapolations such as e.g. the high correlation value between schooling and BMI |
|Polio|No|No|Ethical and bias issues|While diseases can be a useful predictor, some countries might not want to report on health-specific issues for privacy reasons. Furthermore, unlike e.g adult mortality, not all countries experience all diseases in our dataset equally, or at all, leading to bias. This is particularly the case with 'polio' as many countries no longer have polio cases. We have however included it in our robust model due to it increasing the accuracy of our model, as part of 'illnesses'|
|Total expenditure|No|No|Not included in dataset|While the dataset dictionary makes mention of this variable, it is not included in the dataset|
|Diphtheria|No|No|Ethical and bias issues|While diseases can be a useful predictor, some countries might not want to report on health-specific issues for privacy reasons. Furthermore, unlike e.g adult mortality, not all countries experience all diseases in our dataset equally, or at all, leading to bias. We have however included it in our robust model, as part of 'illnesses'due to it increasing the accuracy of our model|
|HIV/AIDS|No|No|Ethical and bias issues|While diseases can be a useful predictor, some countries might not want to report on health-specific issues for privacy reasons. Furthermore, unlike e.g adult mortality, not all countries experience all diseases in our dataset equally, or at all, leading to bias. The ethical component is particularly strong with HIV, due to the negative stereotypes associated with high HIV rates|
|GDP|No|No|Multicollinearity / minimalism|GDP per capita overlaps a lot with other metrics such as 'status' and 'schooling'. Multicollinearity would be an issue if we included this, but we excluded it anyway due to the focus on a minimalistic model, with 'GDP' not contributing significantly to the accuracy of the model compared to the included predictors  |
|Population|No|No|Bias and ethics issues|'Population' has low correlation rates with other predictors, and does not seem to contribute much to the accuracy of the model either. This is logical due to population being an absolute number, compared to many of the other variables in our dataset being per capita. There would also be generalisation issues with including this predictor, as it would train the model to associate high/low population with mortality, which does not seem sensible. Including it could also lead to ethical issues as it would be easy to predict which country relates to which population number|
|Thinness 1-19 years|No|No|Ethics issues / minimalism|Excluded for similar ethical concerns as with BMI. Could be merged with the below thinnes predictor to avoid multicollinearity, but as these predictors do not affect our accuracy as significantly as our two included predictors, they would be excldued regardless for the ease of use of the model. We have also excluded it from our robust model due to multicollinearity issues with other included predictors, which add more to the accuracy|
|Thinness 5-9 years|No|No|Ethics issues / minimalism|Excluded for similar ethical concerns as with BMI. Could be merged with the below thinnes predictor to avoid multicollinearity, but as these predictors do not affect our accuracy as significantly as our two included predictors, they would be excldued regardless for the ease of use of the model. We have also excluded it from our robust model due to multicollinearity issues with other included predictors, which add more to the accuracy|
|Income composition of resources|No|No|Not included in dataset|While the dataset dictionary makes mention of this variable, it is not included in the dataset|
|Schooling|No|No|Multicollinearity and ethical issues|'Schooling' could be considered problematic from an inclusion perspective, due the biases assocaited with education. There are also multicollinearity issues associated with 'schooling'. While a predictor that does add to the overal accuracy of the model, there are too many issues with the predictor to include it in the model. For similar reasons we have also excluded it from our robust model| 
|Region|No|No|Bias and ethical issues|Not included in the data dictionary, but present in the dataset. The issues with 'region' mirror those with 'country' and lead to generalisation issues, as well as, by extension in this case, ethical issues. We have excluded it from our robust model due to multicollinearity issues with 'country'|
|<i>Illnesses</i>|<b>Yes</b>|No|Feature combination|'Illnesses' is the combined feature of 'hepatitis B','diptheria' and 'polio'. We've combined these variables for our robust model as each separate predictor by themselves improves the accuracy of the model, but individually there is multicollinearity. We have excluded it from our ethical model due to it adding complexity, and three extra inputs, when we wish to keep this model simple and explainable|
|<i>*Mortality rates</i>|<b>Yes</b>|No|Feature combination|'Mortality rates' is the combination of the features 'infant deaths' and 'under five deaths'. Similar to 'illnesses', We've combined these variables for our robust model as each separate predictor by themselves improves the accuracy of the model, but individually there is multicollinearity. The reasons for excluding it from our ethical model are the same too, and we already include one of the two predictors already|

### W.H.O. Life Expectancy Predictor Function

In [None]:
# Importing essential libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import statsmodels.api as sm  # For statistical modeling and regression analysis
from sklearn.preprocessing import StandardScaler  # For feature scaling
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.metrics import mean_squared_error  # For calculating model evaluation metrics (e.g., RMSE)
from IPython.display import display, HTML  # For displaying dataframes or HTML content in Jupyter Notebooks
import joblib  # For saving and loading machine learning models


def select_model():
    # Display a stylish HTML question to either use the Ethical Model or the Robust Model
    display(HTML("""
        <h2 style="color: rgb(37 71 134 / 85%); font-family: Arial, sans-serif;">Hello and welcome to the <b>W.H.O. Life Expectancy Predictor</b></h2>
        <p style="font-size: 16px;">Would you like to use an <b>Ethical Model (97.1 % Accuracy)</b> or a <b>Robust Model (99.7% Accuracy)</b>?</p>
        <p style="font-size: 16px;"><b>Please note</b> that by accessing the function, you are giving us <u>permission to use your data</u>.</p>
        <p style="font-size: 16px;">If you have any sensitive data, please use our <b>Ethical Model</b>.</p>
        <h4>Please enter <b>ethical</b> for the ethical model or <b>robust</b> for the robust model:</b></h4>
    """))
    
    choice = input("").strip().lower()


###### Ethical model function ######
    def ethical_model():
        
        """
        This function predicts life expectancy based on two factors:
        - Adult mortality rate
        - Infant deaths
        The function asks the user to input these values and calculates the predicted life expectancy.
        """

### DATA LOADING ###
    ## Step 1: Load coefficients from CSV and transfer into a dictionary
        # We read the coefficients from a file called 'ethical_model_coefficients.csv' which was saved when we trained our model
        coeffs_df = pd.read_csv("feature_coefficients_ethical.csv", index_col="Feature_cols")
        # We store these coefficients in a dictionary so that each variable name (e.g., 'Adult_mortality') is associated with its coefficient value.
        coeffs = coeffs_df['Coefficient'].to_dict()



#### USER INPUT FOR PREDICTION ###
    ## Step 2: Prompt the user to input values for adult mortality and infant deaths
        print("Please enter the following values:")

        # Function to validate input
        def get_valid_input(prompt, limit):
            while True:
                try:
                    value = float(input(prompt))
                    if value < 0:
                        print("The value cannot be negative. Please enter a positive number.")
                    elif value > limit:
                        print(f"The value cannot exceed {limit}. Please try again.")
                    else:
                        return value
                except ValueError:
                    print("Invalid input. Please enter a numeric value.")
    
        # Get valid inputs for adult mortality and infant deaths
        adult_mortality = get_valid_input(
            "Adult mortality - Rates of both sexes (probability of dying between 15 and 60 years per 1000 population): ", 
            limit=1000
        )
        infant_deaths = get_valid_input(
            "Infant deaths - Number of Infant Deaths per 1000 population: ", 
            limit=1000
    )
    
        # Prepare the input data in a dictionary format to be used in the prediction calculation
        input_data = {
            'const': 1.0,                         # This represents the constant term in the model (it's always 1)
            'Adult_mortality': adult_mortality,   # The user input for adult mortality
            'Infant_deaths': infant_deaths        # The user input for infant deaths
        }
    
        # Step 4: Calculate the prediction
        # We use the coefficients dictionary to calculate the prediction
        prediction = sum(coeffs.get(feature, 0) * input_data[feature] for feature in input_data)



        # Display the predicted life expectancy result to the user in a nice format
        # This uses HTML to show a message with the prediction rounded to two decimal places.
        display(HTML(f"""
            <h4 style="color: #4CAF50;">Predicted Life Expectancy: {prediction:.2f} years</h4>
            <p style="font-size: 16px;">If you would like to try again, please re-run the function.</b></p>
                    """))


####### Robust model function ######
    def robust_model():
        """
        This function predicts life expectancy based on various health-related factors and country-specific data.
        It asks the user for input, processes the data, and then applies a trained model to make the prediction.
        """
    
### DATA LOADING ###
    # Step 1: Load the model coefficients from the CSV file. This file contains the values for each feature in the model.
        coeffs_df = pd.read_csv("feature_coefficients.csv")
        
        # Extract columns related to country-specific dummy variables (those starting with 'Country_')
        country_columns = [col for col in coeffs_df['Feature_cols'] if col.startswith("Country_")]
    
    ### Step 2: Collect User Input for Prediction ###
        print("Please enter the following values:")
    
        # Function to validate numeric input
        def get_valid_input(prompt, min_value=None, max_value=None, is_int=False):
            while True:
                try:
                    value = int(input(prompt)) if is_int else float(input(prompt))
                    if min_value is not None and value < min_value:
                        print(f"The value cannot be less than {min_value}. Please try again.")
                    elif max_value is not None and value > max_value:
                        print(f"The value cannot exceed {max_value}. Please try again.")
                    else:
                        return value
                except ValueError:
                    print("Invalid input. Please enter a valid numeric value.")
        
        # Function to validate string input
        def get_valid_string(prompt):
            while True:
                value = input(prompt).strip()
                if value:
                    return value
                else:
                    print("Input cannot be empty. Please enter a valid value.")
        
        # Prompt the user for input values
        adult_mortality = get_valid_input(
            "Adult mortality - Rates of both sexes (probability of dying between 15 and 60 years per 1000 population): ",
            min_value=0,
            max_value=1000
        )
        infant_deaths = get_valid_input(
            "Infant deaths - Number of Infant Deaths per 1000 population: ",
            min_value=0,
            max_value=1000
        )
        under_five_deaths = get_valid_input(
            "Under 5 deaths - Number of under-five deaths per 1000 population: ",
            min_value=0,
            max_value=1000
        )
        hepatitis_b = get_valid_input(
            "Hepatitis B - (HepB) immunization coverage among 1-year-olds (%): ",
            min_value=0,
            max_value=100
        )
        polio = get_valid_input(
            "Polio - (Pol3) immunization coverage among 1-year-olds (%): ",
            min_value=0,
            max_value=100
        )
        diphtheria = get_valid_input(
            "Diphtheria - (DTP3) immunization coverage among 1-year-olds (%): ",
            min_value=0,
            max_value=100
        )
        year = get_valid_input(
            "Year: ",
            min_value=2000,
            max_value=2100)

            
        country = get_valid_string("Country (e.g., 'Algeria', 'Brazil'): ")
        
    ### Step 3: Prepare the User Input Data ###
        # Store the user's input values in a DataFrame (a table format) for easy processing
        input_data = pd.DataFrame([[adult_mortality, infant_deaths, under_five_deaths, hepatitis_b, polio, diphtheria, year]], 
                                  columns=['Adult_mortality', 'Infant_deaths', 'Under_five_deaths', 'Hepatitis_B', 'Polio', 'Diphtheria', 'Year'])
    
    ### Step 4: Handle Country-Specific Dummy Variables ###
        # Create a dictionary that generates country-specific dummy variables (1 if the country matches, 0 otherwise)
        country_data = {col: [1 if col == f'Country_{country}' else 0] for col in country_columns}
        
        # Convert the dictionary into a DataFrame for easy merging with the user input data
        country_df = pd.DataFrame(country_data)
        
        # Combine the country-specific data with the rest of the user input data
        input_data = pd.concat([input_data, country_df], axis=1)
    
    ### Step 5: Add Interaction Terms to the Data ###
        # Create interaction terms by multiplying related factors to account for their combined effects on life expectancy
        input_data['illnesses'] = input_data['Hepatitis_B'] * input_data['Polio'] * input_data['Diphtheria']
        input_data['mortality_rates'] = input_data['Infant_deaths'] * input_data['Under_five_deaths']
    
    ### Step 6: Add Constant Term for Intercept ###
        # Add a constant column (for the intercept) that is necessary for the prediction model
        input_data = sm.add_constant(input_data, has_constant='add')
    
    ### Step 7: Apply Feature Scaling ###
        try:
            # Load the previously fitted scaler (used to scale the input data to match the model's training data)
            scaler = joblib.load("scaler.pkl")
            
            # Remove the constant column for scaling (we'll add it back later)
            input_data_scaled = input_data.drop(columns=['const'])
            
            # Apply scaling to the input data using the loaded scaler
            input_data_scaled = scaler.transform(input_data_scaled)
            
            # Convert the scaled data back into a DataFrame with proper column names
            input_data_scaled = pd.DataFrame(input_data_scaled, columns=input_data.columns[1:])
            
            # Re-add the constant column (which was dropped earlier for scaling)
            input_data_scaled = sm.add_constant(input_data_scaled, has_constant='add')
        
        except FileNotFoundError:
            # If the scaler file is missing, display an error message
            print("Scaler file not found. Please ensure the scaler is saved during training.")
            return
    
    ### Step 8: Make the Prediction ###
        try:
            # Load the previously trained model (the model that makes predictions based on the input data)
            model = joblib.load("model.pkl")
            
            # Use the model to make a prediction based on the scaled input data
            prediction = model.predict(input_data_scaled)
            
            # Display the predicted life expectancy to the user in a formatted message
            display(HTML(f"""
                <h4 style="color: #4CAF50;">Predicted Life Expectancy: {prediction[0]:.2f} years</h4>
                <p style="font-size: 16px;">If you would like to try again, please re-run the function.</b></p>
            """))
        
        except FileNotFoundError:
            # If the scaler file is missing, display an error message
            print("Scaler file not found. Please ensure the scaler is saved during training.")
            return
    
    
    # Example usage of the function:
    # This part allows the user to choose which model to run (either 'ethical' or 'robust').
    if choice == 'ethical':
        ethical_model()
    elif choice == 'robust':
        robust_model()
    else:
        # If the user enters an invalid choice, display an error message
        display(HTML("<p style='color: red;'>Invalid input. Please enter 'ethical' or 'robust'.</p>"))

# Call the select_model function to run the entire process
select_model()
