<div align="center"> <h1> WHO Life Expectancy Prediction Model Report</h1> </div>

<img src="https://data.org/wp-content/uploads/2024/01/WHO_logo.svg" style="width:100%; height:200px; object-fit:cover;" />

The **objective** of this project is to create a robust predictive model using linear regression with a low margin of error.

### Table of Contents
 - 01: Exploratory Data Analysis
 - 02: Modelling
 - 03: Function
 - 04: Conclusions

## 01: Exploratory Data Analysis

This section explores the data, looking only at the features.

In [None]:
# Plot to check if features correlate
sns.pairplot(X_train)
plt.show()
# This is where much correlation was found which will be discussed.

In [None]:
# Check if features correlate
X_train.corr(numeric_only = True)
# There is much correlation which will be explored below

#### 1.1 Ethical Considerations:
- Medical records: sensitive data - vaccination rates
- Omitting countries - ethical and modelling reasons
- Some columns were already omitted: percentage expenditure, total expenditure, income composition of resources

#### 1.2) Observations
- Thinness in both age groups are correlated as the ages overlap.
- Strong correlation between infant death and under 5 death - multicollinearity.
- Somewhat of a negative correlation between child and adolescent thinness and BMI
- Positive correlation between diptheria and polio and hepatitis b.

**1.3) Columns that show potential multicolinearity:**   
When analysing the correlation matrix for the features of our model, the decision was made to drop certain columns to reduce multicolinearity in this model. 

Infant deaths: 
- between ages of 0-1 already represented in under 5 deaths 

Thinness 5-9 and Thinness 10-19:
- both correlated to each other and other features e.g. ‘Schooling’ 

HIV: 
- highly correlated to ‘Adult Mortality’ which we considered more valuable in predicting life expectancy

Diptheria and Polio: 
- both highly correlated to infant deaths, each other and Hepatitis B

BMI: 
- BMI can be considered flawed and racially biased 
Not a very representative indicator of health 

Alcohol: 
- correlations between ‘Schooling’ and ‘Economy Status (both Developed and Developing)’

Economy Status Developing: 
- Only needed one of either ‘Economy Status Developed’ and ‘Economy Status Developing’

**1.4) Columns that do not seem useful:**
- Year


## 02: Modelling

#### 2.1: Importing libraries and reading the data

In [None]:
#Importing the four main libraries
import numpy as np # for maths
import seaborn as sns # to augment matplotlib/visuals
import matplotlib.pyplot as plt # visuals
import pandas as pd # data

# We use train/test split for our data
from sklearn.model_selection import train_test_split

# Scaling
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# Modelling library : stats models
import statsmodels.api as sm # contains the model
import statsmodels.tools # contains the metrics

# Metrics from sklearn
from sklearn import metrics

In [None]:
df = pd.read_csv('Life Expectancy Data.csv')

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df.head()

In [None]:
df.columns

#### 2.2: Train/test split

In [None]:
def train_test(df):
    df = df.copy()
    feature_cols = list(df.columns) # Selects all the columns
    feature_cols.remove('Life_expectancy') # Removes the target
    X = df[feature_cols] # Dataframe which holds all the features
    y = df['Life_expectancy'] # The target to predict
    # Train/test split data
    X_train, X_test, y_train, y_test = train_test_split(X, # features
                                                    y, # target
                                                    test_size = 0.2, # The % allocated to test
                                                    random_state = 56) # Random state - python will retain the shuffle of dataset
    return X_train, X_test, y_train, y_test

In [None]:
# Run function to create train test split
X_train, X_test, y_train, y_test = train_test(df)

In [None]:
# Sanity check for the split
print(all(X_train.index == y_train.index)) # Check training indices match
print(all(X_test.index == y_test.index)) # Check testing indices match

In [None]:
# To look at central tendencies of training dataset.
X_train.describe()

#### 2.3: Feature Engineering

In [None]:
## Applying feature engineering on X-train

def feature_eng(df):
        # create a local copy of the data
        df = df.copy() 
        # drop column country and year
        if 'Country' or 'Year' in df.columns:
            df = df.drop(columns=['Country', 'Year'], errors='ignore')
        if 'Region' in df.columns:
        # OHE the relevant columns
            df = pd.get_dummies(df, columns = ['Region'], drop_first = True, prefix = 'Region', dtype = int) # OHE
        return df


In [None]:
#Robust Scaler
def scaler_rob(df):
    rob = RobustScaler() ## Initial scaler
    rob.fit(df) ## Fit the data
    df_scaled = rob.transform(df)
    df_scaled = pd.DataFrame(df_scaled, columns=df.columns, index=df.index)
    df_scaled = sm.add_constant(df_scaled)
    #Return our result
    return df_scaled

In [None]:
## Transform X_train
X_train_fe = feature_eng(X_train)

In [None]:
## Scale X_train
X_train_fe = scaler_rob(X_train_fe)

In [None]:
X_train_fe.head()

#### 2.4.1: Statistical Model: Basic

This is the model that does not use sensitive data.

In [None]:
## Create a train OLS linear regression for both models
def ols_lin(y_train, X_train_fe, feature_cols):
    lin_reg = sm.OLS(y_train, X_train_fe[feature_cols]) # creates model
    results = lin_reg.fit() # fit model and store it
    return results

In [None]:
# Basic features: 
# After scaling, the model had a low RMSE 2.36 low condition no. 19.7
feature_cols = ['const',
       'Adult_mortality', 'GDP_per_capita', 'Economy_status_Developed'] 

In [None]:
## Summary table
results = ols_lin(y_train, X_train_fe, feature_cols)
results.summary()

In [None]:
# Gets our predictions and stores them in y_pred
y_pred = results.predict(X_train_fe[feature_cols])

# Gets the RMSE of our model: y_train(real) against y_pred (predicted)
rmse = statsmodels.tools.eval_measures.rmse(y_train, y_pred)

print(rmse)

In [None]:
## Feature engineering the X_test set 

# FE X_test using the same function
X_test_fe = feature_eng(X_test)

## Scale X_test
X_test_fe = scaler_rob(X_test_fe)

# Select exact same features
X_test_fe = X_test_fe[feature_cols]


In [None]:
# Test prediction
y_test_pred = results.predict(X_test_fe) # holds predictions on the testing set

# Testing RMSE
rmse_test = statsmodels.tools.eval_measures.rmse(y_test, y_test_pred)
print(rmse_test)

#### 2.42: Statistical Model: Optimised

This is the model that uses sensitive data.

In [None]:
# Sensitive features:  
feature_cols_sens = ['const', 'Region_Asia',
       'Region_Central America and Caribbean', 'Region_European Union',
       'Region_Middle East', 'Region_North America', 'Region_Oceania',
       'Region_Rest of Europe', 'Region_South America', 'Under_five_deaths',
       'Adult_mortality', 'Hepatitis_B', 'GDP_per_capita',
        'Schooling', 'Economy_status_Developed'] 
# After scaling, the above value is 1.23, condition no. 22.5


In [None]:
## Summary table
results_sens = ols_lin(y_train, X_train_fe, feature_cols_sens)
results_sens.summary()

In [None]:
## Actual performance on training data

# Gets our predictions and stores them in y_pred
y_pred_sens = results_sens.predict(X_train_fe[feature_cols_sens])

# Gets the RMSE of our model: y_train(real) against y_pred (predicted)
rmse_sens = statsmodels.tools.eval_measures.rmse(y_train, y_pred_sens)

print(rmse_sens)

In [None]:
## Feature engineering the X_test set 

# FE X_test using the same function
X_test_fe_sens = feature_eng(X_test)

## Scale X_test
X_test_fe_sens = scaler_rob(X_test_fe_sens)

# Select exact same features
X_test_fe_sens = X_test_fe_sens[feature_cols_sens]


In [None]:
# Test prediction
y_test_pred_sens = results_sens.predict(X_test_fe_sens) # holds predictions on the testing set

# Testing RMSE
rmse_test_sens = statsmodels.tools.eval_measures.rmse(y_test, y_test_pred_sens)
print(rmse_test_sens)

#### 2.5: Observations after the running the model
**Second Iteration of EDA-**
**Columns with low statistical significance (high p-values):**
- Alcohol
- Measles 
- Population 

**VIF and Stepwise**
- Used before scaling and was not effective (condition no. was very high so not robust)
- After scaling, the condition number was suitably small so we did not want to compromise the RMSE or the quality of the model by reducing it further.

## 03: Function

This can be found in the function notebook.

## 04: Conclusions

### 4.1 Evaluation of sensitive model:
- Performance was worse compared to the optimised model with an RMSE value of 2.7 (Which tells us the margin of error of our life expectancy model predictions).
- This model performs quite well considering the limited amount of data used for predictions (Region / Adult mortality / GDP per capita / Economy status).
- In addtion it had an excellent condition number (19.7) indicating very low levels of collinearity in the modelling and general numerical issues, this is understandable since we have few features.

### 4.11 Evaluation of optimised model: 
- This model is suitable for more accurate predictions giving a low 1.23 margin of error.
- Requires sensitive data and may not be usable for everybody.
- Performed well in reducing condition no. with a value of 22.5 (Slightly worst compared to sensitive model but expected due to increased features).

### 4.2 Limitations:
- The model groups countries by region and assumes that countries within a region are not unique which is false.
- Potentially outdated data (Data from early 2000s is now 20+ years old)
- Limited dataset more information would be required to give a better model (e.g smoking / drug use).
- Model does not handle if country has gone from developing to developed in the time period.

### 4.3 Implications and Applications:
- A model like this could be used to inform public health policies e.g. vaccinations.
- Can lead to discrimination- Individuals with lower life expectancy can face barriers when accessing healthcare.
- If country chooses to give more sensitive information, data protection should be in places.
- Individual effect of life expectancy prediction- Can lead to reduced mental well-being.