#TUNED RANDOM FOREST MODEL

#### I created this model to see if the results that it provides can provide useful insights, and to make predictions based on the data.

Random forest models have certain advantages, such as reduced variance, bias and likelihood of overfitting.

Running the WHO data through the random forest model resulted in high scores:


|Model	            | F1	   |Recall	|Precision|Accuracy|
|:------------------|:------:|:------:|:-------:|-------:|
|Tuned Random Forest | 0.979004 | 0.979167 | 0.979938 | 0.979167|




In [None]:
####################
# IMPORT LIBRARIES #
####################

# Import numpy`, pandas`, pickle`, and `sklearn
# Import the relevant functions from `sklearn.ensemble,
# `sklearn.model_selection`, and `sklearn.metrics`.

import numpy as np
import pandas as pd
import pickle as pkl
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from IPython.display import display
import csv

In [None]:
# Import data

WHO_data = pd.read_csv('web_download_data_WHO.csv')

In [None]:
#############################
# EXPLORATORY DATA ANALYSIS #
#############################

print('Display first 10 rows:')
display(WHO_data.head(10))
print('\n')

print('Display data types:')
display(WHO_data.dtypes)
print('\n')

print('Display data shape:')
display(WHO_data.shape)
print('\n')

print('Display data summary:')
display(WHO_data.describe())
print('\n')

# Check for number of rows that contain missing values
print('Check for missing values:')
WHO_data_isna = WHO_data.isna().any(axis=1).sum()
print(f'There are {WHO_data_isna} rows with missing values.')


Display first 10 rows:


Unnamed: 0,IND_NAME,DIM_GEO_NAME,IND_CODE,DIM_GEO_CODE,DIM_TIME_YEAR,DIM_1_CODE,VALUE_NUMERIC,VALUE_STRING,VALUE_COMMENTS
0,Adolescent birth rate (per 1000 women),Afghanistan,MDG_0000000003,AFG,2021,AGEGROUP_YEARS15-19,62.0,62.0,Afghanistan 2022-2023 Multiple Indicator Clust...
1,Adolescent birth rate (per 1000 women),Afghanistan,MDG_0000000003,AFG,2021,AGEGROUP_YEARS10-14,18.0,18.0,Afghanistan 2022-2023 Multiple Indicator Clust...
2,Age-standardized mortality rate attributed to ...,Afghanistan,SDGAIRBODA,AFG,2019,SEX_BTSX,265.66452,265.7,
3,Age-standardized prevalence of hypertension am...,Afghanistan,NCD_HYP_PREVALENCE_A,AFG,2019,SEX_BTSX,40.200001,40.2,
4,Age-standardized prevalence of obesity among a...,Afghanistan,NCD_BMI_30A,AFG,2022,SEX_BTSX,19.222589,19.2,
5,Age-standardized prevalence of tobacco use amo...,Afghanistan,M_Est_tob_curr_std,AFG,2022,SEX_BTSX,22.700001,22.7,The most recent survey was conducted in 2019. ...
6,Amount of water- and sanitation-related offici...,Afghanistan,SDGODAWS,AFG,2022,,67.955803,67.96,
7,Annual mean concentrations of fine particulate...,Afghanistan,SDGPM25,AFG,2019,,75.18718,75.2,
8,Average of 15 International Health Regulations...,Afghanistan,SDGIHR2021,AFG,2023,,38.0667,38.0,
9,Density of dentists (per 10 000 population),Afghanistan,HWF_0010,AFG,2019,,0.714,0.7,Includes Dentists Stock Total. Data source: WH...




Display data types:


Unnamed: 0,0
IND_NAME,object
DIM_GEO_NAME,object
IND_CODE,object
DIM_GEO_CODE,object
DIM_TIME_YEAR,int64
DIM_1_CODE,object
VALUE_NUMERIC,float64
VALUE_STRING,object
VALUE_COMMENTS,object




Display data shape:


(10503, 9)



Display data summary:


Unnamed: 0,DIM_TIME_YEAR,VALUE_NUMERIC
count,10503.0,10503.0
mean,2020.753499,462594.8
std,1.641991,19857620.0
min,2014.0,0.0
25%,2020.0,5.433967
50%,2021.0,22.04675
75%,2022.0,64.90374
max,2023.0,1619405000.0




Check for missing values:
There are 9678 rows with missing values.


In [None]:
################
# DATA SHAPING #
################

# Pivot dataframe to create columns based on IND_NAME
WHO_data_pivot = WHO_data.pivot(index=['DIM_GEO_NAME'],
                        columns=['IND_NAME','DIM_1_CODE'], values='VALUE_NUMERIC')

WHO_data = WHO_data_pivot.reset_index()

# Flatten multi-index pivot into dataframe
WHO_data.columns = WHO_data.columns.to_flat_index()
WHO_data.columns = ['_'.join(str(col) for col in multi_col) for multi_col in WHO_data.columns]

# One-hot encode DIM_GEO_NAME_
WHO_data = pd.get_dummies(WHO_data, columns=['DIM_GEO_NAME_'], dtype=int)

# Import CSV file as dictionary, used to rename long column names with shorter names
def csv_to_dict_no_header(filename):
    """Imports a two-column CSV file into a dictionary.

    Args:
        filename (str): The filename of the CSV file.

    Returns:
        dict: A dictionary where the first column is the key and the second is the value.
    """
    data_dict = {}
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        for row in reader:
           if row:  # Ensure the row is not empty
                key = row[0]
                value = row[1]
                data_dict[key] = value
    return data_dict

filename = 'WHO_data_1_columns_dict.csv'
data_as_dict = csv_to_dict_no_header(filename)

WHO_data.rename(columns=data_as_dict, inplace=True)

# Create 'Life_expectancy_category' column
condition1 = (WHO_data['Life_expect_at_brth_yrs_BTSX'] <= 65)
condition2 = ((WHO_data['Life_expect_at_brth_yrs_BTSX'] > 65)
  & (WHO_data['Life_expect_at_brth_yrs_BTSX'] <= 75))
condition3 = (WHO_data['Life_expect_at_brth_yrs_BTSX'] > 75)

value1 = 'Low'
value2 = 'Medium'
value3 = 'High'
defaultvalue = 'NaN'

WHO_data['Life_expectancy_category'] = np.select([condition1, condition2,
                                               condition3],
                                                [value1, value2, value3],
                                              default=defaultvalue)

WHO_data = WHO_data.fillna(0)

# Remove records with Life_expectancy_category == NaN
WHO_data = WHO_data[WHO_data['Life_expectancy_category'] != 'NaN']

# Drop column Life_expect_at_brth_yrs_BTSX, used in creating Life_expectancy_category
WHO_data = WHO_data.drop(columns=['Life_expect_at_brth_yrs_BTSX'])

# Display data types of the variables
print('Display data types:')
display(WHO_data.dtypes)
print('\n')

# Display data shape
print('Display data shape:')
display(WHO_data.shape)
print('\n')



Display data types:


Unnamed: 0,0
Adlcnt_birth_rate_per1k_wm_agrp_yr_15_19,float64
Adlcnt_birth_rate_per1k_wm_agrp_yr_10_14,float64
Age_std_mort_rt_hhold_ambnt_air_poltn_100K_BTSX,float64
Age_std_prev_hyptsn_adlts_age_30-79_yr_pct_BTSX,float64
Age_std_prev_obsty_adlts_18pls_yr_pct_BTSX,float64
...,...
DIM_GEO_NAME__Yemen,int64
DIM_GEO_NAME__Zambia,int64
DIM_GEO_NAME__Zimbabwe,int64
"DIM_GEO_NAME__occupied Palestinian territory, including east Jerusalem",int64




Display data shape:


(192, 271)





In [None]:
##################
# MODEL BUILDING #
##################

%%time

# Separate the dataset into labels (y) and features (X)
y = WHO_data['Life_expectancy_category']
X = WHO_data.drop(columns=['Life_expectancy_category'])

# Separate into train, validate and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

# Encoder for label encoding of the outcome variable
y_encoder = LabelEncoder()

# Fit and transform the training, validating and testing outcome variable using the encoder
y_train_encoded = y_encoder.fit_transform(y)
y = y_train_encoded
y_val_encoded = y_encoder.transform(y_val)
y_val = y_val_encoded
y_test_encoded = y_encoder.transform(y_test)
Y_test = y_test_encoded

#########################
# HYPERPARAMETER TUNING #
#########################

# Determine the set of hyperparameters
cv_params = {
    'n_estimators': [50, 100],
    'max_depth': [10, 50],
    'min_samples_split': [0.001, 0.01],
    'min_samples_leaf': [0.5, 1],
    'max_features': ['sqrt'],
    'max_samples': [0.5, 0.9]
}

# Create list of split indices
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)

# Instantiate model
rf = RandomForestClassifier(random_state=42)

# Search over specified parameters
rf_val = GridSearchCV(rf, cv_params, cv=custom_split, refit='f1', n_jobs=-1, verbose=1)

# Fit the model
rf_val.fit(X_train, y_train)

# Obtain optimal parameters
best_parameters = rf_val.best_params_
print('\nOptimal parameters:')
display(best_parameters)
print('\n')

# Use optimal parameters on GridSearchCV
rf_opt = RandomForestClassifier(**best_parameters, random_state=42)

# Fit the optimal model
rf_opt.fit(X_train, y_train)

# Predict on test set
y_pred = rf_opt.predict(X_test)

# Display a comparison of actual vs. predicted labels
results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(f'Actual vs. predicted labels:\n{results_df.head()} \n')


Fitting 1 folds for each of 32 candidates, totalling 32 fits

Optimal parameters:


{'max_depth': 10,
 'max_features': 'sqrt',
 'max_samples': 0.5,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 50}



Actual vs. predicted labels:
     Actual Predicted
47   Medium    Medium
145    High      High
80   Medium    Medium
153  Medium    Medium
119  Medium    Medium 

CPU times: user 270 ms, sys: 12.2 ms, total: 282 ms
Wall time: 6.38 s


In [None]:
##########################
# RESULTS AND EVALUATION #
##########################

### OBTAIN PERFORMANCE SCORES ###

# Obtain precision score
pc_test = precision_score(y_test, y_pred, average='weighted')
print(f'The precision score is {pc_test:.3f}')

# Obtain recall score
rc_test = recall_score(y_test, y_pred, average='weighted')
print(f'The recall score is {rc_test:.3f}')

# Obtain accurarcy score
ac_test = accuracy_score(y_test, y_pred)
print(f'The accuracy score is {ac_test:.3f}')

# Obtain F1 score
f1_test = f1_score(y_test, y_pred, average='weighted')
print(f'The f1 score is {f1_test:.3f}')

# Precision score on test data set
print(f'\nThe precision score is: {pc_test:.3f} for the test set,' +
      f'\nwhich means of all positive predictions, {pc_test:.3f} prediction are true positive.')

# Recall score on test data set
print(f'\nThe recall score is: {rc_test:.3f} for the test set, \nwhich means of all ' +
      f'real positive cases in test set, {rc_test:.3f} are predicted positive.')

# Accurarcy score on test data set
print(f'\nThe accuracy score is: {ac_test:.3f} for the test set,' +
      f'\nwhich means of all cases in test set, {ac_test:.3f} are predicted true positive ' +
      'or true negative.')

# F1 score on test data data set
print(f'\nThe f1 score is: {f1_test:.3f} for the test set,' +
      f'\nwhich means the test set\'s harmonic mean is {f1_test:.3f}.\n')

### EVALUATE THE MODEL ###

# Create table of results
table = pd.DataFrame({'Model': ["Tuned Random Forest"],
                      'F1': [f1_test],
                      'Recall': [rc_test],
                      'Precision': [pc_test],
                      'Accuracy': [ac_test]
                      }
                     )

print(f'\nResults of tuned random forest model:')
table

The precision score is 0.980
The recall score is 0.979
The accuracy score is 0.979
The f1 score is 0.979

The precision score is: 0.980 for the test set,
which means of all positive predictions, 0.980 prediction are true positive.

The recall score is: 0.979 for the test set, 
which means of all real positive cases in test set, 0.979 are predicted positive.

The accuracy score is: 0.979 for the test set,
which means of all cases in test set, 0.979 are predicted true positive or true negative.

The f1 score is: 0.979 for the test set,
which means the test set's harmonic mean is 0.979.


Results of tuned random forest model:


Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Random Forest,0.979004,0.979167,0.979938,0.979167
