# Life Expectancy Analysis & Modeling Using WHO, UN data

## Recursive Feature Elimination

To:&nbsp;&nbsp;&nbsp;&nbsp; [Magnimind](https://magnimindacademy.com/)

From: Matt Curcio, matt.curcio.us@gmail.com

Date: 2023-01-29

Re:&nbsp;&nbsp;&nbsp; 

---

## Executive Summary

Recursive Feature Elimination is a Sci-Kit Learn package that systematically eliminates features and then ranks them by order of importance.

- Sorting the importance of features provided: `Status`, `AdultMort`, `BMI`, `Polio`, `DTP`, `HIV`, `Income`

---

- Input: `Clean_LE_Data_FEng_4.csv`

<!-- 
Idea derived from
https://www.kaggle.com/code/goyalshalini93/car-price-prediction-linear-regression-rfe
-->

In [9]:
# Common Python Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# Libraries from Sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm

In [2]:
path = '../data/processed/'
fileName = 'Clean_LE_Data_FEng_4.csv'

df = pd.read_csv(path+fileName, header=0)

# Convert 4 vars to categorical
df['Country'] = pd.Categorical(df['Country'])
df['Status'] = pd.Categorical(df['Status'])
df['Region'] = pd.Categorical(df['Region'])

df.head()

Unnamed: 0,Country,Year,Status,LifeExpectancy,AdultMort,EtOH,PercExpen,Measles,BMI,lt5yD,Polio,TotalExpen,DTP,HIV,Thin1_19y,Income,Education,Region
0,Afghanistan,2015,0,65.0,263.0,0.01,71.279624,1154,19.1,83,6.0,8.16,65.0,0.1,17.2,0.479,10.1,2
1,Afghanistan,2014,0,59.9,271.0,0.01,73.523582,492,18.6,86,58.0,8.18,62.0,0.1,17.5,0.476,10.0,2
2,Afghanistan,2013,0,59.9,268.0,0.01,73.219243,430,18.1,89,62.0,8.13,64.0,0.1,17.7,0.47,9.9,2
3,Afghanistan,2012,0,59.5,272.0,0.01,78.184215,2787,17.6,93,67.0,8.52,67.0,0.1,17.9,0.463,9.8,2
4,Afghanistan,2011,0,59.2,275.0,0.01,7.097109,3013,17.2,97,68.0,7.87,68.0,0.1,18.2,0.454,9.5,2


## Recursive Feature Elimination

### train_test_split Section

In [3]:
x = df.drop(['LifeExpectancy'], axis=1)
y = df['LifeExpectancy']

x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.3,
                                                    random_state=100)

print('\nShape of x_train is', {x_train.shape})
print('\nShape of y_train is', {y_train.shape})

print('\nShape of x_test is', {x_test.shape})
print('\nShape of y_test is', {y_test.shape})


Shape of x_train is {(2049, 17)}

Shape of y_train is {(2049,)}

Shape of x_test is {(879, 17)}

Shape of y_test is {(879,)}


### NOTE 1:
- Cannot Use `Stratify`, option for `train_test_split`. There are not sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

In [4]:
x_train.head()

Unnamed: 0,Country,Year,Status,AdultMort,EtOH,PercExpen,Measles,BMI,lt5yD,Polio,TotalExpen,DTP,HIV,Thin1_19y,Income,Education,Region
1346,Kazakhstan,2013,0,28.0,6.48,26.407266,73,51.4,6,98.0,4.3,98.0,0.1,2.4,0.782,15.0,1
2073,Qatar,2006,0,83.0,1.28,448.595299,144,65.0,0,95.0,2.58,96.0,0.1,4.9,0.835,13.7,2
746,Denmark,2005,1,92.0,11.28,7627.412444,2,55.0,0,93.0,9.77,93.0,0.1,1.2,0.897,16.6,5
2667,Tunisia,2004,0,15.0,1.36,379.765905,1,51.4,4,97.0,5.63,97.0,0.1,6.4,0.673,13.6,2
348,Botswana,2003,0,693.0,5.51,299.367125,59,31.6,4,96.0,4.65,96.0,31.9,1.9,0.567,11.8,7


In [5]:
y_train.head()

1346    69.5
2073    76.6
746     78.1
2667    74.0
348     46.4
Name: LifeExpectancy, dtype: float64

In [7]:
# Save everything just in case
x_train.to_csv(path+'x_train.csv', index=False)
x_test.to_csv(path+'x_test.csv', index=False)

y_train.to_csv(path+'y_train.csv', index=False)
y_test.to_csv(path+'y_test.csv', index=False)

In [13]:
scaler = StandardScaler()

numerical_vars = ['AdultMort','EtOH','PercExpen',
                  'Measles','BMI','lt5yD','Polio','TotalExpen',
                  'DTP','HIV','Thin1_19y','Income','Education']
                  
x_train[numerical_vars] = scaler.fit_transform(x_train[numerical_vars])
x_train.drop(columns=['Country'], inplace=True)

In [14]:
x_train.head()

Unnamed: 0,Year,Status,AdultMort,EtOH,PercExpen,Measles,BMI,lt5yD,Polio,TotalExpen,DTP,HIV,Thin1_19y,Income,Education,Region
1346,2013,0,-1.111033,0.465999,-0.35835,-0.206915,0.644948,-0.219162,0.663112,-0.695903,0.662289,-0.318614,-0.543703,0.749034,0.908488,1
2073,2006,0,-0.664399,-0.858965,-0.152524,-0.200388,1.326527,-0.254921,0.531484,-1.400205,0.576187,-0.318614,0.018928,1.01115,0.51202,2
746,2005,1,-0.591314,1.689043,3.347303,-0.213441,0.825366,-0.254921,0.443733,1.543944,0.447034,-0.318614,-0.813766,1.317777,1.396449,5
2667,2004,0,-1.2166,-0.838581,-0.18608,-0.213533,0.644948,-0.231082,0.619236,-0.151296,0.619238,-0.318614,0.356507,0.209964,0.481522,2
348,2003,0,4.289169,0.218842,-0.225276,-0.208202,-0.347352,-0.231082,0.57536,-0.552585,0.576187,5.769986,-0.656229,-0.31427,-0.067434,7


### RFE fitting

In [17]:
lm = LinearRegression()
lm.fit(x_train,y_train)
rfe = RFE(lm)
rfe = rfe.fit(x_train, y_train)

In [18]:
feature_importance = list(zip(x_train.columns,rfe.support_,rfe.ranking_))

In [19]:
def Sort_Tuple(tup):
    """ reverse = None (Sorts in Ascending order)
    key is set to sort using second element of
    sublist lambda has been used
    """
    tup.sort(key = lambda x: x[2])
    return tup
 

# printing the sorted list of tuples
Sort_Tuple(feature_importance)

[('Status', True, 1),
 ('AdultMort', True, 1),
 ('BMI', True, 1),
 ('Polio', True, 1),
 ('DTP', True, 1),
 ('HIV', True, 1),
 ('Income', True, 1),
 ('Education', True, 1),
 ('PercExpen', False, 2),
 ('Thin1_19y', False, 3),
 ('Measles', False, 4),
 ('EtOH', False, 5),
 ('Region', False, 6),
 ('TotalExpen', False, 7),
 ('lt5yD', False, 8),
 ('Year', False, 9)]

### NOTE 2: Inference
    
- Sorting the importance of features provided: `Status`,`AdultMort`,`BMI`,`Polio`,`DTP`,`HIV`,`Income`