# Life_Expectancy_WHO_UN_Analysis_Modeling

## Recursive_Feature_Elimination

To:&nbsp;&nbsp;&nbsp;&nbsp; [Magnimind](https://magnimindacademy.com/)

From: Matt Curcio, matt.curcio.ri@gmail.com

Date: 2023-01-29

Re:&nbsp;&nbsp;&nbsp; NOTEBOOK #6

---


- **Use** `Clean_LE_Data_FEng_4.csv`

<!-- Idea derived from
https://www.kaggle.com/code/goyalshalini93/car-price-prediction-linear-regression-rfe
-->

In [7]:
# Common Python Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# import warnings
import warnings
warnings.filterwarnings("ignore")

# Libraries from Sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm 

from statsmodels.stats.outliers_influence import variance_inflation_factor

In [8]:
# Reality check ;)
path = '../data/processed/'
fileName = 'Clean_LE_Data_FEng_4.csv'

df = pd.read_csv(path+fileName, header=0)

# Convert 4 vars to categorical
df['Country'] = pd.Categorical(df['Country'])
df['Year'] = pd.Categorical(df['Year'])
df['Status'] = pd.Categorical(df['Status'])
df['Region'] = pd.Categorical(df['Region'])

df.head(3)

Unnamed: 0,Country,Year,Status,LifeExpectancy,AdultMort,EtOH,PercExpen,Measles,BMI,lt5yD,Polio,TotalExpen,DTP,HIV,Thin1_19y,Income,Education,Region
0,Afghanistan,2015,0,65.0,263.0,0.01,71.279624,1154,19.1,83,6.0,8.16,65.0,0.1,17.2,0.479,10.1,2
1,Afghanistan,2014,0,59.9,271.0,0.01,73.523582,492,18.6,86,58.0,8.18,62.0,0.1,17.5,0.476,10.0,2
2,Afghanistan,2013,0,59.9,268.0,0.01,73.219243,430,18.1,89,62.0,8.13,64.0,0.1,17.7,0.47,9.9,2


## Recursive Feature Elimination

### train_test_split Section

In [9]:
x = df.drop(['LifeExpectancy','Country'], axis=1)
y = df['LifeExpectancy']

x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.3,
                                                    random_state=100)

print('\nShape of x_train is', {x_train.shape})
print('\nShape of y_train is', {y_train.shape})

print('\nShape of x_test is', {x_test.shape})
print('\nShape of y_test is', {y_test.shape})


Shape of x_train is {(2049, 16)}

Shape of y_train is {(2049,)}

Shape of x_test is {(879, 16)}

Shape of y_test is {(879,)}


### NOTE 1:
- Cannot Use `Stratify`, option for `train_test_split`. There are not sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

In [10]:
x_train.head()

Unnamed: 0,Year,Status,AdultMort,EtOH,PercExpen,Measles,BMI,lt5yD,Polio,TotalExpen,DTP,HIV,Thin1_19y,Income,Education,Region
1346,2013,0,28.0,6.48,26.407266,73,51.4,6,98.0,4.3,98.0,0.1,2.4,0.782,15.0,1
2073,2006,0,83.0,1.28,448.595299,144,65.0,0,95.0,2.58,96.0,0.1,4.9,0.835,13.7,2
746,2005,1,92.0,11.28,7627.412444,2,55.0,0,93.0,9.77,93.0,0.1,1.2,0.897,16.6,5
2667,2004,0,15.0,1.36,379.765905,1,51.4,4,97.0,5.63,97.0,0.1,6.4,0.673,13.6,2
348,2003,0,693.0,5.51,299.367125,59,31.6,4,96.0,4.65,96.0,31.9,1.9,0.567,11.8,7


In [11]:
y_train.head()

1346    69.5
2073    76.6
746     78.1
2667    74.0
348     46.4
Name: LifeExpectancy, dtype: float64

In [12]:
# Save Train/Test data
x_train.to_csv('x_train.csv', index=False)

x_test.to_csv('x_test.csv', index=False)

y_train.to_csv('y_train.csv', index=False)

y_test.to_csv('y_test.csv', index=False)

In [13]:
scaler = MinMaxScaler()

numerical_vars = ['AdultMort','EtOH','PercExpen',
                  'Measles','BMI','lt5yD','Polio','TotalExpen',
                  'DTP','HIV','Thin1_19y','Income','Education']
                  
x_train[numerical_vars] = scaler.fit_transform(x_train[numerical_vars])

In [14]:
x_train.head()

Unnamed: 0,Year,Status,AdultMort,EtOH,PercExpen,Measles,BMI,lt5yD,Polio,TotalExpen,DTP,HIV,Thin1_19y,Income,Education,Region
1346,2013,0,0.037396,0.362262,0.001356,0.0004,0.657963,0.0024,0.989583,0.211151,0.989691,0.0,0.083333,0.827513,0.724638,1
2073,2006,0,0.113573,0.071109,0.023029,0.000789,0.835509,0.0,0.958333,0.109134,0.969072,0.0,0.173913,0.883598,0.661836,2
746,2005,1,0.126039,0.631019,0.391553,1.1e-05,0.704961,0.0,0.9375,0.535587,0.938144,0.0,0.039855,0.949206,0.801932,5
2667,2004,0,0.019391,0.075588,0.019495,5e-06,0.657963,0.0016,0.979167,0.290036,0.979381,0.0,0.228261,0.712169,0.657005,2
348,2003,0,0.958449,0.307951,0.015368,0.000323,0.399478,0.0016,0.96875,0.23191,0.969072,0.629703,0.065217,0.6,0.570048,7


### RFE fitting

In [15]:
lm = LinearRegression()
lm.fit(x_train,y_train)

rfe = RFE(lm)
rfe = rfe.fit(x_train, y_train)

In [16]:
feature_importance = list(zip(x_train.columns,rfe.support_,rfe.ranking_))

In [17]:
def Sort_Tuple(tup):
    """ reverse = None (Sorts in Ascending order)
    key is set to sort using second element of
    sublist lambda has been used
    """
    tup.sort(key = lambda x: x[2])
    return tup
 

# printing the sorted list of tuples
Sort_Tuple(feature_importance)

[('AdultMort', True, 1),
 ('PercExpen', True, 1),
 ('Measles', True, 1),
 ('BMI', True, 1),
 ('DTP', True, 1),
 ('HIV', True, 1),
 ('Income', True, 1),
 ('Education', True, 1),
 ('Polio', False, 2),
 ('Thin1_19y', False, 3),
 ('Status', False, 4),
 ('lt5yD', False, 5),
 ('TotalExpen', False, 6),
 ('EtOH', False, 7),
 ('Region', False, 8),
 ('Year', False, 9)]

### NOTE 2: Inference
    
- **USE FEATURES**: `'AdultMort','PercExpen','Measles','BMI','DTP''HIV','Income','Education'`.