# Predicting Life Expectancy with Linear Regression
Author: Keilyn Yuzuki | February 2020

A quick look into how the accuracy of life expectancy predictions changes with the number of features selected. Demonstrates the drawbacks of simplifying big data too much.

**Data Source:** https://www.kaggle.com/kumarajarshi/life-expectancy-who


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import ipywidgets as widgets
from ipywidgets import interact

In [2]:
le = pd.read_csv('Life-Expectancy-Data.csv')

In [3]:
le.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [4]:
le.shape

(2938, 22)

In [5]:
le = le.dropna()

In [6]:
le.dtypes

Country                             object
Year                                 int64
Status                              object
Life expectancy                    float64
Adult Mortality                    float64
infant deaths                        int64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
Measles                              int64
 BMI                               float64
under-five deaths                    int64
Polio                              float64
Total expenditure                  float64
Diphtheria                         float64
 HIV/AIDS                          float64
GDP                                float64
Population                         float64
 thinness  1-19 years              float64
 thinness 5-9 years                float64
Income composition of resources    float64
Schooling                          float64
dtype: object

In [7]:
le.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

In [8]:
train, test = train_test_split(le, test_size = 0.1)

In [9]:
X_train = train.drop(['Country', 'Year', 'Status', 'Life expectancy '], axis = 1)
Y_train = train[['Life expectancy ']]

X_test = test.drop(['Country', 'Year', 'Status', 'Life expectancy '], axis = 1)
Y_test = test[['Life expectancy ']]

In [10]:
model = LinearRegression()
model.fit(X_train, Y_train)

training_accuracy = model.score(X_train, Y_train)

In [11]:
training_accuracy

0.8352927265631496

In [12]:
model.score(X_test, Y_test)

0.8256324044502747

In [13]:
X_train = train.drop(['Country', 'Year', 'Status', 'Life expectancy '], axis = 1)
Y_train = train[['Life expectancy ']]

X_test = test.drop(['Country', 'Year', 'Status', 'Life expectancy '], axis = 1)
Y_test = test[['Life expectancy ']]

model = LinearRegression()
model.fit(X_train, Y_train)

print('Training accuracy: ', model.score(X_train, Y_train))
print('Test accuracy: ', model.score(X_test, Y_test))

Training accuracy:  0.8352927265631496
Test accuracy:  0.8256324044502747


In [14]:
list(le.columns)[4:]

['Adult Mortality',
 'infant deaths',
 'Alcohol',
 'percentage expenditure',
 'Hepatitis B',
 'Measles ',
 ' BMI ',
 'under-five deaths ',
 'Polio',
 'Total expenditure',
 'Diphtheria ',
 ' HIV/AIDS',
 'GDP',
 'Population',
 ' thinness  1-19 years',
 ' thinness 5-9 years',
 'Income composition of resources',
 'Schooling']

In [15]:
for col in list(le.columns)[4:]:
    X_train = train[[col]]
    Y_train = train[['Life expectancy ']]

    X_test = test[[col]]
    Y_test = test[['Life expectancy ']]

    model = LinearRegression()
    model.fit(X_train, Y_train)
    
    print('Model trained only on: ', col)
    print('Training accuracy: ', model.score(X_train, Y_train))
    print('Test accuracy: ', model.score(X_test, Y_test))
    print(' ')

Model trained only on:  Adult Mortality
Training accuracy:  0.5063197419049972
Test accuracy:  0.37679270802646525
 
Model trained only on:  infant deaths
Training accuracy:  0.027393772828626206
Test accuracy:  0.03666199182206187
 
Model trained only on:  Alcohol
Training accuracy:  0.16606564263914414
Test accuracy:  0.12528878350483852
 
Model trained only on:  percentage expenditure
Training accuracy:  0.17063960217956506
Test accuracy:  0.1391727361436722
 
Model trained only on:  Hepatitis B
Training accuracy:  0.03808135298673054
Test accuracy:  0.05412817176118201
 
Model trained only on:  Measles 
Training accuracy:  0.0038527533424170812
Test accuracy:  0.009554333570671258
 
Model trained only on:   BMI 
Training accuracy:  0.3036830196182293
Test accuracy:  0.20297717787412506
 
Model trained only on:  under-five deaths 
Training accuracy:  0.035598347943653685
Test accuracy:  0.04661700961615578
 
Model trained only on:  Polio
Training accuracy:  0.10398141768657365
Test 

In [16]:
def predict_life_expectancy(variables):
    X_train = train[list(variables)]
    X_test = test[list(variables)]
        
    Y_train = train[['Life expectancy ']]
    Y_test = test[['Life expectancy ']]

    model = LinearRegression()
    model.fit(X_train, Y_train)

    if len(variables) == 1:
        variables = variables[0]
    print('Model trained on: ', variables)
    print('Training accuracy: ', model.score(X_train, Y_train))
    print('Test accuracy: ', model.score(X_test, Y_test))

In [17]:
var = widgets.SelectMultiple(
    options=list(le.columns)[4:],
    value=['Adult Mortality'],
    rows=10,
    description='Variables',
    disabled=False
)

interact(predict_life_expectancy,  variables = var);

interactive(children=(SelectMultiple(description='Variables', index=(0,), options=('Adult Mortality', 'infant …

bar graph that shows accuracy for current selection, or updates with history of past selections
just show test accuracy

look at measles, and 

which single factor gives you best/worst result, which combo of factors gives best result - hint- should be at least 80% of the features
