# Regresión Lineal Multivariada
## Estimando la esperanza de vida con datos de la OMS.
La esperanza de vida es una estadística que pretende decir la cantidad promedio
de años que un individuo va a vivir.

Los datos provienen de la Organización Mundial de la Salud (OMS o WHO en inglés).

Con un total de 2938 registros cada uno con 22 columnas. Contiente la informacion del
2000 al 2015 de los 193 paises que proporcionaron datos a la OMS. No todos los paises
tienen registros de todos los años.

*Description and context of the Life Expectancy (WHO) dataset can be found here.*

https://www.kaggle.com/kumarajarshi/life-expectancy-who

In [309]:
# Import libraries
import pandas as pd
import numpy as np
from numpy.linalg.linalg import matmul, inv
import os
import matplotlib.pyplot as plt

le = pd.read_csv("Life_Expectancy_Data.csv", delimiter=",")

le.head()

# estadistica descriptiva
le.describe()


Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


Ahora que hemos importado los datos, la documentación indica que existen
valores nulos en algunas columnas, asi que será necesario realizar un pre
procesamiento para tratar estos datos faltantes.

Utilizamos un método que describe el usuario harshini564, de la comunidad de kaggle.

Dicho método realiza una interpolación lineal para sustituir los valores nulos.

In [310]:
# renombramos las columnas porque contienen espacios innecesarios

le.rename(columns={" BMI ":"BMI","Life expectancy ":"Life_Expectancy","Adult Mortality":"Adult_Mortality",
                   "infant deaths":"Infant_Deaths","percentage expenditure":"Percentage_Exp","Hepatitis B":"HepatitisB",
                  "Measles ":"Measles","under-five deaths ":"Under_Five_Deaths","Diphtheria ":"Diphtheria",
                  " HIV/AIDS":"HIV/AIDS"," thinness  1-19 years":"thinness_1to19_years"," thinness 5-9 years":"thinness_5to9_years",
                   "Income composition of resources":"Income_Comp_Of_Resources",
                   "Total expenditure":"Tot_Exp"},inplace=True)



In [311]:
# Identify percentage of null values in each column.
le.isnull().sum()*100/le.isnull().count()

Country                      0.000000
Year                         0.000000
Status                       0.000000
Life_Expectancy              0.340368
Adult_Mortality              0.340368
Infant_Deaths                0.000000
Alcohol                      6.603131
Percentage_Exp               0.000000
HepatitisB                  18.822328
Measles                      0.000000
BMI                          1.157250
Under_Five_Deaths            0.000000
Polio                        0.646698
Tot_Exp                      7.692308
Diphtheria                   0.646698
HIV/AIDS                     0.000000
GDP                         15.248468
Population                  22.191967
thinness_1to19_years         1.157250
thinness_5to9_years          1.157250
Income_Comp_Of_Resources     5.684139
Schooling                    5.547992
dtype: float64

In [312]:
country_list = le.Country.unique()
fill_list = ['Life_Expectancy','Adult_Mortality','Alcohol','HepatitisB','BMI','Polio','Tot_Exp','Diphtheria','GDP','Population',
             'thinness_1to19_years','thinness_5to9_years','Income_Comp_Of_Resources','Schooling']

# Treat null values using interpolation.
for country in country_list:
    le.loc[le['Country'] == country,fill_list] = le.loc[le['Country'] == country,fill_list].interpolate()

# Drop remaining null values after interpolation.
le.dropna(inplace=True)


In [313]:
# Verifying null-values after applying above methods.
le.isnull().sum()

Country                     0
Year                        0
Status                      0
Life_Expectancy             0
Adult_Mortality             0
Infant_Deaths               0
Alcohol                     0
Percentage_Exp              0
HepatitisB                  0
Measles                     0
BMI                         0
Under_Five_Deaths           0
Polio                       0
Tot_Exp                     0
Diphtheria                  0
HIV/AIDS                    0
GDP                         0
Population                  0
thinness_1to19_years        0
thinness_5to9_years         0
Income_Comp_Of_Resources    0
Schooling                   0
dtype: int64

Ahora procedemos a seleccionar nuestro objeto de estudio. Nos gustaría estimar la esperanza
de vida de todos los países que se puedan para el año 2014.


In [314]:
# seleccionamos los registros del año 2014
y2014 = le.loc[le['Year'] == 2014 , le.columns]
y2014.head()

Unnamed: 0,Country,Year,Status,Life_Expectancy,Adult_Mortality,Infant_Deaths,Alcohol,Percentage_Exp,HepatitisB,Measles,...,Polio,Tot_Exp,Diphtheria,HIV/AIDS,GDP,Population,thinness_1to19_years,thinness_5to9_years,Income_Comp_Of_Resources,Schooling
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
17,Albania,2014,Developing,77.5,8.0,0,4.51,428.749067,98.0,0,...,98.0,5.88,98.0,0.1,4575.763787,288914.0,1.2,1.3,0.761,14.2
33,Algeria,2014,Developing,75.4,11.0,21,0.01,54.237318,95.0,0,...,95.0,7.21,95.0,0.1,547.8517,39113313.0,6.0,5.8,0.741,14.4
49,Angola,2014,Developing,51.7,348.0,67,8.33,23.965612,64.0,11699,...,68.0,3.31,64.0,2.0,479.31224,2692466.0,8.5,8.3,0.527,11.4
81,Argentina,2014,Developing,76.2,118.0,8,7.93,847.371746,94.0,1,...,92.0,4.79,94.0,0.1,12245.25645,42981515.0,1.0,0.9,0.825,17.3


Sin embargo, aun tenemos columnas con datos categoricos. Es necesario removerlas ya que
la regresion lineal trata unicamente con datos numericos, por simplicidad, no vamos a realizar una
codificación para estos datos, aunque hacerlo es posible.

In [315]:
# y removemos las columnas "Status", "Year" y "Country"
df = y2014.drop(columns=['Status', 'Year', 'Country'])
df.head()

Unnamed: 0,Life_Expectancy,Adult_Mortality,Infant_Deaths,Alcohol,Percentage_Exp,HepatitisB,Measles,BMI,Under_Five_Deaths,Polio,Tot_Exp,Diphtheria,HIV/AIDS,GDP,Population,thinness_1to19_years,thinness_5to9_years,Income_Comp_Of_Resources,Schooling
1,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,86,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
17,77.5,8.0,0,4.51,428.749067,98.0,0,57.2,1,98.0,5.88,98.0,0.1,4575.763787,288914.0,1.2,1.3,0.761,14.2
33,75.4,11.0,21,0.01,54.237318,95.0,0,58.4,24,95.0,7.21,95.0,0.1,547.8517,39113313.0,6.0,5.8,0.741,14.4
49,51.7,348.0,67,8.33,23.965612,64.0,11699,22.7,101,68.0,3.31,64.0,2.0,479.31224,2692466.0,8.5,8.3,0.527,11.4
81,76.2,118.0,8,7.93,847.371746,94.0,1,62.2,9,92.0,4.79,94.0,0.1,12245.25645,42981515.0,1.0,0.9,0.825,17.3


In [316]:
# extraemos la variable explicada (Life_Expectancy)
ve = df['Life_Expectancy']
df = df.drop(columns=['Life_Expectancy'])
y = ve.values

In [317]:
# la variable X contiene nuestra matriz de datos incluyendo una columna de unos al principio.
X = df.values
X = np.c_[np.ones(X.shape[0]), X]


Hasta ahora:

**X** es de tamaño 131x19.

**y** es de tamaño 131x1.

Realizamos los calculos según el método de MCO.

In [318]:
X_t = X.transpose()
XXt_inv = inv(matmul(X_t, X))
Xy = matmul(X_t, y)
b = matmul(XXt_inv, Xy)

¡Ahora tenemos el estimador b!

In [319]:
def predict(x, b):
    x0 = b[0:1]
    bb = b[1::]
    return np.dot(bb, x)+x0

# seleccionamos un país para ver cual es la prediccion segun nuestro modelo
rx = df.iloc[0]
print("Valor esperado: 59.9\nValor estimado: {}".format(predict(rx, b)))

Valor esperado: 59.9
Valor estimado: [61.59550675]


Parece que funciona...

Ahora podemos verificar las medidas de bondad de ajuste del modelo.

In [320]:
y_hat = []
for x in df.values:
    y_hat.append(float(predict(x,b)))
y_hat = np.array(y_hat)
u = y - y_hat

Ya calculamos todos los "y i's" con nuestro modelo y ahora verificamos
la variabilidad.

In [321]:
def VT(y):
    return np.sum(np.power(y - np.mean(y), 2))

def VE(y, y_h):
    return np.sum(np.power(y_h - np.mean(y), 2))

def VNE(y, y_h):
    return np.sum(np.power(y - y_h, 2))

print("Variabilidad Total: {}".format(VT(y)))
print("Variabilidad explicada: {}".format(VE(y, y_hat)))
print("Variabilidad no explicada: {}".format(VNE(y, y_hat)))
print("VE + VNE = {}".format(VE(y, y_hat)+VNE(y, y_hat)))

Variabilidad Total: 9626.488396946563
Variabilidad explicada: 8486.507574053016
Variabilidad no explicada: 1139.9808228927434
VE + VNE = 9626.488396945759


Observamos que efectivamente

VT = VNE + VE


Y el coeficiente R cuadrada y R cuadrada ajustado son los siguientes:

In [322]:
m = y.shape[0]
n = X.shape[1]

var_re = (1/m-n)*np.sum(np.power(u, 2))
r_sq = VE(y, y_hat)/VT(y)

r_sq_adj = 1 - (m-1)/(m-n) * (1 - r_sq)
print("R cuadrada: {}".format(r_sq))
print("R cuadrada ajustado: {}".format(r_sq_adj))

R cuadrada: 0.8815787464871262
R cuadrada ajustado: 0.8625467593154144


¡Tenemos un buen modelo!
