# Desafío - Clasificación desde la econometría

**Nombre:** Luis Porras

## Desafio 1: Preparar el ambiente de trabajo

* sbp: Presión Sanguínea Sistólica.
* tobacco: Promedio tabaco consumido por día.
* ldl: Lipoproteína de baja densidad.
* adiposity: Adiposidad.
* famhist: Antecedentes familiares de enfermedades cardiácas. (Binaria)
* typea: Personalidad tipo A
* obesity: Obesidad.
* alcohol: Consumo actual de alcohol.
* age: edad.
* chd: Enfermedad coronaria. (dummy)

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (6,3)

In [89]:
df = pd.read_csv('southafricanheart.csv')

In [90]:
df.head()

Unnamed: 0.1,Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,1,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
1,2,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,3,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,4,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
4,5,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1


## Desafío 2

ejecute los siguientes pasos:

1. Recodifique `famhist` a dummy, asignando 1 a la categoría minoritaria.

In [93]:
df['famhist'].value_counts()

Absent     270
Present    192
Name: famhist, dtype: int64

In [94]:
df['famhist_present'] = np.where(df['famhist'] == 'Present', 1, 0)

In [95]:
df['famhist_present'].value_counts()

0    270
1    192
Name: famhist_present, dtype: int64

2. Utilice smf.logit para estimar el modelo.

In [96]:
var_dep = ['chd']
var_indps = ['famhist_present', 'sbp', 'tobacco', 'ldl', 'adiposity', 'typea', 'obesity', 'alcohol', 'age']
columns = var_dep + var_indps

In [97]:
db_subset = df.loc[:, columns]

In [98]:
db_subset.head()

Unnamed: 0,chd,famhist_present,sbp,tobacco,ldl,adiposity,typea,obesity,alcohol,age
0,1,1,160,12.0,5.73,23.11,49,25.3,97.2,52
1,1,0,144,0.01,4.41,28.61,55,28.87,2.06,63
2,0,1,118,0.08,3.48,32.28,52,29.14,3.81,46
3,1,1,170,7.5,6.41,38.03,51,31.99,24.26,58
4,1,1,134,13.6,3.5,27.78,60,25.99,57.34,49


In [99]:
df['chd'].value_counts()

0    302
1    160
Name: chd, dtype: int64

In [27]:
modelo_logit = smf.logit('chd ~ famhist_present', db_subset).fit()

Optimization terminated successfully.
         Current function value: 0.608111
         Iterations 5


3. Implemente una función `inverse_logit` que realize el mapeo de log-odds a probabilidad.

In [28]:
def inverse_logit(x):
    return 1 / (1 + np.exp(-x))

4. Con el modelo estimado, responda lo siguiente:

* ¿Cuál es la probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria?

In [30]:
modelo_logit.summary()

0,1,2,3
Dep. Variable:,chd,No. Observations:,462.0
Model:,Logit,Df Residuals:,460.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 27 Aug 2019",Pseudo R-squ.:,0.0574
Time:,21:01:39,Log-Likelihood:,-280.95
converged:,True,LL-Null:,-298.05
,,LLR p-value:,4.937e-09

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.1690,0.143,-8.169,0.000,-1.449,-0.889
famhist_present,1.1690,0.203,5.751,0.000,0.771,1.567


In [52]:
estimate_chd_1 = modelo_logit.params['Intercept'] + modelo_logit.params['famhist_present']
print(f"El log odds estimado es de {estimate_chd_1}")

El log odds estimado es de 0.0


In [58]:
prob_chd_1 = inverse_logit(estimate_chd_1) * 100
print(f"La probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria es de: {prob_chd_1}%")

La probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria es de: 50.0%


* ¿Cuál es la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria?

In [54]:
estimate_chd_2 = modelo_logit.params['Intercept']
print(f"El log odds estimado es de {estimate_chd_2}")

El log odds estimado es de -1.1689930854299098


In [57]:
prob_chd_2 = inverse_logit(estimate_chd_2) * 100
print(f"La probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria es de: {round(prob_chd_2, 2)}%")

La probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria es de: 23.7%


* ¿Cuál es la diferencia en la probabilidad entre un individuo con antecedentes y  otro sin antecedentes?

In [61]:
print(f"La diferencia en probabilidad entre un individuo con antecedentes y otro sin antecedentes es de: {round(prob_chd_1 - prob_chd_2, 2)}%")

La diferencia en probabilidad entre un individuo con antecedentes y otro sin antecedentes es de: 26.3%


* Replique el modelo con smf.ols  y  comente las similitudes entre los coeficientes estimados.

In [62]:
modelo_ols = smf.ols('chd ~ famhist_present', db_subset).fit()

In [63]:
modelo_ols.summary()

0,1,2,3
Dep. Variable:,chd,R-squared:,0.074
Model:,OLS,Adj. R-squared:,0.072
Method:,Least Squares,F-statistic:,36.86
Date:,"Tue, 27 Aug 2019",Prob (F-statistic):,2.66e-09
Time:,21:32:12,Log-Likelihood:,-294.59
No. Observations:,462,AIC:,593.2
Df Residuals:,460,BIC:,601.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.2370,0.028,8.489,0.000,0.182,0.292
famhist_present,0.2630,0.043,6.071,0.000,0.178,0.348

0,1,2,3
Omnibus:,768.898,Durbin-Watson:,1.961
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.778
Skew:,0.579,Prob(JB):,1.72e-13
Kurtosis:,1.692,Cond. No.,2.47


In [64]:
print(f"Segun el modelo de Regresion lineal, la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria es de {0.2370 * 100} %")
      
      

Segun el modelo de Regresion lineal, la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria es de 23.7 %


La probabilidad dio exactamente igual a la del modelo logit!

In [66]:
print(f"Segun el modelo de Regresion linea, la probabilidad de un individuo con antecedentes familiares de tener nua enfermedad coronaria es de: {(0.2370 + 0.2630) * 100} %")

Segun el modelo de Regresion linea, la probabilidad de un individuo con antecedentes familiares de tener nua enfermedad coronaria es de: 50.0 %


La probabilidad dio exactamente igual a la del modelo logit!

* Estime el mismo modelo con LPM

## Desafío 3: Estimación completa

Implemente un modelo con más de una variable independiente

In [86]:
model_rigth_term = ' + '.join(var_indps)
model_formula = f"chd ~ {model_rigth_term}"
model_formula

'chd ~ famhist_present + sbp + tobacco + ldl + adiposity + typea + obesity + alcohol + age'

In [87]:
modelo2_logit = smf.logit(model_formula, db_subset).fit()

Optimization terminated successfully.
         Current function value: 0.510974
         Iterations 6


In [88]:
modelo2_logit.summary()

0,1,2,3
Dep. Variable:,chd,No. Observations:,462.0
Model:,Logit,Df Residuals:,452.0
Method:,MLE,Df Model:,9.0
Date:,"Tue, 27 Aug 2019",Pseudo R-squ.:,0.208
Time:,21:58:50,Log-Likelihood:,-236.07
converged:,True,LL-Null:,-298.05
,,LLR p-value:,2.0549999999999998e-22

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-6.1507,1.308,-4.701,0.000,-8.715,-3.587
famhist_present,0.9254,0.228,4.061,0.000,0.479,1.372
sbp,0.0065,0.006,1.135,0.256,-0.005,0.018
tobacco,0.0794,0.027,2.984,0.003,0.027,0.132
ldl,0.1739,0.060,2.915,0.004,0.057,0.291
adiposity,0.0186,0.029,0.635,0.526,-0.039,0.076
typea,0.0396,0.012,3.214,0.001,0.015,0.064
obesity,-0.0629,0.044,-1.422,0.155,-0.150,0.024
alcohol,0.0001,0.004,0.027,0.978,-0.009,0.009


* Depure el modelo manteniendo las variables con significancia estadística al 95%.

Las variables con significancia estadística al 95% son: `famhist_present`, `tobacco`, `ldl`, `typea`, `age`

In [101]:
modelo3_logit = smf.logit('chd ~ famhist_present + tobacco + ldl + typea + age', db_subset).fit()

Optimization terminated successfully.
         Current function value: 0.514811
         Iterations 6


In [102]:
modelo3_logit.summary()

0,1,2,3
Dep. Variable:,chd,No. Observations:,462.0
Model:,Logit,Df Residuals:,456.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 27 Aug 2019",Pseudo R-squ.:,0.202
Time:,22:03:23,Log-Likelihood:,-237.84
converged:,True,LL-Null:,-298.05
,,LLR p-value:,2.554e-24

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-6.4464,0.921,-7.000,0.000,-8.251,-4.642
famhist_present,0.9082,0.226,4.023,0.000,0.466,1.351
tobacco,0.0804,0.026,3.106,0.002,0.030,0.131
ldl,0.1620,0.055,2.947,0.003,0.054,0.270
typea,0.0371,0.012,3.051,0.002,0.013,0.061
age,0.0505,0.010,4.944,0.000,0.030,0.070


* Compare los estadísticos de bondad de ajuste entre ambos.

In [103]:
0.2080 - 0.2020

0.005999999999999978

La diferencia entre los R Cuadrados es: 0.005, lo que significa que no hay diferencias significativas

* Reporte de forma sucinta el efecto de las variables en el log-odds de tener una enfermedad coronaria.
