# Desafío - Clasificación desde la econometría


## Desafío 1: Preparar el ambiente de trabajo
- Se detallan los pasos a seguir
- tip: Los tips o sugerencias preceden de tip
- Se generan dos notebooks, uno con las soluciones y otro con los ejercicios.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
import scipy.stats as stats

import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.filterwarnings('ignore')

plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize']=(10, 6)

plt.rcParams['figure.dpi']=200

In [2]:
df = pd.read_csv('southafricanheart.csv').drop('Unnamed: 0', axis=1)

## Desafío 2

In [3]:
df['famhist'].value_counts()

Absent     270
Present    192
Name: famhist, dtype: int64

In [4]:
df.head()

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
1,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
4,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1


In [5]:
df['famhist_bin'] = np.where(df['famhist'] == 'Present', 1, 0)

In [6]:
m1_logit = smf.logit('chd ~ famhist_bin', df).fit()

m1_logit.summary2()

Optimization terminated successfully.
         Current function value: 0.608111
         Iterations 5


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.057
Dependent Variable:,chd,AIC:,565.8944
Date:,2019-06-13 23:14,BIC:,574.1655
No. Observations:,462,Log-Likelihood:,-280.95
Df Model:,1,LL-Null:,-298.05
Df Residuals:,460,LLR p-value:,4.9371e-09
Converged:,1.0000,Scale:,1.0
No. Iterations:,5.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-1.1690,0.1431,-8.1687,0.0000,-1.4495,-0.8885
famhist_bin,1.1690,0.2033,5.7514,0.0000,0.7706,1.5674


In [7]:
def inverse_logit (x):
    return 1/ (1+np.exp(-x))
    

In [8]:
estimate_y =m1_logit.params['Intercept'] + (m1_logit.params['famhist_bin'])*1

estimate_y2 = m1_logit.params['Intercept'] + (m1_logit.params['famhist_bin'])*0

prob_famhist_1 = inverse_logit(estimate_y)
prob_famhist_0 = inverse_logit(estimate_y2)
print ('la probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria es:',
       round(prob_famhist_1,2))

print ('la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria es:',
       round(prob_famhist_0,2))

print ('la diferencia en la probabilidad entre un individuo con antecedentes y otro sin antecedentes es: ', 
       round(prob_famhist_1-prob_famhist_0,3))


la probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria es: 0.5
la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria es: 0.24
la diferencia en la probabilidad entre un individuo con antecedentes y otro sin antecedentes es:  0.263


In [9]:
m1_ols = smf.ols('chd ~ famhist_bin',df).fit()
resultado_m1_ols = m1_ols.summary()
diagnostico_m1_ols = resultado_m1_ols.tables[0]
coeficientes_m1_ols = resultado_m1_ols.tables[1]
coeficientes_m1_ols

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.2370,0.028,8.489,0.000,0.182,0.292
famhist_bin,0.2630,0.043,6.071,0.000,0.178,0.348


In [10]:
m1_logit.params/4

Intercept     -0.292248
famhist_bin    0.292248
dtype: float64

## Desafío 3: Estimación completa


In [11]:
m2_logit=smf.logit('chd ~ sbp + tobacco + ldl + adiposity + typea + obesity + alcohol + age + famhist_bin',df).fit()

m2_logit.summary2()

Optimization terminated successfully.
         Current function value: 0.510974
         Iterations 6


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.208
Dependent Variable:,chd,AIC:,492.14
Date:,2019-06-13 23:14,BIC:,533.4957
No. Observations:,462,Log-Likelihood:,-236.07
Df Model:,9,LL-Null:,-298.05
Df Residuals:,452,LLR p-value:,2.0548e-22
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-6.1507,1.3083,-4.7015,0.0000,-8.7149,-3.5866
sbp,0.0065,0.0057,1.1350,0.2564,-0.0047,0.0177
tobacco,0.0794,0.0266,2.9838,0.0028,0.0272,0.1315
ldl,0.1739,0.0597,2.9152,0.0036,0.0570,0.2909
adiposity,0.0186,0.0293,0.6346,0.5257,-0.0388,0.0760
typea,0.0396,0.0123,3.2138,0.0013,0.0154,0.0637
obesity,-0.0629,0.0442,-1.4218,0.1551,-0.1496,0.0238
alcohol,0.0001,0.0045,0.0271,0.9784,-0.0087,0.0089
age,0.0452,0.0121,3.7285,0.0002,0.0215,0.0690


In [12]:
m2_logit_depur= smf.logit('chd ~ tobacco + ldl + typea + age + famhist_bin',df).fit()
m2_logit_depur.summary2()

Optimization terminated successfully.
         Current function value: 0.514811
         Iterations 6


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.202
Dependent Variable:,chd,AIC:,487.6856
Date:,2019-06-13 23:14,BIC:,512.499
No. Observations:,462,Log-Likelihood:,-237.84
Df Model:,5,LL-Null:,-298.05
Df Residuals:,456,LLR p-value:,2.5537000000000002e-24
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-6.4464,0.9209,-7.0004,0.0000,-8.2513,-4.6416
tobacco,0.0804,0.0259,3.1057,0.0019,0.0297,0.1311
ldl,0.1620,0.0550,2.9470,0.0032,0.0543,0.2697
typea,0.0371,0.0122,3.0505,0.0023,0.0133,0.0610
age,0.0505,0.0102,4.9442,0.0000,0.0305,0.0705
famhist_bin,0.9082,0.2258,4.0228,0.0001,0.4657,1.3507


## Desafío 4: Estimación de perfiles
A partir del modelo depurado, genere las estimaciones en log-odds y posteriormente transfórmelas
a probabilidades con inverse_logit . Los perfiles a estimar son los siguientes:
- La probabilidad de tener una enfermedad coronaria para un individuo con características similares a la muestra.
- La probabilidad de tener una enfemerdad coronaria para un individuo con altos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.
- La probabilidad de tener una enfemerdad coronaria para un individuo con bajos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.

In [13]:
prob = m2_logit_depur.params

In [14]:
prob

Intercept     -6.446445
tobacco        0.080375
ldl            0.161992
typea          0.037115
age            0.050460
famhist_bin    0.908175
dtype: float64

In [15]:
#La probabilidad de tener una enfermedad coronaria para un individuo con características similares a la muestra
print('para 1: ',round(inverse_logit(prob[0] + prob[1]*np.mean(df.tobacco)+prob[2]*np.mean(df.ldl)+prob[3]*np.mean(df.typea)+prob[4]*np.mean(df.age)+prob[5]),3))
print('para 0: ',round(inverse_logit(prob[0] + prob[1]*np.mean(df.tobacco)+prob[2]*np.mean(df.ldl)+prob[3]*np.mean(df.typea)+prob[4]*np.mean(df.age)+prob[5]*0),3))

para 1:  0.414
para 0:  0.222


In [16]:
#La probabilidad de tener una enfemerdad coronaria para un individuo con altos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.

print('para 1: ',round(inverse_logit(prob[0] + prob[1]*np.mean(df.tobacco)+prob[2]*np.max(df.ldl)+prob[3]*np.mean(df.typea)+prob[4]*np.mean(df.age)+prob[5]),3))
print('para 0: ',round(inverse_logit(prob[0] + prob[1]*np.mean(df.tobacco)+prob[2]*np.max(df.ldl)+prob[3]*np.mean(df.typea)+prob[4]*np.mean(df.age)+prob[5]*0),3))

para 1:  0.797
para 0:  0.613


In [17]:
#La probabilidad de tener una enfemerdad coronaria para un individuo con bajos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.

print('para 1: ',round(inverse_logit(prob[0] + prob[1]*np.mean(df.tobacco)+prob[2]*np.min(df.ldl)+prob[3]*np.mean(df.typea)+prob[4]*np.mean(df.age)+prob[5]),3))
print('para 0: ',round(inverse_logit(prob[0] + prob[1]*np.mean(df.tobacco)+prob[2]*np.min(df.ldl)+prob[3]*np.mean(df.typea)+prob[4]*np.mean(df.age)+prob[5]*0),3))

para 1:  0.278
para 0:  0.134
