# Analytics 
## Ciências de Dados
### Prof. Dr. Neylson Crepalde

---

## Regressão Logística

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from wooldridge import *
%matplotlib inline

In [6]:
#Get the list of datasets
print(dataWoo())

  J.M. Wooldridge (2016) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93    

In [8]:
mroz = dataWoo('mroz')
mroz.head()

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,huseduc,huswage,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
0,1,1610,1,0,32,12,3.354,2.65,2708,34,12,4.0288,16310.0,0.7215,12,7,5.0,0,14,10.91006,1.210154,196
1,1,1656,0,2,30,12,1.3889,2.65,2310,30,9,8.4416,21800.0,0.6615,7,7,11.0,1,5,19.499981,0.328512,25
2,1,1980,1,3,35,12,4.5455,4.04,3072,40,12,3.5807,21040.0,0.6915,12,7,5.0,0,15,12.03991,1.514138,225
3,1,456,0,3,34,12,1.0965,3.25,1920,53,10,3.5417,7300.0,0.7815,7,7,5.0,0,6,6.799996,0.092123,36
4,1,1568,1,2,31,14,4.5918,3.6,2000,32,12,10.0,27300.0,0.6215,12,14,9.5,1,7,20.100058,1.524272,49


## Wooldridge, example 17.1
### Labor supply of married, working women

Vamos investigar a probabilidade de uma mulher estar no mercado de trabalho dependendo de suas características sociodemográficas. 

`mroz` é um pandas DataFrame com 753 observações e 22 variáveis. 

- inlf: =1 if in lab frce, 1975
- hours: hours worked, 1975
- kidslt6: # kids < 6 years
- kidsge6: # kids 6-18
- age: woman's age in yrs
- educ: years of schooling
- wage: est. wage from earn, hrs
- repwage: rep. wage at interview in 1976
- hushrs: hours worked by husband, 1975
- husage: husband's age
- huseduc: husband's years of schooling
- huswage: husband's hourly wage, 1975
- faminc: family income, 1975
- mtr: fed. marg. tax rte facing woman
- motheduc: mother's years of schooling
- fatheduc: father's years of schooling
- unem: unem. rate in county of resid.
- city: =1 if live in SMSA
- exper: actual labor mkt exper
- nwifeinc: (faminc - wage*hours)/1000
- lwage: log(wage)
- expersq: exper^2

Implemente o seguinte modelo:

$$P(inlf) = \beta nwifeinc + \beta educ + \beta exper + \beta exper^2 + \beta age + \beta kidslt6 + \beta kidsge6$$

In [36]:
mroz.describe()

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,huseduc,huswage,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
count,753.0,753.0,753.0,753.0,753.0,753.0,428.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,753.0,428.0,753.0
mean,0.568393,740.576361,0.237716,1.353254,42.537849,12.286853,4.177682,1.849734,2267.270916,45.12085,12.491368,7.482179,23080.594954,0.678863,9.250996,8.808765,8.623506,0.642762,10.63081,20.128964,1.190173,178.038513
std,0.49563,871.314216,0.523959,1.319874,8.072574,2.280246,3.310282,2.419887,595.566649,8.058793,3.020804,4.230559,12190.202026,0.083496,3.367468,3.57229,3.114934,0.479504,8.06913,11.634797,0.723198,249.630849
min,0.0,0.0,0.0,0.0,30.0,5.0,0.1282,0.0,175.0,30.0,3.0,0.4121,1500.0,0.4415,0.0,0.0,3.0,0.0,0.0,-0.029057,-2.054164,0.0
25%,0.0,0.0,0.0,0.0,36.0,12.0,2.2626,0.0,1928.0,38.0,11.0,4.7883,15428.0,0.6215,7.0,7.0,7.5,0.0,4.0,13.02504,0.816509,16.0
50%,1.0,288.0,0.0,1.0,43.0,12.0,3.4819,0.0,2164.0,46.0,12.0,6.9758,20880.0,0.6915,10.0,7.0,7.5,1.0,9.0,17.700001,1.247574,81.0
75%,1.0,1516.0,0.0,2.0,49.0,13.0,4.97075,3.58,2553.0,52.0,15.0,9.1667,28200.0,0.7215,12.0,12.0,11.0,1.0,15.0,24.466,1.603571,225.0
max,1.0,4950.0,3.0,8.0,60.0,17.0,25.0,9.98,5010.0,60.0,17.0,40.508999,96000.0,0.9415,17.0,17.0,14.0,1.0,45.0,96.0,3.218876,2025.0


In [53]:
mroz.isna().sum()

inlf          0
hours         0
kidslt6       0
kidsge6       0
age           0
educ          0
wage        325
repwage       0
hushrs        0
husage        0
huseduc       0
huswage       0
faminc        0
mtr           0
motheduc      0
fatheduc      0
unem          0
city          0
exper         0
nwifeinc      0
lwage       325
expersq       0
dtype: int64

Se tentarmos ajustar um modelo linear, 

In [66]:
fit0 = smf.ols(formula='inlf ~ nwifeinc + educ + exper + np.power(exper, 2) + age + kidslt6 + kidsge6', 
               data=mroz).fit()
print(fit0.summary())

                            OLS Regression Results                            
Dep. Variable:                   inlf   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.257
Method:                 Least Squares   F-statistic:                     38.22
Date:                Mon, 28 Oct 2019   Prob (F-statistic):           6.90e-46
Time:                        22:00:08   Log-Likelihood:                -423.89
No. Observations:                 753   AIC:                             863.8
Df Residuals:                     745   BIC:                             900.8
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.5855      0

In [65]:
fit1 = smf.glm(formula='inlf ~ nwifeinc + educ + exper + np.power(exper, 2) + age + kidslt6 + kidsge6', 
               data=mroz, family=sm.families.Binomial()).fit()
print(fit1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                   inlf   No. Observations:                  753
Model:                            GLM   Df Residuals:                      745
Model Family:                Binomial   Df Model:                            7
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -401.77
Date:                Mon, 28 Oct 2019   Deviance:                       803.53
Time:                        21:59:45   Pearson chi2:                     732.
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.4255      0

In [64]:
fit1.params

Intercept             0.425452
nwifeinc             -0.021345
educ                  0.221170
exper                 0.205870
np.power(exper, 2)   -0.003154
age                  -0.088024
kidslt6              -1.443354
kidsge6               0.060112
dtype: float64

In [22]:
from sklearn.linear_model import LogisticRegression

In [67]:
fit2 = LogisticRegression(penalty='none', solver='newton-cg')
y = mroz.inlf
X = mroz[['nwifeinc', 'educ', 'exper','expersq', 'age','kidslt6','kidsge6']]
fit2.fit(X, y)
print('Intercept:', fit2.intercept_)
print('Coefs:', fit2.coef_)

Intercept: [0.42545235]
Coefs: [[-0.02134517  0.22117037  0.20586953 -0.0031541  -0.08802437 -1.44335413
   0.06011222]]


In [76]:
from sklearn.metrics import confusion_matrix, roc_auc_score

In [77]:
print(confusion_matrix(mroz.inlf, fit2.predict(X)))
print(roc_auc_score(mroz.inlf, fit2.predict(X)))

[[207 118]
 [ 81 347]]
0.7238353702372393
