# EBAC - Regressão II - regressão múltipla

## Tarefa I

#### Previsão de renda

Vamos trabalhar com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [3]:
import pandas as pd
import seaborn as sns
from seaborn import load_dataset
import matplotlib.pyplot as plt
import numpy as np

import patsy
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline

In [4]:
df = pd.read_csv('previsao_de_renda.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15000 non-null  int64  
 1   data_ref               15000 non-null  object 
 2   id_cliente             15000 non-null  int64  
 3   sexo                   15000 non-null  object 
 4   posse_de_veiculo       15000 non-null  bool   
 5   posse_de_imovel        15000 non-null  bool   
 6   qtd_filhos             15000 non-null  int64  
 7   tipo_renda             15000 non-null  object 
 8   educacao               15000 non-null  object 
 9   estado_civil           15000 non-null  object 
 10  tipo_residencia        15000 non-null  object 
 11  idade                  15000 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  15000 non-null  float64
 14  renda                  15000 non-null  float64
dtypes:

1. Ajuste um modelo para prever log(renda) considerando todas as covariáveis disponíveis.
    - Utilizando os recursos do Patsy, coloque as variáveis qualitativas como *dummies*.
    - Mantenha sempre a categoria mais frequente como casela de referência
    - Avalie os parâmetros e veja se parecem fazer sentido prático.  


2. Remova a variável menos significante e analise:
    - Observe os indicadores que vimos, e avalie se o modelo melhorou ou piorou na sua opinião.
    - Observe os parâmetros e veja se algum se alterou muito.  


3. Siga removendo as variáveis menos significantes, sempre que o *p-value* for menor que 5%. Compare o modelo final com o inicial. Observe os indicadores e conclua se o modelo parece melhor. 
    

### 1. Ajustando um modelo para prever log(renda)

In [8]:
# Variáveis qualitativas: sexo, tipo_renda, educacao, estado_civil, tipo_residencia

In [9]:
df['sexo'].value_counts() #F

sexo
F    10119
M     4881
Name: count, dtype: int64

In [10]:
df['tipo_renda'].value_counts() #Assalariado

tipo_renda
Assalariado         7633
Empresário          3508
Pensionista         2582
Servidor público    1268
Bolsista               9
Name: count, dtype: int64

In [11]:
df['educacao'].value_counts() #Secundário

educacao
Secundário             8895
Superior completo      5335
Superior incompleto     579
Primário                165
Pós graduação            26
Name: count, dtype: int64

In [12]:
df['estado_civil'].value_counts() #Casado

estado_civil
Casado      10534
Solteiro     1798
União        1078
Separado      879
Viúvo         711
Name: count, dtype: int64

In [13]:
df['tipo_residencia'].value_counts() #Casa

tipo_residencia
Casa             13532
Com os pais        675
Governamental      452
Aluguel            194
Estúdio             83
Comunitário         64
Name: count, dtype: int64

In [14]:
y, X = patsy.dmatrices('np.log(renda) ~ data_ref + C(sexo, Treatment("F")) + posse_de_veiculo + posse_de_imovel + qtd_filhos + C(tipo_renda, Treatment("Assalariado")) + C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + C(tipo_residencia, Treatment("Casa"))  + idade + tempo_emprego + qt_pessoas_residencia', data = df)
X

DesignMatrix with shape (12427, 39)
  Columns:
    ['Intercept',
     'data_ref[T.2015-02-01]',
     'data_ref[T.2015-03-01]',
     'data_ref[T.2015-04-01]',
     'data_ref[T.2015-05-01]',
     'data_ref[T.2015-06-01]',
     'data_ref[T.2015-07-01]',
     'data_ref[T.2015-08-01]',
     'data_ref[T.2015-09-01]',
     'data_ref[T.2015-10-01]',
     'data_ref[T.2015-11-01]',
     'data_ref[T.2015-12-01]',
     'data_ref[T.2016-01-01]',
     'data_ref[T.2016-02-01]',
     'data_ref[T.2016-03-01]',
     'C(sexo, Treatment("F"))[T.M]',
     'posse_de_veiculo[T.True]',
     'posse_de_imovel[T.True]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Bolsista]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Empresário]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Pensionista]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Servidor público]',
     'C(educacao, Treatment("Secundário"))[T.Primário]',
     'C(educacao, Treatment("Secundário"))[T.Pós graduação]',
     'C(educacao, Treatment

In [15]:
y

DesignMatrix with shape (12427, 1)
  np.log(renda)
        8.99471
        7.52410
        7.72041
        8.79494
        8.77585
        7.27647
        7.45358
        7.83042
        8.13750
        9.46801
        8.76443
        6.36506
       10.14091
        7.06603
        8.20940
        9.89158
        9.52359
        8.57316
       10.17252
        9.01970
        8.29814
        8.44590
        8.63262
        5.93891
        9.07289
        6.96150
        7.60007
        8.99363
        8.76293
        7.84689
  [12397 rows omitted]
  Terms:
    'np.log(renda)' (column 0)
  (to view full data, use np.asarray(this_obj))

In [16]:
reg = smf.ols('np.log(renda) ~ data_ref + C(sexo, Treatment("F")) + posse_de_veiculo + posse_de_imovel + qtd_filhos + C(tipo_renda, Treatment("Assalariado")) + C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + C(tipo_residencia, Treatment("Casa"))  + idade + tempo_emprego + qt_pessoas_residencia', data=df).fit()
df['res'] = reg.resid
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.359
Model:,OLS,Adj. R-squared:,0.357
Method:,Least Squares,F-statistic:,182.5
Date:,"Thu, 23 Jan 2025",Prob (F-statistic):,0.0
Time:,00:47:07,Log-Likelihood:,-13554.0
No. Observations:,12427,AIC:,27190.0
Df Residuals:,12388,BIC:,27480.0
Df Model:,38,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5144,0.220,29.619,0.000,6.083,6.945
data_ref[T.2015-02-01],0.0031,0.035,0.088,0.930,-0.066,0.073
data_ref[T.2015-03-01],0.0505,0.036,1.421,0.155,-0.019,0.120
data_ref[T.2015-04-01],0.0494,0.035,1.394,0.163,-0.020,0.119
data_ref[T.2015-05-01],-0.0183,0.035,-0.518,0.605,-0.088,0.051
data_ref[T.2015-06-01],0.0729,0.035,2.056,0.040,0.003,0.142
data_ref[T.2015-07-01],0.0285,0.035,0.806,0.421,-0.041,0.098
data_ref[T.2015-08-01],0.0010,0.035,0.027,0.978,-0.069,0.070
data_ref[T.2015-09-01],-0.0072,0.035,-0.204,0.838,-0.076,0.062

0,1,2,3
Omnibus:,1.087,Durbin-Watson:,2.028
Prob(Omnibus):,0.581,Jarque-Bera (JB):,1.064
Skew:,0.021,Prob(JB):,0.587
Kurtosis:,3.016,Cond. No.,2140.0


### 2. Removendo a variável menos significante e analisando

Removendo o data_ref:

In [19]:
reg = smf.ols('np.log(renda) ~ C(sexo, Treatment("F")) + posse_de_veiculo + posse_de_imovel + qtd_filhos + C(tipo_renda, Treatment("Assalariado")) + C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + C(tipo_residencia, Treatment("Casa"))  + idade + tempo_emprego + qt_pessoas_residencia', data=df).fit()
df['res'] = reg.resid
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.357
Model:,OLS,Adj. R-squared:,0.356
Method:,Least Squares,F-statistic:,287.5
Date:,"Thu, 23 Jan 2025",Prob (F-statistic):,0.0
Time:,00:47:07,Log-Likelihood:,-13568.0
No. Observations:,12427,AIC:,27190.0
Df Residuals:,12402,BIC:,27370.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5264,0.219,29.853,0.000,6.098,6.955
"C(sexo, Treatment(""F""))[T.M]",0.7874,0.015,53.723,0.000,0.759,0.816
posse_de_veiculo[T.True],0.0441,0.014,3.119,0.002,0.016,0.072
posse_de_imovel[T.True],0.0829,0.014,5.926,0.000,0.055,0.110
"C(tipo_renda, Treatment(""Assalariado""))[T.Bolsista]",0.2209,0.241,0.916,0.360,-0.252,0.694
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1551,0.015,10.387,0.000,0.126,0.184
"C(tipo_renda, Treatment(""Assalariado""))[T.Pensionista]",-0.3087,0.241,-1.280,0.201,-0.782,0.164
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0576,0.022,2.591,0.010,0.014,0.101
"C(educacao, Treatment(""Secundário""))[T.Primário]",0.0141,0.072,0.196,0.844,-0.127,0.155

0,1,2,3
Omnibus:,0.858,Durbin-Watson:,2.023
Prob(Omnibus):,0.651,Jarque-Bera (JB):,0.839
Skew:,0.019,Prob(JB):,0.657
Kurtosis:,3.012,Cond. No.,2130.0


Não houve uma grande mudança no modelo após a remoção da variável menos significativa. Isso faz sentido já que não há uma grande influência da variável.

### 3. Removendo as variáveis menos significantes, sempre que o *p-value* for menor que 5%.

As variáveis removidas são: data_ref, tipo_renda, educação, estado_civil, tipo_residencia, qt_pessoas_residencia.

In [23]:
reg = smf.ols('np.log(renda) ~ C(sexo, Treatment("F")) + posse_de_veiculo + posse_de_imovel + qtd_filhos + idade + tempo_emprego', data=df).fit()
df['res'] = reg.resid
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.347
Model:,OLS,Adj. R-squared:,0.346
Method:,Least Squares,F-statistic:,1098.0
Date:,"Thu, 23 Jan 2025",Prob (F-statistic):,0.0
Time:,00:47:07,Log-Likelihood:,-13673.0
No. Observations:,12427,AIC:,27360.0
Df Residuals:,12420,BIC:,27410.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.2223,0.034,213.646,0.000,7.156,7.289
"C(sexo, Treatment(""F""))[T.M]",0.7688,0.015,52.768,0.000,0.740,0.797
posse_de_veiculo[T.True],0.0561,0.014,3.986,0.000,0.029,0.084
posse_de_imovel[T.True],0.0866,0.014,6.278,0.000,0.060,0.114
qtd_filhos,0.0239,0.009,2.767,0.006,0.007,0.041
idade,0.0049,0.001,6.399,0.000,0.003,0.006
tempo_emprego,0.0610,0.001,59.084,0.000,0.059,0.063

0,1,2,3
Omnibus:,1.243,Durbin-Watson:,2.026
Prob(Omnibus):,0.537,Jarque-Bera (JB):,1.216
Skew:,0.022,Prob(JB):,0.545
Kurtosis:,3.02,Cond. No.,223.0


- Observamos que sem as variáveis não significantes o r-quadrado final foi menor do que o com todos os dados. Isso pode acontecer quando se reduz o número de variáveis do modelo.
- O AIC teve um leve aumento