# EBAC - Regressão II - regressão múltipla

## Tarefa I

#### Previsão de renda

Vamos trabalhar com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy

In [2]:
df = pd.read_csv('previsao_de_renda.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15000 non-null  int64  
 1   data_ref               15000 non-null  object 
 2   id_cliente             15000 non-null  int64  
 3   sexo                   15000 non-null  object 
 4   posse_de_veiculo       15000 non-null  bool   
 5   posse_de_imovel        15000 non-null  bool   
 6   qtd_filhos             15000 non-null  int64  
 7   tipo_renda             15000 non-null  object 
 8   educacao               15000 non-null  object 
 9   estado_civil           15000 non-null  object 
 10  tipo_residencia        15000 non-null  object 
 11  idade                  15000 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  15000 non-null  float64
 14  renda                  15000 non-null  float64
dtypes:

1. Ajuste um modelo para prever log(renda) considerando todas as covariáveis disponíveis.
    - Utilizando os recursos do Patsy, coloque as variáveis qualitativas como *dummies*.
    - Mantenha sempre a categoria mais frequente como casela de referência
    - Avalie os parâmetros e veja se parecem fazer sentido prático.  


2. Remova a variável menos significante e analise:
    - Observe os indicadores que vimos, e avalie se o modelo melhorou ou piorou na sua opinião.
    - Observe os parâmetros e veja se algum se alterou muito.  


3. Siga removendo as variáveis menos significantes, sempre que o *p-value* for menor que 5%. Compare o modelo final com o inicial. Observe os indicadores e conclua se o modelo parece melhor. 
    

In [4]:
# 1

df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12427 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             12427 non-null  int64  
 1   data_ref               12427 non-null  object 
 2   id_cliente             12427 non-null  int64  
 3   sexo                   12427 non-null  object 
 4   posse_de_veiculo       12427 non-null  bool   
 5   posse_de_imovel        12427 non-null  bool   
 6   qtd_filhos             12427 non-null  int64  
 7   tipo_renda             12427 non-null  object 
 8   educacao               12427 non-null  object 
 9   estado_civil           12427 non-null  object 
 10  tipo_residencia        12427 non-null  object 
 11  idade                  12427 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  12427 non-null  float64
 14  renda                  12427 non-null  float64
dtypes: bool

In [5]:
df['log_renda'] = np.log(df['renda'])

In [6]:
df.mode().iloc[0]

Unnamed: 0                         0
data_ref                  2015-09-01
id_cliente                    5573.0
sexo                               F
posse_de_veiculo               False
posse_de_imovel                 True
qtd_filhos                       0.0
tipo_renda               Assalariado
educacao                  Secundário
estado_civil                  Casado
tipo_residencia                 Casa
idade                           40.0
tempo_emprego               4.216438
qt_pessoas_residencia            2.0
renda                         728.96
log_renda                   6.591619
Name: 0, dtype: object

In [7]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           C(tipo_residencia, Treatment("Casa")) + posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)
X

DesignMatrix with shape (12427, 25)
  Columns:
    ['Intercept',
     'C(sexo, Treatment("F"))[T.M]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Bolsista]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Empresário]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Pensionista]',
     'C(tipo_renda, Treatment("Assalariado"))[T.Servidor público]',
     'C(educacao, Treatment("Secundário"))[T.Primário]',
     'C(educacao, Treatment("Secundário"))[T.Pós graduação]',
     'C(educacao, Treatment("Secundário"))[T.Superior completo]',
     'C(educacao, Treatment("Secundário"))[T.Superior incompleto]',
     'C(estado_civil, Treatment("Casado"))[T.Separado]',
     'C(estado_civil, Treatment("Casado"))[T.Solteiro]',
     'C(estado_civil, Treatment("Casado"))[T.União]',
     'C(estado_civil, Treatment("Casado"))[T.Viúvo]',
     'C(tipo_residencia, Treatment("Casa"))[T.Aluguel]',
     'C(tipo_residencia, Treatment("Casa"))[T.Com os pais]',
     'C(tipo_residencia, Treatment("Casa"))[T.Comuni

In [8]:
y

DesignMatrix with shape (12427, 1)
  log_renda
    8.99471
    7.52410
    7.72041
    8.79494
    8.77585
    7.27647
    7.45358
    7.83042
    8.13750
    9.46801
    8.76443
    6.36506
   10.14091
    7.06603
    8.20940
    9.89158
    9.52359
    8.57316
   10.17252
    9.01970
    8.29814
    8.44590
    8.63262
    5.93891
    9.07289
    6.96150
    7.60007
    8.99363
    8.76293
    7.84689
  [12397 rows omitted]
  Terms:
    'log_renda' (column 0)
  (to view full data, use np.asarray(this_obj))

In [9]:
sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.357
Model:,OLS,Adj. R-squared:,0.356
Method:,Least Squares,F-statistic:,287.5
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:06,Log-Likelihood:,-13568.0
No. Observations:,12427,AIC:,27190.0
Df Residuals:,12402,BIC:,27370.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5264,0.219,29.853,0.000,6.098,6.955
"C(sexo, Treatment(""F""))[T.M]",0.7874,0.015,53.723,0.000,0.759,0.816
"C(tipo_renda, Treatment(""Assalariado""))[T.Bolsista]",0.2209,0.241,0.916,0.360,-0.252,0.694
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1551,0.015,10.387,0.000,0.126,0.184
"C(tipo_renda, Treatment(""Assalariado""))[T.Pensionista]",-0.3087,0.241,-1.280,0.201,-0.782,0.164
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0576,0.022,2.591,0.010,0.014,0.101
"C(educacao, Treatment(""Secundário""))[T.Primário]",0.0141,0.072,0.196,0.844,-0.127,0.155
"C(educacao, Treatment(""Secundário""))[T.Pós graduação]",0.1212,0.142,0.853,0.394,-0.157,0.400
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1079,0.014,7.763,0.000,0.081,0.135

0,1,2,3
Omnibus:,0.858,Durbin-Watson:,2.023
Prob(Omnibus):,0.651,Jarque-Bera (JB):,0.839
Skew:,0.019,Prob(JB):,0.657
Kurtosis:,3.012,Cond. No.,2130.0


O Primeiro modelo está com o R² e o R² Ajustado bem próximos, sendo 35,7% e 35,6% respectivamente. A variável com maior p_value é "C(educacao, Treatment("Secundário"))[T.Primário]", porém a variável educação possui outros valores que são bem significantes para o modelo. Então será excluído do modelo apenas essa variável.

In [10]:
# 2

df = df.drop(df[df['educacao'] == 'Primário'].index)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12324 entries, 0 to 14999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             12324 non-null  int64  
 1   data_ref               12324 non-null  object 
 2   id_cliente             12324 non-null  int64  
 3   sexo                   12324 non-null  object 
 4   posse_de_veiculo       12324 non-null  bool   
 5   posse_de_imovel        12324 non-null  bool   
 6   qtd_filhos             12324 non-null  int64  
 7   tipo_renda             12324 non-null  object 
 8   educacao               12324 non-null  object 
 9   estado_civil           12324 non-null  object 
 10  tipo_residencia        12324 non-null  object 
 11  idade                  12324 non-null  int64  
 12  tempo_emprego          12324 non-null  float64
 13  qt_pessoas_residencia  12324 non-null  float64
 14  renda                  12324 non-null  float64
 15  log_ren

In [11]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           C(tipo_residencia, Treatment("Casa")) + posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.358
Model:,OLS,Adj. R-squared:,0.356
Method:,Least Squares,F-statistic:,297.7
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:09,Log-Likelihood:,-13466.0
No. Observations:,12324,AIC:,26980.0
Df Residuals:,12300,BIC:,27160.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5243,0.219,29.817,0.000,6.095,6.953
"C(sexo, Treatment(""F""))[T.M]",0.7889,0.015,53.500,0.000,0.760,0.818
"C(tipo_renda, Treatment(""Assalariado""))[T.Bolsista]",0.2230,0.241,0.924,0.356,-0.250,0.696
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1576,0.015,10.512,0.000,0.128,0.187
"C(tipo_renda, Treatment(""Assalariado""))[T.Pensionista]",-0.3099,0.241,-1.283,0.199,-0.783,0.163
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0592,0.022,2.655,0.008,0.015,0.103
"C(educacao, Treatment(""Secundário""))[T.Pós graduação]",0.1201,0.142,0.844,0.398,-0.159,0.399
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1076,0.014,7.727,0.000,0.080,0.135
"C(educacao, Treatment(""Secundário""))[T.Superior incompleto]",-0.0295,0.032,-0.915,0.360,-0.093,0.034

0,1,2,3
Omnibus:,0.778,Durbin-Watson:,2.024
Prob(Omnibus):,0.678,Jarque-Bera (JB):,0.755
Skew:,0.018,Prob(JB):,0.685
Kurtosis:,3.015,Cond. No.,2130.0


Os valores de R² teve um pequeno aumento e o do Ajustado se manteve. A variável com maior p_value é tipo_residencia, e todos os valores dessa variável possue p_value maior que 5%, então será removida do modelo.

In [12]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.357
Model:,OLS,Adj. R-squared:,0.356
Method:,Least Squares,F-statistic:,380.3
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:10,Log-Likelihood:,-13467.0
No. Observations:,12324,AIC:,26970.0
Df Residuals:,12305,BIC:,27110.0
Df Model:,18,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5294,0.219,29.859,0.000,6.101,6.958
"C(sexo, Treatment(""F""))[T.M]",0.7909,0.015,53.797,0.000,0.762,0.820
"C(tipo_renda, Treatment(""Assalariado""))[T.Bolsista]",0.2241,0.241,0.928,0.353,-0.249,0.697
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1580,0.015,10.556,0.000,0.129,0.187
"C(tipo_renda, Treatment(""Assalariado""))[T.Pensionista]",-0.3126,0.241,-1.295,0.195,-0.786,0.161
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0605,0.022,2.717,0.007,0.017,0.104
"C(educacao, Treatment(""Secundário""))[T.Pós graduação]",0.1191,0.142,0.837,0.403,-0.160,0.398
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1074,0.014,7.735,0.000,0.080,0.135
"C(educacao, Treatment(""Secundário""))[T.Superior incompleto]",-0.0291,0.032,-0.902,0.367,-0.092,0.034

0,1,2,3
Omnibus:,0.738,Durbin-Watson:,2.024
Prob(Omnibus):,0.691,Jarque-Bera (JB):,0.72
Skew:,0.018,Prob(JB):,0.698
Kurtosis:,3.012,Cond. No.,2130.0


O valor de R² teve uma pequena queda. O mesmo tratamento feito anteriormente com a variável "educação" será feito com outro valor, o que apresenta o maior p-value (Pós graduação).

In [13]:
df = df.drop(df[df['educacao'] == 'Pós graduação'].index)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12298 entries, 0 to 14999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             12298 non-null  int64  
 1   data_ref               12298 non-null  object 
 2   id_cliente             12298 non-null  int64  
 3   sexo                   12298 non-null  object 
 4   posse_de_veiculo       12298 non-null  bool   
 5   posse_de_imovel        12298 non-null  bool   
 6   qtd_filhos             12298 non-null  int64  
 7   tipo_renda             12298 non-null  object 
 8   educacao               12298 non-null  object 
 9   estado_civil           12298 non-null  object 
 10  tipo_residencia        12298 non-null  object 
 11  idade                  12298 non-null  int64  
 12  tempo_emprego          12298 non-null  float64
 13  qt_pessoas_residencia  12298 non-null  float64
 14  renda                  12298 non-null  float64
 15  log_ren

In [14]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.358
Model:,OLS,Adj. R-squared:,0.357
Method:,Least Squares,F-statistic:,402.0
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:12,Log-Likelihood:,-13445.0
No. Observations:,12298,AIC:,26930.0
Df Residuals:,12280,BIC:,27060.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5278,0.219,29.837,0.000,6.099,6.957
"C(sexo, Treatment(""F""))[T.M]",0.7913,0.015,53.726,0.000,0.762,0.820
"C(tipo_renda, Treatment(""Assalariado""))[T.Bolsista]",0.2239,0.241,0.927,0.354,-0.249,0.697
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1580,0.015,10.532,0.000,0.129,0.187
"C(tipo_renda, Treatment(""Assalariado""))[T.Pensionista]",-0.3120,0.242,-1.292,0.196,-0.785,0.161
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0606,0.022,2.719,0.007,0.017,0.104
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1075,0.014,7.737,0.000,0.080,0.135
"C(educacao, Treatment(""Secundário""))[T.Superior incompleto]",-0.0288,0.032,-0.894,0.372,-0.092,0.034
"C(estado_civil, Treatment(""Casado""))[T.Separado]",0.3268,0.111,2.933,0.003,0.108,0.545

0,1,2,3
Omnibus:,0.647,Durbin-Watson:,2.022
Prob(Omnibus):,0.724,Jarque-Bera (JB):,0.63
Skew:,0.017,Prob(JB):,0.73
Kurtosis:,3.01,Cond. No.,2120.0


O valor de R² aumentou novamente. O mesmo tratamento feito anteriormente com a variável "educação" será feito com a variável tipo_renda, o que apresenta o maior p-value (Bolsista). Esse procedimento será feito até o modelo ter apenas variáveis com p_value menor que 5%.

In [15]:
df = df.drop(df[df['tipo_renda'] == 'Bolsista'].index)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12289 entries, 0 to 14999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             12289 non-null  int64  
 1   data_ref               12289 non-null  object 
 2   id_cliente             12289 non-null  int64  
 3   sexo                   12289 non-null  object 
 4   posse_de_veiculo       12289 non-null  bool   
 5   posse_de_imovel        12289 non-null  bool   
 6   qtd_filhos             12289 non-null  int64  
 7   tipo_renda             12289 non-null  object 
 8   educacao               12289 non-null  object 
 9   estado_civil           12289 non-null  object 
 10  tipo_residencia        12289 non-null  object 
 11  idade                  12289 non-null  int64  
 12  tempo_emprego          12289 non-null  float64
 13  qt_pessoas_residencia  12289 non-null  float64
 14  renda                  12289 non-null  float64
 15  log_ren

In [16]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.358
Model:,OLS,Adj. R-squared:,0.357
Method:,Least Squares,F-statistic:,426.9
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:13,Log-Likelihood:,-13439.0
No. Observations:,12289,AIC:,26910.0
Df Residuals:,12272,BIC:,27040.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5278,0.219,29.829,0.000,6.099,6.957
"C(sexo, Treatment(""F""))[T.M]",0.7913,0.015,53.712,0.000,0.762,0.820
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1580,0.015,10.529,0.000,0.129,0.187
"C(tipo_renda, Treatment(""Assalariado""))[T.Pensionista]",-0.3120,0.242,-1.291,0.197,-0.786,0.162
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0606,0.022,2.718,0.007,0.017,0.104
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1075,0.014,7.735,0.000,0.080,0.135
"C(educacao, Treatment(""Secundário""))[T.Superior incompleto]",-0.0288,0.032,-0.893,0.372,-0.092,0.034
"C(estado_civil, Treatment(""Casado""))[T.Separado]",0.3268,0.111,2.933,0.003,0.108,0.545
"C(estado_civil, Treatment(""Casado""))[T.Solteiro]",0.2695,0.109,2.471,0.013,0.056,0.483

0,1,2,3
Omnibus:,0.628,Durbin-Watson:,2.023
Prob(Omnibus):,0.731,Jarque-Bera (JB):,0.613
Skew:,0.017,Prob(JB):,0.736
Kurtosis:,3.009,Cond. No.,2120.0


In [17]:
df = df.drop(df[df['educacao'] == 'Superior incompleto'].index)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11731 entries, 0 to 14999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             11731 non-null  int64  
 1   data_ref               11731 non-null  object 
 2   id_cliente             11731 non-null  int64  
 3   sexo                   11731 non-null  object 
 4   posse_de_veiculo       11731 non-null  bool   
 5   posse_de_imovel        11731 non-null  bool   
 6   qtd_filhos             11731 non-null  int64  
 7   tipo_renda             11731 non-null  object 
 8   educacao               11731 non-null  object 
 9   estado_civil           11731 non-null  object 
 10  tipo_residencia        11731 non-null  object 
 11  idade                  11731 non-null  int64  
 12  tempo_emprego          11731 non-null  float64
 13  qt_pessoas_residencia  11731 non-null  float64
 14  renda                  11731 non-null  float64
 15  log_ren

In [18]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.36
Model:,OLS,Adj. R-squared:,0.36
Method:,Least Squares,F-statistic:,440.3
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:15,Log-Likelihood:,-12836.0
No. Observations:,11731,AIC:,25700.0
Df Residuals:,11715,BIC:,25820.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5417,0.219,29.854,0.000,6.112,6.971
"C(sexo, Treatment(""F""))[T.M]",0.7983,0.015,52.790,0.000,0.769,0.828
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1570,0.015,10.198,0.000,0.127,0.187
"C(tipo_renda, Treatment(""Assalariado""))[T.Pensionista]",-0.3135,0.242,-1.297,0.195,-0.787,0.160
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0578,0.023,2.553,0.011,0.013,0.102
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1077,0.014,7.740,0.000,0.080,0.135
"C(estado_civil, Treatment(""Casado""))[T.Separado]",0.3123,0.112,2.799,0.005,0.094,0.531
"C(estado_civil, Treatment(""Casado""))[T.Solteiro]",0.2669,0.109,2.444,0.015,0.053,0.481
"C(estado_civil, Treatment(""Casado""))[T.União]",-0.0389,0.026,-1.490,0.136,-0.090,0.012

0,1,2,3
Omnibus:,0.818,Durbin-Watson:,2.01
Prob(Omnibus):,0.664,Jarque-Bera (JB):,0.808
Skew:,0.02,Prob(JB):,0.668
Kurtosis:,3.006,Cond. No.,2090.0


In [19]:
df = df.drop(df[df['tipo_renda'] == 'Pensionista'].index)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11722 entries, 0 to 14999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             11722 non-null  int64  
 1   data_ref               11722 non-null  object 
 2   id_cliente             11722 non-null  int64  
 3   sexo                   11722 non-null  object 
 4   posse_de_veiculo       11722 non-null  bool   
 5   posse_de_imovel        11722 non-null  bool   
 6   qtd_filhos             11722 non-null  int64  
 7   tipo_renda             11722 non-null  object 
 8   educacao               11722 non-null  object 
 9   estado_civil           11722 non-null  object 
 10  tipo_residencia        11722 non-null  object 
 11  idade                  11722 non-null  int64  
 12  tempo_emprego          11722 non-null  float64
 13  qt_pessoas_residencia  11722 non-null  float64
 14  renda                  11722 non-null  float64
 15  log_ren

In [20]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.36
Model:,OLS,Adj. R-squared:,0.36
Method:,Least Squares,F-statistic:,471.2
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:16,Log-Likelihood:,-12830.0
No. Observations:,11722,AIC:,25690.0
Df Residuals:,11707,BIC:,25800.0
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5415,0.219,29.845,0.000,6.112,6.971
"C(sexo, Treatment(""F""))[T.M]",0.7986,0.015,52.771,0.000,0.769,0.828
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1571,0.015,10.197,0.000,0.127,0.187
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0579,0.023,2.553,0.011,0.013,0.102
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1075,0.014,7.726,0.000,0.080,0.135
"C(estado_civil, Treatment(""Casado""))[T.Separado]",0.3123,0.112,2.798,0.005,0.094,0.531
"C(estado_civil, Treatment(""Casado""))[T.Solteiro]",0.2677,0.109,2.451,0.014,0.054,0.482
"C(estado_civil, Treatment(""Casado""))[T.União]",-0.0387,0.026,-1.483,0.138,-0.090,0.012
"C(estado_civil, Treatment(""Casado""))[T.Viúvo]",0.3649,0.116,3.143,0.002,0.137,0.593

0,1,2,3
Omnibus:,0.798,Durbin-Watson:,2.01
Prob(Omnibus):,0.671,Jarque-Bera (JB):,0.79
Skew:,0.02,Prob(JB):,0.674
Kurtosis:,3.004,Cond. No.,2090.0


In [21]:
df = df.drop(df[df['estado_civil'] == 'União'].index)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10860 entries, 0 to 14998
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             10860 non-null  int64  
 1   data_ref               10860 non-null  object 
 2   id_cliente             10860 non-null  int64  
 3   sexo                   10860 non-null  object 
 4   posse_de_veiculo       10860 non-null  bool   
 5   posse_de_imovel        10860 non-null  bool   
 6   qtd_filhos             10860 non-null  int64  
 7   tipo_renda             10860 non-null  object 
 8   educacao               10860 non-null  object 
 9   estado_civil           10860 non-null  object 
 10  tipo_residencia        10860 non-null  object 
 11  idade                  10860 non-null  int64  
 12  tempo_emprego          10860 non-null  float64
 13  qt_pessoas_residencia  10860 non-null  float64
 14  renda                  10860 non-null  float64
 15  log_ren

In [22]:
y, X = patsy.dmatrices('log_renda ~ C(sexo, Treatment("F")) + C(tipo_renda, Treatment("Assalariado")) + \
           C(educacao, Treatment("Secundário")) + C(estado_civil, Treatment("Casado")) + \
           posse_de_veiculo + posse_de_imovel + \
           qtd_filhos + idade + tempo_emprego + qt_pessoas_residencia',
   df)

sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,log_renda,R-squared:,0.363
Model:,OLS,Adj. R-squared:,0.362
Method:,Least Squares,F-statistic:,475.4
Date:,"Sat, 22 Feb 2025",Prob (F-statistic):,0.0
Time:,14:48:18,Log-Likelihood:,-11913.0
No. Observations:,10860,AIC:,23850.0
Df Residuals:,10846,BIC:,23960.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.4909,0.230,28.209,0.000,6.040,6.942
"C(sexo, Treatment(""F""))[T.M]",0.8075,0.016,51.335,0.000,0.777,0.838
"C(tipo_renda, Treatment(""Assalariado""))[T.Empresário]",0.1624,0.016,10.077,0.000,0.131,0.194
"C(tipo_renda, Treatment(""Assalariado""))[T.Servidor público]",0.0562,0.023,2.397,0.017,0.010,0.102
"C(educacao, Treatment(""Secundário""))[T.Superior completo]",0.1053,0.015,7.252,0.000,0.077,0.134
"C(estado_civil, Treatment(""Casado""))[T.Separado]",0.3360,0.117,2.874,0.004,0.107,0.565
"C(estado_civil, Treatment(""Casado""))[T.Solteiro]",0.2903,0.115,2.534,0.011,0.066,0.515
"C(estado_civil, Treatment(""Casado""))[T.Viúvo]",0.3902,0.121,3.219,0.001,0.153,0.628
posse_de_veiculo[T.True],0.0437,0.015,2.878,0.004,0.014,0.073

0,1,2,3
Omnibus:,0.54,Durbin-Watson:,2.021
Prob(Omnibus):,0.763,Jarque-Bera (JB):,0.532
Skew:,0.017,Prob(JB):,0.767
Kurtosis:,3.004,Cond. No.,2110.0


# 3

Inicialmente o dataframe tinha 15 mil linhas, antes do primeiro modelo foi reduzido para 12427, e o último modelo tinha 10860. Do primeiro modelo para o último houve uma diminuição do tamanho do dataframe de mais de 12%. O último se mostrou melhor com um R² maior (36,3% contra 35,7%) e o R² Ajustado foi bem próximo (36,2%), mostrando que penalizou, mas foi uma penalização mínima pela quantidade de variável.