# Regression With Grouped Data

- heteroskedasticity (이분산성)
  - when the variance is not constant across all values of the features
  - 이는 주로 grouped data가 존재할 떄, 많이 발생한다.

In [4]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.formula.api as sm

In [3]:
wage = pd.read_csv("../data/wage.csv")[["wage", "lhwage", "educ", "IQ"]]

wage.head()

Unnamed: 0,wage,lhwage,educ,IQ
0,769,2.956212,12,93
1,808,2.782539,18,119
2,825,3.026504,14,108
3,650,2.788093,12,96
4,562,2.642622,11,74


In [7]:
model_1 = sm.ols('lhwage ~ educ', data=wage).fit()
model_1.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.2954,0.089,25.754,0.000,2.121,2.470
educ,0.0529,0.007,8.107,0.000,0.040,0.066


- 위의 경우 grouping이 되지 않은 raw 데이터이다. 
- 그렇다면 이들이 grouping이 되었을 떄는 어떻게 회귀식을 만들 수 있을까?
- WLS를 이용하면 된다.

In [8]:
group_wage = (wage
              .assign(count=1)
              .groupby("educ")
              .agg({"lhwage":"mean", "count":"count"})
              .reset_index())

group_wage

Unnamed: 0,educ,lhwage,count
0,9,2.856475,10
1,10,2.786911,35
2,11,2.855997,43
3,12,2.922168,393
4,13,3.021182,85
5,14,3.042352,77
6,15,3.090766,45
7,16,3.176184,150
8,17,3.246566,40
9,18,3.144257,57


In [9]:
model_2 = sm.wls('lhwage ~ educ', data=group_wage, weights=group_wage["count"]).fit()
model_2.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.2954,0.078,29.327,0.000,2.115,2.476
educ,0.0529,0.006,9.231,0.000,0.040,0.066


- 계수를 보면 거의 동일한 결과가 나온다.
- std err도 거의 비슷한다.
- `weights`를 지정하지 않으면 아래와 같이 조금 다른 결과가 나온다.

In [10]:
model_3 = sm.ols('lhwage ~ educ', data=group_wage).fit()
model_3.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.3650,0.082,28.988,0.000,2.177,2.553
educ,0.0481,0.006,8.136,0.000,0.034,0.062


# Regression for Dummies

https://matheusfacure.github.io/python-causality-handbook/06-Grouped-and-Dummy-Regression.html