## Least Squares (LS)
- 작성자: 고려대학교 경제학과 한치록 교수, 데이터사이언스팀 이창훈 과장

간편한 사용을 위하여 `bok_da` 라이브러리에 최소자승법(LS) 관련 모듈을 제공한다.

* `regress`: OLS와 WLS
* `prais`: Prais-Winsten 추정(Cochrane-Orcutt 포함)
* `het_test`, `ac_test`, `reset_test`: 이분산 검정, 자기상관 검정, RESET 검정

이 문서에서는 Stata와 결과를 비교하고자 한다. Python에서 Stata의 실행에 관한 자세한 내용은 `7-01`과 `7-02` 매뉴얼을 참고하라. 이 문서들에서는 `bok_da` 라이브러리에 구현된 간편 모듈을 사용한다.

## regress

`regress` 함수는 간편하게 인자를 받아들여 OLS 또는 WLS를 수행한다. `statsmodels`의 관련 함수들을 호출할 뿐이므로 `statsmodels` 함수들과 마찬가지로 `RegressionResults` 클래스를 리턴하고, 따라서 그 후 `summary` 메쏘드를 사용하여 결과를 display할 수 있다.

### OLS

먼저 OLS 회귀 결과는 다음과 같다. 이하에서는 일부러 공선성이 발생하도록 `I(smoke-aged)`와 `I(smoke+aged)`를 회귀모형에 추가하여 `bok_da` 라이브러리의 `regress` 함수가 공선성을 적절하게 점검하고 처리하는지 확인하였다. I()는 Identity 함수로 두 변수의 합 또는 차이를 새로운 변수로 만들라는 의미이다. 두 변수는 원변수의 선형조합이기때문에 다중공선성 문제가 발생하게 된다.

[3_07]: 07%20BOK%20Library%20II%20(IV).ipynb
[3_08]: 08%20BOK%20Library%20III%20(Time%20Series).ipynb

#### **(주의) 본 매뉴얼에서 Stata 기능은 라이선스 이슈로 아직까지는 BIDAS 환경에서 사용할 수 없다. 매뉴얼에서 stata 관련 코드는 주석처리하였다. 로컬환경(내부망, 인터넷망)에서 활용하는 경우 주석해제 하여 사용할 수 있다.**

In [3]:
import bok_da as bd
import pandas as pd
from bok_da.linear.lm import regress

df = pd.read_csv('../data/Death.csv')
fm = 'deathrate~smoke+drink+aged+I(smoke-aged)+I(smoke+aged)+C(year)'
ols = regress(fm, data=df, vce='cl', cluster='region')
print(ols.summary(slim=True))

note: [1mI(smoke - aged)[0;0m omitted because of collinearity.
note: [1mI(smoke + aged)[0;0m omitted because of collinearity.
                            OLS Regression Results                            
Dep. Variable:              deathrate   R-squared:                      0.9205
Model:                            OLS   Adj. R-squared:                 0.9189
No. Observations:                 258   F-statistic:                     585.2
Covariance Type:              cluster   Prob (F-statistic):             0.0000
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          0.27932    0.89277       0.31      0.754    -1.47048     2.02913
C(year)[T.2009]   -0.36224    0.07250      -5.00      0.000    -0.50433    -0.22015
C(year)[T.2010]   -0.30478    0.07896      -3.86      0.000    -0.45955    -0.15001
smoke              0.04025    0.02118       1.90      0

공선성 문제가 해결됨을 볼 수 있다.

- 위 결과를 `bok_da` 라이브러리를 통하여 호출하는 `pystata`를 사용하여 Stata로 복원하면 다음과 같다. 우선 다음과 같이 모듈들을 설치한다.

```python
pip install pystata stata_setup
```

패키지 설치 후 다음과 같이 한다.(매뉴얼에서 Stata 관련 코드는 라이선스 문제로 로컬 PC환경(내부망, 인터넷망)에서만 실행 가능하다. BIDAS 환경에서는 아직 Stata기능을 지원하지 않는다.)

In [2]:
# import bok_da
# from bok_da.stata import Stata

# stata = Stata('/Applications/Stata', 'mp')
# stata.get_ready()
# df = bok.read.csv('Death.csv')
# df['sma'] = df.smoke - df.aged
# df['spa'] = df.smoke + df.aged
# stata.use(df, force=True)
# stata.run('reg deathrate smoke drink aged sma spa i.year, vce(cl region)')

. reg deathrate smoke drink aged sma spa i.year, vce(cl region)
note: sma omitted because of collinearity.
note: spa omitted because of collinearity.

Linear regression                               Number of obs     =        258
                                                F(5, 85)          =     585.21
                                                Prob > F          =     0.0000
                                                R-squared         =     0.9205
                                                Root MSE          =     .61453

                                (Std. err. adjusted for 86 clusters in region)
------------------------------------------------------------------------------
             |               Robust
   deathrate | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       smoke |   .0402485    .021176     1.90   0.061    -.0018551    .0823521
       drink |   .0050

위에서 `bok.regress` 모듈로부터 리턴되는 것은 `statsmodels`의 OLS model을 `fit`한 결과이다(클러스터 표준오차). 공선성이 체크되었음을 확인할 수 있다.

### t test

In [6]:
# python
ols.t_test('smoke=drink')

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
c0            0.03519    0.03143       1.12      0.263    -0.02642     0.09680

### Wald test

In [7]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
print(ols.wald_test('smoke=0,drink=0'))

<Wald test (chi2): statistic=[[6.0494385]], p-value=0.04857145581121495, df_denom=2>


### WLS

다음은 WLS 추정 결과이다. `weights=...` 인자를 주면 된다.

In [9]:
df = pd.read_csv('../data/Death.csv')
fm = 'deathrate~smoke+drink+aged+I(smoke-aged)+I(smoke+aged)+C(year)'
wls = regress(fm, data=df, weights='regpop', vce='cl', cluster='region')
print(wls.summary(slim=True))

note: [1mI(smoke - aged)[0;0m omitted because of collinearity.
note: [1mI(smoke + aged)[0;0m omitted because of collinearity.
                            WLS Regression Results                            
Dep. Variable:              deathrate   R-squared:                      0.9407
Model:                            WLS   Adj. R-squared:                 0.9395
No. Observations:                 258   F-statistic:                     806.7
Covariance Type:              cluster   Prob (F-statistic):             0.0000
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         -0.37040    1.33423      -0.28      0.781    -2.98545     2.24465
C(year)[T.2009]   -0.29085    0.07084      -4.11      0.000    -0.42970    -0.15200
C(year)[T.2010]   -0.28009    0.08379      -3.34      0.001    -0.44431    -0.11587
smoke              0.03566    0.02315       1.54      0

Stata 결과와 비교해 보자.

In [6]:
# import bok_da
# from bok_da.stata import Stata

# stata = Stata('/Applications/Stata', 'mp')
# stata.get_ready()
# df = bok.read.csv('Death.csv')
# df['sma'] = df.smoke - df.aged
# df['spa'] = df.smoke + df.aged
# stata.use(df, force=True)
# stata.run('reg deathrate smoke drink aged sma spa i.year [aw=regpop], vce(cl region)')

. reg deathrate smoke drink aged sma spa i.year [aw=regpop], vce(cl region)
(sum of wgt is 14,356,343)
note: sma omitted because of collinearity.
note: spa omitted because of collinearity.

Linear regression                               Number of obs     =        258
                                                F(5, 85)          =     806.71
                                                Prob > F          =     0.0000
                                                R-squared         =     0.9407
                                                Root MSE          =     .59603

                                (Std. err. adjusted for 86 clusters in region)
------------------------------------------------------------------------------
             |               Robust
   deathrate | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       smoke |   .0356597   .0231473     1.54   0.127    -.010

## Prais-Winsten

In [11]:
from bok_da.linear.lm import prais

qsales = pd.read_stata('../data/qsales.dta')
pw = prais('csales~isales', qsales, vce='HC1')
print(pw.summary(slim=True))

Iteration 0:  rho = 0.0000
Iteration 1:  rho = 0.6312
Iteration 2:  rho = 0.6500
Iteration 3:  rho = 0.6528
Iteration 4:  rho = 0.6532
Iteration 5:  rho = 0.6533
Iteration 6:  rho = 0.6533
Iteration 7:  rho = 0.6533
Iteration 8:  rho = 0.6533
       Prais-Winsten AR(1) regression with iterative estimates Results        
Dep. Variable:                 csales   R-squared:                      0.9987
Model:                          Prais   Adj. R-squared:                 0.9986
No. Observations:                  20   F-statistic:                     6074.
Covariance Type:                  HC1   Prob (F-statistic):             0.0000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -1.26782    0.34283      -3.70      0.000    -1.93975    -0.59589
isales        0.17499    0.00225      77.93      0.000     0.17059     0.17939
rho                                    0.65329

Stata와 비교해 보자.

In [8]:
# import bok_da as bd
# from bok_da.stata import Stata

# stata = Stata('/Applications/Stata', 'mp')
# stata.get_ready()
# stata.run('''use qsales, clear
# prais csales isales, vce(r)''')


. use qsales, clear

. prais csales isales, vce(r)

Iteration 0:  rho = 0.0000
Iteration 1:  rho = 0.6312
Iteration 2:  rho = 0.6500
Iteration 3:  rho = 0.6528
Iteration 4:  rho = 0.6532
Iteration 5:  rho = 0.6533
Iteration 6:  rho = 0.6533
Iteration 7:  rho = 0.6533
Iteration 8:  rho = 0.6533

Prais–Winsten AR(1) regression with iterated estimates

Linear regression                               Number of obs     =         20
                                                F(1, 18)          =    6073.66
                                                Prob > F          =     0.0000
                                                R-squared         =     0.9987
                                                Root MSE          =     .06627

------------------------------------------------------------------------------
             |             Semirobust
      csales | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+-----------------------------------------

## Tests

### 이분산 검정

#### Breusch-Pagan 검정

In [13]:
from bok_da.linear.test import het_test

df = pd.read_csv('../data/Death.csv')
fm = 'deathrate~smoke+drink+aged'
ols = regress(fm, data=df[df.year==2010])
bp = het_test(ols, verbose=True)

Breusch-Pagan test for heteroskedasticity
Variables: All independent variables

H0: Constant variance

    chi2(3) =   6.28    F(3, 82) =   2.15
Prob > chi2 = 0.0989    Prob > F = 0.1000


위와 동일한 결과는 Stata로 다음과 같이 만들 수 있다.

In [10]:
# import bok_da as bd
# from bok_da.stata import Stata

# stata = Stata('/Applications/Stata', 'mp')
# stata.get_ready()
# df = bd.read.csv('Death.csv')
# stata.use(df, force=True)
# stata.run(
#     "reg deathrate smoke drink aged if year==2010\n"
#     "estat hett, rhs iid\n"
#     "estat hett, rhs fs"
# )


. reg deathrate smoke drink aged if year==2010

      Source |       SS           df       MS      Number of obs   =        86
-------------+----------------------------------   F(3, 82)        =    320.90
       Model |  385.368639         3  128.456213   Prob > F        =    0.0000
    Residual |  32.8242936        82  .400296264   R-squared       =    0.9215
-------------+----------------------------------   Adj R-squared   =    0.9186
       Total |  418.192933        85  4.91991686   Root MSE        =    .63269

------------------------------------------------------------------------------
   deathrate | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       smoke |    .061505   .0370383     1.66   0.101     -.012176     .135186
       drink |   .0158218   .0197301     0.80   0.425    -.0234276    .0550712
        aged |   .4079032   .0162442    25.11   0.000     .3755883    .4402181
  

#### White 검정

In [14]:
df = pd.read_csv('../data/Death.csv')
fm = 'deathrate~smoke+drink+aged'
ols = regress(fm, data=df[df.year==2010])
w = het_test(ols, method='w', verbose=True)

White test for heteroskedasticity
Variables: All independent variables and quadratic

H0: Constant variance

    chi2(9) =  15.75    F(9, 82) =   1.89
Prob > chi2 = 0.0723    Prob > F = 0.0656


In [12]:
# import bok_da as bd
# from bok_da.stata import Stata

# stata = Stata('/Applications/Stata', 'mp')
# stata.get_ready()
# df = bd.read.csv('Death.csv')
# stata.use(df, force=True)
# stata.run(
#     "reg deathrate smoke drink aged if year==2010\n"
#     "estat imtest, white"
# )


. reg deathrate smoke drink aged if year==2010

      Source |       SS           df       MS      Number of obs   =        86
-------------+----------------------------------   F(3, 82)        =    320.90
       Model |  385.368639         3  128.456213   Prob > F        =    0.0000
    Residual |  32.8242936        82  .400296264   R-squared       =    0.9215
-------------+----------------------------------   Adj R-squared   =    0.9186
       Total |  418.192933        85  4.91991686   Root MSE        =    .63269

------------------------------------------------------------------------------
   deathrate | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       smoke |    .061505   .0370383     1.66   0.101     -.012176     .135186
       drink |   .0158218   .0197301     0.80   0.425    -.0234276    .0550712
        aged |   .4079032   .0162442    25.11   0.000     .3755883    .4402181
  

In [15]:
df = pd.read_csv('../data/Death.csv')
fm = 'deathrate~smoke+drink+aged'
ols = regress(fm, data=df[df.year==2010])
ws = het_test(ols, method='ws', verbose=True)

White test for heteroskedasticity
Variables: Fitted values and square

H0: Constant variance

    chi2(2) =   0.77    F(2, 82) =   0.37
Prob > chi2 = 0.6808    Prob > F = 0.6888


### 자기상관 검정

#### Breusch-Godfrey 검정

In [17]:
from bok_da.linear.test import ac_test

df = pd.read_stata('../data/klein.dta')
fm = 'consump~wagegovt'
ols = regress(fm, data=df)
bg = ac_test(ols, nlags=1, verbose=True)

Breusch-Godfrey test for autocorrelation

H0: No serial correlation

    chi2(1) =  14.26    F(1, 20) =  35.04
Prob > chi2 = 0.0002    Prob > F = 0.0000


Stata와 같은 결과이다.

In [15]:
# from bok_da.stata import Stata

# # Stata
# stata = Stata('/Applications/Stata', 'mp')
# stata.get_ready()
# stata.use(df, force=True)
# stata.run('tsset yr')
# stata.run('reg consump wagegovt', quietly = True)
# stata.run('estat bgodfrey')

. tsset yr

Time variable: yr, 1920 to 1941
        Delta: 1 unit
. qui reg consump wagegovt
. estat bgodfrey

Breusch–Godfrey LM test for autocorrelation
---------------------------------------------------------------------------
    lags(p)  |          chi2               df                 Prob > chi2
-------------+-------------------------------------------------------------
       1     |         14.264               1                   0.0002
---------------------------------------------------------------------------
                        H0: no serial correlation


### RESET test

In [19]:
from bok_da.linear.test import reset_test

df = pd.read_csv('../data/Death.csv')
fm = 'deathrate~smoke+drink+aged'
ols = regress(fm, data=df[df.year==2010])
reset = reset_test(ols, power=4, verbose=True)

Ramsey RESET test

H0: Model is correctly linearly specified

F(3, 79) =   1.51
Prob > F = 0.2189


Stata 결과도 똑같다.

In [17]:
# # Stata
# stata = Stata('/Applications/Stata', 'mp')
# stata.get_ready()
# stata.use(df, force=True)
# stata.run('reg deathrate smoke drink aged if year==2010', quietly = True)
# stata.run('estat ovtest')

. qui reg deathrate smoke drink aged if year==2010
. estat ovtest

Ramsey RESET test for omitted variables
Omitted: Powers of fitted values of deathrate

H0: Model has no omitted variables

F(3, 79) =   1.51
Prob > F = 0.2189
