# 含有定性信息的多元回归分析：二值（或虚拟）变量
---

## 对定性信息的描述

定性信息通常以二值信息的形式出现：一个人是男还是女；一个人是否拥有一台个人计算机。有关信息可以通过定义一个**二值变量**或**0-1变量**。

<div align=center>
<img src="./pic/w015.jpg" width = "50%" />
</div>

讨论：是否可以用其他二值来描述定性信息？

## 只有一个虚拟变量

例如如下决定小时工资的模型

$$wage=\beta_{0}+\delta_{0} \text { female }+\beta_{1} educ+\mu$$

那么

$$\delta_{0}=\mathrm{E}(\text { wage } | \text { female}=1, \text {educ}) - \mathrm{E}(\text { wage } | \text { female}=0, \text {educ})$$

即

$$\delta_{0}=\mathrm{E}(\text { wage } | \text { female,educ }) - \mathrm{E}(\text { wage } | \text { male,educ })$$

其中男性为**基组**。可以用下图描述。

<div align=center>
<img src="./pic/w016.jpg" width = "50%" />
</div>

讨论：是否可以在模型中包括另一个虚拟变量$male$?

答：在上述模型中使用两个虚拟变量将导致完全多重共线性。如果将模型的截距去掉，可以将每一组的虚拟变量都包括进来，例如上述模型可以写为$wage=\beta_{0} \text { male }+\alpha_{0} \text { female }+\beta_{1} educ+\mu$。但是不含截距项的回归怎样计算$R^{2}$没有一个一致同意的方法，所以很少使用这个表达式。

<br>

** 例子：小时工资方程**

In [1]:
import ipystata

In [2]:
%%stata

cd "D:\github\notebook\Teaching\Courses\Undergraduate\Econometrics\data"


D:\github\notebook\Teaching\Courses\Undergraduate\Econometrics\data



In [7]:
%%stata

use WAGE1, clear

eststo clear
eststo: quietly reg wage female
eststo: quietly reg wage female educ exper tenure
esttab, se r2 ar2 sca(rss)


(est1 stored)

(est2 stored)

--------------------------------------------
                      (1)             (2)   
                     wage            wage   
--------------------------------------------
female             -2.512***       -1.811***
                  (0.303)         (0.265)   

educ                                0.572***
                                 (0.0493)   

exper                              0.0254*  
                                 (0.0116)   

tenure                              0.141***
                                 (0.0212)   

_cons               7.099***       -1.568*  
                  (0.210)         (0.725)   
--------------------------------------------
N                     526             526   
R-sq                0.116           0.364   
adj. R-sq           0.114           0.359   
rss                6332.2          4557.3   
--------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001



**政策评估**

在最简单的项目评估中，把对象分为两组。对照组（control group）不参加这个项目，而实验组或处理组（treatment group）则参加。

<br>

**例子：培训津贴对培训小时数的影响**

In [8]:
%%stata

use JTRAIN, clear

reg hrsemp grant lsales lemploy if year==1988


      Source |       SS           df       MS      Number of obs   =       105
-------------+----------------------------------   F(3, 101)       =     10.44
       Model |  18622.7268         3  6207.57559   Prob > F        =    0.0000
    Residual |  60031.0921       101  594.367249   R-squared       =    0.2368
-------------+----------------------------------   Adj R-squared   =    0.2141
       Total |  78653.8189       104   756.28672   Root MSE        =     24.38

------------------------------------------------------------------------------
      hrsemp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       grant |    26.2545   5.591765     4.70   0.000     15.16194    37.34705
      lsales |  -.9845809   3.539903    -0.28   0.781    -8.006797    6.037635
     lemploy |  -6.069871   3.882893    -1.56   0.121    -13.77249    1.632744
       _cons |   46.66508    43.4121     1.07   0.

### 当因变量为$log(y)$时，对虚拟解释变量系数的解释

一般地，如果$\hat{\beta}_{1}$是一个虚拟变量，那么当$log(y)$是因变量时，精确的百分数变化为

$$100 \cdot\left[\exp \left(\hat{\beta}_{1}\right)-1\right]$$

<br>

**例子：对数小时工资**

In [10]:
%%stata

use WAGE1, clear

reg lwage female educ exper expersq tenure tenursq
di exp(_b[female]*1)-1


      Source |       SS           df       MS      Number of obs   =       526
-------------+----------------------------------   F(6, 519)       =     68.18
       Model |  65.3791009         6  10.8965168   Prob > F        =    0.0000
    Residual |  82.9506505       519  .159827843   R-squared       =    0.4408
-------------+----------------------------------   Adj R-squared   =    0.4343
       Total |  148.329751       525   .28253286   Root MSE        =    .39978

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   -.296511   .0358055    -8.28   0.000    -.3668524   -.2261696
        educ |   .0801967   .0067573    11.87   0.000     .0669217    .0934716
       exper |   .0294324   .0049752     5.92   0.000     .0196585    .0392063
     expersq |  -.0005827   .0001073    -5.43   0.

## 使用多类别虚拟变量

**例子：对数小时工资方程**

In [11]:
%%stata

use WAGE1, clear

gen male = (!female)
gen single = (~married)
gen marrmale = (married & ~female)
gen marrfem = (married & female)
gen singfem = (female & ~married)
gen singmale = (~female & ~married)

eststo clear
eststo: quietly reg lwage marrmale marrfem singfem educ exper expersq tenure tenursq
eststo: quietly reg lwage marrmale singmale singfem educ exper expersq tenure tenursq
esttab, se r2 ar2 sca(rss)


(est1 stored)

(est2 stored)

--------------------------------------------
                      (1)             (2)   
                    lwage           lwage   
--------------------------------------------
marrmale            0.213***        0.411***
                 (0.0554)        (0.0458)   

marrfem            -0.198***                
                 (0.0578)                   

singfem            -0.110*         0.0879   
                 (0.0557)        (0.0523)   

educ               0.0789***       0.0789***
                (0.00669)       (0.00669)   

exper              0.0268***       0.0268***
                (0.00524)       (0.00524)   

expersq         -0.000535***    -0.000535***
               (0.000110)      (0.000110)   

tenure             0.0291***       0.0291***
                (0.00676)       (0.00676)   

tenursq         -0.000533*      -0.000533*  
               (0.000231)      (0.000231)   

singmale                            0.198***
                

### 通过使用虚拟变量来包括序数信息

假设我们想估计城市信用等级对地方政府债券利率（$MBR$）的影响。为简便起见，假设等级的方位是$\{0,1,2,3,4\}$，$0$为最低信用等级，$4$为最高信用等级。这就是一个**序数变量**的例子，如果称这个变量为$CR$。如何将变量$CR$放到一个模型中去解释$MBR$呢？

一种可能是

$$MBR=\beta_{0}+\beta_{1} CR+ \text{other factors}$$

另一种可能是

$$MBR=\beta_{0}+\delta_{1} C R_{1}+\delta_{2} C R_{2}+\delta_{3} C R_{3}+\delta_{4} C R_{4}+ \text{other factors}$$

第二个模型更好，因为它使得每两个信用等级之间的变动都可能具有不同的影响。事实上，模型一可以视为模型二的一个特殊形式，它施加了下述约束条件：$\delta_{2}=2 \delta_{1}$, $\delta_{3}=3 \delta_{1}$,$\delta_{4}=4 \delta_{1}$。

<br>

**例子：相貌吸引力对工资的影响**

In [17]:
%%stata

use beauty, clear
tab looks

eststo clear
eststo: quietly reg lwage looks if female == 1
eststo: quietly reg lwage belavg abvavg if female == 1
eststo: quietly reg lwage belavg abvavg educ exper expersq if female == 1
esttab, se r2 ar2 sca(rss) star(* 0.10 ** 0.05 *** 0.01)

eststo clear
eststo: quietly reg lwage looks if female == 0
eststo: quietly reg lwage belavg abvavg if female == 0
eststo: quietly reg lwage belavg abvavg educ exper expersq if female == 0
esttab, se r2 ar2 sca(rss) star(* 0.10 ** 0.05 *** 0.01)


from 1 to 5 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         13        1.03        1.03
          2 |        142       11.27       12.30
          3 |        722       57.30       69.60
          4 |        364       28.89       98.49
          5 |         19        1.51      100.00
------------+-----------------------------------
      Total |      1,260      100.00

(est1 stored)

(est2 stored)

(est3 stored)

------------------------------------------------------------
                      (1)             (2)             (3)   
                    lwage           lwage           lwage   
------------------------------------------------------------
looks              0.0734**                                 
                 (0.0349)                                   

belavg                             -0.138*         -0.126*  
                                 (0.0762)        (0.0686)   

abvavg                            

## 涉及虚拟变量的交互作用

### 虚拟变量之间的交互作用

**例子：对数工资方程**

In [18]:
%%stata

use WAGE1, clear

gen male = (!female)
gen single = (~married)
gen marrmale = (married & ~female)
gen marrfem = (married & female)
gen singfem = (female & ~married)
gen singmale = (~female & ~married)

eststo clear
eststo: quietly reg lwage marrmale marrfem singfem
eststo: quietly reg lwage marrmale singmale singfem
eststo: quietly reg lwage female married marrfem
esttab, se r2 ar2 sca(rss)


(est1 stored)

(est2 stored)

(est3 stored)

------------------------------------------------------------
                      (1)             (2)             (3)   
                    lwage           lwage           lwage   
------------------------------------------------------------
marrmale            0.427***        0.506***                
                 (0.0616)        (0.0537)                   

marrfem           -0.0797                          -0.375***
                 (0.0655)                        (0.0857)   

singfem            -0.132*        -0.0519                   
                 (0.0668)        (0.0596)                   

singmale                           0.0797                   
                                 (0.0655)                   

female                                             -0.132*  
                                                 (0.0668)   

married                                             0.427***
                                  

### 允许出现不同的斜率

在工资方程中，如果我们还想检验男性和女性接受教育的回报是否相同。

<div align=center>
<img src="./pic/w017.jpg" width = "50%" />
</div>

对此，我们必须构建下列模型

$$\log (\text {wage})=\beta_{0}+\delta_{0} \text {female }+\beta_{1} \text {educ }+\delta_{1} \text {female} \cdot \text {educ}+\mu$$

In [20]:
%%stata

use WAGE1, clear

gen femed = female*educ

eststo clear
eststo: quietly reg lwage female educ exper expersq tenure tenursq
eststo: quietly reg lwage female educ femed exper expersq tenure tenursq
esttab, se r2 ar2 sca(rss)

test female femed


(est1 stored)

(est2 stored)

--------------------------------------------
                      (1)             (2)   
                    lwage           lwage   
--------------------------------------------
female             -0.297***       -0.227   
                 (0.0358)         (0.168)   

educ               0.0802***       0.0824***
                (0.00676)       (0.00847)   

exper              0.0294***       0.0293***
                (0.00498)       (0.00498)   

expersq         -0.000583***    -0.000580***
               (0.000107)      (0.000108)   

tenure             0.0317***       0.0319***
                (0.00685)       (0.00686)   

tenursq         -0.000585*      -0.000590*  
               (0.000235)      (0.000235)   

femed                            -0.00556   
                                 (0.0131)   

_cons               0.417***        0.389** 
                 (0.0989)         (0.119)   
--------------------------------------------
N                

### 检验不同组之间回归函数上的差别

假设我们想检验，是否有一个相同的回归模型来描述大学男女运动员的大学GPA。这个方程是

$$cumgpa =\beta_{0}+\beta_{1} sat+\beta_{2} hsperc+\beta_{3} \text {tothrs}+\mu$$

其中$sat$是SAT分数，$hsperc$是高中的排名百分位，而$tothrs$则是大学课程的总学时数。如果我们想检验男女之间是否存在差异，就必须运行模型的截距和斜率对两组而言都不相同：

$$\begin{align}
cumgpa &= \beta_{0}+\delta_{0}female+\beta_{1}sat+\delta_{1}female \cdot sat + \beta_{2}hsperc \\
& + \delta_{2}female \cdot hsperc + \beta_{3}tothrs+ \delta_{3}female \cdot tothrs +\mu
\end{align}$$

男性和女性的$cumgpa$都遵循同一个模型的原假设表述为

$$\mathrm{H}_{0} : \delta_{0}=0, \delta_{1}=0, \delta_{2}=0, \delta_{3}=0$$

In [21]:
%%stata

use GPA3, clear

gen female_sat = female * sat
gen female_hsperc = female * hsperc
gen female_tothrs = female * tothrs

reg cumgpa female sat female_sat hsperc female_hsperc tothrs female_tothrs if spring == 1

test female female_sat female_hsperc female_tothrs


      Source |       SS           df       MS      Number of obs   =       366
-------------+----------------------------------   F(7, 358)       =     34.95
       Model |  53.5391808         7   7.6484544   Prob > F        =    0.0000
    Residual |  78.3545052       358  .218867333   R-squared       =    0.4059
-------------+----------------------------------   Adj R-squared   =    0.3943
       Total |  131.893686       365  .361352564   Root MSE        =    .46783

-------------------------------------------------------------------------------
       cumgpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
       female |  -.3534862   .4105293    -0.86   0.390    -1.160838    .4538659
          sat |   .0010516   .0001811     5.81   0.000     .0006955    .0014078
   female_sat |   .0007506   .0003852     1.95   0.052    -6.88e-06    .0015081
       hsperc |  -.0084516   .0013704    -6.