---
title: "14-广义线性模型（GLM）"
subtitle: "Generalized Linear Model"
author: "Simon Zhou"
date: "2025-05-08"
format: 
    html:
        code-fold: false
        fig_caption: true
        number-sections: true
        toc: true
        toc-depth: 2
---

In [1]:
import stata_setup
stata_setup.config('C:/Program Files/Stata18', 'mp', splash=False)

In [13]:
%%stata
sysuse auto.dta,clear

(1978 automobile data)


大于15%会产生Bias的这个问题，只有在我们想用odds去estimaterisk的时候才会发生。

1. 在cohort study和clinicaltrial中，outcome是结局事件的发生率;
2. 在cross-sectional中，outcome是结局事件的prevalence;
3. 在case-control中，我们不报告risk，只报告odds，因此就没有这个问题了。

## 什么是广义线性模型

- 线性回归 Linear regression：
    - $exp Y = \beta_0 + \beta_1*X_1+\cdots + \beta_n *X_n$
- 广义线性回归 Generalized linear model：
    - $f(exp Y)= β_0 + β_1*X_1 + \cdots + \beta_n* X_n$
    - Linear regression is a special case of GLM
    - Same for logistic regression, Poisson regression, Log-Binomial regression
    - Even for Cox regression (Aug.31st & Sep.7th), GEE model (Sep.14th)
- 总结：在GLM中，转化Y使得转化之后的Y和X们呈线性关系

## Simple GLM vs. Multiple GLM

### 简单广义线性模型

$$f(exp Y)= β_0 + β_1*X_1$$

### 多元广义线性模型

$$f(exp Y)= β_0 + β_1*X_1 + \cdots + \beta_n* X_n$$

X变量可以是各种类型的变量（连续、分类）

## Specify Options

### Family

Y 变量的分布类型

### Link

把 Y 怎么转化，才和 X 成linear关系

## linear regression

### Y var

1. 连续分布（continuous variable）
2. 如何分布? 高斯分布（Gaussian Distribution）
3. Model：
   -  $$f(exp Y)= β_0 + β_1*X_1 + \cdots + \beta_n* X_n$$
   -  Y 怎么转化才和 X 成 linear 关系？
      - 不用转化
      - Link：Identity（恒等）

## Logistic regression

### Y var

1. Binary variable
2. 如何分布? 二项分布(Binomial)
3. Model:
   - $$ln[P(Y=1)/P(Y=0)]= β_0 + β_1*X_1+\cdots + β_n*X_n$$
   - Exp Y = P(Y=1)
   - Y 怎么转化才和X成linear关系？Link：logit

## Poisson regression

### Y var

1. Count variable:整数(1,2,3,.n)
2. 如何分布? 泊松分布(Poisson)
3. Model:
   - $$ln[risk of event)]= β_0 + β_1*X_1+\cdots + β_n*X_n$$
   - Y 怎么转化才和X成linear关系？Link：Log

## Log-binomial Regression
### Y var

1. Binary variable
2. 如何分布? 二项分布(Binomial)
3. Model:
   - $$ln[P(Y=1)]= β_0 + β_1*X_1+\cdots + β_n*X_n$$
   - Exp Y = P(Y=1)
   - Y 怎么转化才和X成linear关系？Link：Log

In [14]:
%%stata
help glm


[R] glm -- Generalized linear models
           (View complete PDF manual entry)


Syntax
------

        glm depvar [indepvars] [if] [in] [weight] [, options]

    options                     Description
    -------------------------------------------------------------------------
    Model
      family(familyname)        distribution of depvar; default is
                                  family(gaussian)
      link(linkname)            link function; default is canonical link for
                                  family() specified

    Model 2
      noconstant                suppress constant term
      exposure(varname)         include ln(varname) in model with coefficient
                                  constrained to 1
      offset(varname)           include varname in model with coefficient
                                  constrained to 1
      constraints(constraints)  apply specified linear constraints
      asis                      retain perfect predictor variables
  

glm command in Stata
- Umbrella Command
- linear regression: family(**gau**ssian) link(**i**dentity)
- logistic regression: family(**b**inomial) link(**l**ogit)
- poisson regression: family(**p**oisson) link(**log**)
- log-binomial regression: family(**b**inomial) link(**log**)

> 加粗部分表示，在实际代码中可以简写为加粗的字母或单词即可

## 代码的转换

### Linear regression:

- `regress price weight length mpg i.rep78`
- `glm price weight length mpg i.rep78, family(gaussian)link(identity)`

### Logistic regression:

- `logistic low age i.smoke i.race`
- `glm low age i.smoke i.race,family(binomial) link(logit)`
- `glm low age i.smoke i.race,family(binomial) link(logit) eform`

> `eform` 直接输出 OR 值，即 $e^{\beta}$

## 线性回归模型的比较

In [6]:
%%stata
regress price weight length mpg i.rep78


      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(7, 61)        =      7.25
       Model |   262008114         7  37429730.6   Prob > F        =    0.0000
    Residual |   314788844        61  5160472.86   R-squared       =    0.4542
-------------+----------------------------------   Adj R-squared   =    0.3916
       Total |   576796959        68  8482308.22   Root MSE        =    2271.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   5.186695   1.163383     4.46   0.000     2.860367    7.513022
      length |  -124.1544   40.07637    -3.10   0.003     -204.292   -44.01671
         mpg |  -126.8367   84.49819    -1.50   0.138    -295.8012    42.12791
             |
       rep78 |
          2  |   113

In [15]:
%%stata
glm price weight length mpg i.rep78, family(gaussian)link(identity)




Iteration 0:  Log likelihood = -626.90582  

Generalized linear models                         Number of obs   =         69
Optimization     : ML                             Residual df     =         61
                                                  Scale parameter =    5160473
Deviance         =  314788844.4                   (1/df) Deviance =    5160473
Pearson          =  314788844.4                   (1/df) Pearson  =    5160473

Variance function: V(u) = 1                       [Gaussian]
Link function    : g(u) = u                       [Identity]

                                                  AIC             =   18.40307
Log likelihood   = -626.9058204                   BIC             =   3.15e+08

------------------------------------------------------------------------------
             |                 OIM
       price | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
     

## logistic 回归的比较

重新导入数据：

In [7]:
%%stata
webuse lbw,clear

(Hosmer & Lemeshow data)


In [8]:
%%stata
logistic low age i.smoke i.race


Logistic regression                                     Number of obs =    189
                                                        LR chi2(4)    =  15.81
                                                        Prob > chi2   = 0.0033
Log likelihood = -109.4311                              Pseudo R2     = 0.0674

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9657186   .0322573    -1.04   0.296     .9045206    1.031057
             |
       smoke |
     Smoker  |    3.00582   1.118001     2.96   0.003     1.449982    6.231081
             |
        race |
      Black  |   2.749483   1.356659     2.05   0.040     1.045318    7.231924
      Other  |   2.876948   1.167921     2.60   0.009     1.298314    6.375062
             |
       _cons |    .365111   .3146026    -1.17   0.242 

In [10]:
%%stata
glm low age i.smoke i.race,family(binomial) link(logit) eform


Iteration 0:  Log likelihood = -109.53148  
Iteration 1:  Log likelihood = -109.43111  
Iteration 2:  Log likelihood =  -109.4311  
Iteration 3:  Log likelihood =  -109.4311  

Generalized linear models                         Number of obs   =        189
Optimization     : ML                             Residual df     =        184
                                                  Scale parameter =          1
Deviance         =  218.8621974                   (1/df) Deviance =   1.189468
Pearson          =  182.9642078                   (1/df) Pearson  =   .9943707

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.210911
Log likelihood   = -109.4310987                   BIC             =  -745.6193

------------------------------------------------------------------------------
             |                 OIM
         low | Odds ratio   std.

## Log-Binomial Model

`glm low age i.smoke i.race,family(binomial) link(log) `

- Logistic regression 的一种替代
- Fail to converge 
- Poisson regression with robust variance estimate

为了避免出现 Fail to converge 的问题，采用稳健方差估计：

`glm low age i.smoke i.race,family(poisson) link(log) robust`

- `robust` 可以缩写为 `r`

In [12]:
%%stata
glm low age i.smoke i.race,family(poisson) link(log) robust


Iteration 0:  Log pseudolikelihood = -124.55888  
Iteration 1:  Log pseudolikelihood = -122.39663  
Iteration 2:  Log pseudolikelihood = -122.39591  
Iteration 3:  Log pseudolikelihood = -122.39591  

Generalized linear models                         Number of obs   =        189
Optimization     : ML                             Residual df     =        184
                                                  Scale parameter =          1
Deviance         =   126.791811                   (1/df) Deviance =   .6890859
Pearson          =  124.4629927                   (1/df) Pearson  =   .6764293

Variance function: V(u) = u                       [Poisson]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   1.348105
Log pseudolikelihood = -122.3959055               BIC             =  -837.6896

------------------------------------------------------------------------------
             |               Robust
         lo