---
title: "13-二分类Logistic回归"
subtitle: "Logistic regression for binary outcomes"
author: "Simon Zhou"
date: "2025-05-07"
format: 
    html:
        code-fold: false
        fig_caption: true
        number-sections: true
        toc: true
        toc-depth: 2
---

In [1]:
import stata_setup
stata_setup.config('C:/Program Files/Stata18', 'mp', splash=False)

## 什么时候该用 Logistic 回归

当outcome发生率 >15% 时logistic regression得出的OR值会overestimate实际的RR值

- 传统的 Logistic Regression（得出OR值）
- Mantel-Haenszel（得出RR值）
- Poisson Regression with robust variance estimate

### 新方法

- 1998 Zhang and Yu *What's the Relative Risk?*

$$RR=\frac{OR}{(1-P_0)+(P_0\times OR)}$$

- 2003 McNutt *Outcomes Estimating the Relative Risk in Cohort Studies and Clinical Trials of Common*

- **金标准：Log Binomial**

## 二分类 Logistic 模型的假设

1. 假设1:因变量(结局)是二分类变量。
2. 假设2:有至少1个自变量，自变量可以是连续变量，也可以是分类变量，
3. 假设3:每条观测间相互独立。分类变量(包括因变量和自变量)的分类必须全面且每一个分类间互斥
4. 假设4:最小样本量要求为自变量数目的15倍，但一些研究者认为样本量应达到自变量数目的50倍
5. 假设5:连续的自变量与因变量的logit转换值之间存在线性关系。
6. 假设6:自变量之间无多重共线性，
7. 假设7:没有明显的离群点、杠杆点和强影响点。

## 做 Logistic 回归的要求
1. Y是二分类变量
2. Y的发生率 <15%

## 导入数据

变量 `low` 是我们的结局事件，我们想看什么因素和孩子的 `low birthweight` 相关

In [None]:
%%stata
webuse lbw,clear

(Hosmer & Lemeshow data)


In [3]:
%%stata
tab low


Birthweight |
     <2500g |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        130       68.78       68.78
          1 |         59       31.22      100.00
------------+-----------------------------------
      Total |        189      100.00


**Disclaimer:** 本节的数据集中，结局事件发生率远远大于15%，应使用Log binomial模型进行分析。这里使用Logistic regression进行分析仅仅为了讲
解如何使用Stata进行操作、以及为下节的Log-binomial进行铺垫。

## Logistic regression

语法：

```stata
logistic y x1 x2 x3 ...
```

### Model 1: $low=\beta_0+\beta_1 age$

In [4]:
%%stata
logistic low age


Logistic regression                                     Number of obs =    189
                                                        LR chi2(1)    =   2.76
                                                        Prob > chi2   = 0.0966
Log likelihood = -115.95598                             Pseudo R2     = 0.0118

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9501333   .0299423    -1.62   0.105     .8932232    1.010669
       _cons |      1.469   1.075492     0.53   0.599     .3498129    6.168901
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.


$\beta_1$：母亲的年龄每增加1岁，孩子低体重的风险是之前的0.95倍(95%CI:0.89,1.01)

### Model 2: $low=\beta_0+\beta_1 age+\beta_2 smoke$

In [5]:
%%stata
logistic low age i.smoke


Logistic regression                                     Number of obs =    189
                                                        LR chi2(2)    =   7.40
                                                        Prob > chi2   = 0.0248
Log likelihood = -113.63815                             Pseudo R2     = 0.0315

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9514394   .0304194    -1.56   0.119     .8936482    1.012968
             |
       smoke |
     Smoker  |   1.997405    .642777     2.15   0.032     1.063027    3.753081
       _cons |   1.062798   .8048781     0.08   0.936     .2408901    4.689025
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.


$\beta_1$：控制了母亲的吸烟状况以后，母亲的年龄每增加1岁孩子低体重的风险是之前的0.95倍(95% CI:0.89,1.01)

### Model 2: $low=\beta_0+\beta_1 age+\beta_2 smoke+\beta_3 race$

In [6]:
%%stata
logistic low age i.smoke i.race


Logistic regression                                     Number of obs =    189
                                                        LR chi2(4)    =  15.81
                                                        Prob > chi2   = 0.0033
Log likelihood = -109.4311                              Pseudo R2     = 0.0674

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9657186   .0322573    -1.04   0.296     .9045206    1.031057
             |
       smoke |
     Smoker  |    3.00582   1.118001     2.96   0.003     1.449982    6.231081
             |
        race |
      Black  |   2.749483   1.356659     2.05   0.040     1.045318    7.231924
      Other  |   2.876948   1.167921     2.60   0.009     1.298314    6.375062
             |
       _cons |    .365111   .3146026    -1.17   0.242 

$\beta_1$：控制了母亲的吸烟状况和种族以后，母亲的年龄每增加1岁孩子低体重的风险是之前的0.97倍(95%CI:0.90,1.03)