# 📌 课程作业：HW1（总分 100 分）

## **提交说明**
📩 **提交方式**：请将作业发送至邮箱 **jiqing@cup.edu.cn**

📌 **邮件主题格式**：
```ECO-HW1-姓名-学号```

📂 **文件命名格式**：
```ECO_HW1_姓名_学号.ipynb```

📅 **截止提交日期**：
👉 **2025 年 3 月 24 日（星期一上课前）**

---

## **求解模型**
我们研究的模型如下：
$
\ln(wage) = \beta_0 + \beta_1 \cdot educ + \beta_2 \cdot exper + \beta_3 \cdot expersq + \epsilon
$

- 其中，$educ$ 为 **内生变量**，其 **工具变量** 选择为：
  - $constant$（常数项）
  - $exper$（工作经验）
  - $expersq$（经验的平方）
  - $motheduc$（母亲受教育程度）
  - $fatheduc$（父亲受教育程度）



## **数据导入与预处理**
本次作业使用 **Wooldridge 经济学数据集**，请确保已安装 `wooldridge` 库，并使用以下代码导入 **Mroz** 数据集：

---

In [1]:
import wooldridge as woo
mroz = woo.dataWoo('mroz') # 加载 Mroz 数据集
mroz = mroz.dropna(subset=['lwage']) # 删除缺失工资数据的观测值
mroz.head() # 显示前几行数据

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,...,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
0,1,1610,1,0,32,12,3.354,2.65,2708,34,...,16310.0,0.7215,12,7,5.0,0,14,10.91006,1.210154,196
1,1,1656,0,2,30,12,1.3889,2.65,2310,30,...,21800.0,0.6615,7,7,11.0,1,5,19.499981,0.328512,25
2,1,1980,1,3,35,12,4.5455,4.04,3072,40,...,21040.0,0.6915,12,7,5.0,0,15,12.03991,1.514138,225
3,1,456,0,3,34,12,1.0965,3.25,1920,53,...,7300.0,0.7815,7,7,5.0,0,6,6.799996,0.092123,36
4,1,1568,1,2,31,14,4.5918,3.6,2000,32,...,27300.0,0.6215,12,14,9.5,1,7,20.100058,1.524272,49


### **任务一：计算两阶段最小二乘的系数方差和 T 值（共 50 分）**


### **两阶段最小二乘法（2SLS）**
**直接估计公式**：
$
\beta = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z'y
$

**估计结果**：
$
\hat{\ln(wage)} = 0.0481 + 0.0614 \cdot educ + 0.0442 \cdot exper - 0.0009 \cdot expersq
$

---

#### **任务 1.1：计算系数估计值**
- 用矩阵计算 **$\beta$ 系数**，即使用 2SLS 方法求解 **回归系数估计值**（10 分）

#### **任务 1.2：假设误差项服从独立同方差**
- 用矩阵计算 **系数的方差** 和 **T 值**（10 分）

#### **任务 1.3：假设误差项服从独立异方差**
- 用矩阵计算 **系数的方差** 和 **T 值**（10 分）

#### **任务 1.4：假设误差项存在组内相关性（聚类到 city-level）**
- 用矩阵计算 **系数的方差（组内异方差聚类）** 和 **T 值**（10 分）

#### **任务 1.5：使用 `linearmodels.iv` 对上述三种情况进行检验**
- 采用 `from linearmodels.iv import IV2SLS` 进行检验，并比较结果（10 分）

---

In [2]:
import wooldridge as woo
import numpy as np
mroz = woo.dataWoo('mroz') # 加载 Mroz 数据集
mroz = mroz.dropna(subset=['lwage']) # 删除缺失工资数据的观测值

# 获取目标变量
ln_wage = np.log(mroz['wage'])
cons = np.ones_like(ln_wage)
educ = mroz['educ']
exper = mroz['exper']
expersq = mroz['expersq']
motheduc = mroz['motheduc']
fatheduc = mroz['fatheduc']

## 任务 1.1 求解过程：

- 2sls系数估计量：
$\beta = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z'y$


In [3]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: GMM/2SLS估计系数 ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
temp1 = np.linalg.inv(Z.T @ Z)
beta = np.linalg.inv(X.T @ Z @ temp1 @ Z.T @ X) @ X.T @ Z @ temp1 @ Z.T @ y

# === Step 2: 整理结果输出 ===
results_df_homoskedastic = pd.DataFrame({
    'Coef.': beta.flatten(),
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("2sls回归系数估计值: \n")
print(results_df_homoskedastic)

2sls回归系数估计值: 

             Coef.
constant    0.0481
education   0.0614
experience  0.0442
exper_sq   -0.0009


## 任务 1.2 求解过程：

- 2sls系数估计量：
$\beta = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z'y$

- 2sls系数方差标准式（$\Omega$ 为方差）：
$Var[\beta] = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z' \Omega Z(Z'Z)^{-1}Z'X(X'Z(Z'Z)^{-1}Z'X)^{-1}$

- 2sls系数方差展开式（同方差）：
$Var[\beta] = \sigma^2 (X'Z(Z'Z)^{-1}Z'X)^{-1}$


In [4]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: GMM/2SLS估计系数 ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
temp1 = np.linalg.inv(Z.T @ Z)
beta = np.linalg.inv(X.T @ Z @ temp1 @ Z.T @ X) @ X.T @ Z @ temp1 @ Z.T @ y

# === Step 2: 残差项与 sigma^2 ===
u = y - X @ beta
n = len(y)
k = X.shape[1]
sigma_squared = (u.T @ u) / (n-k)  # 方法1：小样本调整（n-k）
# sigma_squared = (u.T @ u) / n  # 方法2：大样本近似（n）

# === Step 3: 协方差矩阵 & 标准误 ===
temp2 = np.linalg.inv(X.T @ Z @ temp1 @ Z.T @ X)
beta_var = sigma_squared * temp2
beta_std = np.sqrt(np.diag(beta_var))

# === Step 4: t 值 & p 值（使用正态近似） ===
t_stats = beta.flatten() / beta_std  # flatten 保证是一维数组
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 大样本近似使用正态分布

# === Step 5: 整理结果输出 ===
results_df_homoskedastic = pd.DataFrame({
    'Coef.': beta.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("2sls回归结果 (普通标准误，小样本调整，n-k): \n")
print(results_df_homoskedastic)

2sls回归结果 (普通标准误，小样本调整，n-k): 

             Coef.  Std.Err.  T-stat  P-value
constant    0.0481    0.4003  0.1202   0.9044
education   0.0614    0.0314  1.9530   0.0508
experience  0.0442    0.0134  3.2883   0.0010
exper_sq   -0.0009    0.0004 -2.2380   0.0252


In [5]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: GMM/2SLS估计系数 ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
temp1 = np.linalg.inv(Z.T @ Z)
beta = np.linalg.inv(X.T @ Z @ temp1 @ Z.T @ X) @ X.T @ Z @ temp1 @ Z.T @ y

# === Step 2: 残差项与 sigma^2 ===
u = y - X @ beta
n = len(y)
k = X.shape[1]
# sigma_squared = (u.T @ u) / (n-k)  # 方法1：小样本调整（n-k）
sigma_squared = (u.T @ u) / n  # 方法2：大样本近似（n）

# === Step 3: 协方差矩阵 & 标准误 ===
temp2 = np.linalg.inv(X.T @ Z @ temp1 @ Z.T @ X)
beta_var = sigma_squared * temp2
beta_std = np.sqrt(np.diag(beta_var))

# === Step 4: t 值 & p 值（使用正态近似） ===
t_stats = beta.flatten() / beta_std  # flatten 保证是一维数组
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 大样本近似使用正态分布

# === Step 5: 整理结果输出 ===
results_df_homoskedastic = pd.DataFrame({
    'Coef.': beta.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("2sls回归结果 (普通标准误，大样本近似，n): \n")
print(results_df_homoskedastic)

2sls回归结果 (普通标准误，大样本近似，n): 

             Coef.  Std.Err.  T-stat  P-value
constant    0.0481    0.3985  0.1207   0.9039
education   0.0614    0.0313  1.9622   0.0497
experience  0.0442    0.0134  3.3038   0.0010
exper_sq   -0.0009    0.0004 -2.2485   0.0245


## 任务 1.5 （同方差）验证过程：

In [6]:
from linearmodels.iv import IV2SLS
import pandas as pd

# === Step 1: 构造变量（使用 DataFrame，确保每列有名称） ===
exog = pd.DataFrame({'constant': cons, 'experience': exper, 'exper_sq': expersq})
endog = pd.DataFrame({'education': educ})
instruments = pd.DataFrame({'mother_educ': motheduc, 'father_educ': fatheduc})
dependent = pd.DataFrame({'ln_wage': ln_wage})

# === Step 2: 构建 IV 模型（同方差协方差估计） ===
model = IV2SLS(dependent, exog, endog, instruments)
results = model.fit(cov_type="homoskedastic")

# === Step 3: 输出结果 ===
print(results.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                ln_wage   R-squared:                      0.1357
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1296
No. Observations:                 428   F-statistic:                    24.653
Date:                Tue, Mar 25 2025   P-value (F-stat)                0.0000
Time:                        11:03:28   Distribution:                  chi2(3)
Cov. Estimator:         homoskedastic                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
constant       0.0481     0.3985     0.1207     0.9039     -0.7329      0.8291
experience     0.0442     0.0134     3.3038     0.00

## 任务 1.3 求解过程：

- 2sls系数估计量：
$\beta = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z'y$

- 2sls系数方差标准式（$\Omega$ 为方差）：
$Var[\beta] = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z' \Omega Z(Z'Z)^{-1}Z'X(X'Z(Z'Z)^{-1}Z'X)^{-1}$

- 2sls系数方差展开式（异方差）：
$
Var[\hat{\beta}] = (X^{\prime}Z (Z^{\prime}Z)^{-1} Z^{\prime}X)^{-1}
X^{\prime}Z (Z^{\prime}Z)^{-1}
\left( \sum_{i=1}^{N} Z_i^{\prime} \widehat{u}_i \widehat{u}_i^{\prime} Z_i \right)
(Z^{\prime}Z)^{-1} Z^{\prime}X
(X^{\prime}Z (Z^{\prime}Z)^{-1} Z^{\prime}X)^{-1}
$

In [7]:
import numpy as np
import pandas as pd
from scipy.stats import norm  # 使用正态分布近似

# === Step 1: 手动估计 β̂ ===
ZTZ_inv = np.linalg.inv(Z.T @ Z)
beta = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X) @ X.T @ Z @ ZTZ_inv @ Z.T @ y

# === Step 2: 残差 ===
u = y - X @ beta

# === Step 3: White 异方差稳健协方差矩阵 ===
sigma_hat_matrix = np.diagflat(u**2)

A = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X) @ X.T @ Z @ ZTZ_inv @ Z.T
beta_cov = A @ sigma_hat_matrix @ A.T
beta_std = np.sqrt(np.diag(beta_cov))

# === Step 4: t 值 & p 值 ===
t_stats = beta.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 双尾 p 值（正态分布）

# === Step 5: 整理结果 ===
results_df_heteroskedastic = pd.DataFrame({
    'Coef.': beta.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

# === Step 6: 打印结果 ===
print("2sls回归结果 (稳健标准误): \n")
print(results_df_heteroskedastic)

2sls回归结果 (稳健标准误): 

             Coef.  Std.Err.  T-stat  P-value
constant    0.0481    0.4278  0.1124   0.9105
education   0.0614    0.0332  1.8503   0.0643
experience  0.0442    0.0155  2.8546   0.0043
exper_sq   -0.0009    0.0004 -2.1001   0.0357


## 任务 1.5 （异方差）验证过程：

In [8]:
from linearmodels.iv import IV2SLS
import pandas as pd

# === Step 1: 构造变量（使用 DataFrame，确保每列有名称） ===
exog = pd.DataFrame({'constant': cons, 'experience': exper, 'exper_sq': expersq})
endog = pd.DataFrame({'education': educ})
instruments = pd.DataFrame({'mother_educ': motheduc, 'father_educ': fatheduc})
dependent = pd.DataFrame({'ln_wage': ln_wage})

# === Step 2: 构建 IV 模型（同方差协方差估计） ===
model = IV2SLS(dependent, exog, endog, instruments)
results = model.fit(cov_type = "heteroskedastic")  # 使用稳健标准误

# === Step 3: 输出结果 ===
print(results.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                ln_wage   R-squared:                      0.1357
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1296
No. Observations:                 428   F-statistic:                    18.611
Date:                Tue, Mar 25 2025   P-value (F-stat)                0.0003
Time:                        11:03:29   Distribution:                  chi2(3)
Cov. Estimator:       heteroskedastic                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
constant       0.0481     0.4278     0.1124     0.9105     -0.7903      0.8865
experience     0.0442     0.0155     2.8546     0.00

## 任务 1.4 求解过程：

- 2sls系数估计量：
$\beta = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z'y$

- 2sls系数方差标准式（$\Omega$ 为方差）：
$Var[\beta] = (X'Z(Z'Z)^{-1}Z'X)^{-1} X'Z(Z'Z)^{-1}Z' \Omega Z(Z'Z)^{-1}Z'X(X'Z(Z'Z)^{-1}Z'X)^{-1}$

- 2sls系数方差展开式（聚类标准误）：
$$
方式 1：Var[\hat{\beta}] = (X^{\prime}Z (Z^{\prime}Z)^{-1} Z^{\prime}X)^{-1}
 X^{\prime}Z (Z^{\prime}Z)^{-1}
\left( \sum_{g=1}^{G} Z_g^{\prime} \widehat{u}_g \widehat{u}_g^{\prime} Z_g \right)
(Z^{\prime}Z)^{-1} Z^{\prime}X 
(X^{\prime}Z (Z^{\prime}Z)^{-1} Z^{\prime}X)^{-1}
$$

$$
方式 2：Var[\hat{\beta}] = (X^{\prime}Z (Z^{\prime}Z)^{-1} Z^{\prime}X)^{-1}
\left( \sum_{g=1}^{G}
 ((Z (Z^{\prime}Z)^{-1}
 Z^{\prime}X^{\prime})_g)' \widehat{u}_g \widehat{u}_g^{\prime} (Z (Z^{\prime}Z)^{-1}
 Z^{\prime}X^{\prime})_g
\right)
(X^{\prime}Z (Z^{\prime}Z)^{-1} Z^{\prime}X)^{-1}
$$

In [9]:
import numpy as np
import pandas as pd
from scipy.stats import norm  # 使用标准正态分布计算 p 值

# === Step 1: 手动估计 β̂（与 White 相同）===
ZTZ_inv = np.linalg.inv(Z.T @ Z)
beta = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X) @ X.T @ Z @ ZTZ_inv @ Z.T @ y

# === Step 2: 残差计算 ===
u = y - X @ beta

# === Step 3: 构造聚类稳健协方差矩阵（按 cluster 分组累加）===
cluster_ids = mroz['city'].values        # 聚类维度（如城市）
cluster_labels = np.unique(cluster_ids)  # 唯一聚类组
G = len(cluster_labels)                  # 聚类个数

S = np.zeros((Z.shape[1], Z.shape[1]))   # 初始化中间矩阵 S

# 遍历每一个 cluster，累加矩阵
for g in cluster_labels:
    idx = (cluster_ids == g)             # 当前 cluster 的索引
    Z_g = Z[idx, :]
    u_g = u[idx].to_numpy().reshape(-1, 1)
    S += Z_g.T @ u_g @ u_g.T @ Z_g       # 聚类残差项对协方差的贡献


# === Step 4: 协方差矩阵计算 ===
N = len(y)
K = X.shape[1]
dof_correction = (G / (G - 1)) * ((N - 1) / (N - K))  # 附加自由度修正
A = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X) @ X.T @ Z @ ZTZ_inv
beta_cov = A @ S @ A.T * dof_correction                   # Cluster-Robust 协方差矩阵
beta_std = np.sqrt(np.diag(beta_cov))   # 标准误

# === Step 5: t 值 & p 值 ===
t_stats = beta.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 正态分布近似（大样本）

# === Step 6: 输出结果表 ===
results_df_cluster = pd.DataFrame({
    'Coef.': beta.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("2sls回归结果 (聚类标准误，带自由度修正，dof_correction) :\n")
print(results_df_cluster)

2sls回归结果 (聚类标准误，带自由度修正，dof_correction) :

             Coef.  Std.Err.  T-stat  P-value
constant    0.0481    0.1755  0.2741   0.7840
education   0.0614    0.0179  3.4384   0.0006
experience  0.0442    0.0112  3.9568   0.0001
exper_sq   -0.0009    0.0002 -3.8472   0.0001


In [10]:
import numpy as np
import pandas as pd
from scipy.stats import norm  # 使用标准正态分布计算 p 值

# === Step 1: 手动估计 β̂（与 White 相同）===
ZTZ_inv = np.linalg.inv(Z.T @ Z)
beta = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X) @ X.T @ Z @ ZTZ_inv @ Z.T @ y

# === Step 2: 残差计算 ===
u = y - X @ beta

# === Step 3: 构造聚类稳健协方差矩阵（按 cluster 分组累加）===
cluster_ids = mroz['city'].values        # 聚类维度（如城市）
cluster_labels = np.unique(cluster_ids)  # 唯一聚类组
G = len(cluster_labels)                  # 聚类个数

Pz = Z @ np.linalg.inv(Z.T @ Z) @ Z.T
PzX = Pz @ X
S = np.zeros((X.shape[1], X.shape[1]))   # 初始化中间矩阵 S
# 遍历每一个 cluster，累加矩阵
for g in cluster_labels:
    idx = (cluster_ids == g)             # 当前 cluster 的索引
    Xg_hat = X_hat[idx, :]              # PZ X 的第 g 组
    u_g = u[idx].to_numpy().reshape(-1, 1)
    S += Xg_hat.T @ u_g @ u_g.T @ Xg_hat


# === Step 4: 协方差矩阵计算 ===
N = len(y)
K = X.shape[1]
dof_correction = (G / (G - 1)) * ((N - 1) / (N - K))  # 附加自由度修正
A = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X)
beta_cov = A @ S @ A.T * dof_correction
beta_std = np.sqrt(np.diag(beta_cov))   # 标准误

# === Step 5: t 值 & p 值 ===
t_stats = beta.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 正态分布近似（大样本）

# === Step 6: 输出结果表 ===
results_df_cluster = pd.DataFrame({
    'Coef.': beta.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("2sls回归结果 (聚类标准误，不带自由度修正) :\n")
print(results_df_cluster)

NameError: name 'X_hat' is not defined

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm  # 使用标准正态分布计算 p 值

# === Step 1: 手动估计 β̂（与 White 相同）===
ZTZ_inv = np.linalg.inv(Z.T @ Z)
beta = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X) @ X.T @ Z @ ZTZ_inv @ Z.T @ y

# === Step 2: 残差计算 ===
u = y - X @ beta

# === Step 3: 构造聚类稳健协方差矩阵（按 cluster 分组累加）===
cluster_ids = mroz['city'].values        # 聚类维度（如城市）
cluster_labels = np.unique(cluster_ids)  # 唯一聚类组
G = len(cluster_labels)                  # 聚类个数

S = np.zeros((Z.shape[1], Z.shape[1]))   # 初始化中间矩阵 S

# 遍历每一个 cluster，累加矩阵
for g in cluster_labels:
    idx = (cluster_ids == g)             # 当前 cluster 的索引
    Z_g = Z[idx, :]
    u_g = u[idx].to_numpy().reshape(-1, 1)
    S += Z_g.T @ u_g @ u_g.T @ Z_g       # 聚类残差项对协方差的贡献

# === Step 4: 协方差矩阵计算 ===
A = np.linalg.inv(X.T @ Z @ ZTZ_inv @ Z.T @ X) @ X.T @ Z @ ZTZ_inv
beta_cov = A @ S @ A.T
beta_std = np.sqrt(np.diag(beta_cov))   # 标准误

# === Step 5: t 值 & p 值 ===
t_stats = beta.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 正态分布近似（大样本）

# === Step 6: 输出结果表 ===
results_df_cluster = pd.DataFrame({
    'Coef.': beta.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("2sls回归结果 (聚类标准误，不带自由度修正) :\n")
print(results_df_cluster)

## 任务 1.5 （聚类标准误）验证过程：

In [None]:
from linearmodels.iv import IV2SLS
import pandas as pd

# === Step 1: 构造变量（使用 DataFrame，确保每列有名称） ===
exog = pd.DataFrame({'constant': cons, 'experience': exper, 'exper_sq': expersq})
endog = pd.DataFrame({'education': educ})
instruments = pd.DataFrame({'mother_educ': motheduc, 'father_educ': fatheduc})
dependent = pd.DataFrame({'ln_wage': ln_wage})

# === Step 2: 构建 IV 模型（聚类标准误） ===
model = IV2SLS(dependent, exog, endog, instruments)
results = model.fit(cov_type='clustered', clusters=mroz['city'])

# === Step 3: 输出结果 ===
print(results.summary)

### **任务二：计算广义矩估计（GMM）的系数方差和 T 值（共 50 分）**

### **广义矩估计（GMM）**
**两步法估计公式**：$\hat{\beta}_{gmm} = (X' Z \Phi Z' X)^{-1} X' Z \Phi Z' y$
- 其中，**最优权重矩阵** 分两步计算：
  - 第一步：$\hat{\epsilon}^{(1)}$ 在 $\Phi = (Z'Z)^{-1}$ 的假设下估计得到
  - 第二步：$\Phi = (Z'\hat{\epsilon}^{(1)} \hat{\epsilon}^{(1)'}Z)^{-1}$

**估计结果**：
$\hat{\ln(wage)} = 0.0477 + 0.0611 \cdot educ + 0.0451 \cdot exper - 0.0009 \cdot expersq$

---

#### **任务 2.1：计算系数估计值**
- 用矩阵计算 **$\beta$ 系数**，即使用 GMM 方法求解 **回归系数估计值**（10 分）

#### **任务 2.2：假设误差项服从独立同方差**
- 用矩阵计算 **系数的方差** 和 **T 值**（10 分）

#### **任务 2.3：假设误差项服从独立异方差**
- 用矩阵计算 **系数的方差** 和 **T 值**（10 分）

#### **任务 2.4：假设误差项存在组内相关性（聚类到 city-level）**
- 用矩阵计算 **系数的方差（组内异方差聚类）** 和 **T 值**（10 分）

#### **任务 2.5：使用 `linearmodels.iv` 对上述三种情况进行检验**
- 采用 `from linearmodels.iv import IVGMM` 进行检验，并比较结果（10 分）

---

## 任务 2.1 求解过程：

- GMM系数估计量：
$\hat{\beta}^{(2)} = (X' Z \Phi^{(2)} Z' X)^{-1} X' Z \Phi^{(2)} Z' y$


In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: 构造 X、Z、y 矩阵 ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
n = len(y)

# === Step 2: 初始估计（2SLS，用作一阶段 GMM）===
Phi_1 = np.linalg.inv(Z.T @ Z)
beta_2sls = np.linalg.inv(X.T @ Z @ Phi_1 @ Z.T @ X) @ X.T @ Z @ Phi_1 @ Z.T @ y

# === Step 3: 利用残差构建加权矩阵（估计 Ω̂）===
u_hat = y - X @ beta_2sls
Zu = Z * u_hat.values.reshape(-1, 1)
Phi_2 = np.linalg.inv((Zu.T @ Zu) / n)  # GMM 权重矩阵（第二阶段）

# === Step 4: Two-Step GMM 估计 β̂_GMM ===
beta_gmm = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T @ y


results_df_gmm = pd.DataFrame({
    'Coef.': beta_gmm.flatten(),
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("Two-Step GMM 回归系数估计值: \n")
print(results_df_gmm)

## 任务 2.2 求解过程：

- GMM系数估计量：
$\hat{\beta}^{(2)} = (X' Z \Phi^{(2)} Z' X)^{-1} X' Z \Phi^{(2)} Z' y$

- GMM系数方差标准式（$\Omega$ 为方差）：
$Var[\beta^{(2)}] = (X'Z\Phi^{(2)}Z'X)^{-1} X'Z\Phi^{(2)}Z' \Omega Z\Phi^{(2)}Z'X(X'Z\Phi^{(2)}Z'X)^{-1}$

- GMM系数方差展开式（同方差）：
$Var[\beta] = \sigma^2 (X'Z\Phi^{(2)}Z'X)^{-1} X'Z\Phi^{(2)}Z' Z\Phi^{(2)}Z'X(X'Z\Phi^{(2)}Z'X)^{-1}$

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: 构造 X、Z、y 矩阵 ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
n = len(y)

# === Step 2: 初始估计（2SLS，用作一阶段 GMM）===
Phi_1 = np.linalg.inv(Z.T @ Z)
beta_2sls = np.linalg.inv(X.T @ Z @ Phi_1 @ Z.T @ X) @ X.T @ Z @ Phi_1 @ Z.T @ y

# === Step 3: 利用残差构建加权矩阵（估计 Ω̂）===
u_hat = y - X @ beta_2sls
Zu = Z * u_hat.values.reshape(-1, 1)
Phi_2 = np.linalg.inv((Zu.T @ Zu) / n)  # GMM 权重矩阵（第二阶段）

# === Step 4: Two-Step GMM 估计 β̂_GMM ===
beta_gmm = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T @ y

# === Step 5: 手动计算协方差矩阵 & 标准误差（同方差假设）===
u = y - X @ beta_gmm
sigma_squared = (u.T @ u) / (n-k)  # 方法1：小样本调整（n-k）
# sigma_squared = (u.T @ u) / n  # 方法2：大样本近似（n）

A = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T
beta_cov = sigma_squared * (A @ A.T)
beta_std = np.sqrt(np.diag(beta_cov))

# === Step 6: t 值、p 值、结果输出 ===
t_stats = beta_gmm.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 使用正态分布近似

results_df_gmm = pd.DataFrame({
    'Coef.': beta_gmm.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("Two-Step GMM 回归结果 (普通标准误，小样本调整，n-k): \n")
print(results_df_gmm)

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: 构造 X、Z、y 矩阵 ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
n = len(y)

# === Step 2: 初始估计（2SLS，用作一阶段 GMM）===
Phi_1 = np.linalg.inv(Z.T @ Z)
beta_2sls = np.linalg.inv(X.T @ Z @ Phi_1 @ Z.T @ X) @ X.T @ Z @ Phi_1 @ Z.T @ y

# === Step 3: 利用残差构建加权矩阵（估计 Ω̂）===
u_hat = y - X @ beta_2sls
Zu = Z * u_hat.values.reshape(-1, 1)
Phi_2 = np.linalg.inv((Zu.T @ Zu) / n)  # GMM 权重矩阵（第二阶段）

# === Step 4: Two-Step GMM 估计 β̂_GMM ===
beta_gmm = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T @ y

# === Step 5: 手动计算协方差矩阵 & 标准误差（同方差假设）===
u = y - X @ beta_gmm
# sigma_squared = (u.T @ u) / (n-k)  # 方法1：小样本调整（n-k）
sigma_squared = (u.T @ u) / n  # 方法2：大样本近似（n）

A = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T
beta_cov = sigma_squared * (A @ A.T)
beta_std = np.sqrt(np.diag(beta_cov))

# === Step 6: t 值、p 值、结果输出 ===
t_stats = beta_gmm.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))  # 使用正态分布近似

results_df_gmm = pd.DataFrame({
    'Coef.': beta_gmm.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("Two-Step GMM 回归结果 (普通标准误，大样本近似，n): \n")
print(results_df_gmm)

## 任务 2.5 （同方差）验证过程：

In [None]:
from linearmodels.iv import IVGMM
import pandas as pd

# === Step 1: 构造变量（使用 DataFrame，确保列名清晰） ===
exog = pd.DataFrame({'constant': cons,'experience': exper,'exper_sq': expersq})
endog = pd.DataFrame({'education': educ})
instruments = pd.DataFrame({'mother_educ': motheduc,'father_educ': fatheduc})
dependent = pd.DataFrame({'ln_wage': ln_wage})

# === Step 2: 构建 IV-GMM 模型（默认为同方差协方差估计） ===
model = IVGMM(dependent, exog, endog, instruments)
results = model.fit(cov_type='homoskedastic')  # 可选："robust", "clustered", "kernel"

# === Step 3: 输出结果 ===
print("IV-GMM Estimation Results:\n")
print(results.summary)

## 任务 2.3 求解过程：

- GMM系数估计量：
$\hat{\beta}^{(2)} = (X' Z \Phi^{(2)} Z' X)^{-1} X' Z \Phi^{(2)} Z' y$

- GMM系数方差标准式（$\Omega$ 为方差）：
$Var[\beta^{(2)}] = (X'Z\Phi^{(2)}Z'X)^{-1} X'Z\Phi^{(2)}Z' \Omega Z\Phi^{(2)}Z'X(X'Z\Phi^{(2)}Z'X)^{-1}$

- GMM系数方差展开式（异方差）：
$Var[\beta] = (X'Z\Phi^{(2)}Z'X)^{-1} X'Z\Phi^{(2)}\left( \sum_{i=1}^{N} Z_i^{\prime} \widehat{u}_i \widehat{u}_i^{\prime} Z_i \right)\Phi^{(2)}Z'X(X'Z\Phi^{(2)}Z'X)^{-1}$

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: 构造 X、Z、y 矩阵 ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
n = len(y)

# === Step 2: 初始估计（2SLS）用于构建 GMM 加权矩阵 ===
Phi_1 = np.linalg.inv(Z.T @ Z)
beta_2sls = np.linalg.inv(X.T @ Z @ Phi_1 @ Z.T @ X) @ X.T @ Z @ Phi_1 @ Z.T @ y

# === Step 3: 构建加权矩阵（使用残差 u）===
u_hat = y - X @ beta_2sls
Zu = Z * u_hat.values.reshape(-1, 1)
Phi_2 = np.linalg.inv((Zu.T @ Zu) / n)

# === Step 4: Two-Step GMM 系数估计 ===
beta_gmm = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T @ y

# === Step 5: White 异方差稳健标准误（对角残差矩阵）===
u = y - X @ beta_gmm
sigma_matrix = np.diagflat(u.values**2)  # n × n 对角矩阵
A = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T
beta_cov = A @ sigma_matrix @ A.T
beta_std = np.sqrt(np.diag(beta_cov))

# === Step 6: t 值 & p 值 ===
t_stats = beta_gmm.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))

# === Step 7: 整理输出结果 ===
results_df_gmm_white = pd.DataFrame({
    'Coef.': beta_gmm.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("Two-Step GMM 回归结果 (稳健标准误): \n")
print(results_df_gmm_white)

## 任务 2.5 （异方差）验证过程：

In [None]:
from linearmodels.iv import IVGMM
import pandas as pd

# === Step 1: 构造变量（使用 DataFrame 格式，带变量名）===
exog = pd.DataFrame({'constant': cons,'experience': exper,'exper_sq': expersq})
endog = pd.DataFrame({'education': mroz['educ']})
instruments = pd.DataFrame({'mother_educ': mroz['motheduc'],'father_educ': mroz['fatheduc']})
dependent = pd.DataFrame({'ln_wage': ln_wage})

# === Step 2: 构建 IV-GMM 模型，并使用异方差稳健协方差估计 ===
model = IVGMM(dependent, exog, endog, instruments)
results_gmm_robust = model.fit(cov_type='heteroskedastic')  # White-robust SE

# === Step 3: 输出结果 ===
print("IV-GMM Estimation Results (White Heteroskedasticity-Robust):\n")
print(results_gmm_robust.summary)

## 任务 2.4 求解过程：

- GMM系数估计量：
$\hat{\beta}^{(2)} = (X' Z \Phi^{(2)} Z' X)^{-1} X' Z \Phi^{(2)} Z' y$

- GMM系数方差标准式（$\Omega$ 为方差）：
$Var[\beta^{(2)}] = (X'Z\Phi^{(2)}Z'X)^{-1} X'Z\Phi^{(2)}Z' \Omega Z\Phi^{(2)}Z'X(X'Z\Phi^{(2)}Z'X)^{-1}$

- GMM系数方差展开式（聚类标准误）：
$Var[\beta] = (X'Z\Phi^{(2)}Z'X)^{-1} X'Z\Phi^{(2)}\left( \sum_{i=g}^{G} Z_g^{\prime} \widehat{u}_g \widehat{u}_g^{\prime} Z_g \right)\Phi^{(2)}Z'X(X'Z\Phi^{(2)}Z'X)^{-1}$

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: 构造 X、Z、y ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
n = len(y)
k = X.shape[1]

# === Step 2: 初始 2SLS 估计用于构建权重矩阵 ===
Phi_1 = np.linalg.inv(Z.T @ Z)
beta_2sls = np.linalg.inv(X.T @ Z @ Phi_1 @ Z.T @ X) @ X.T @ Z @ Phi_1 @ Z.T @ y

# === Step 3: 构造 Two-Step GMM 权重矩阵 ===
u_hat = y - X @ beta_2sls
Zu = Z * u_hat.values.reshape(-1, 1)
Phi_2 = np.linalg.inv((Zu.T @ Zu) / n)

# === Step 4: 计算 Two-Step GMM 系数估计 ===
beta_gmm = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T @ y

# === Step 5: 构建按 city 聚类的协方差矩阵 ===
u = y - X @ beta_gmm
cluster_ids = mroz['city']
cluster_labels = np.unique(cluster_ids)
G = len(cluster_labels)

# 初始化 S 矩阵
S = np.zeros((Z.shape[1], Z.shape[1]))

# 累加每个 cluster 的 S_g 部分
for g in cluster_labels:
    idx = (cluster_ids == g)
    Z_g = Z[idx, :]
    u_g = u[idx].to_numpy().reshape(-1, 1)
    S += Z_g.T @ u_g @ u_g.T @ Z_g

# === Step 6: 聚类稳健协方差矩阵计算 ===
N = len(y)
K = X.shape[1]
dof_correction = (G / (G - 1)) * ((N - 1) / (N - K))  # 附加自由度修正
A = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2
beta_cov_cluster = A @ S @ A.T * dof_correction  # 进行自由度修正
beta_std = np.sqrt(np.diag(beta_cov_cluster))

# === Step 7: 计算 t 值 & p 值 ===
t_stats = beta_gmm.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))

# === Step 8: 输出结果表格 ===
results_df_gmm_cluster = pd.DataFrame({
    'Coef.': beta_gmm.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("Two-Step GMM 回归结果 (聚类标准误，带自由度修正，dof_correction) :\n")
print(results_df_gmm_cluster)

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# === Step 1: 构造 X、Z、y ===
X = np.c_[cons, educ, exper, expersq]
Z = np.c_[cons, exper, expersq, motheduc, fatheduc]
y = ln_wage
n = len(y)
k = X.shape[1]

# === Step 2: 初始 2SLS 估计用于构建权重矩阵 ===
Phi_1 = np.linalg.inv(Z.T @ Z)
beta_2sls = np.linalg.inv(X.T @ Z @ Phi_1 @ Z.T @ X) @ X.T @ Z @ Phi_1 @ Z.T @ y

# === Step 3: 构造 Two-Step GMM 权重矩阵 ===
u_hat = y - X @ beta_2sls
Zu = Z * u_hat.values.reshape(-1, 1)
Phi_2 = np.linalg.inv((Zu.T @ Zu) / n)

# === Step 4: 计算 Two-Step GMM 系数估计 ===
beta_gmm = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2 @ Z.T @ y

# === Step 5: 构建按 city 聚类的协方差矩阵 ===
u = y - X @ beta_gmm
cluster_ids = mroz['city']
cluster_labels = np.unique(cluster_ids)
G = len(cluster_labels)

# 初始化 S 矩阵
S = np.zeros((Z.shape[1], Z.shape[1]))

# 累加每个 cluster 的 S_g 部分
for g in cluster_labels:
    idx = (cluster_ids == g)
    Z_g = Z[idx, :]
    u_g = u[idx].to_numpy().reshape(-1, 1)
    S += Z_g.T @ u_g @ u_g.T @ Z_g

# === Step 6: 聚类稳健协方差矩阵计算 ===
A = np.linalg.inv(X.T @ Z @ Phi_2 @ Z.T @ X) @ X.T @ Z @ Phi_2
beta_cov_cluster = A @ S @ A.T
beta_std = np.sqrt(np.diag(beta_cov_cluster))

# === Step 7: 计算 t 值 & p 值 ===
t_stats = beta_gmm.flatten() / beta_std
p_values = 2 * (1 - norm.cdf(np.abs(t_stats)))

# === Step 8: 输出结果表格 ===
results_df_gmm_cluster = pd.DataFrame({
    'Coef.': beta_gmm.flatten(),
    'Std.Err.': beta_std,
    'T-stat': t_stats,
    'P-value': p_values
}, index=['constant', 'education', 'experience', 'exper_sq']).round(4)

print("Two-Step GMM 回归结果 (聚类标准误，不带自由度修正) :\n")
print(results_df_gmm_cluster)

## 任务 2.5 （聚类标准误）验证过程：

In [None]:
from linearmodels.iv import IVGMM
import pandas as pd

# === Step 1: 构造变量（DataFrame 格式，确保列名清晰）===
exog = pd.DataFrame({'constant': cons,'experience': exper,'exper_sq': expersq})
endog = pd.DataFrame({'education': mroz['educ']})
instruments = pd.DataFrame({'mother_educ': mroz['motheduc'],'father_educ': mroz['fatheduc']})
dependent = pd.DataFrame({'ln_wage': ln_wage})
clusters = mroz['city']

# === Step 2: 构建 IV-GMM 模型，使用 cluster-robust 协方差估计 ===
model = IVGMM(dependent, exog, endog, instruments)
results_cluster = model.fit(cov_type='clustered', clusters=clusters)

# === Step 3: 输出结果 ===
print("IV-GMM Estimation Results (Cluster-Robust SE, clustered by city):\n")
print(results_cluster.summary)