# 中证1000因子分析实战

本notebook详细演示因子分析的完整流程，包括：
- 数据加载与预处理
- 涨停/成交不足股票过滤
- IC（信息系数）分析
- 分层收益回测
- 结果可视化

---

## 1. 环境配置与数据加载

### 1.1 导入必要的库

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# 自定义模块
from mylib.get_local_data import get_local_data, list_data_files

print("库导入成功！")

库导入成功！


### 1.2 查看数据结构

首先了解数据文件的组织方式：

In [3]:
# 查看daily数据文件
files = list_data_files('daily')
print(f"daily数据文件数: {len(files)}")
print(f"前3个文件: {files[:3]}")
print(f"后3个文件: {files[-3:]}")

daily数据文件数: 270
前3个文件: [(20250102, 'daily_data/daily/2025/01/daily_20250102.parquet'), (20250103, 'daily_data/daily/2025/01/daily_20250103.parquet'), (20250106, 'daily_data/daily/2025/01/daily_20250106.parquet')]
后3个文件: [(20260206, 'daily_data/daily/2026/02/daily_20260206.parquet'), (20260209, 'daily_data/daily/2026/02/daily_20260209.parquet'), (20260210, 'daily_data/daily/2026/02/daily_20260210.parquet')]


数据按年/月目录组织：
```
daily_data/daily/
├── 2025/
│   ├── 01/daily_20250102.parquet
│   ├── 02/daily_20250206.parquet
│   └── ...
└── 2026/
    └── ...
```

### 1.3 加载因子数据

使用`get_local_data`函数加载指定字段的数据：

In [4]:
# 加载收盘价数据
close_df = get_local_data(
    sec_list=None,  # None表示所有股票
    start='20250101',
    end='20251231',
    filed='close',
    data_type='daily'
)

print(f"收盘价数据形状: {close_df.shape}")
print(f"交易日数: {close_df.shape[0]}")
print(f"股票数: {close_df.shape[1]}")
print(f"\n数据预览:")
close_df.head()

收盘价数据形状: (243, 5500)
交易日数: 243
股票数: 5500

数据预览:


ts_code,000001.SZ,000002.SZ,000004.SZ,000006.SZ,000007.SZ,000008.SZ,000009.SZ,000010.SZ,000011.SZ,000012.SZ,...,920964.BJ,920970.BJ,920971.BJ,920974.BJ,920976.BJ,920978.BJ,920981.BJ,920982.BJ,920985.BJ,920992.BJ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-01-02,11.43,7.11,14.18,7.25,7.02,2.84,8.9,2.71,8.62,5.14,...,6.93,5.65,26.5,6.94,18.8,13.3,25.7,205.9,9.89,13.29
2025-01-03,11.38,7.0,12.8,6.91,6.6,2.65,8.72,2.64,8.27,5.08,...,6.8,5.77,26.16,7.27,18.51,13.26,26.0,209.32,9.82,12.95
2025-01-06,11.44,6.98,12.52,6.7,6.63,2.58,8.74,2.64,8.3,5.12,...,6.49,5.8,24.86,7.25,17.56,13.18,26.99,216.62,10.04,13.15
2025-01-07,11.51,7.05,13.11,6.85,6.93,2.66,8.73,2.9,8.53,5.09,...,6.6,5.91,25.4,7.69,18.11,13.54,27.57,214.51,10.18,13.27
2025-01-08,11.5,6.96,13.4,6.96,6.93,2.64,8.63,2.96,8.45,5.06,...,6.62,5.95,27.18,7.63,18.36,13.83,27.54,217.72,10.09,13.57


**返回结果说明**：
- `index`: 日期（datetime格式）
- `columns`: 股票代码（如000001.SZ）
- `values`: 因子值

---

## 2. 收益率计算

因子分析的核心是计算未来收益率，然后计算因子与收益率之间的相关性（IC）。

In [5]:
def compute_returns(prices: pd.DataFrame, forward_periods: int = 1) -> pd.DataFrame:
    """
    计算未来收益率

    Args:
        prices: 价格DataFrame
        forward_periods: 未来天数（默认1天）

    Returns:
        未来收益率DataFrame
    """
    returns = prices.pct_change().shift(-forward_periods)
    return returns.iloc[:-forward_periods]

# 计算未来1日收益率
returns_df = compute_returns(close_df)
print(f"收益率数据形状: {returns_df.shape}")
print(f"\n收益率统计:")
print(returns_df.stack().describe())

收益率数据形状: (242, 5500)

收益率统计:
count    1.315685e+06
mean     1.409965e-03
std      3.070254e-02
min     -8.932584e-01
25%     -1.232966e-02
50%      0.000000e+00
75%      1.286174e-02
max      5.784098e-01
dtype: float64


**关键概念**：
- `shift(-1)`: 因子值在t日，收益率是t+1日（未来收益率）
- `stack()`: 将宽表转为长表，便于统计分析

---

## 3. 涨停与成交不足过滤

### 3.1 加载过滤所需数据

In [6]:
# 加载涨跌幅和成交金额
pct_chg_df = get_local_data(
    start='20250101', end='20251231',
    filed='pct_chg', data_type='daily'
)

amount_df = get_local_data(
    start='20250101', end='20251231',
    filed='amount', data_type='daily'
)

print(f"涨跌幅数据形状: {pct_chg_df.shape}")
print(f"成交金额数据形状: {amount_df.shape}")

# 注意：amount单位是【万元】
print(f"\n成交金额示例（万元）:")
print(amount_df.head())

涨跌幅数据形状: (243, 5500)
成交金额数据形状: (243, 5500)

成交金额示例（万元）:
ts_code       000001.SZ   000002.SZ   000004.SZ   000006.SZ  000007.SZ  \
date                                                                     
2025-01-02  2102923.078  854487.563  167987.024  225252.942  48456.355   
2025-01-03  1320520.978  795154.845  164927.377  275530.027  53622.184   
2025-01-06  1234305.778  591154.773  106111.266  216461.997  25114.735   
2025-01-07   858329.049  504087.513  110966.824  160237.518  21530.880   
2025-01-08  1223598.997  632775.676  132231.151  253004.650  26524.725   

ts_code      000008.SZ   000009.SZ   000010.SZ  000011.SZ  000012.SZ  ...  \
date                                                                  ...   
2025-01-02  316812.796  215305.581   95665.836  36956.749  85557.805  ...   
2025-01-03  319412.025  126688.166   75658.571  40593.401  82624.045  ...   
2025-01-06  236957.732   94111.507   70149.192  24182.482  65223.298  ...   
2025-01-07  186369.920   88586.233  1411

### 3.2 定义过滤条件

过滤逻辑：
- **涨停过滤**: `pct_chg >= 9.5%`（A股涨停板）
- **成交不足过滤**: `amount < 100万`（单位已转换，100万/1万=100万）

In [7]:
# 定义阈值
LIMIT_UP_THRESHOLD = 9.5  # 涨停阈值（%）
MIN_AMOUNT = 100  # 最小成交金额（万），即100万

# 计算过滤掩码
limit_up_mask = pct_chg_df >= LIMIT_UP_THRESHOLD
low_amount_mask = amount_df < MIN_AMOUNT
filter_mask = limit_up_mask | low_amount_mask

print(f"涨停股票数: {limit_up_mask.sum().sum()}")
print(f"成交不足股票数: {low_amount_mask.sum().sum()}")
print(f"过滤总数: {filter_mask.sum().sum()}")
print(f"过滤比例: {filter_mask.sum().sum() / pct_chg_df.size * 100:.2f}%")

涨停股票数: 22226
成交不足股票数: 5
过滤总数: 22231
过滤比例: 1.66%


### 3.3 应用过滤

将涨停和成交不足股票的因子值设为NaN：

In [8]:
# 复制因子数据并应用过滤
factor_filtered = close_df.copy()
factor_filtered[filter_mask] = np.nan

print(f"原始数据量: {close_df.size}")
print(f"过滤后数据量: {factor_filtered.size - factor_filtered.isna().sum().sum()}")
print(f"保留比例: {(~factor_filtered.isna()).sum().sum() / close_df.size * 100:.2f}%")

原始数据量: 1336500
过滤后数据量: 1291667
保留比例: 96.65%


---

## 4. IC（信息系数）分析

### 4.1 IC计算原理

IC（Information Coefficient）衡量因子预测能力，计算方式：
- **Pearson IC**: 因子值与收益率的皮尔逊相关系数
- **Spearman IC**: 因子值与收益率的斯皮尔曼相关系数（秩相关，更稳健）

本分析使用**Spearman IC**。

In [10]:
def compute_ic(
    factor_df: pd.DataFrame,
    returns_df: pd.DataFrame,
    method: str = 'spearman'
) -> Tuple[np.ndarray, Dict]:
    """
    计算信息系数（IC）

    Args:
        factor_df: 因子值DataFrame
        returns_df: 收益率DataFrame
        method: 'spearman' or 'pearson'

    Returns:
        ic_series: 每日IC值
        ic_stats: IC统计信息
    """
    # 对齐日期
    common_dates = factor_df.index.intersection(returns_df.index)
    factor_aligned = factor_df.loc[common_dates]
    returns_aligned = returns_df.loc[common_dates]

    # 找到共同的股票
    common_stocks = factor_aligned.columns.intersection(returns_aligned.columns)
    factor_aligned = factor_aligned[common_stocks]
    returns_aligned = returns_aligned[common_stocks]

    ic_series = []

    for date in common_dates:
        factor_vals = factor_aligned.loc[date].values
        return_vals = returns_aligned.loc[date].values

        # 去除NaN
        mask = ~(np.isnan(factor_vals) | np.isnan(return_vals))
        if mask.sum() < 10:
            continue

        factor_vals = factor_vals[mask]
        return_vals = return_vals[mask]

        if method == 'spearman':
            ic, _ = stats.spearmanr(factor_vals, return_vals)
        else:
            ic, _ = stats.pearsonr(factor_vals, return_vals)

        if not np.isnan(ic):
            ic_series.append(ic)

    ic_series = np.array(ic_series)

    if len(ic_series) > 0:
        ic_stats = {
            'ic_mean': float(np.mean(ic_series)),
            'ic_std': float(np.std(ic_series)),
            'ic_ir': float(np.mean(ic_series) / np.std(ic_series)) if np.std(ic_series) > 0 else 0,
            'ic_positive_ratio': float((ic_series > 0).mean()),
            'ic_t_stat': float(np.mean(ic_series) / (np.std(ic_series) / np.sqrt(len(ic_series)))) if np.std(ic_series) > 0 else 0,
            'ic_count': len(ic_series)
        }
    else:
        ic_stats = {}

    return ic_series, ic_stats

# 计算IC
ic_series, ic_stats = compute_ic(factor_filtered, returns_df)

print("="*50)
print("IC分析结果")
print("="*50)
print(f"IC均值: {ic_stats.get('ic_mean', 'N/A'):.4f}")
print(f"IC标准差: {ic_stats.get('ic_std', 'N/A'):.4f}")
print(f"IC IR（信息比率）: {ic_stats.get('ic_ir', 'N/A'):.4f}")
print(f"正IC占比: {ic_stats.get('ic_positive_ratio', 'N/A')*100:.2f}%")
print(f"IC T统计量: {ic_stats.get('ic_t_stat', 'N/A'):.4f}")
print(f"交易日数: {ic_stats.get('ic_count', 'N/A')}")

IC分析结果
IC均值: -0.0161
IC标准差: 0.1714
IC IR（信息比率）: -0.0942
正IC占比: 47.93%
IC T统计量: -1.4659
交易日数: 242


### 4.2 IC解读

| IC IR范围 | 因子评级 | 说明 |
|----------|---------|------|
| IR > 0.5 | 优秀 | 因子预测能力强 |
| 0.3 < IR <= 0.5 | 良好 | 因子有稳定预测能力 |
| 0.1 < IR <= 0.3 | 一般 | 因子有一定预测能力 |
| 0 <= IR <= 0.1 | 中性 | 因子预测能力弱 |
| IR < 0 | 无效/反向 | 因子无效或反向 |

---

## 5. 分层收益回测

### 5.1 分层方法

将股票按因子值分为5组（Q1-Q5），Q1为因子值最低组，Q5为最高组。

In [None]:
def compute_quantile_returns(
    factor_df: pd.DataFrame,
    returns_df: pd.DataFrame,
    quantiles: int = 5
) -> Dict:
    """
    计算分层组合收益
    """
    common_dates = factor_df.index.intersection(returns_df.index)
    factor_aligned = factor_df.loc[common_dates]
    returns_aligned = returns_df.loc[common_dates]

    common_stocks = factor_aligned.columns.intersection(returns_aligned.columns)
    factor_aligned = factor_aligned[common_stocks]
    returns_aligned = returns_aligned[common_stocks]

    results = {}

    for date in common_dates:
        factor_vals = factor_aligned.loc[date].values
        return_vals = returns_aligned.loc[date].values

        mask = ~(np.isnan(factor_vals) | np.isnan(return_vals))
        if mask.sum() < quantiles * 2:
            continue

        valid_factor = factor_vals[mask]
        valid_returns = return_vals[mask]

        try:
            quantile_bounds = np.nanpercentile(valid_factor, np.linspace(0, 100, quantiles + 1))
            quantile_bounds[0] = -np.inf
            quantile_bounds[-1] = np.inf
        except:
            continue

        for i in range(quantiles):
            lower = quantile_bounds[i]
            upper = quantile_bounds[i + 1]
            group_mask = (valid_factor >= lower) & (valid_factor < upper)

            if group_mask.sum() > 0:
                group_key = f'Q{i + 1}'
                if group_key not in results:
                    results[group_key] = []
                results[group_key].append(valid_returns[group_mask].mean())

    for key in results:
        returns_list = results[key]
        results[key] = {
            'mean_return': float(np.mean(returns_list)),
            'std': float(np.std(returns_list)),
            'count': len(returns_list)
        }

    if 'Q1' in results and f'Q{quantiles}' in results:
        top = results[f'Q{quantiles}']['mean_return']
        bottom = results['Q1']['mean_return']
        results['Long_Short'] = {
            'mean_return': float(top - bottom),
            'win_rate': float((top > bottom).mean()) if isinstance(top, pd.Series) else (top > bottom)
        }

    return results

# 计算分层收益
quantile_returns = compute_quantile_returns(factor_filtered, returns_df)

print("="*50)
print("分层收益分析")
print("="*50)
for key in ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Long_Short']:
    if key in quantile_returns:
        r = quantile_returns[key]
        print(f"{key}",r)
        print(f"{key}: 均值={r['mean_return']*100:.4f}%, 标准差={r['std']*100:.4f}%")

分层收益分析
Q1: 均值=0.1089%, 标准差=1.2565%
Q2: 均值=0.1143%, 标准差=1.2663%
Q3: 均值=0.1375%, 标准差=1.4203%
Q4: 均值=0.1373%, 标准差=1.5388%
Q5: 均值=0.1168%, 标准差=1.6423%


KeyError: 'std'

### 5.2 分层收益解读

理想情况下：
- Q1（低因子值）收益最低
- Q5（高因子值）收益最高
- Long_Short（多空组合）有显著正收益

---

## 6. 完整因子分析函数

In [None]:
def analyze_single_factor(
    factor_name: str,
    data_type: str,
    field: str,
    stocks: List[str],
    start_date: str,
    end_date: str,
    limit_up_threshold: float = 9.5,
    min_amount: float = 100
) -> Dict:
    """
    完整因子分析流程
    """
    print(f"分析因子: {factor_name}")

    # 1. 加载因子数据
    factor_df = get_local_data(
        sec_list=stocks, start=start_date, end=end_date,
        filed=field, data_type=data_type
    )

    if factor_df.empty:
        return {'error': '数据为空'}

    # 2. 计算收益率
    returns_df = compute_returns(factor_df)

    # 3. 对齐日期
    common_dates = factor_df.index.intersection(returns_df.index)
    factor_df = factor_df.loc[common_dates]
    returns_df = returns_df.loc[common_dates]

    # 4. 加载过滤数据
    try:
        pct_chg_df = get_local_data(
            sec_list=stocks, start=start_date, end=end_date,
            filed='pct_chg', data_type='daily'
        )
        amount_df = get_local_data(
            sec_list=stocks, start=start_date, end=end_date,
            filed='amount', data_type='daily'
        )

        pct_chg_df = pct_chg_df.loc[common_dates]
        amount_df = amount_df.loc[common_dates]

        common_stocks = factor_df.columns.intersection(
            pct_chg_df.columns
        ).intersection(amount_df.columns)
        factor_df = factor_df[common_stocks]
        pct_chg_df = pct_chg_df[common_stocks]
        amount_df = amount_df[common_stocks]

        limit_up_mask = pct_chg_df >= limit_up_threshold
        low_amount_mask = amount_df < min_amount
        filter_mask = limit_up_mask | low_amount_mask

        factor_filtered = factor_df.copy()
        factor_filtered[filter_mask] = np.nan

        total = factor_df.size
        filtered = filter_mask.sum().sum()
        print(f"  过滤: {filtered}/{total} ({filtered/total*100:.1f}%)")

    except Exception as e:
        print(f"  警告: 过滤失败 ({e})")
        factor_filtered = factor_df

    # 5. 计算IC
    ic_series, ic_stats = compute_ic(factor_filtered, returns_df)

    # 6. 计算分层收益
    quantile_returns = compute_quantile_returns(factor_filtered, returns_df)

    result = {
        'factor': factor_name,
        'data_type': data_type,
        'field': field,
        'start_date': start_date,
        'end_date': end_date,
        'stock_count': len(factor_df.columns),
        'date_count': len(factor_df),
        'ic_stats': ic_stats,
        'quantile_returns': quantile_returns,
        'ic_series': ic_series.tolist() if len(ic_series) > 0 else []
    }

    print(f"  完成: IC均值={ic_stats.get('ic_mean', 'N/A'):.4f}")

    return result

---

## 7. 批量因子分析示例

In [None]:
# 定义要分析的因子
factors_to_analyze = [
    ('volume_ratio', 'daily_basic', 'volume_ratio'),
    ('turnover_rate', 'daily_basic', 'turnover_rate'),
    ('close', 'daily', 'close'),
]

# 示例股票列表
stocks = ['000001.SZ', '000002.SZ', '000004.SZ', '000005.SZ', '000006.SZ',
          '000007.SZ', '000008.SZ', '000009.SZ', '000010.SZ', '000011.SZ']

# 执行分析
results = []
for name, dtype, field in factors_to_analyze:
    result = analyze_single_factor(
        factor_name=name, data_type=dtype, field=field,
        stocks=stocks, start_date='20250101', end_date='20250131'
    )
    if 'error' not in result:
        results.append(result)

print("\n" + "="*50)
print("因子对比")
print("="*50)
for r in sorted(results, key=lambda x: x['ic_stats'].get('ic_ir', 0), reverse=True):
    print(f"{r['factor']}: IC IR={r['ic_stats'].get('ic_ir', 0):.4f}, IC均值={r['ic_stats'].get('ic_mean', 0):.4f}")

---

## 8. 可视化

In [None]:
import matplotlib.pyplot as plt

if len(results) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # IC序列图
    ax1 = axes[0]
    for r in results[:2]:
        ic_series = np.array(r['ic_series'])
        if len(ic_series) > 0:
            ax1.plot(ic_series, alpha=0.7, label=r['factor'])
    ax1.axhline(y=0, color='r', linestyle='--')
    ax1.set_title("IC时间序列")
    ax1.set_xlabel("交易日")
    ax1.set_ylabel("IC值")
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # 分层收益图
    ax2 = axes[1]
    x = np.arange(5)
    width = 0.35
    for i, r in enumerate(results[:2]):
        means = [r['quantile_returns'].get(f'Q{j+1}', {}).get('mean_return', 0) * 100 for j in range(5)]
        ax2.bar(x + i * width, means, width, label=r['factor'])

    ax2.set_xlabel("分位数组")
    ax2.set_ylabel("平均日收益率 (%)")
    ax2.set_title("分层收益对比")
    ax2.set_xticks(x + width / 2)
    ax2.set_xticklabels(['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.axhline(y=0, color='r', linestyle='--')

    plt.tight_layout()
    plt.savefig('factor_analysis.png', dpi=150)
    plt.show()
    print("图表已保存到 factor_analysis.png")

---

## 9. 总结

### 关键步骤

1. **数据加载**: `get_local_data()` 从parquet文件加载
2. **收益率计算**: `prices.pct_change().shift(-1)` 计算未来收益
3. **过滤处理**: 涨停(>=9.5%) + 成交不足(<100万)
4. **IC计算**: Spearman秩相关系数
5. **分层回测**: Q1-Q5分组收益

### 注意事项

- **amount单位**: Tushare的amount是**万元**，100万=100
- **NaN处理**: 过滤时将无效数据设为NaN
- **日期对齐**: 确保因子、收益、过滤数据日期一致

In [None]:
print("\n" + "="*50)
print("Notebook执行完成！")
print("="*50)