In [1]:
import os

os.environ["http_proxy"] = "http://localhost:7890"
os.environ["https_proxy"] = "http://localhost:7890"

4.1数据
1）初始化

在Python中运行Qlib程序前，需要首先初始化运行环境，命令为qlib.init，代码如下：

In [7]:
import qlib
from qlib.constant import REG_CN

# from qlib.contrib.model.pytorch_alstm_ts import ALSTM

data_uri = 'H:/data/qlib/qlib_data/cn_data'
# GetData().qlib_data(target_dir=data_uri, region=REG_CN)
qlib.init(provider_uri=data_uri, region=REG_CN)

[19576:MainThread](2024-01-27 12:42:42,050) INFO - qlib.Initialization - [config.py:416] - default_conf: client.
[19576:MainThread](2024-01-27 12:42:42,053) INFO - qlib.Initialization - [__init__.py:74] - qlib successfully initialized based on client settings.
[19576:MainThread](2024-01-27 12:42:42,054) INFO - qlib.Initialization - [__init__.py:76] - data_path={'__DEFAULT_FREQ': WindowsPath('H:/data/qlib/qlib_data/cn_data')}


2）获取交易日期和全部股票代码

In [9]:
# 获取日历
from qlib.data import D

tradedate = D.calendar(start_time='2020-01-01', end_time='2020-11-30', freq='day')
print(tradedate[:5])
# 获取所有证券代码
instruments = D.instruments(market='all')
stock_list = D.list_instruments(instruments=instruments,
                                start_time='2020-07-01',
                                end_time='2020-11-30',
                                as_list=True)
print(stock_list[-5:])

[Timestamp('2020-01-02 00:00:00') Timestamp('2020-01-03 00:00:00')
 Timestamp('2020-01-06 00:00:00') Timestamp('2020-01-07 00:00:00')
 Timestamp('2020-01-08 00:00:00')]
['SZ300891', 'SZ300892', 'SZ300893', 'SZ300895', 'SH000905']


3）获取字段数据

调用qlib.data.features模块可以获取指定股票指定日期指定字段数据，例如下图展示获取惠云钛业（SZ300891）在2020-01-01~2020-11-30日频后复权收盘价和成交量。

In [10]:
#3. 获取指定股票指定日期指定字段数据
features_df = D.features(instruments=['SZ300891'],
                         fields=['$close', ' $volume'],
                         start_time='2020-01-01',
                         end_time='2020-11-30',
                         freq='day')
print(features_df.head())

                         $close       $volume
instrument datetime                          
SZ300891   2020-09-17  1.000000  1.617669e+09
           2020-09-18  0.881890  1.203641e+09
           2020-09-21  0.871391  7.776675e+08
           2020-09-22  0.843832  6.129917e+08
           2020-09-23  0.851269  6.875450e+08


4.2股票池

使用qlib.data.filter.NameDFilter命令进行股票名称静态筛选，参数name_rule_re为纳入股票代码的正则表达式，如HK[0-9!]表示以HK开头，后续为数字或感叹号的股票代码，感叹号代表目前已退市股票。

其次，使用qlib.data.filter.ExpressionDFilter命令进行股票因子表达式的动态筛选，参数rule_expression为入选的因子表达式，如$close>=1代表收盘价应大于等于1元。

随后，通过qlib.data.instruments命令的参数filter_pipe，将两个筛选条件组装到一起，代码如下：

- 不加if __name__=='__main__' ，会报错“RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.”


In [12]:
if __name__ == '__main__':
    from qlib.data.filter import NameDFilter, ExpressionDFilter

    # 静态Filter:深交所A股
    nameDFilter = NameDFilter(name_rule_re='SZ[0-9!]')
    # 动态Filter:后复权价格大于等于1元
    expressionDFilter = ExpressionDFilter(rule_expression='$close>=5')
    #按以上两个过滤条件获取新的股票代码集
    instruments = D.instruments(market='all', filter_pipe=[nameDFilter, expressionDFilter])
    stock_list = D.list_instruments(instruments=instruments,
                                    start_time='2020-07-01',
                                    end_time='2020-11-30',
                                    as_list=True)
    #展示条件过滤后的5个股票代码
    print(stock_list[-5:])

['SZ300782', 'SZ300803', 'SZ300831', 'SZ300841', 'SZ300846']


4.3因子

Qlib提供Alpha158和Alpha360两类量价因子库，用户也可根据需要自定义因子库。源码位于qlib/contrib/data/handler.py，主要包括两个四个类：

Alpha360(DataHandlerLP)、Alpha360vwap(Alpha360)
Alpha158(DataHandlerLP)、Alpha158vwap(Alpha360)
配置文件对应的代码说明如下：

```python
def parse_config_to_fields(config):
 """create factors from config
        config = {
            'kbar': {}, # whether to use some hard-code kbar features
            'price': { # whether to use raw price features
                'windows': [0, 1, 2, 3, 4], # use price at n days ago
                'feature': ['OPEN', 'HIGH', 'LOW'] # which price field to use
            },
            'volume': { # whether to use raw volume features
                'windows': [0, 1, 2, 3, 4], # use volume at n days ago
            },
            'rolling': { # whether to use rolling operator based features
                'windows': [5, 10, 20, 30, 60], # rolling windows size
                'include': ['ROC', 'MA', 'STD'], # rolling operator to use
                #if include is None we will use default operators
                'exclude': ['RANK'], # rolling operator not to use
            }
        }
```
其中参数data_handler_config相当于配置文件，字典类型，用来定义完整数据起止日期（start_time和end_time），拟合数据起止日期（fit_start_time和fit_end_time），股票池（instruments）等。

拟合数据起止日期区间应为完整数据起止日期数据的子集。

拟合数据日期（训练和验证集）和余下日期（测试集）在数据预处理的方式上有所不同，将在下章展开讨论。

生成Alpha158因子调用qlib.contrib.data.handler模块下的Alpha158类，具体命令为：

```python
from qlib.contrib.data.handler import Alpha158
h = Alpha158(**data_handler_config)
```

执行上述指令后，程序将计算从start_time至end_time的**当期因子值和下期收益**，分别作为后续**AI模型训练的特征和标签**。

代码中使用了两个处理器，infer_processors用于模型预测，learn_processors用于模型训练。

```python
    infer_processors = check_transform_proc(infer_processors, fit_start_time, fit_end_time)
    learn_processors = check_transform_proc(learn_processors, fit_start_time, fit_end_time)
```

In [16]:
from qlib.contrib.data.handler import Alpha158

instruments = D.instruments(market='all')
# stock_list = D.list_instruments(instruments=instruments,
#                                 start_time='2020-01-01',
#                                 end_time='2020-11-30',
#                                 as_list=True)
# 设置日期、股票池等参数
data_handler_config = {
    "start_time": "2020-01-01",
    "end_time": "2020-11-30",
    "fit_start_time": "2020-01-01",
    "fit_end_time": "2020-06-30",
    "instruments": instruments
}
h = Alpha158(**data_handler_config)
# 获取列名(因子名称)
print(h.get_cols())

[19576:MainThread](2024-01-27 13:00:12,306) INFO - qlib.timer - [log.py:127] - Time cost: 125.072s | Loading data Done
[19576:MainThread](2024-01-27 13:00:12,893) INFO - qlib.timer - [log.py:127] - Time cost: 0.201s | DropnaLabel Done
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[cols] = df[cols].groupby("datetime", group_keys=False).apply(self.zscore_func)
[19576:MainThread](2024-01-27 13:00:13,451) INFO - qlib.timer - [log.py:127] - Time cost: 0.557s | CSZScoreNorm Done
[19576:MainThread](2024-01-27 13:00:13,478) INFO - qlib.timer - [log.py:127] - Time cost: 1.171s | fit & process data Done
[19576:MainThread](2024-01-27 13:00:13,479) INFO - qlib.timer - [log.py:127] - Time cost: 126.246s | Init data Done


['KMID', 'KLEN', 'KMID2', 'KUP', 'KUP2', 'KLOW', 'KLOW2', 'KSFT', 'KSFT2', 'OPEN0', 'HIGH0', 'LOW0', 'VWAP0', 'ROC5', 'ROC10', 'ROC20', 'ROC30', 'ROC60', 'MA5', 'MA10', 'MA20', 'MA30', 'MA60', 'STD5', 'STD10', 'STD20', 'STD30', 'STD60', 'BETA5', 'BETA10', 'BETA20', 'BETA30', 'BETA60', 'RSQR5', 'RSQR10', 'RSQR20', 'RSQR30', 'RSQR60', 'RESI5', 'RESI10', 'RESI20', 'RESI30', 'RESI60', 'MAX5', 'MAX10', 'MAX20', 'MAX30', 'MAX60', 'MIN5', 'MIN10', 'MIN20', 'MIN30', 'MIN60', 'QTLU5', 'QTLU10', 'QTLU20', 'QTLU30', 'QTLU60', 'QTLD5', 'QTLD10', 'QTLD20', 'QTLD30', 'QTLD60', 'RANK5', 'RANK10', 'RANK20', 'RANK30', 'RANK60', 'RSV5', 'RSV10', 'RSV20', 'RSV30', 'RSV60', 'IMAX5', 'IMAX10', 'IMAX20', 'IMAX30', 'IMAX60', 'IMIN5', 'IMIN10', 'IMIN20', 'IMIN30', 'IMIN60', 'IMXD5', 'IMXD10', 'IMXD20', 'IMXD30', 'IMXD60', 'CORR5', 'CORR10', 'CORR20', 'CORR30', 'CORR60', 'CORD5', 'CORD10', 'CORD20', 'CORD30', 'CORD60', 'CNTP5', 'CNTP10', 'CNTP20', 'CNTP30', 'CNTP60', 'CNTN5', 'CNTN10', 'CNTN20', 'CNTN30', 'CNT

其中因子对应的算子计算公式在handler.py中可以看到：

"G:\ProgramData\miniconda3\envs\qlibenv\Lib\site-packages\qlib\contrib\data\handler.py"

In [17]:
# 通过下面代码获取标签：
#获取T日每只股票的标签数据（收益，及涨幅）
Alpha158_df_label = h.fetch(col_set="label")
print(Alpha158_df_label)
# 默认参数下，股票t日的标签对应t+2日收盘价相对于t+1日收盘价的涨跌幅，相当于t日收盘后发信号，t+1日收盘时刻开仓，t+2日收盘时刻平仓。
# 如下，2020年1月2日沪深300的标签值为-0.003778，对应2020年1月6日该股票涨跌幅为-0.3778%（2020年1月4~5日非交易日）。

                         LABEL0
datetime   instrument          
2020-01-02 SH000300   -0.003778
           SH000903   -0.006925
           SH000905    0.010076
           SH600000   -0.011111
           SH600004   -0.017261
...                         ...
2020-09-25 SZ300890         NaN
           SZ300891         NaN
           SZ300892         NaN
           SZ300893         NaN
           SZ300895         NaN

[676964 rows x 1 columns]


In [18]:
# 通过下面代码可获取T日的特征（因子）值。
# 获取T日每只股票的特征(因子值)值
Alpha158_df_feature = h.fetch(col_set="feature")
print(Alpha158_df_feature)
# 默认参数下，股票t日的特征对应t日收盘后计算出的因子值。

                           KMID      KLEN     KMID2       KUP      KUP2  \
datetime   instrument                                                     
2020-01-02 SH000300    0.007495  0.012450  0.602024  0.004955  0.397976   
           SH000903    0.006225  0.011553  0.538800  0.005328  0.461200   
           SH000905    0.011207  0.016346  0.685605  0.001653  0.101117   
           SH600000    0.000000  0.015237  0.000000  0.013633  0.894736   
           SH600004   -0.002278  0.012529 -0.181813  0.003417  0.272729   
...                         ...       ...       ...       ...       ...   
2020-09-25 SZ300890   -0.058000  0.085636 -0.677282  0.027273  0.318471   
           SZ300891   -0.044214  0.074236 -0.595587  0.028930  0.389707   
           SZ300892   -0.110807  0.148837 -0.744485  0.038030  0.255515   
           SZ300893    0.032195  0.245854  0.130953  0.153659  0.625000   
           SZ300895   -0.104925  0.152836 -0.686523  0.040896  0.267578   

                        