# Qlib Step Forward

In [3]:
import qlib
qlib.init(provider_uri="../data/qlib-day/")

[1352:MainThread](2022-08-24 17:33:43,456) INFO - qlib.Initialization - [config.py:413] - default_conf: client.
[1352:MainThread](2022-08-24 17:33:43,755) INFO - qlib.Initialization - [__init__.py:74] - qlib successfully initialized based on client settings.
[1352:MainThread](2022-08-24 17:33:43,755) INFO - qlib.Initialization - [__init__.py:76] - data_path={'__DEFAULT_FREQ': PosixPath('/home/ppoak/Quant/data/qlib-day')}


## Learn Process vs Infer Process

The learn process need to learn some parameters from given last time period, which is set in the parameters as `fit_start_time` and `fit_end_time`. For example, the zscore mean and standard deviation can be calculated from the last period. Then, the parameters will directly be used in the new coming dataset.

Every processor implemented a `__call__` function, this is the entrypoint for the processor's actual call. What is different is that processors used with learning have the `fit` method, which summarize the information / experience from the past time period.

Here we implement a processor which can perform MAD deextreme

In [19]:
import pandas as pd
from qlib.data.dataset.processor import Processor, get_group_columns

class CSMADDeextreme(Processor):
    def __init__(self, fields_group: str = None, n: int = 5):
        self.fields_group = fields_group
        self.n = n
    
    def __call__(self, df: pd.DataFrame):
        cols = get_group_columns(df, self.fields_group)
        csmed = df[cols].groupby("datetime").median()
        csmad = df[cols].groupby("datetime").apply(
            lambda x: x - csmed).abs().groupby("datetime").median()
        return df[cols].clip(csmed - self.n * csmad, csmed + self.n * csmad)

In [38]:
from qlib.utils import init_instance_by_config

config = {
    "class": "DataHandlerLP",
    "module_path": "qlib.data.dataset.handler",
    "kwargs": {
        "instruments": "000016.XSHG",
        "start_time": "20200101", 
        "end_time": "20210101",
        "data_loader": {
            "class": "QlibDataLoader",
            "module_path": "qlib.data.dataset.loader",
            "kwargs": {
                "config": [("$close / Ref($close, 60) - 1",), ("MOM60",)],
            },
        },
        "infer_processors": [CSMADDeextreme(n=7)],
    }
}

h = init_instance_by_config(config)

[8633:MainThread](2022-08-23 15:56:31,828) INFO - qlib.timer - [log.py:117] - Time cost: 0.572s | Loading data Done
[8633:MainThread](2022-08-23 15:56:32,129) INFO - qlib.timer - [log.py:117] - Time cost: 0.298s | CSMADDeextreme Done
[8633:MainThread](2022-08-23 15:56:32,132) INFO - qlib.timer - [log.py:117] - Time cost: 0.302s | fit & process data Done
[8633:MainThread](2022-08-23 15:56:32,134) INFO - qlib.timer - [log.py:117] - Time cost: 0.879s | Init data Done


In [39]:
h.fetch(data_key="infer", squeeze=True)

datetime    instrument 
2020-01-02  600000.XSHG    0.040033
            600009.XSHG   -0.009855
            600010.XSHG   -0.101351
            600016.XSHG    0.035831
            600028.XSHG    0.031936
                             ...   
2020-12-31  603160.XSHG   -0.011062
            603259.XSHG    0.327291
            603288.XSHG    0.237137
            603501.XSHG    0.302926
            603986.XSHG    0.140893
Name: MOM60, Length: 13608, dtype: float32

## Oprations

Qlib provide us a series of data operators. There are five main operators base classes:

1. NpElemOperator
2. NpPairOperator
3. Rolling
4. PairRolling
5. TResample

*The above base classes is the second level base class under the user interface, it doesn't include all base classes in `qlib.data.ops`*

First, the `NPElemOperator` controls the element-wised feature without any time-related parameter, like calculating the sign of the feature with `Sign`, calculating the absolute value of the feature with `Abs` and so on.

Second, the `NpPairOperator` is mainly used for the interactions with two features, like calculating the added feature of two different feature with `Add`, comparing whether one feature is not less than the other with `Ge` and so on. In the `_load_internal` part for `NpPairOperator` and `NpElemOperator`, where the real calculating part locates, indicate that the principle of this operator is `getattr(np, func)`.

What is worth mentioning is that one `Ternary Conditional Operator` frequently use in coding language is also implemented. We can judge the condition and apply values due to different results by `If`. The loading data part is in the `_load_internal` method, which implies the realization of this operator is `np.where`.

Third, the `Rolling` operator takes a period as input, and classes based on it will apply functions in the rolling window. We can easily compute mean, skew or other basic statistic indicators on it by `Mean`, `Skew` or other functions. But the operator `Slope`, `Rsquare`, `Resi` use the cython as backend, which is used for calculating the slope, r square and residual in linear regression for data in the rolling window.

Fourth, the `PairRolling` part is something like *multi-column rolling* in pandas rolling procession. We can compute rolling window correlation and rolling window covariance by `Corr` and `Cov` based on it. In the `Rolling` and `PairRolling` operators, most functions are realized by pandas. Qlib takes the expression as input, constructing a rolling window, and get the function attributions of the rolling window and finally apply on it.

Fifth, the `TResample` is a operator used for resample. The internal calls the pandas resample class interfaces. At present it only it is not implemented for more usage.

The real data fetching interface of a `Expression` class is `_load_internal` method, and this is the underlaying the `load` method in `Dataloader` class. When the `load` method of a `Dataloader` class is called, the data loading jobs will be distributed to the workers, the number of the workers depends on `max(len(instruments), system_cores)`. Once jobs are distributed, `joblib` module will run the tasks parallely and computing the expression, finally return data. So **cython and parallel working is the key to the fast factor constructing (mostly because of parallel working because cython part takes so little)**

In the next cell we implemented one dummy operator as a self-defined operator example.

In [11]:
import numpy as np
from qlib.data.ops import NpElemOperator, Operators

class AbsPlusLog(NpElemOperator):
    """Calculating standard deviation in the resampled data window"""
    def __init__(self, feature):
        super().__init__(feature, "foo")

    def _load_internal(self, instrument, start_index, end_index, *args):
        series = self.feature.load(instrument, start_index, end_index, *args)
        return np.abs(series) + np.log(series)
    
Operators.register([AbsPlusLog])

In [18]:
from qlib.data.dataset.loader import QlibDataLoader

qdl = QlibDataLoader(config=[('Abspluslog($chgPct)', '$chgPct'), ('ChgPctAbsPlusLogChgPct', 'ChgPct')])
data = qdl.load(instruments=['000001.XSHE'], start_time='20200101', end_time='20201231', )

Now we verify the result

In [19]:
(data.ChgPct.abs().add(np.log(data.ChgPct)) == data.ChgPctAbsPlusLogChgPct).all()

True

Remind that using string format expression is not the only way, You can also implement expression by code. Here is an exmaple which does the same thing as above examples. I believe this will be more explicit when encountering some complex expressions.

In [16]:
from qlib.data.ops import *
from qlib.data import D

# no more need for '$'
f1 = Feature("high") / Feature("close")
f2 = Feature("open") / Feature("close")
f3 = f1 + f2
f4 = f3 * f3 / f3

# however, using index name as instruments is not available anymore
data = D.features(instruments=['600519.XSHG'], fields=[f4], start_time='20200101')
data

Unnamed: 0_level_0,Unnamed: 1_level_0,"Div(Mul(Add(Div($high,$close),Div($open,$close)),Add(Div($high,$close),Div($open,$close))),Add(Div($high,$close),Div($open,$close)))"
instrument,datetime,Unnamed: 2_level_1
600519.XSHG,2020-01-02,2.011558
600519.XSHG,2020-01-03,2.071280
600519.XSHG,2020-01-06,2.007217
600519.XSHG,2020-01-07,1.988525
600519.XSHG,2020-01-08,2.003924
600519.XSHG,...,...
600519.XSHG,2022-08-15,2.010511
600519.XSHG,2022-08-16,2.011497
600519.XSHG,2022-08-17,2.012513
600519.XSHG,2022-08-18,2.015869


## Qlib Rolling model training

The key point of rolling model training is that with the rolling window moves, the same data might be regenerated many times. For example, in the first rolling window, ...