# Qlib Explorer

## Installation

For normal x86 intel chip or amd chip, just create a conda virtual environment in python 3.8, then use `pip install` to make a qlib environment. code as follows:

```shell
conda create -n qlib python=3.8
conda activate qlib
pip install pyqlib
```

But on Mac M1 arm, things are slightly different, but to start with, create a conda environment.

```shell
conda create -n qlib python=3.8
```

Then, clone the qlib project to a proper directory, like `Desktop`

```shell
git clone https://github.com/microsoft/qlib
```

Then, `cd` to the directory and install the dependency using conda first

```shell
conda install lightbgm ecos pytables cvxpy mlflow fire ruamel
```

These dependencies are neccessary to install by conda because the are not pure python package, without conda, the installation might not be successful. After installing these, we can simple use the command below to install the qlib library to Mac M1 locally.

```shell
export HDF5_DIR=/path/to/somewhere/you/want/to/store/hdf5file
pip install .
```

Wait for a second, the installation will be successful. And the following cell can be run.

In [2]:
import qlib

qlib.__version__

'0.8.6.99'

## Data Transformation

### Daily Data

The Qlib provide a better storage plan for most of quanters, we can view this as below:

![qlib-storage-structure](./images/qlib-storage-structure.png)

In features directory, keeps the features of our dataset, each stock for a directory, and each feature a bin file in the directory. Show as follows:

![qlib-storage-features](./images/qlib-storage-features.png)

Anyway, the three core file in these directories are `day.txt`, `all.txt`, and the `bin files`. The `day.txt` simply stored the all trading day calender amoung all instruments, which is the union set of the all instruments' trading days. The `all.txt` records the entry date and exit date for every instrument in a line. The `bin files` are slightly special, the value of them are a column for one sepcified instrument, but the **first value is the index of the feature start date in `day.txt`**, and the file is stored in numpy generic `bin` format wihout any extra information.

To generate these files, you can simple realize that by numpy or pandas. For `day.txt`, simply construct the union set of your total data, and store them into a list, use `np.savetxt` is quite enough, or you can construct a Series (or DataFrame), using `data.to_csv('day.txt', headers=False, index=False)` is also a great way to accomplish that. For `all.txt`, get the min date and max date for each instrument, construct a list or DataFrame, use `np.savetxt` or `data.to_csv` like above works fine, further more, to make this faster, we recommend using `pandas.core.DataFrame.groupby`. For `bin files`, we need to get each instrument, and get the index of first date, save the numpy format data using `data[feature].value.tofile`. But before that, we must apply `np.hstack([date_index, data[feature].value])` first. However, in updating process, `hstack` is not neccessary.

Qlib provide a `script/dump_bin` script to simplify our workflow of dumping data into bin, however, that is for `csv` file only, if your files are `feathe` or `parquet` format, that can not help. So we provide a `SingleFileDumper` in [library](../library/dumper.py). You can try that.

In [1]:
from library.dumper import FileDumper

dumper = FileDumper(
    file_path = "../data/index-market-daily/index-market-daily.parquet",
    file_type = "parquet",
    date_field = None,
    inst_field = None,
    dump_field = None,
    dump_path = "../data/qlib-day",
    dump_mode = "update",
    name_pattern = None,
    freq = "day",
)
dumper.dump()

AttributeError: 'str' object has no attribute 'strftime'

In [1]:
from library.dumper import IndexCompDumper

dumper = IndexCompDumper(
    file_path = "../data/index/index_weights.parquet",
    file_type = "parquet",
    date_field = None,
    inst_field = None,
    index_field = 1,
    dump_path = "../data/qlib-day/",
    name_pattern = None,
    name_col = None
)
dumper.dump()

### PIT Data

PIT data can be easily processed and stored in a continuous time series based file or database. However, this way of storing is not only redudent but also time consuming. Qlib provide us a clever way to store PIT data.

PIT data store in the same directory with daily kline data, but has a independent directory called financial. Each instrument has a sub-directory like the feature sub-directory. But what is different is that each feature not only depends on one binary file, but also a index binary file. Let's start with the index file.

The index file should ends with the suffix '.index', and all features in `financial` data should be followed by `_q` or `_a` ending, `_q` means it is quartly data, and `_a` means it is annually data. To start with, the first element in index file is `start_year`, this is a unsigned integer, marked as `'I'`, and this is used as the first report period year amoung the data. The, the binary index file is followed with **byte index**, this index is the actual file cursor offset index indicating the position of `.data` file.

Then we turn to the `.data` file, this is where the data actually stored. the data is stored in a 4 column array (however it is not store in numpy array binary format, the original data is generated by `struct` package). The first column is the annote date, indicating the actual date of publishing the data. The second columns is report period, meaning to which report period the data belongs. The third column is real value of the data. The fourth column is the byte index where current data appears again, it points to a infinitely large position if it is the latest data for that period. This structure is some what like a chained table, when given a date, we can easily access the latest published data on that date, and at the same time, we can easily get the latest data "in the future" (quoted because the future is relative to that given date).

![pit-data](./images/pit-data.png)

Moreover, when accessing the pit data, we can use the indentifier `"$$"` before the actual feature name.

## Data Plot

Now we have successfully converted daily trading data into qlib format, we can use the qlib api to fetch useful data into memory.

In [2]:
import plotly.graph_objects as go
from qlib.data import D

In [3]:
qlib.init(provider_uri='../data/qlib-day')

[3747:MainThread](2022-08-24 15:15:01,934) INFO - qlib.Initialization - [config.py:413] - default_conf: client.
[3747:MainThread](2022-08-24 15:15:02,174) INFO - qlib.Initialization - [__init__.py:74] - qlib successfully initialized based on client settings.
[3747:MainThread](2022-08-24 15:15:02,175) INFO - qlib.Initialization - [__init__.py:76] - data_path={'__DEFAULT_FREQ': PosixPath('/home/ppoak/Quant/data/qlib-day')}


In [7]:
ohlc = D.features(instruments=['600519.xshg'], start_time='2021-01-01', end_time='20220819', fields=['$open', '$high', '$low', '$close'])
go.Figure(data = go.Candlestick(
    x = ohlc.index.levels[1],
    open = ohlc['$open'],
    high = ohlc['$high'],
    low = ohlc['$low'],
    close = ohlc['$close'],
))

## Data Loader

### QlibDataLoader

Data Loader is a builtin qlib data reader, we can use this class to load any cross sectional and time series data. We can initialize it with a `config` parameter, and provide the `expression variable` and `rename variable` in two list.

In [4]:
from qlib.data.dataset.loader import QlibDataLoader

# like sql `as` expression, the latter list indicates the names of the column
qdl = QlibDataLoader(config=(['$close', '$high'], ['close', 'high']))
qdl.load(instruments=['600519.xshg'], start_time='2022-08-01', end_time='20220819')

Unnamed: 0_level_0,Unnamed: 1_level_0,close,high
datetime,instrument,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-08-01,600519.xshg,1890.300049,1908.0
2022-08-02,600519.xshg,1879.97998,1887.97998
2022-08-03,600519.xshg,1885.0,1904.0
2022-08-04,600519.xshg,1916.01001,1923.099976
2022-08-05,600519.xshg,1923.959961,1935.0
2022-08-08,600519.xshg,1911.530029,1932.880005
2022-08-09,600519.xshg,1910.0,1918.98999
2022-08-10,600519.xshg,1876.0,1918.0
2022-08-11,600519.xshg,1906.900024,1907.0
2022-08-12,600519.xshg,1928.0,1930.0


We can also use the qlib expression engine to load some calculated features into the memory

In [5]:
insts = ['600000.XSHG', '600004.XSHG', '600009.XSHG']
close_ma = ['EMA($close, 10)', 'EMA($close, 30)']
ma_names = ['EMA10', 'EMA30']
qdl_ma = QlibDataLoader(config=(close_ma, ma_names))
qdl_ma.load(instruments=insts, start_time='20210101', end_time='20210110')

Unnamed: 0_level_0,Unnamed: 1_level_0,EMA10,EMA30
datetime,instrument,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-04,600000.XSHG,9.606045,9.729256
2021-01-04,600004.XSHG,13.699723,14.055768
2021-01-04,600009.XSHG,74.138649,75.316963
2021-01-05,600000.XSHG,9.621153,9.725617
2021-01-05,600004.XSHG,13.722251,14.037616
2021-01-05,600009.XSHG,74.784477,75.463432
2021-01-06,600000.XSHG,9.660882,9.732524
2021-01-06,600004.XSHG,13.685838,14.001202
2021-01-06,600009.XSHG,75.325035,75.611725
2021-01-07,600000.XSHG,9.690149,9.738145


Moreover, you can find other expression operator in `/qlib/data/ops.py`

Sometimes, we need to classify the calculated data, the most frequently used class name is `feature` and `label`, we can change the `list format config` to `dict format config`.

In [6]:
insts = ['600000.XSHG', '600004.XSHG', '600009.XSHG']
close_ma = ['EMA($close, 10)', 'EMA($close, 30)']
ma_names = ['EMA10', 'EMA30']
ret = ['Ref($close, -2) / Ref($close, -1) - 1',]
ret_name = ['forward',]
qdl_ma_gp = QlibDataLoader(config={'feature': (close_ma, ma_names), 'label': (ret, ret_name)})
qdl_ma_gp.load(instruments=insts, start_time='20210101', end_time='20210110')

Unnamed: 0_level_0,Unnamed: 1_level_0,feature,feature,label
Unnamed: 0_level_1,Unnamed: 1_level_1,EMA10,EMA30,forward
datetime,instrument,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2021-01-04,600000.XSHG,9.606045,9.729256,0.014463
2021-01-04,600004.XSHG,13.699723,14.055768,-0.019551
2021-01-04,600009.XSHG,74.138649,75.316963,0.002458
2021-01-05,600000.XSHG,9.621153,9.725617,-0.001018
2021-01-05,600004.XSHG,13.722251,14.037616,-0.052437
2021-01-05,600009.XSHG,74.784477,75.463432,-0.021164
2021-01-06,600000.XSHG,9.660882,9.732524,0.002039
2021-01-06,600004.XSHG,13.685838,14.001202,0.02728
2021-01-06,600009.XSHG,75.325035,75.611725,0.001714
2021-01-07,600000.XSHG,9.690149,9.738145,-0.014242


Using parameter `filter_pipe` can also help us filter stocks which only satisfy our preset conditions.

In [None]:
from qlib.data.filter import ExpressionDFilter

insts = 'all'
close_ma = ['EMA($close, 10)', 'EMA($close, 30)']
ma_names = ['EMA10', 'EMA30']

filter_rule = ExpressionDFilter(rule_expression='EMA($close, 10) > EMA($close, 30)')

qdl_fil = QlibDataLoader(config=(close_ma, ma_names), filter_pipe=[filter_rule,])
qdl_fil.load(instruments=insts, start_time='20210104', end_time='20210104')

### StaticDataLoader

Also, qlib provide us a `StaticDataLoader` for direct access to outer data file, it reads pickle file by default, but, with a dictionary format parameter `config`, csv and h5df format are also supported.

And not only the static file can be read by `StaticDataLoader`, the in memory dataframe can be read by `StaticDataloader`

In [22]:
qdl = QlibDataLoader(config=(['$open', '$high', '$low', '$close'], ['open', 'high', 'low', 'close']))
df = qdl.load(instruments=['600000.XSHG', '600004.XSHG', '600009.XSHG'], start_time='20210101', end_time='20211231')
df.to_pickle('../data/other/sample.pkl')

In [24]:
from qlib.data.dataset.loader import StaticDataLoader

sdl_pkl = StaticDataLoader(config={'feature': '../data/other/sample.pkl'})
sdl_pkl.load(instruments=['600009.XSHG'], start_time='20210101', end_time='20210131')

Unnamed: 0_level_0,Unnamed: 1_level_0,feature,feature,feature,feature
Unnamed: 0_level_1,Unnamed: 1_level_1,open,high,low,close
datetime,instrument,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2021-01-04,600009.XSHG,75.699997,78.769997,75.169998,77.489998
2021-01-05,600009.XSHG,77.5,77.5,76.0,77.300003
2021-01-06,600009.XSHG,77.110001,79.480003,76.800003,77.489998
2021-01-07,600009.XSHG,77.660004,77.959999,75.0,75.849998
2021-01-08,600009.XSHG,76.150002,77.120003,75.43,75.980003
2021-01-11,600009.XSHG,76.089996,76.589996,73.129997,73.410004
2021-01-12,600009.XSHG,73.300003,74.0,72.029999,73.720001
2021-01-13,600009.XSHG,73.800003,77.699997,73.730003,77.589996
2021-01-14,600009.XSHG,77.110001,77.230003,73.199997,73.269997
2021-01-15,600009.XSHG,72.849998,73.660004,71.889999,71.940002


In [25]:
data = sdl_pkl.load(instruments=['600009.XSHG'], start_time='20210101', end_time='20210131')
sdl_df = StaticDataLoader(config=data)
sdl_df.load()

Unnamed: 0_level_0,Unnamed: 1_level_0,feature,feature,feature,feature
Unnamed: 0_level_1,Unnamed: 1_level_1,open,high,low,close
datetime,instrument,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2021-01-04,600009.XSHG,75.699997,78.769997,75.169998,77.489998
2021-01-05,600009.XSHG,77.5,77.5,76.0,77.300003
2021-01-06,600009.XSHG,77.110001,79.480003,76.800003,77.489998
2021-01-07,600009.XSHG,77.660004,77.959999,75.0,75.849998
2021-01-08,600009.XSHG,76.150002,77.120003,75.43,75.980003
2021-01-11,600009.XSHG,76.089996,76.589996,73.129997,73.410004
2021-01-12,600009.XSHG,73.300003,74.0,72.029999,73.720001
2021-01-13,600009.XSHG,73.800003,77.699997,73.730003,77.589996
2021-01-14,600009.XSHG,77.110001,77.230003,73.199997,73.269997
2021-01-15,600009.XSHG,72.849998,73.660004,71.889999,71.940002


## Data Handler

Before model training, we must preprocess some data like missing value, or simply standarize our dataset. DataHandler is designed for this.

DataHandler takes three key parameters in the process of handling data:

1. infer_processors: **learn parameters during fitting time, and process data during non-fitting time**
2. learn_processors: **process data without learning**
3. shared_processors: **shared processors**

Because there is learning process in `infer_processors`, they where provide `fit` method in their class definition.

DataHandler will keep raw data and processed data during processing, when user fetches data, it will return different types of data according to the data key.

> *In my point of view, the `learn_processors` are more like a time-series based data processors, and the `infer_processors` are cross-section based data processors*

1. `data_key = DataHandlerLP.DK_I`, return infer df
2. `data_key = DataHandlerLP.DK_L`, return learn df
3. `data_key = DataHandlerLP.DK_R`, return raw df

There is another parameter called `process_type`, which decide the sequence of processing data

1. `process_type = DataHandlerLP.PTYPE_I`, process of independent
2. `process_type = DataHandlerLP.PTYPE_A`, process of append

![process-type](./images/process-type.jpg)

In [8]:
from qlib.data.dataset.processor import CSZScoreNorm, DropnaProcessor, ZScoreNorm
from qlib.data.dataset.handler import DataHandlerLP

shared_processors = [DropnaProcessor()]
learn_processors = [CSZScoreNorm()]
infer_processors = [ZScoreNorm(fit_start_time='20210101', fit_end_time='20210110')]

dh_pr_test = DataHandlerLP(
    instruments = ['600000.XSHG', '600004.XSHG', '600009.XSHG'],
    start_time = '20210101',
    end_time = '20210120',
    process_type = DataHandlerLP.PTYPE_I,
    learn_processors = learn_processors,
    shared_processors = shared_processors,
    infer_processors = infer_processors,
    data_loader = qdl,
)

[11854:MainThread](2022-08-24 13:31:25,527) INFO - qlib.timer - [log.py:117] - Time cost: 0.292s | Loading data Done
[11854:MainThread](2022-08-24 13:31:25,572) INFO - qlib.timer - [log.py:117] - Time cost: 0.043s | DropnaProcessor Done
[11854:MainThread](2022-08-24 13:31:25,702) INFO - qlib.timer - [log.py:117] - Time cost: 0.129s | ZScoreNorm Done
[11854:MainThread](2022-08-24 13:31:25,744) INFO - qlib.timer - [log.py:117] - Time cost: 0.041s | CSZScoreNorm Done
[11854:MainThread](2022-08-24 13:31:25,746) INFO - qlib.timer - [log.py:117] - Time cost: 0.217s | fit & process data Done
[11854:MainThread](2022-08-24 13:31:25,747) INFO - qlib.timer - [log.py:117] - Time cost: 0.512s | Init data Done


Okay, let's fetch the results

In [7]:
raw_df = dh_pr_test.fetch(data_key=DataHandlerLP.DK_R)
infer_df = dh_pr_test.fetch(data_key=DataHandlerLP.DK_I)
learn_df = dh_pr_test.fetch(data_key=DataHandlerLP.DK_L)

infer_df.isna().sum(), learn_df.isna().sum()

(close    0
 high     0
 dtype: int64,
 close    0
 high     0
 dtype: int64)

In [8]:
learn_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,close,high
datetime,instrument,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-04,600000.XSHG,-0.633712,-0.633847
2021-01-04,600004.XSHG,-0.519091,-0.518947
2021-01-04,600009.XSHG,1.152803,1.152793
2021-01-05,600000.XSHG,-0.630972,-0.63186
2021-01-05,600004.XSHG,-0.522014,-0.521067


In [9]:
infer_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,close,high
datetime,instrument,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-04,600000.XSHG,-0.768963,-0.772224
2021-01-04,600004.XSHG,-0.627621,-0.630508
2021-01-04,600009.XSHG,1.434015,1.431394
2021-01-05,600000.XSHG,-0.769288,-0.77382
2021-01-05,600004.XSHG,-0.635094,-0.639445


## DataSet

Now, we come to the last step before training, dataset spliting.

Dataset class definition is directly stored in `qlib.data.dataset`. Now we can have a overview of the `qlib.data.dataset` package. The workflow is to initialize a `DataLoader` for `DataHandler`, and initialize the `DataHandler` for a final `DataSetH`, where 'H' means `Handler`, indicating that the `DataSet` is constructed from a `DataHandler`.

In [10]:
from qlib.data.dataset import DatasetH

ds = DatasetH(handler=dh_pr_test, segments={"train": ("20210101", "20210105"), "test": ("20210106", "20210110")})

To fetch the splited data, use `ds.prepare`, the parameter is the segments name

In [11]:
ds.prepare('train').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,close,high
datetime,instrument,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-04,600000.XSHG,-0.768963,-0.772224
2021-01-04,600004.XSHG,-0.627621,-0.630508
2021-01-04,600009.XSHG,1.434015,1.431394
2021-01-05,600000.XSHG,-0.769288,-0.77382
2021-01-05,600004.XSHG,-0.635094,-0.639445


In [12]:
ds.prepare('test').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,close,high
datetime,instrument,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-06,600000.XSHG,-0.764739,-0.769032
2021-01-06,600004.XSHG,-0.643867,-0.638807
2021-01-06,600009.XSHG,1.434015,1.454056
2021-01-07,600000.XSHG,-0.765064,-0.76584
2021-01-07,600004.XSHG,-0.666937,-0.648382


As a matter of fact, some model takes time series data as input, or they are trained in a rolling window. Qlib provide us a class called `TSDataH` which helps us generate time-series dataset. just follow the next cell, you will create a ten-day rolling window dataset. And with the day rolling on, you can get latest data in `data[date]` way. Then you can feed them into the model, then train it rollingly.

**NOTE: althogh the `TSDataSampler` can access data through `[datetime, asset]`, but unlike the pandas index, `TSDataSampler` won't check whether the asset exist. Actually, it use `bisect.bisect_right` to judge the position of asset. So nonexist asset might be accepted and the return won't be the asset information.**

In [97]:
from qlib.data.dataset import TSDatasetH
from qlib.data.dataset.loader import QlibDataLoader
from qlib.data.dataset.handler import DataHandlerLP
from qlib.data.dataset.processor import CSZScoreNorm

dl = QlibDataLoader(
    config = [
        ("$open", "$high", "$low", "$close"),
        ("open", "high", "low", "close"),
    ]
)
dh = DataHandlerLP(
    instruments = "000016.XSHG",
    start_time = "20200101",
    end_time = "20220801",
    process_type = "independent",
    # infer_processors = [CSZScoreNorm()],
    data_loader = dl
)
ds = TSDatasetH(
    step_len = 10,
    segments = {
        "train": ("20200101", "20220101"),
        "valid": ("20220104", "20220601"),
        "test": ("20220602", "20220801"),
    },
    handler = dh,
)
train_sampler = ds.prepare('train')

[3747:MainThread](2022-08-24 17:12:31,099) INFO - qlib.timer - [log.py:117] - Time cost: 0.559s | Loading data Done
[3747:MainThread](2022-08-24 17:12:31,101) INFO - qlib.timer - [log.py:117] - Time cost: 0.000s | fit & process data Done
[3747:MainThread](2022-08-24 17:12:31,102) INFO - qlib.timer - [log.py:117] - Time cost: 0.563s | Init data Done


The `TSDatasetH` returns a `TSDataSampler` as output. This is a class providing rolling window data. It can be selected in normal pandas way like `[datetime, asset]`, and it can also be selected by a single integer.

When you sample in a pandas way, you will get a `np.ndarray`, and you may encounter a lot of `nan`s in the results in first few samples. That's because the number of data doesn't fit the rolling window, so qlib can fulfill them with `nan` automatically. however, when encountering `nan` data in the middle part, the `fillna_type` can really work.

In [98]:
import pandas as pd

dl.load(instruments=['600000.XSHG'], start_time='20200917', end_time='20201001').droplevel(1)

Unnamed: 0_level_0,open,high,low,close
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-09-17,9.87,9.89,9.8,9.83
2020-09-18,9.84,9.94,9.8,9.94
2020-09-21,9.94,9.97,9.83,9.84
2020-09-22,9.81,9.84,9.69,9.7
2020-09-23,9.71,9.72,9.62,9.63
2020-09-24,9.61,9.62,9.46,9.5
2020-09-25,9.5,9.53,9.46,9.47
2020-09-28,9.49,9.53,9.46,9.46
2020-09-29,9.5,9.52,9.43,9.43
2020-09-30,9.45,9.49,9.35,9.39


In [99]:
train_sampler['20201001', "600000.XSHG"]

array([[9.87, 9.89, 9.8 , 9.83],
       [9.84, 9.94, 9.8 , 9.94],
       [9.94, 9.97, 9.83, 9.84],
       [9.81, 9.84, 9.69, 9.7 ],
       [9.71, 9.72, 9.62, 9.63],
       [9.61, 9.62, 9.46, 9.5 ],
       [9.5 , 9.53, 9.46, 9.47],
       [9.49, 9.53, 9.46, 9.46],
       [9.5 , 9.52, 9.43, 9.43],
       [9.45, 9.49, 9.35, 9.39]], dtype=float32)

Like said before, the other way of selecting data in sampler is using integer. The logic of using integer to slice data is:

1. convert the integer to a tuple, assuming this integer as the index of sampler data index, which you can access by `train_sampler.data_index`;
2. from the tuple before, the first value is the date index, and the second value is the instrument index. Then we select the latest `step_len` days within the same asset, return the data.

This method is not that explicit, so maybe the pandas style slicing is more useful.

In [112]:
# train_sampler.idx_df store the index mapping in a dataframe format
n = train_sampler.data_index.get_indexer([('20200930', '600000.XSHG')])
train_sampler[n]

array([[[9.87, 9.89, 9.8 , 9.83],
        [9.84, 9.94, 9.8 , 9.94],
        [9.94, 9.97, 9.83, 9.84],
        [9.81, 9.84, 9.69, 9.7 ],
        [9.71, 9.72, 9.62, 9.63],
        [9.61, 9.62, 9.46, 9.5 ],
        [9.5 , 9.53, 9.46, 9.47],
        [9.49, 9.53, 9.46, 9.46],
        [9.5 , 9.52, 9.43, 9.43],
        [9.45, 9.49, 9.35, 9.39]]], dtype=float32)

Last but not least, here are some useful attributes that may help you find the logic of slicing:

1. `train_sampler.data_index`, a data index which is taken from the original dataset
2. `train_sampler.idx_df`, a index mapping for `train_sampler.data_index`, the value of the dataframe is the position index from `train_sampler.data_index`
3. `train_sampler.data_arr`, original dataset in array format
4. `train_sampler.idx_map`, the ndarray form index storing the numerical index for `train_sampler.data_index`

## Model Training

Now, we can get processed dataset with features and label. The next step is to train the dataset using different models. Qlib has a lot of builtin models, the same, they are located in `qlib.contrib.model`.

In [None]:
from qlib.contrib.data.handler import Alpha158
from qlib.data.dataset import TSDatasetH
from qlib.contrib.model.pytorch_alstm_ts import ALSTM

train_period = ("2021-01-01", "2021-01-10")
valid_period = ("2021-01-11", "2021-01-15")
test_period = ("2021-01-16", "2021-01-20")

dh = Alpha158(
    instruments = ['600000.XSHG', '600004.XSHG', '600009.XSHG'],
    start_time = train_period[0],
    end_time = test_period[1],
    infer_processors = {}
)
ds = TSDatasetH(
    handler = dh,
    step_len = 40,
    segments = {
        "train": train_period,
        "valid": valid_period,
        "test": test_period,
    },
)
model = ALSTM(
    d_feat = 158,
    metric = "mse",
    run_type = "GRU",
    batch_size = 800,
    early_stop = 10,
)
model.fit(dataset = ds, save_path = None)
model.predict(dataset=ds, segement='test')

## Configuration

Actually, Qlib provide users a more convenient way to initialize a `QlibDataLoader`, `DataHandler` , `Dataset` or even a `Model`. The interface function is `init_instance_by_config`.

In [14]:
from qlib.utils import init_instance_by_config

qdl_config = {
    "class": "QlibDataLoader",
    "module_path": "qlib.data.dataset.loader",
    "kwargs": {
        "config": {
            "feature": (['EMA($close, 10)', 'EMA($close, 30)'], ['EMA10', 'EMA30']),
            "label": (['Ref($close, -2) / Ref($close, -1) - 1'], ['Forward',]),
        },
        "freq": "day",
    }
}

qdl = init_instance_by_config(qdl_config)
insts = ['600000.XSHG', '600004.XSHG', '600009.XSHG']
qdl.load(instruments=insts, start_time='20210101', end_time='20210131').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,feature,feature,label
Unnamed: 0_level_1,Unnamed: 1_level_1,EMA10,EMA30,Forward
datetime,instrument,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2021-01-04,600000.XSHG,9.606045,9.729256,0.014463
2021-01-04,600004.XSHG,13.699723,14.055768,-0.019551
2021-01-04,600009.XSHG,74.138649,75.316963,0.002458
2021-01-05,600000.XSHG,9.621153,9.725617,-0.001018
2021-01-05,600004.XSHG,13.722251,14.037616,-0.052437


And it is also availble for us to omit the `module_path` key in `qdl_config`, but it should be implemented in `class` key like:

```python
qdl_config = {
    "class": "qlib.data.dataset.loader.QlibDataLoader",
    "kwargs": {
        ...
    }
}
```

Since the `DataHandler` can also be initialized by `init_instance_by_config`, we can actually save some factor definition by using the config dictionary. But to start with, we inherit a `DataHandlerLP`. So we can fetch handled data simply by `fetch` function and construct a `Dataset` easily.

In [None]:
class MACDRSIFeature(DataHandlerLP):
    def __init__(
        self,
        instruments = None,
        start_time = None,
        end_time = None,
        freq = "day",
        infer_processors = [],
        learn_processors = [],
        fit_start_time = None,
        fit_end_time = None,
        process_type=DataHandlerLP.PTYPE_A,
        **kwargs,
    ):
        data_loader = {
            "class": "QlibDataLoader",
            "kwargs": {
                "config": {
                    "feature": self.get_feature_config(),
                    "label": kwargs.get("label", self.get_label_config()),
                },
                "freq": freq,
            }
        }
        super().__init__(
            instruments = instruments,
            start_time = start_time,
            end_time = end_time,
            data_loader = data_loader,
            infer_processors = infer_processors,
            learn_processors = learn_processors,
            process_type = process_type,
        )
    
    def get_feature_config(self):
        macd = '(EMA($close, 12) - EMA($close, 26)) / $close - EMA((EMA($close, 12) - EMA($close, 26)) / $close, 9) / $close'
        rsi = ('100 - 100 / (1 + (Sum(Greater($close - Ref($close, 1), 0), 14) / Count(($close - Ref($close, 1)) > 0, 14)) /'
            '(Sum(Abs(Greater(Ref($close, 1) - $close, 0)), 14) / Count(($close - Ref($close, 1)) < 0, 14)))')
        return [macd, rsi], ['MACD', 'RSI']
    def get_label_config(self):
        return (["Ref($close, -2) / Ref($close, -1) - 1", ], ["Forward", ])
    
feature = MACDRSIFeature(instruments=['600000.XSHG', '600004.XSHG', '600009.XSHG'], start_time='20210101', end_time='2021-01-31')
feature.fetch().head()

Actually, inheriting `DataHandlerLP` doesn't come from nowhere, the builtin `Alpha158` and `Alpha360` also takes the same way. And moreover, models, preset factors almost all located in `qlib.contrib`, where you can check out later.

Here we provide a snippet of `Alpha360` definition in qlib

```python
class Alpha360(DataHandlerLP):
    def __init__(
        self,
        instruments="csi500",
        start_time=None,
        end_time=None,
        freq="day",
        infer_processors=_DEFAULT_INFER_PROCESSORS,
        learn_processors=_DEFAULT_LEARN_PROCESSORS,
        fit_start_time=None,
        fit_end_time=None,
        filter_pipe=None,
        inst_processor=None,
        **kwargs,
    ):
    ...
```

By this kind of configurations, we can easily construct dataset and model, and this can be called a workflow.

In [None]:
ds_config = {
    "class": "TSDatasetH",
    "module_path": "qlib.data.dataset",
    "kwargs": {
        "handler": {
            "class": "Alpha158",
            "module_path": "qlib.contrib.data.handler",
            "kwargs": {
                "start_time": "2015-01-01",
                "end_time": "2022-03-01",
                "fit_start_time": "2015-01-01",
                "fit_end_time": "2019-12-31",
                "instruments": ['600000.XSHG', '600004.XSHG', '600009.XSHG'],
                "infer_processors": [
                    {
                        "class": "RobustZScoreNorm",
                        "kwargs": {
                            "fields_group": "feature",
                            "clip_outlier": True,
                        },
                    },
                    {
                        "class": "Fillna",
                        "kwargs": {
                            "fields_group": "feature",
                        },
                    },
                ],
                "learn_processors": [
                    {
                        "class": "DropnaLabel",
                    },
                    {
                        "class": "CSRankNorm",
                        "kwargs": {
                            "fields_group": "label"
                        }
                    },
                ],
                "label": ["Ref($close, -2) / Ref($close, -1) - 1"],
            }
        },
        "segments": {
            "train": ["2015-01-01", "2019-12-31"],
            "valid": ["2020-01-01", "2020-12-31"],
            "test": ["2021-01-01", "2022-03-01"],
        },
        "step_len": 40,
    }
}
model_config = {
    "class": "ALSTM",
    "module_path": "qlib.contrib.model.pytorch_alstm_ts",
    "kwargs": {
        "d_feat": 158,
        "hidden_size": 64,
        "num_layers": 2,
        "dropout": 0.0,
        "n_epochs": 200,
        "lr": 1e-3,
        "early_stop": 10,
        "batch_size": 800,
        "metric": "loss",
        "loss": "mse",
        "n_jobs": 20,
        "GPU": 0,
        "run_type": "GRU",
    }
}

ds = init_instance_by_config(ds_config)
model = init_instance_by_config(model_config)
model.fit(dataset=ds)

Also, using the `yaml` package to load yaml format configuration and run is availble in qlib.

## Recorder

Recorder serves as a machine learning process keeper. We can save some middle results or the uncompleted models to disk. When we set `resume` parameter to `True`, the training process will start where the last time ends.

There three typical recorder in qlib:

1. SignalRecord, for test data, generate predict value with trained model
2. SigAnaRecord, analyze generated predict value, calculating IC, IR etc.
3. PortAnaRecord, backtest predict factor value, to see the backtest performance

## Strategy

User can define their own strategy class in qlib with inheriting `BaseStrategy`, the core logic should be in `generate_trade_decision` fucntion. This is just like `handle_data` function in zipline framework.

There are 3 key parameters in class `BaseStrategy` initialization:

1. level_infra: level infrastructure, some common components, like trading calendar
2. common_infra: common infrastructure, some common components, like trade_positions, exchange market
3. trade_exchange: when set, use this to represent exchange_market, else turn to find exchange parameter in `common_infra`

The core process of backtest strategy in qlib starts with `get_signal`, and then, your strategy class will generate trade decision by `generate_trade_decision` function, the decision results will by passed to Executor, which is a class connecting strategy and the exchange market. Once the decision is passed, the order will by delivered from executor to exchange, it will generate a `execu_result` from exchange market. Moreover, the result will be returned to the strategy itself.

In [2]:
%load_ext dmind
%dmindheader

In [3]:
%%dmind text filetree classic

Qlib Strategy

    BaseStrategy
        property
            level_infra
            common_infra
            trade_exchange
        method
            trade_calendar
            trade_position
            trade_exchange
            generate_trade_decision
    
    BaseSignalStrategy
        property
            signal
        
    BuiltinStrategy
        TopkDropoutStrategy
        EnhancedIndexingStrategy