# Running Stock Predictor (US Stocks) on Google Colab



## Get US stock data from Yahoo Finance and save it to google drive. 
- Your google drive should be mounted in Colab
- Only need to do this once


In [None]:
!pip install numpy
!pip install --upgrade cython
!git clone https://github.com/microsoft/qlib.git

In [None]:
!cd qlib && python setup.py install

In [10]:
!python /content/qlib/scripts/get_data.py qlib_data --target_dir /content/drive/MyDrive/qlib_us_data --region us

[32m2022-09-02 04:02:52.425[0m | [1mINFO    [0m | [36mqlib.tests.data[0m:[36m_download_data[0m:[36m59[0m - [1mqlib_data_us_1d_latest.zip downloading......[0m
450095104it [00:10, 43224598.86it/s]                   
[32m2022-09-02 04:03:02.841[0m | [1mINFO    [0m | [36mqlib.tests.data[0m:[36m_unzip[0m:[36m85[0m - [1m/content/drive/MyDrive/qlib_us_data/20220902040252_qlib_data_us_1d_latest.zip unzipping......[0m
100% 71959/71959 [08:41<00:00, 138.01it/s]


## Installation
* Clone repo and install pyqlib
* Restart runtime after install

In [3]:
!git clone https://github.com/jingedawang/StockPredictor.git

Cloning into 'StockPredictor'...
remote: Enumerating objects: 95, done.[K
remote: Counting objects:   1% (1/95)[Kremote: Counting objects:   2% (2/95)[Kremote: Counting objects:   3% (3/95)[Kremote: Counting objects:   4% (4/95)[Kremote: Counting objects:   5% (5/95)[Kremote: Counting objects:   6% (6/95)[Kremote: Counting objects:   7% (7/95)[Kremote: Counting objects:   8% (8/95)[Kremote: Counting objects:   9% (9/95)[Kremote: Counting objects:  10% (10/95)[Kremote: Counting objects:  11% (11/95)[Kremote: Counting objects:  12% (12/95)[Kremote: Counting objects:  13% (13/95)[Kremote: Counting objects:  14% (14/95)[Kremote: Counting objects:  15% (15/95)[Kremote: Counting objects:  16% (16/95)[Kremote: Counting objects:  17% (17/95)[Kremote: Counting objects:  18% (18/95)[Kremote: Counting objects:  20% (19/95)[Kremote: Counting objects:  21% (20/95)[Kremote: Counting objects:  22% (21/95)[Kremote: Counting objects:  23% (22/95)[Kremote: Co

In [6]:
!pip install pyqlib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacred>=0.7.4
  Downloading sacred-0.8.2-py2.py3-none-any.whl (106 kB)
[K     |████████████████████████████████| 106 kB 5.3 MB/s 
[?25hCollecting python-socketio
  Downloading python_socketio-5.7.1-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.3 MB/s 
[?25hCollecting redis>=3.0.1
  Downloading redis-4.3.4-py3-none-any.whl (246 kB)
[K     |████████████████████████████████| 246 kB 61.7 MB/s 
[?25hCollecting python-redis-lock>=3.3.1
  Downloading python_redis_lock-3.7.0-py2.py3-none-any.whl (12 kB)
Collecting schedule>=0.6.0
  Downloading schedule-1.1.0-py2.py3-none-any.whl (10 kB)
Collecting fire>=0.3.1
  Downloading fire-0.4.0.tar.gz (87 kB)
[K     |████████████████████████████████| 87 kB 6.6 MB/s 
Collecting matplotlib>=3.3
  Downloading matplotlib-3.5.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[K     |████████████████

In [4]:
import qlib
from qlib.constant import REG_US
from qlib.data.dataset import DatasetH
from qlib.utils import init_instance_by_config, flatten_dict
from qlib.workflow import R
from qlib.tests.data import GetData

from StockPredictor.algorithm.stock_predictor.data_handler import Alpha158TwoWeeks

import pickle

In [5]:
# use default data
provider_uri = "/content/drive/MyDrive/qlib_us_data"  # target_dir
qlib.init(provider_uri=provider_uri, region=REG_US)

[667:MainThread](2022-09-02 04:14:36,564) INFO - qlib.Initialization - [config.py:413] - default_conf: client.
INFO:qlib.Initialization:default_conf: client.
[667:MainThread](2022-09-02 04:14:36,578) INFO - qlib.Initialization - [__init__.py:74] - qlib successfully initialized based on client settings.
INFO:qlib.Initialization:qlib successfully initialized based on client settings.
[667:MainThread](2022-09-02 04:14:36,589) INFO - qlib.Initialization - [__init__.py:76] - data_path={'__DEFAULT_FREQ': PosixPath('/content/drive/MyDrive/qlib_us_data')}
INFO:qlib.Initialization:data_path={'__DEFAULT_FREQ': PosixPath('/content/drive/MyDrive/qlib_us_data')}


In [8]:
# Load data with our customized data handler.
# The Alpha158TwoWeeks is different with Alpha158 only in the labels.
# TODO: Data is important for model training, we need to try other adjustments to the data handler to acheive better results.
data_handler = Alpha158TwoWeeks(instruments='sp500')
dataset = DatasetH(
          handler=data_handler,
          segments={
            "train": ["2008-01-01", "2014-12-31"],
            "valid": ["2015-01-01", "2016-12-31"],
            "test": ["2017-01-01", "2020-08-01"]
            }
          )

[667:MainThread](2022-09-02 05:15:33,159) INFO - qlib.timer - [log.py:117] - Time cost: 456.453s | Loading data Done
INFO:qlib.timer:Time cost: 456.453s | Loading data Done
[667:MainThread](2022-09-02 05:15:36,568) INFO - qlib.timer - [log.py:117] - Time cost: 0.691s | DropnaLabel Done
INFO:qlib.timer:Time cost: 0.691s | DropnaLabel Done
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
[667:MainThread](2022-09-02 05:15:50,598) INFO - qlib.timer - [log.py:117] - Time cost: 14.025s | CSZScoreNorm Done
INFO:qlib.timer:Time cost: 14.025s | CSZScoreNorm Done
[667:MainThread](2022-09-02 05:15:50,648) INFO - qlib.timer - [log.py:117] - Time cost: 17.482s | fit & process data Done
INFO:qlib.timer:Time cost: 17.482s | fit & process data Done
[667:MainThread](202

In [9]:
# NOTE: This line is optional.
# Show the prepared training data to make sure we are using the correct data for trainning.
example_df = dataset.prepare("train")
print(example_df.head())

                           KMID      KLEN     KMID2       KUP      KUP2  \
datetime   instrument                                                     
2008-01-02 A          -0.010090  0.018544 -0.544115  0.003545  0.191178   
           AA         -0.009051  0.021393 -0.423081  0.008777  0.410255   
           AAPL       -0.022231  0.038691 -0.574578  0.004968  0.128406   
           ABC         0.001799  0.013717  0.131146  0.011468  0.836067   
           ABK         0.076923  0.076923  1.000000  0.000000  0.000000   

                           KLOW     KLOW2      KSFT     KSFT2     OPEN0  ...  \
datetime   instrument                                                    ...   
2008-01-02 A           0.004909  0.264707 -0.008726 -0.470587  1.010193  ...   
           AA          0.003565  0.166664 -0.014262 -0.666672  1.009134  ...   
           AAPL        0.011492  0.297017 -0.015707 -0.405967  1.022737  ...   
           ABC         0.000450  0.032788 -0.009220 -0.672133  0.998204  .

In [15]:
def get_dataset_config(
    dataset_class="Alpha158",
    train=("2008-01-01", "2014-12-31"),
    valid=("2015-01-01", "2016-12-31"),
    test=("2017-01-01", "2020-08-01"),
    handler_kwargs={"instruments": "sp500"},
):
    return {
        "class": "DatasetH",
        "module_path": "qlib.data.dataset",
        "kwargs": {
            "handler": {
                "class": dataset_class,
                "module_path": "qlib.contrib.data.handler",
                "kwargs": get_data_handler_config(**handler_kwargs),
            },
            "segments": {
                "train": train,
                "valid": valid,
                "test": test,
            },
        },
    }

def get_data_handler_config(
    start_time="2008-01-01",
    end_time="2020-08-01",
    fit_start_time="<dataset.kwargs.segments.train.0>",
    fit_end_time="<dataset.kwargs.segments.train.1>",
    instruments="sp500",
):
    return {
        "start_time": start_time,
        "end_time": end_time,
        "fit_start_time": fit_start_time,
        "fit_end_time": fit_end_time,
        "instruments": instruments,
    }

In [16]:
# Use GBDT model.
# TODO: Model architecture is also important. We need to try different models to acheive better results.
GBDT_MODEL = {
    "class": "LGBModel",
    "module_path": "qlib.contrib.model.gbdt",
    "kwargs": {
        "loss": "mse",
        "colsample_bytree": 0.8879,
        "learning_rate": 0.0421,
        "subsample": 0.8789,
        "lambda_l1": 205.6999,
        "lambda_l2": 580.9768,
        "max_depth": 8,
        "num_leaves": 210,
        "num_threads": 20,
    },
}

def get_gbdt_task(dataset_kwargs={}, handler_kwargs={"instruments": "sp500"}):
    return {
        "model": GBDT_MODEL,
        "dataset": get_dataset_config(**dataset_kwargs, handler_kwargs=handler_kwargs),
    }


SP500_GBDT_TASK = get_gbdt_task(handler_kwargs={"instruments": "sp500"})
model = init_instance_by_config(SP500_GBDT_TASK["model"])

ModuleNotFoundError. CatBoostModel are skipped. (optional: maybe installing CatBoostModel can fix it.)


In [17]:
# start experiment.
with R.start(experiment_name="workflow"):
    R.log_params(**flatten_dict(SP500_GBDT_TASK))
    model.fit(dataset)
    R.save_objects(**{"params.pkl": model})

    pred = model.predict(dataset)
    print('pred', pred)

# TODO: We need do backtest to evaluate our model.

[667:MainThread](2022-09-02 05:24:28,919) INFO - qlib.workflow - [expm.py:315] - <mlflow.tracking.client.MlflowClient object at 0x7feade9fcb50>
INFO:qlib.workflow:<mlflow.tracking.client.MlflowClient object at 0x7feade9fcb50>
[667:MainThread](2022-09-02 05:24:28,945) INFO - qlib.workflow - [exp.py:260] - Experiment 1 starts running ...
INFO:qlib.workflow:Experiment 1 starts running ...
[667:MainThread](2022-09-02 05:24:29,710) INFO - qlib.workflow - [recorder.py:339] - Recorder d5ff3e01701b4613a4bebdc15ea931c1 starts running under Experiment 1 ...
INFO:qlib.workflow:Recorder d5ff3e01701b4613a4bebdc15ea931c1 starts running under Experiment 1 ...
[667:MainThread](2022-09-02 05:24:29,908) INFO - qlib.workflow - [recorder.py:372] - Fail to log the uncommitted code of $CWD when run `git diff`
INFO:qlib.workflow:Fail to log the uncommitted code of $CWD when run `git diff`
[667:MainThread](2022-09-02 05:24:30,118) INFO - qlib.workflow - [recorder.py:372] - Fail to log the uncommitted code of 

Training until validation scores don't improve for 50 rounds
[20]	train's l2: 0.982006	valid's l2: 0.997296
[40]	train's l2: 0.970632	valid's l2: 0.997015
[60]	train's l2: 0.96202	valid's l2: 0.997033
[80]	train's l2: 0.955098	valid's l2: 0.997051
[100]	train's l2: 0.949218	valid's l2: 0.997134
Early stopping, best iteration is:
[68]	train's l2: 0.959171	valid's l2: 0.996992


[667:MainThread](2022-09-02 05:25:44,240) INFO - qlib.timer - [log.py:117] - Time cost: 0.000s | waiting `async_log` Done
INFO:qlib.timer:Time cost: 0.000s | waiting `async_log` Done


pred datetime    instrument
2017-01-03  A            -0.013359
            AAL           0.038548
            AAP           0.065628
            AAPL         -0.011128
            ABBV          0.014177
                            ...   
2020-07-31  YUM          -0.004051
            ZBH          -0.029806
            ZBRA         -0.039705
            ZION          0.046226
            ZTS          -0.043556
Length: 443290, dtype: float64


In [19]:
pred.shape

(443290,)

In [20]:
import pandas as pd
dfp = pd.DataFrame(pred)

In [22]:
dfp.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
datetime,instrument,Unnamed: 2_level_1
2020-07-31,YUM,-0.004051
2020-07-31,ZBH,-0.029806
2020-07-31,ZBRA,-0.039705
2020-07-31,ZION,0.046226
2020-07-31,ZTS,-0.043556
