# 🪙 G-Research Crypto Kale Pipeline
![](./images/vector-blockchain-poster.jpg)

---


In this [Kaggle competition](https://www.kaggle.com/competitions/g-research-crypto-forecasting/overview), you'll use your machine learning expertise to forecast short term returns in 14 popular cryptocurrencies. The dataset provided contains information on historic trades for several cryptoassets, such as Bitcoin and Ethereum. 

> G-Research is a leading quantitative research and technology company. By using the latest scientific techniques, they produce world-beating predictive research and build advanced technology to analyse the world's data.

## Install necessary packages

We can install the necessary package by either running pip install --user <package_name> or include everything in a requirements.txt file and run pip install --user -r requirements.txt. We have put the dependencies in a requirements.txt file so we will use the former method.

NOTE: Do not forget to use the --user argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline. After installing python packages, restart notebook kernel before proceeding.

In [2]:
!pip install -r requirements.txt --user --quiet

## Imports

In this section we import the packages we need for this example. Make it a habit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale.

In [2]:
import os, random, subprocess
import pandas as pd
import numpy as np
import time, datetime, zipfile
import joblib, talib
from tqdm import tqdm
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

## Project hyper-parameters

In this cell, we define the different hyper-parameters. Defining them in one place makes it easier to experiment with their values and also facilitates the execution of HP Tuning experiments using Kale and Katib.

In [3]:
# Hyper-parameters
LR = 0.01
N_EST = 1200

Set random seed for reproducibility

In [4]:
def fix_all_seeds(seed):
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

fix_all_seeds(2022)

## Download data

In this section, we download the data from kaggle using the Kaggle API credentials

In [5]:
# setup kaggle environment for data download
dataset = "g-research-crypto-forecasting"

# setup kaggle environment for data download
with open('/secret/kaggle-secret/password', 'r') as file:
    kaggle_key = file.read().rstrip()
with open('/secret/kaggle-secret/username', 'r') as file:
    kaggle_user = file.read().rstrip()

os.environ['KAGGLE_USERNAME'], os.environ['KAGGLE_KEY'] = kaggle_user, kaggle_key

# download kaggle's g-research-crypto-forecast data
subprocess.run(["kaggle","competitions", "download", "-c", dataset])

CompletedProcess(args=['kaggle', 'competitions', 'download', '-c', 'g-research-crypto-forecasting'], returncode=0)

In [6]:
# path to download to
data_path = 'data'

# extract g-research-crypto-forecasting.zip to load_data_path
with zipfile.ZipFile(f"{dataset}.zip","r") as zip_ref:
    zip_ref.extractall(data_path, members=['train.csv', 'asset_details.csv'])

## Load the dataset

First, let us load and analyze the data.

The data is in csv format, thus, we use the handy read_csv pandas method.

In [7]:
TRAIN_CSV = f'{data_path}/train.csv'
ASSET_DETAILS_CSV = f'{data_path}/asset_details.csv'

In [8]:
df_train = pd.read_csv(TRAIN_CSV)

In [9]:
df_train.shape

(24236806, 10)

In [10]:
df_asset_details = pd.read_csv(ASSET_DETAILS_CSV).sort_values("Asset_ID")
df_asset_details

Unnamed: 0,Asset_ID,Weight,Asset_Name
1,0,4.304065,Binance Coin
2,1,6.779922,Bitcoin
0,2,2.397895,Bitcoin Cash
10,3,4.406719,Cardano
13,4,3.555348,Dogecoin
3,5,1.386294,EOS.IO
5,6,5.894403,Ethereum
4,7,2.079442,Ethereum Classic
11,8,1.098612,IOTA
6,9,2.397895,Litecoin


In [11]:
df_train['datetime'] = pd.to_datetime(df_train['timestamp'], unit='s')
df_train = df_train[df_train['datetime'] >= '2020-01-01 00:00:00'].copy()

In [12]:
df_train.shape

(12228898, 11)

In [13]:
df_train['datetime'].max()

Timestamp('2021-09-21 00:00:00')

In [14]:
df_train.isna().sum()

timestamp         0
Asset_ID          0
Count             0
Open              0
High              0
Low               0
Close             0
Volume            0
VWAP              9
Target       262453
datetime          0
dtype: int64

### Define Pipeline Functions

In [15]:
# define the evaluation metric
def weighted_correlation(a, train_data):
    
    weights = train_data.add_w.values.flatten()
    b = train_data.get_label()
    
    
    w = np.ravel(weights)
    a = np.ravel(a)
    b = np.ravel(b)

    sum_w = np.sum(w)
    mean_a = np.sum(a * w) / sum_w
    mean_b = np.sum(b * w) / sum_w
    var_a = np.sum(w * np.square(a - mean_a)) / sum_w
    var_b = np.sum(w * np.square(b - mean_b)) / sum_w

    cov = np.sum((a * b * w)) / np.sum(w) - mean_a * mean_b
    corr = cov / np.sqrt(var_a * var_b)

    return 'eval_wcorr', corr, True

In [16]:
def RSI(df, n):
    return talib.RSI(df['Close'], n)

def ATR(df, n):
    return talib.ATR(df["High"], df.Low, df.Close, n)

#Create a function to calculate the Double Exponential Moving Average (DEMA)
def DEMA(data, time_period):
    #Calculate the Exponential Moving Average for some time_period (in days)
    EMA = data['Close'].ewm(span=time_period, adjust=False).mean()
    #Calculate the DEMA
    DEMA = 2*EMA - EMA.ewm(span=time_period, adjust=False).mean()
    return DEMA

def upper_shadow(df):
    return df['High'] - np.maximum(df['Close'], df['Open'])

def lower_shadow(df):
    return np.minimum(df['Close'], df['Open']) - df['Low']

## Feature Engineering

In [17]:
def get_features(df, 
                 asset_id, 
                 train=True):
    '''
    This function takes a dataframe with all asset data and return the lagged features for a single asset.
    
    df - Full dataframe with all assets included
    asset_id - integer from 0-13 inclusive to represent a cryptocurrency asset
    train - True - you are training your model
          - False - you are submitting your model via api
    '''
    # filter based on asset id
    df = df[df['Asset_ID']==asset_id]
    
    # sort based on time stamp
    df = df.sort_values('timestamp')
    
    if train == True:
        df_feat = df.copy()
        
        # define a train_flg column to split your data into train and validation
        totimestamp = lambda s: np.int32(time.mktime(datetime.datetime.strptime(s, "%d/%m/%Y").timetuple()))
        valid_window = [totimestamp("01/05/2021")]
        
        df_feat['train_flg'] = np.where(df_feat['timestamp']>=valid_window[0], 0,1)
        df_feat = df_feat[['timestamp','Asset_ID', 'High', 'Low', 'Open', 'Close', 'Volume','Target','train_flg']].copy()
    else:
        df = df.sort_values('row_id')
        df_feat = df[['Asset_ID', 'High', 'Low', 'Open', 'Close', 'Volume','row_id']].copy()
        
    for i in tqdm([30, 120, 240]):
        # creating technical indicators
        df_feat[f'RSI_{i}'] = RSI(df_feat, i)
        df_feat[f'ATR_{i}'] = ATR(df_feat, i)
        df_feat[f'DEMA_{i}'] = DEMA(df_feat, i)

    for i in tqdm([30, 120, 240]):
        # creating lag features
        df_feat[f'sma_{i}'] = df_feat['Close'].rolling(i).mean()/df_feat['Close'] -1
        df_feat[f'return_{i}'] = df_feat['Close']/df_feat['Close'].shift(i) -1
    
    # new featu# creating technical indicators featureses
    df_feat['HL'] = np.log(df_feat['High'] - df_feat['Low'])
    df_feat['OC'] = np.log(df_feat['Close'] - df_feat['Open'])
    
    df_feat['lower_shadow'] = np.log(lower_shadow(df)) 
    df_feat['upper_shadow'] = np.log(upper_shadow(df))
    
    # replace inf with nan
    df_feat.replace([np.inf, -np.inf], np.nan, inplace=True)
    
    # datetime features
    df_feat['Date'] = pd.to_datetime(df_feat['timestamp'], unit='s')
    df_feat['Day'] = df_feat['Date'].dt.weekday.astype(np.int32)
    df_feat["dayofyear"] = df_feat['Date'].dt.dayofyear
    df_feat["weekofyear"] = df_feat['Date'].dt.weekofyear
    df_feat["season"] = ((df_feat['Date'].dt.month)%12 + 3)//3
    

    df_feat = df_feat.drop(['Open','Close','High','Low', 'Volume', 'Date'], axis=1)
    
    # fill nan values with 0
    df_feat = df_feat.fillna(0)
    
    return df_feat

In [None]:
# create your feature dataframe for each asset and concatenate
feature_df = pd.DataFrame()
for i in range(14):
    print(i)
    feature_df = pd.concat([feature_df,get_features(df_train,i,train=True)])

0


100%|██████████| 3/3 [00:00<00:00,  7.64it/s]
100%|██████████| 3/3 [00:00<00:00, 12.66it/s]


1


100%|██████████| 3/3 [00:00<00:00,  7.61it/s]
100%|██████████| 3/3 [00:00<00:00, 16.32it/s]


2


100%|██████████| 3/3 [00:00<00:00,  9.89it/s]
100%|██████████| 3/3 [00:00<00:00, 22.05it/s]


3


100%|██████████| 3/3 [00:00<00:00,  8.52it/s]
100%|██████████| 3/3 [00:00<00:00, 17.92it/s]


4


100%|██████████| 3/3 [00:00<00:00, 10.35it/s]
100%|██████████| 3/3 [00:00<00:00, 22.56it/s]


## Merge Assets Features

In [None]:
# assign weight column feature dataframe
feature_df = pd.merge(feature_df, df_asset_details[['Asset_ID','Weight']], how='left', on=['Asset_ID'])

In [None]:
feature_df.columns

## Modelling

In [None]:
# define features for LGBM
features = ['Asset_ID', 'RSI_30', 'ATR_30',
       'DEMA_30', 'RSI_120', 'ATR_120', 'DEMA_120', 'RSI_240', 'ATR_240',
       'DEMA_240', 'sma_30', 'return_30', 'sma_120', 'return_120', 'sma_240',
       'return_240', 'HL', 'OC', 'lower_shadow', 'upper_shadow', 'Day',
       'dayofyear', 'weekofyear', 'season']
categoricals = ['Asset_ID']

In [None]:
# define train and validation weights and datasets
weights_train = feature_df.query('train_flg == 1')[['Weight']]
weights_test = feature_df.query('train_flg == 0')[['Weight']]

train_dataset = lgb.Dataset(feature_df.query('train_flg == 1')[features], 
                            feature_df.query('train_flg == 1')['Target'].values, 
                            feature_name = features,
                           categorical_feature= categoricals)
val_dataset = lgb.Dataset(feature_df.query('train_flg == 0')[features], 
                          feature_df.query('train_flg == 0')['Target'].values, 
                          feature_name = features,
                         categorical_feature= categoricals)

train_dataset.add_w = weights_train
val_dataset.add_w = weights_test

evals_result = {}
params = {'n_estimators': int(N_EST),
        'objective': 'regression',
        'metric': 'rmse',
        'boosting_type': 'gbdt',
        'max_depth': -1, 
        'learning_rate': float(LR),
        'seed': 2022,
        'verbose': -1,
        }

# train LGBM2
model = lgb.train(params = params,
                  train_set = train_dataset, 
                  valid_sets = [val_dataset],
                  early_stopping_rounds=60,
                  verbose_eval = 30,
                  feval=weighted_correlation,
                  evals_result = evals_result 
                 )

joblib.dump(model, 'lgb.jl')

In [None]:
fea_imp = pd.DataFrame({'imp':model.feature_importance(), 'col': features})
fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
_ = fea_imp.plot(kind='barh', x='col', y='imp', figsize=(20, 10))

## Evaluation

In [None]:
model = joblib.load('lgb.jl')

In [None]:
root_mean_squared_error = model.best_score.get('valid_0').get('rmse')

In [None]:
weighted_correlation = model.best_score.get('valid_0').get('eval_wcorr')

## Pipeline Metrics

In [None]:
print(root_mean_squared_error)

In [None]:
print(weighted_correlation)