# Experiment 1: LightGBM Algorithm with training data. 

From data exploration section, it is very evident that we are dealing with a large data set (2.4 million data points). Hence it is crucial to work with a model that can process data at a fast and efficient rate. 

LightGBM uses leaf wise (best fit) tree growth. It chooses the lead that minimizes the loss, allowing a growth of an imbalanced tree. Because it doesnt grow level wise, but leaf wise, over fitting can occur if tree depth is not controlled. 

The great attributes about LightGBM is its high speed, it can handle large size of data, takes lower memory to run and most importantly it focuses on accuracy of results.

Now although its attributes, LightGBM is sensitive to overfitting and can easily overfit the data if the parameters are not fine tuned correctly. The high accuracy comes at a cost of high bias in the training data. 

The biggest challenge is going to be parameter tuning and the values I provide to parameters. 

Now lets dive in






In [1]:
# Loading the libraries and data

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import datetime
import gresearch_crypto
import datatable as dt
from lightgbm import LGBMRegressor
import os


data_folder = "../input/g-research-crypto-forecasting/"
!ls $data_folder

In [2]:
asset_df = dt.fread("../input/g-research-crypto-forecasting/asset_details.csv").to_pandas()
train_df = dt.fread("../input/g-research-crypto-forecasting/train.csv").to_pandas()

In [3]:
asset_df

# Data Structure for train_df

Train_df - Training dataset 

The datapoints are as follows:

1. timestamp - A timestamp for the minute covered by the row.

2. Asset_ID - An ID code for the cryptoasset.

3. Count - The number of trades that took place this minute.

4. Open - The USD price at the beginning of the minute.

5. High - The highest USD price during the minute.

6. Low - The lowest USD price during the minute.

7. Close - The USD price at the end of the minute.

8. Volume - The number of cryptoasset u units traded during the minute.

8. VWAP - The volume-weighted average price for the minute.

10. Target - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.

11. Weight - Weight, defined by the competition hosts here.

12. Asset_Name - Human readable Asset name.

In [4]:
train_df

# TRAINING

## Feature Extraction 

The three features I chose for the training model are as follows: 

- hlco_ratio: the ratio between High/ Low and Open/Close.

- upper_shadow: Bullish candlestick pattern. 

- lower_shadow: Bearish candlestick pattern.


In [5]:
# Feature Extraction 

def hlco_ratio(df): return (df['High'] - df['Low'])/(df['Close']-df['Open'])
def upper_shadow(df): return df['High'] - np.maximum(df['Close'], df['Open'])
def lower_shadow(df): return np.minimum(df['Close'], df['Open']) - df['Low']

# A utility function to build features from the original df

def get_features(df):
    df_feat = df[['Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP']].copy()
    df_feat['Upper_Shadow'] = upper_shadow(df_feat)
    df_feat['Lower_Shadow'] = lower_shadow(df_feat)
    df_feat['hlco_ration'] = hlco_ratio(df_feat)
    return df_feat


In [6]:
# Main Training Function 

def get_Xy_and_model_for_asset(df_train, asset_id):
    df = df_train[df_train["Asset_ID"] == asset_id]
    
    # TDFH
    df_proc = get_features(df)
    df_proc['y'] = df['Target']
    df_proc = df_proc.dropna(how="any")
    
    X = df_proc.drop("y", axis=1)
    y = df_proc["y"]    
    model = LGBMRegressor(device = 'gpu')
    model.fit(X, y)
    return X, y, model


In [7]:
# Loop all over assets

Xs = {}
ys = {}
models = {}

for asset_id, asset_name in zip(asset_df['Asset_ID'], asset_df['Asset_Name']):
    print(f"Training model for {asset_name:<16} (ID={asset_id:<2})")
    try:
        X, y, model = get_Xy_and_model_for_asset(train_df, asset_id)    
        Xs[asset_id], ys[asset_id], models[asset_id] = X, y, model
    except:         
        Xs[asset_id], ys[asset_id], models[asset_id] = None, None, None    


In [8]:
# Check the model interface
x = get_features(train_df.iloc[1])
y_pred = models[0].predict(pd.DataFrame([x]))
y_pred[0]

This is the first experiment, as we can see the prediction figure is extremley low. This could mean one of few things, firstly being that the model has a high error rate. This could be due to the parameter tuning in the main training function. 

The objective is to keep fine tunning the parameters chosen until an accurate result is yielded.

This is still in progress and is not finished, I will continue uploading more experiments. Stay tuned. 