# 🪙 G-Research Crypto - Starter LGBM Pipeline
![](https://storage.googleapis.com/kaggle-competitions/kaggle/30894/logos/header.png)

Based on:  https://www.kaggle.com/julian3833/g-research-starter-lgbm-pipeline-lb-0-164
 

### Just a simple pipeline going from zero to a valid submission

We train one `LGBMRegressor` for each asset over a very simple set of features, we get the predictions correctly using the iterator and submit. 


## Please upvote if you find this useful!

## References:
* [Detailed API Introduction](https://www.kaggle.com/sohier/detailed-api-introduction)
* [Basic Submission Template](https://www.kaggle.com/sohier/basic-submission-template)
* [Tutorial to the G-Research Crypto Competition](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)



# Import and load dfs

References: [Tutorial to the G-Research Crypto Competition](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)

In [1]:
import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor
import gresearch_crypto


TRAIN_CSV = '/kaggle/input/g-research-crypto-forecasting/train.csv'
ASSET_DETAILS_CSV = '/kaggle/input/g-research-crypto-forecasting/asset_details.csv'

In [2]:
df_train = pd.read_csv(TRAIN_CSV)#,nrows=9234567)
# df_train = df_train.sort_values(["timestamp","Asset_ID"]) # needed if adding lag features (problematic for target)
df_train.head()

Unnamed: 0,timestamp,Asset_ID,Count,Open,High,Low,Close,Volume,VWAP,Target
0,1514764860,2,40.0,2376.58,2399.5,2357.14,2374.59,19.233005,2373.116392,-0.004218
1,1514764860,0,5.0,8.53,8.53,8.53,8.53,78.38,8.53,-0.014399
2,1514764860,1,229.0,13835.194,14013.8,13666.11,13850.176,31.550062,13827.062093,-0.014643
3,1514764860,5,32.0,7.6596,7.6596,7.6567,7.6576,6626.71337,7.657713,-0.013922
4,1514764860,7,5.0,25.92,25.92,25.874,25.877,121.08731,25.891363,-0.008264


In [3]:
df_asset_details = pd.read_csv(ASSET_DETAILS_CSV).sort_values("Asset_ID")
df_asset_details

Unnamed: 0,Asset_ID,Weight,Asset_Name
1,0,4.304065,Binance Coin
2,1,6.779922,Bitcoin
0,2,2.397895,Bitcoin Cash
10,3,4.406719,Cardano
13,4,3.555348,Dogecoin
3,5,1.386294,EOS.IO
5,6,5.894403,Ethereum
4,7,2.079442,Ethereum Classic
11,8,1.098612,IOTA
6,9,2.397895,Litecoin


# Training

## Utility functions to train a model for one asset

In [4]:
# Two new features from the competition tutorial
def upper_shadow(df):
    return df['High'] - np.maximum(df['Close'], df['Open'])

def lower_shadow(df):
    return np.minimum(df['Close'], df['Open']) - df['Low']

# A utility function to build features from the original df
# It works for rows to, so we can reutilize it.
def get_features(df,row=False):
    df_feat = df[["Asset_ID",'Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP',"timestamp"]].copy()
    df_feat['Upper_Shadow'] = upper_shadow(df_feat)
    df_feat['Lower_Shadow'] = lower_shadow(df_feat)
    
    ## Ad dsome more feats
    df_feat["high_div_low"] = df_feat["High"]/df_feat["Low"]
    df_feat["open_sub_close"] = df_feat["Open"]-df_feat["Close"]

    ## possible seasonality, datetime  features (unlikely to me meaningful, given very short time-frames)
    ### to do: add cyclical features for seasonality
    times = pd.to_datetime(df_feat["timestamp"],unit="s",infer_datetime_format=True)
    if row:
        df_feat["hour"] = times.hour  # .dt
        df_feat["dayofweek"] = times.dayofweek 
        df_feat["day"] = times.day 
    else:
        df_feat["hour"] = times.dt.hour  # .dt
        df_feat["dayofweek"] = times.dt.dayofweek 
        df_feat["day"] = times.dt.day 
    df_feat.drop(columns=["time"],errors="ignore",inplace=True)  # keep original epoch time, drop string

    
    ## todo: features of othet crypto assets in same time period (pivot table + avoid target col)
    ## Lag of target (be careful to avoid leak? Is this valid for test data (in terms of input format?))
    ## more features
    ### time window/time series features
    return df_feat

def get_Xy_and_model_for_asset(df_train, asset_id):
    df = df_train[df_train["Asset_ID"] == asset_id]
    
    # TODO: Try different features here!
    df_proc = get_features(df)
    df_proc['y'] = df['Target']
    df_proc = df_proc.dropna(how="any")
    
    X = df_proc.drop("y", axis=1)
    y = df_proc["y"]
    
    # TODO: Try different models here!
    model = LGBMRegressor(n_estimators=700)
    model.fit(X, y)
    return X, y, model

## Loop over all assets

In [5]:
%%time
Xs = {}
ys = {}
models = {}

for asset_id, asset_name in zip(df_asset_details['Asset_ID'], df_asset_details['Asset_Name']):
    print(f"Training model for {asset_name:<16} (ID={asset_id:<2})")
    X, y, model = get_Xy_and_model_for_asset(df_train, asset_id)    
    Xs[asset_id], ys[asset_id], models[asset_id] = X, y, model

Training model for Binance Coin     (ID=0 )
Training model for Bitcoin          (ID=1 )
Training model for Bitcoin Cash     (ID=2 )
Training model for Cardano          (ID=3 )
Training model for Dogecoin         (ID=4 )
Training model for EOS.IO           (ID=5 )
Training model for Ethereum         (ID=6 )
Training model for Ethereum Classic (ID=7 )
Training model for IOTA             (ID=8 )
Training model for Litecoin         (ID=9 )
Training model for Maker            (ID=10)
Training model for Monero           (ID=11)
Training model for Stellar          (ID=12)
Training model for TRON             (ID=13)
CPU times: user 22min 40s, sys: 11 s, total: 22min 51s
Wall time: 6min 8s


In [6]:
%%time
# Check the model interface
x = get_features(df_train.iloc[0],row=True)
y_pred = models[0].predict([x])
y_pred[0]

CPU times: user 14 ms, sys: 1.01 ms, total: 15 ms
Wall time: 10.7 ms


-0.025412796955613767

# Predict & submit

References: [Detailed API Introduction](https://www.kaggle.com/sohier/detailed-api-introduction)

Something that helped me understand this iterator was adding a pdb checkpoint inside of the for loop:

```python
import pdb; pdb.set_trace()
```

See [Python Debugging With Pdb](https://realpython.com/python-debugging-pdb/) if you want to use it and you don't know how to.


In [7]:
env = gresearch_crypto.make_env()
iter_test = env.iter_test()

for i, (df_test, df_pred) in enumerate(iter_test):
    for j , row in df_test.iterrows():
        
        model = models[row['Asset_ID']]
        x_test = get_features(row,row=True)
        y_pred = model.predict([x_test])[0]
        
        df_pred.loc[df_pred['row_id'] == row['row_id'], 'Target'] = y_pred
        
        # Print just one sample row to get a feeling of what it looks like
        if i == 0 and j == 0:
            display(x_test)

    # Display the first prediction dataframe
    if i == 0:
        display(df_pred)

    # Send submissions
    env.predict(df_pred)

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.


Asset_ID          3.000000e+00
Count             1.201000e+03
Open              1.478556e+00
High              1.486030e+00
Low               1.478000e+00
Close             1.483681e+00
Volume            6.547996e+05
VWAP              1.481439e+00
timestamp         1.623542e+09
Upper_Shadow      2.348667e-03
Lower_Shadow      5.558333e-04
high_div_low      1.005433e+00
open_sub_close   -5.125500e-03
hour              0.000000e+00
dayofweek         6.000000e+00
day               1.300000e+01
Name: 0, dtype: float64

Unnamed: 0,row_id,Target
0,0,-0.0003084373
1,1,-0.0005186861
2,2,0.0001169469
3,3,-7.995537e-06
4,4,9.085875e-07
5,5,3.267314e-05
6,6,-9.723699e-05
7,7,1.876633e-05
8,8,0.001097334
9,9,-2.848795e-05
