<h2>1. Introduction and loading data</h2>

There are two sets of data in this 'kernels only' competition: News and Prices/Returns. The ideia is to use both sets to predict the movement of a given financial asset in the next 10 days. We have data from 2007 to 2017 for training and must predict the movement of assets from Jan 2017 to July 2019.

Two Sigma and Kaggle created a custom package for this competition:

In [0]:
import gc
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)

In [2]:
import os
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/ML/FinalProject')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
news_train = pd.read_csv('dataset/news_market_train_uni1_v1.csv')

In [0]:
market_train = pd.read_csv('dataset/market_train_uni1_v1.csv')

Let's remove data before 2009 (optional):

In [0]:
start = datetime(2009, 1, 1, 0, 0, 0).date()
market_train = market_train.loc[pd.to_datetime(market_train['time']).dt.date >= start].reset_index(drop=True)
news_train = news_train.loc[pd.to_datetime(news_train['time']).dt.date >= start].reset_index(drop=True)

In [6]:
market_train.head(3)

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,close_open_diff,volume_money_mean,binary_returnsNextMktres10,cluster_open_close,cluster_volume,cluster_prev_returns
0,2009-01-02 22:00:00+00:00,A.N,Agilent Technologies Inc,3030118.0,16.24,15.6,0.039028,0.045576,0.029112,0.042122,-0.005511,-0.037037,-0.026992,-0.033293,0.179633,1.041026,48239480.0,True,1,0,0
1,2009-01-02 22:00:00+00:00,AAP.N,Advance Auto Parts Inc,795900.0,34.14,33.86,0.014562,0.022652,-0.010692,0.009156,0.035283,0.047398,-0.00526,0.054363,0.029782,1.008269,27060600.0,True,6,0,0
2,2009-01-02 22:00:00+00:00,AAPL.O,Apple Inc,26964210.0,90.75,85.58,0.063269,-0.004884,0.033274,-0.015174,0.017833,-0.05956,-0.029117,-0.05191,-0.026166,1.060411,2377300000.0,False,8,8,0


In [10]:
news_train.head(3)

Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetCodes,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
0,2009-01-01 00:45:48+00:00,2009-01-01 00:45:47+00:00,2009-01-01 00:45:47+00:00,ecee3d8fc0bd9b8b,SNAPSHOT - Financial Crisis - 0045 GMT,3,1,RTRS,"{'COEN', 'INDS', 'SG', 'RTRS', 'EMRG', 'AUTO',...","{'G', 'GRO', 'ELN', 'Z', 'T', 'SOF', 'PSC', 'M...",2133,4,SNAPSHOT,False,19,406,"{'MOT.N', 'MOT.DE'}",Motorola Solutions Inc,0,0.447214,-1,0.726735,0.158177,0.115088,406,0,0,0,0,0,3,5,11,11,11
1,2009-01-01 02:13:05+00:00,2009-01-01 02:13:05+00:00,2009-01-01 02:13:05+00:00,086113b466359c3f,"UPDATE 2-GM gets $4 bln rescue loan, Chrysler ...",3,1,RTRS,"{'FUND', 'AUTO', 'CYCS', 'DBT', 'NEWS', 'WASH'...","{'C', 'MTL', 'T', 'G', 'O', 'SOF', 'MNI', 'U',...",2940,2,UPDATE 2,False,22,540,"{'F.PA', 'F.F', 'F.DE', 'F.N'}",Ford Motor Co,18,0.072548,-1,0.670384,0.118503,0.211113,108,0,0,0,0,0,8,9,62,62,72
2,2009-01-01 02:24:53+00:00,2009-01-01 02:24:53+00:00,2009-01-01 02:24:53+00:00,cf5c8cec18e3c3ac,Japan's Mizuho to change 3 group presidents-Ni...,3,1,RTRS,"{'FIN', 'ASIA', 'BACT', 'FINS', 'JP', 'BNK', '...","{'PCO', 'T', 'DNP', 'PSC', 'D', 'RNP', 'J', 'P...",1413,1,,False,12,258,"{'MFG.N', '8411.T'}",Mizuho Financial Group Inc,1,1.0,-1,0.747284,0.171259,0.081457,258,0,0,0,0,0,0,0,0,0,1


<h2>2. Preprocessing News</h2>

We are going to remove some columns for now and apply label encoding to a few others:

In [0]:
def preprocess_news(news_train):
    drop_list = [
        'audiences', 'subjects', 'assetName',
        'headline', 'firstCreated', 'sourceTimestamp',
    ]
    news_train.drop(drop_list, axis=1, inplace=True)
    
    # Factorize categorical columns
    for col in ['headlineTag', 'provider', 'sourceId']:
        news_train[col], uniques = pd.factorize(news_train[col])
        del uniques
    
    # Remove {} and '' from assetCodes column
    news_train['assetCodes'] = news_train['assetCodes'].apply(lambda x: x[1:-1].replace("'", ""))
    return news_train

news_train = preprocess_news(news_train)

<h2>3. Unstacking news</h2>

Assets are actually a list of codes in the news frame, but we need to merge with market data which has individual asset codes. Therefore, we are going to unstack each asset code and save the original index with the following function. This is probably not the best way of doing that, but it is simple:

In [8]:
def unstack_asset_codes(news_train):
    codes = []
    indexes = []
    for i, values in news_train['assetCodes'].iteritems():
        explode = values.split(", ")
        codes.extend(explode)
        repeat_index = [int(i)]*len(explode)
        indexes.extend(repeat_index)
    index_df = pd.DataFrame({'news_index': indexes, 'assetCode': codes})
    del codes, indexes
    gc.collect()
    return index_df

index_df = unstack_asset_codes(news_train)
index_df.head()

Unnamed: 0,news_index,assetCode
0,0,MOT.N
1,0,MOT.DE
2,1,F.PA
3,1,F.F
4,1,F.DE


Now we can merge the news on this frame:

In [9]:
def merge_news_on_index(news_train, index_df):
    news_train['news_index'] = news_train.index.copy()

    # Merge news on unstacked assets
    news_unstack = index_df.merge(news_train, how='left', on='news_index')
    news_unstack.drop(['news_index', 'assetCodes'], axis=1, inplace=True)
    return news_unstack

news_unstack = merge_news_on_index(news_train, index_df)
del news_train, index_df
gc.collect()
news_unstack.head(3)

Unnamed: 0,assetCode,time,sourceId,urgency,takeSequence,provider,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
0,MOT.N,2009-01-01 00:45:48+00:00,0,3,1,0,2133,4,0,False,19,406,0,0.447214,-1,0.726735,0.158177,0.115088,406,0,0,0,0,0,3,5,11,11,11
1,MOT.DE,2009-01-01 00:45:48+00:00,0,3,1,0,2133,4,0,False,19,406,0,0.447214,-1,0.726735,0.158177,0.115088,406,0,0,0,0,0,3,5,11,11,11
2,F.PA,2009-01-01 02:13:05+00:00,1,3,1,0,2940,2,1,False,22,540,18,0.072548,-1,0.670384,0.118503,0.211113,108,0,0,0,0,0,8,9,62,62,72


<h2>4. Group by date and asset</h2>

There can be many News for a single date and asset, so we need to group this data. I'll be using a simple mean, but you can use more intelligent features.

In [11]:
def group_news(news_frame):
    news_frame['date'] = pd.to_datetime(news_frame.time).dt.date  # Add date column
    
    aggregations = ['mean']
    gp = news_frame.groupby(['assetCode', 'date']).agg(aggregations)
    gp.columns = pd.Index(["{}_{}".format(e[0], e[1]) for e in gp.columns.tolist()])
    gp.reset_index(inplace=True)
    # Set datatype to float32
    float_cols = {c: 'float32' for c in gp.columns if c not in ['assetCode', 'date']}
    return gp.astype(float_cols)

news_agg = group_news(news_unstack)
del news_unstack; gc.collect()
news_agg.head(3)

Unnamed: 0,assetCode,date,sourceId_mean,urgency_mean,takeSequence_mean,provider_mean,bodySize_mean,companyCount_mean,headlineTag_mean,marketCommentary_mean,sentenceCount_mean,wordCount_mean,firstMentionSentence_mean,relevance_mean,sentimentClass_mean,sentimentNegative_mean,sentimentNeutral_mean,sentimentPositive_mean,sentimentWordCount_mean,noveltyCount12H_mean,noveltyCount24H_mean,noveltyCount3D_mean,noveltyCount5D_mean,noveltyCount7D_mean,volumeCounts12H_mean,volumeCounts24H_mean,volumeCounts3D_mean,volumeCounts5D_mean,volumeCounts7D_mean
0,0005.HK,2009-01-02,48.0,3.0,1.0,4.0,539.0,1.0,-1.0,0.0,5.0,74.0,0.0,1.0,-1.0,0.508528,0.24084,0.250632,74.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,7.0,13.0,14.0
1,0005.HK,2009-01-05,722.0,2.666667,1.5,0.0,2040.333374,3.5,7.5,0.0,12.333333,366.833344,2.333333,0.587591,-0.166667,0.506576,0.24485,0.248575,292.166656,0.166667,0.166667,0.166667,0.166667,0.166667,4.833333,5.5,5.5,8.333333,15.666667
2,0005.HK,2009-01-06,1757.444458,3.0,1.0,1.111111,2950.111084,3.444444,5.333333,0.111111,22.0,531.333313,4.222222,0.438993,0.0,0.363292,0.390779,0.245929,306.888885,0.111111,0.111111,0.111111,0.111111,0.111111,4.0,10.555555,16.0,18.0,23.444445


<h2>5. Merge on Market data</h2>

The final preprocessing step is to merge news data with market data:

In [12]:
market_train['date'] = pd.to_datetime(market_train.time).dt.date
df = market_train.merge(news_agg, how='left', on=['assetCode', 'date'])
del market_train, news_agg
gc.collect()
df.head(3)

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,close_open_diff,volume_money_mean,binary_returnsNextMktres10,cluster_open_close,cluster_volume,cluster_prev_returns,date,sourceId_mean,urgency_mean,takeSequence_mean,provider_mean,bodySize_mean,companyCount_mean,headlineTag_mean,marketCommentary_mean,sentenceCount_mean,wordCount_mean,firstMentionSentence_mean,relevance_mean,sentimentClass_mean,sentimentNegative_mean,sentimentNeutral_mean,sentimentPositive_mean,sentimentWordCount_mean,noveltyCount12H_mean,noveltyCount24H_mean,noveltyCount3D_mean,noveltyCount5D_mean,noveltyCount7D_mean,volumeCounts12H_mean,volumeCounts24H_mean,volumeCounts3D_mean,volumeCounts5D_mean,volumeCounts7D_mean
0,2009-01-02 22:00:00+00:00,A.N,Agilent Technologies Inc,3030118.0,16.24,15.6,0.039028,0.045576,0.029112,0.042122,-0.005511,-0.037037,-0.026992,-0.033293,0.179633,1.041026,48239480.0,True,1,0,0,2009-01-02,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2009-01-02 22:00:00+00:00,AAP.N,Advance Auto Parts Inc,795900.0,34.14,33.86,0.014562,0.022652,-0.010692,0.009156,0.035283,0.047398,-0.00526,0.054363,0.029782,1.008269,27060600.0,True,6,0,0,2009-01-02,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2009-01-02 22:00:00+00:00,AAPL.O,Apple Inc,26964210.0,90.75,85.58,0.063269,-0.004884,0.033274,-0.015174,0.017833,-0.05956,-0.029117,-0.05191,-0.026166,1.060411,2377300000.0,False,8,8,0,2009-01-02,374.545441,3.0,1.454545,0.0,4142.36377,5.363636,18.363636,1.0,53.636364,722.545471,13.363636,0.151112,0.636364,0.259742,0.144537,0.595721,74.818184,2.272727,2.272727,3.727273,3.727273,3.727273,5.454545,5.454545,16.454546,22.454546,22.454546


In [0]:
import pickle as cPickle
export_csv = df.to_csv ('dataset/processed/train.csv', index = None, header=True)
#with open('dataset/processed/base_line_model_RF.pk', 'wb') as f:
#    cPickle.dump(model, f)

<h2>6. Train GBM model </h2>

This competition has a custom metric (check the evaluation tab). The following function returns the custom metric from the probability of each example being positive (or 1).

In [0]:
def custom_metric(date, pred_proba, num_target, universe):
    y = pred_proba*2 - 1
    r = num_target.clip(-1,1) # get rid of outliers
    x = y * r * 1
    result = pd.DataFrame({'day' : date, 'x' : x})
    x_t = result.groupby('day').sum().values
    return np.mean(x_t) / np.std(x_t)

Drop columns that we don't need and set type to float32:

In [20]:
date = df.date
num_target = df.returnsOpenNextMktres10.astype('float32')
bin_target = (df.returnsOpenNextMktres10 >= 0).astype('int8')
# Drop columns that are not features
df.drop(['returnsOpenNextMktres10', 'date', 'assetCode', 'assetName', 'time'], 
        axis=1, inplace=True)
df = df.astype('float32')  # Set all remaining columns to float32 datatype
gc.collect()

787

We can use the last 10% of data (aprox one year) to validate our model. You have to be careful when using different validation techniques like KFold since the time is important here.

In [0]:
train_index, test_index = train_test_split(df.index.values, test_size=0.2, shuffle=False)

In [0]:
def evaluate_model(df, target, train_index, test_index, params):
    params['n_jobs'] = 2  # Use 2 cores/threads
    #model = XGBClassifier(**params)
    model = LGBMClassifier(**params)
    model.fit(df.iloc[train_index], target.iloc[train_index])
    return log_loss(target.iloc[test_index], model.predict_proba(df.iloc[test_index]))

We can use a simple random search to find some hyperparameters:

In [24]:
param_grid = {
    'learning_rate': [0.15, 0.1, 0.05, 0.02, 0.01],
    'num_leaves': [i for i in range(12, 90, 6)],
    'n_estimators': [50, 200, 400, 600, 800],
    'min_child_samples': [i for i in range(10, 100, 10)],
    'colsample_bytree': [0.8, 0.9, 0.95, 1],
    'subsample': [0.8, 0.9, 0.95, 1],
    'reg_alpha': [0.1, 0.2, 0.4, 0.6, 0.8],
    'reg_lambda': [0.1, 0.2, 0.4, 0.6, 0.8],
}

best_eval_score = 0
for i in range(100):  # Hundred runs
    params = {k: np.random.choice(v) for k, v in param_grid.items()}
    score = evaluate_model(df, bin_target, train_index, test_index, params)
    if score < best_eval_score or best_eval_score == 0:
        best_eval_score = score
        best_params = params
print("Best evaluation logloss", best_eval_score)

Best evaluation logloss 1.2639342823427868e-07


<h2>7. Make predictions and submit</h2>

In order to make predictions for the test set we must use the *env.predict* function and for our final submission *env.write_submission_file*. Otherwise, the script will fail in the second stage, when Kaggle will replace the "fake" data for 2019 market data and run our scripts again.

In [25]:
# Train model with full data
clf = LGBMClassifier(**best_params)
clf.fit(df, bin_target)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=80, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=800, n_jobs=2, num_leaves=78, objective=None,
               random_state=None, reg_alpha=0.1, reg_lambda=0.1, silent=True,
               subsample=0.9, subsample_for_bin=200000, subsample_freq=0)

In [0]:
import pickle as cPickle
with open('base_line_model_RF.pk', 'wb') as f:
    cPickle.dump(clf, f)

In [0]:
def write_submission(model, env):
    days = env.get_prediction_days()
    for (market_obs_df, news_obs_df, predictions_template_df) in days:
        news_obs_df = preprocess_news(news_obs_df)
        # Unstack news
        index_df = unstack_asset_codes(news_obs_df)
        news_unstack = merge_news_on_index(news_obs_df, index_df)
        # Group and and get aggregations (mean)
        news_obs_agg = group_news(news_unstack)

        # Join market and news frames
        market_obs_df['date'] = market_obs_df.time.dt.date
        obs_df = market_obs_df.merge(news_obs_agg, how='left', on=['assetCode', 'date'])
        del market_obs_df, news_obs_agg, news_obs_df, news_unstack, index_df
        gc.collect()
        obs_df = obs_df[obs_df.assetCode.isin(predictions_template_df.assetCode)]
        
        # Drop cols that are not features
        feats = [c for c in obs_df.columns if c not in ['date', 'assetCode', 'assetName', 'time']]

        preds = model.predict_proba(obs_df[feats])[:, 1] * 2 - 1
        sub = pd.DataFrame({'assetCode': obs_df['assetCode'], 'confidence': preds})
        predictions_template_df = predictions_template_df.merge(sub, how='left').drop(
            'confidenceValue', axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'})
        
        env.predict(predictions_template_df)
        del obs_df, predictions_template_df, preds, sub
        gc.collect()
    env.write_submission_file()
    
write_submission(clf, env)

<h2>8. Feature importance</h2>

We can use Seaborn to plot the feature importance (with gain or number of splits criteria):

In [0]:
feat_importance = pd.DataFrame()
feat_importance["feature"] = df.columns
feat_importance["gain"] = clf.booster_.feature_importance(importance_type='gain')
feat_importance.sort_values(by='gain', ascending=False, inplace=True)
plt.figure(figsize=(8,10))
ax = sns.barplot(y="feature", x="gain", data=feat_importance)

TODO

* Use custom metric
* Improve aggregations

Thanks for reading! Please upvote if you find usefull.