# Amateur Hour - Predicting Stocks using LightGBM (Market Data Only)
### Starter Kernel by ``Magichanics`` 
*([GitHub](https://github.com/Magichanics) - [Kaggle](https://www.kaggle.com/magichanics))*

This is more of an improvement in organization and efficiency compared to my previous notebook. The reason why I've decided to create a Market Data Only kernel is because of how small the runtime is.

Feel free to post suggestions or criticisms! 

## Table of Contents

* [Step 1. Cleaning Dataset](#section1)
* [Step 2. Feature Engineering](#section2)
* [Step 3. Modelling using LightGBM](#section3)
* [Step 4. Applying the Model](#section4)

In [1]:
import numpy as np
import pandas as pd
import os
import gc
from itertools import chain

import matplotlib.pyplot as plt

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
# import environment for data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

Loading the data... This could take a minute.
Done!


In [4]:
(market_train_df, news_train_df) = env.get_training_data()
sampling = False
if sampling:
    market_train_df = market_train_df.tail(400_000)
del news_train_df

<a id='section1'></a>
## Step 1. Cleaning Dataset

We'll be getting rid of a bit of data from the market dataset.

In [5]:
market_train_df.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
0,2007-02-01 22:00:00+00:00,A.N,Agilent Technologies Inc,2606900.0,32.19,32.17,0.005938,0.005312,,,-0.00186,0.000622,,,0.034672,1.0
1,2007-02-01 22:00:00+00:00,AAI.N,AirTran Holdings Inc,2051600.0,11.12,11.08,0.004517,-0.007168,,,-0.078708,-0.088066,,,0.027803,0.0
2,2007-02-01 22:00:00+00:00,AAP.N,Advance Auto Parts Inc,1164800.0,37.51,37.99,-0.011594,0.025648,,,0.014332,0.045405,,,0.024433,1.0
3,2007-02-01 22:00:00+00:00,AAPL.O,Apple Inc,23747329.0,84.74,86.23,-0.011548,0.016324,,,-0.048613,-0.037182,,,-0.007425,1.0
4,2007-02-01 22:00:00+00:00,ABB.N,ABB Ltd,1208600.0,18.02,18.01,0.011791,0.025043,,,0.012929,0.020397,,,-0.017994,1.0


### Getting rid of Data prior to 2010
Data affected by the Financial Crisis (and January 2010 data) may not benefit this model.

In [6]:
from datetime import datetime, timedelta
start = datetime(2010, 2, 1, 0, 0, 0).date()
market_train_df = market_train_df.loc[market_train_df['time'].dt.date >= start].reset_index(drop=True)

### Cleaning Data
We will be only keeping the features with good correlation with the dataset.

In [7]:
def clean_data(market_df, train=True):
    
    # get only what's necessary
    valid_market_cols = ['time', 'assetCode', 'volume', 'close', 'open',\
                           'returnsClosePrevRaw10', 'returnsOpenPrevRaw10',\
                           'returnsClosePrevMktres10', 'returnsOpenPrevMktres10']
    if train:
        valid_market_cols += ['returnsOpenNextMktres10']
    market_df = market_df[valid_market_cols]

    return market_df

In [8]:
X_train = clean_data(market_train_df, train=True)

In [9]:
X_train.shape

(2918114, 10)

In [10]:
del market_train_df

<a id='section2'></a>
## Step 2. Feature Engineering

From Quant features to text processing features.

In [11]:
X_train.tail()

Unnamed: 0,time,assetCode,volume,close,open,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10
2918109,2016-12-30 22:00:00+00:00,ZIOP.O,1608829.0,5.35,5.37,-0.165367,-0.138042,-0.139597,-0.135913,0.051189
2918110,2016-12-30 22:00:00+00:00,ZLTQ.O,347830.0,43.52,43.62,0.002996,0.002989,0.008213,0.00321,-0.048555
2918111,2016-12-30 22:00:00+00:00,ZNGA.O,7396601.0,2.57,2.58,-0.091873,-0.078571,-0.077252,-0.077188,0.011703
2918112,2016-12-30 22:00:00+00:00,ZTO.N,3146519.0,12.07,12.5,-0.065066,-0.042146,-0.078104,-0.043813,0.083367
2918113,2016-12-30 22:00:00+00:00,ZTS.N,1701204.0,53.53,53.64,0.023127,0.028177,0.026566,0.028719,-0.01622


### Entire Market and Individual Asset Lag Features
We are going to be obtaining Quant Features from both the entire market dataframe and from each individual asset based on assetCode.

Source: https://www.kaggle.com/qqgeogor/eda-script-67

In [12]:
from multiprocessing import Pool

def create_lag(df_code,n_lag=[3,7,14,],shift_size=1):
    code = df_code['assetCode'].unique()
    
    for col in return_features:
        for window in n_lag:
            rolled = df_code[col].shift(shift_size).rolling(window=window)
            lag_mean = rolled.mean()
            lag_max = rolled.max()
            lag_min = rolled.min()
            lag_std = rolled.std()
            df_code['%s_lag_%s_mean'%(col,window)] = lag_mean
            df_code['%s_lag_%s_max'%(col,window)] = lag_max
            df_code['%s_lag_%s_min'%(col,window)] = lag_min

    return df_code.fillna(-1)

def generate_lag_features(df,n_lag = [3,7,14]):
    
    assetCodes = df['assetCode'].unique()
    all_df = []
    df_codes = df.groupby('assetCode')
    df_codes = [df_code[1][['time','assetCode']+return_features] for df_code in df_codes]
    
    pool = Pool(4)
    all_df = pool.map(create_lag, df_codes)
    
    new_df = pd.concat(all_df)  
    new_df.drop(return_features+['time', 'assetCode'],axis=1,inplace=True)
    new_df = pd.concat([df, new_df], axis=1, sort=False)
    pool.close()
    
    return new_df

In [13]:
%%time
return_features = ['returnsClosePrevMktres10','returnsClosePrevRaw10','open','close', 'volume']
X_train = generate_lag_features(X_train)

CPU times: user 15.5 s, sys: 7.17 s, total: 22.7 s
Wall time: 1min 12s


In [14]:
X_train.dropna().head()

Unnamed: 0,time,assetCode,volume,close,open,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,returnsClosePrevMktres10_lag_3_mean,returnsClosePrevMktres10_lag_3_max,returnsClosePrevMktres10_lag_3_min,returnsClosePrevMktres10_lag_7_mean,returnsClosePrevMktres10_lag_7_max,returnsClosePrevMktres10_lag_7_min,returnsClosePrevMktres10_lag_14_mean,returnsClosePrevMktres10_lag_14_max,returnsClosePrevMktres10_lag_14_min,returnsClosePrevRaw10_lag_3_mean,returnsClosePrevRaw10_lag_3_max,returnsClosePrevRaw10_lag_3_min,returnsClosePrevRaw10_lag_7_mean,returnsClosePrevRaw10_lag_7_max,returnsClosePrevRaw10_lag_7_min,returnsClosePrevRaw10_lag_14_mean,returnsClosePrevRaw10_lag_14_max,returnsClosePrevRaw10_lag_14_min,open_lag_3_mean,open_lag_3_max,open_lag_3_min,open_lag_7_mean,open_lag_7_max,open_lag_7_min,open_lag_14_mean,open_lag_14_max,open_lag_14_min,close_lag_3_mean,close_lag_3_max,close_lag_3_min,close_lag_7_mean,close_lag_7_max,close_lag_7_min,close_lag_14_mean,close_lag_14_max,close_lag_14_min,volume_lag_3_mean,volume_lag_3_max,volume_lag_3_min,volume_lag_7_mean,volume_lag_7_max,volume_lag_7_min,volume_lag_14_mean,volume_lag_14_max,volume_lag_14_min
0,2010-02-01 22:00:00+00:00,A.N,4001809.0,29.13,28.16,-0.042721,-0.098014,0.005433,-0.023894,0.032263,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,2010-02-01 22:00:00+00:00,AAI.N,3240233.0,4.86,4.94,-0.10989,-0.10991,-0.090886,-0.11004,0.013942,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,2010-02-01 22:00:00+00:00,AAP.N,1107196.0,40.49,39.55,0.030018,0.005338,0.016256,-0.0257,0.056036,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,2010-02-01 22:00:00+00:00,AAPL.O,26781203.0,194.73,192.58,-0.054387,-0.089026,-0.000965,-0.007019,0.036244,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,2010-02-01 22:00:00+00:00,AAV.N,492243.0,6.5,6.43,-0.075391,-0.085349,0.0205,0.066237,0.079596,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0


### Clustering
We will be clustering the open and close features using KMeans.

In [15]:
from sklearn.cluster import KMeans

# suggesting -> add multiprocessing to KMeans

def clustering(X):

    def cluster_modelling(features):
        df_set = X[features].fillna(0)
        cluster_model = KMeans(n_clusters = 4)
        cluster_model.fit(df_set)
        return cluster_model.predict(df_set)
    
    X['cluster_open_close'] = cluster_modelling(['open', 'close'])
    
    return X

In [16]:
%%time
X_train = clustering(X_train)

CPU times: user 50.5 s, sys: 33.2 s, total: 1min 23s
Wall time: 1min 9s


### Misc. Features
Inclues the following features:
* Daily Difference

In [21]:
def misc_features(X):
    
    # Adding daily difference
    new_col = X["close"] - X["open"]
    X.insert(loc=6, column="daily_diff", value=new_col)
    X['close_to_open'] =  np.abs(X['close'] / X['open'])
    
    # extra features
    X['bartrend'] = X['close'] / X['open']
    X['average'] = (X['close'] + X['open'])/2
    X['pricevolume'] = X['volume'] * X['close']


In [22]:
misc_features(X_train)

In [23]:
X_train.dropna().tail()

Unnamed: 0,time,assetCode,volume,close,open,returnsClosePrevRaw10,daily_diff,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,returnsClosePrevMktres10_lag_3_mean,returnsClosePrevMktres10_lag_3_max,returnsClosePrevMktres10_lag_3_min,returnsClosePrevMktres10_lag_7_mean,returnsClosePrevMktres10_lag_7_max,returnsClosePrevMktres10_lag_7_min,returnsClosePrevMktres10_lag_14_mean,returnsClosePrevMktres10_lag_14_max,returnsClosePrevMktres10_lag_14_min,returnsClosePrevRaw10_lag_3_mean,returnsClosePrevRaw10_lag_3_max,returnsClosePrevRaw10_lag_3_min,returnsClosePrevRaw10_lag_7_mean,returnsClosePrevRaw10_lag_7_max,returnsClosePrevRaw10_lag_7_min,returnsClosePrevRaw10_lag_14_mean,returnsClosePrevRaw10_lag_14_max,returnsClosePrevRaw10_lag_14_min,open_lag_3_mean,open_lag_3_max,open_lag_3_min,open_lag_7_mean,open_lag_7_max,open_lag_7_min,open_lag_14_mean,open_lag_14_max,open_lag_14_min,close_lag_3_mean,close_lag_3_max,close_lag_3_min,close_lag_7_mean,close_lag_7_max,close_lag_7_min,close_lag_14_mean,close_lag_14_max,close_lag_14_min,volume_lag_3_mean,volume_lag_3_max,volume_lag_3_min,volume_lag_7_mean,volume_lag_7_max,volume_lag_7_min,volume_lag_14_mean,volume_lag_14_max,volume_lag_14_min,cluster_open_close,close_to_open,bartrend,average,pricevolume
2918109,2016-12-30 22:00:00+00:00,ZIOP.O,1608829.0,5.35,5.37,-0.165367,-0.02,-0.138042,-0.139597,-0.135913,0.051189,-0.13526,-0.13047,-0.139499,-0.122319,-0.092268,-0.139499,-0.160523,-0.092268,-0.252451,-0.140565,-0.123824,-0.154331,-0.118618,-0.08589,-0.154331,-0.082796,0.020701,-0.154331,5.626667,5.91,5.37,5.88,6.35,5.37,6.143571,6.61,5.37,5.443333,5.59,5.37,5.734286,6.18,5.37,6.045714,6.54,5.37,1384033.0,1565767.0,1261965.0,1297352.0,1565767.0,1121395.0,1547080.0,3897459.0,1096145.0,0,0.996276,0.996276,5.36,8607235.15
2918110,2016-12-30 22:00:00+00:00,ZLTQ.O,347830.0,43.52,43.62,0.002996,-0.1,0.002989,0.008213,0.00321,-0.048555,-0.005185,0.003636,-0.016203,-0.019191,0.003636,-0.063223,-0.065826,0.003636,-0.150018,-0.006356,0.002994,-0.015433,-0.020382,0.002994,-0.065559,-0.002413,0.046804,-0.065559,43.566667,43.68,43.4,43.304286,43.68,42.89,43.704286,46.0,42.89,43.463333,43.55,43.38,43.272857,43.61,42.76,43.552857,44.4,42.76,299696.7,360774.0,238702.0,400007.6,638235.0,238702.0,597071.1,1555770.0,238702.0,0,0.997707,0.997707,43.57,15137561.6
2918111,2016-12-30 22:00:00+00:00,ZNGA.O,7396601.0,2.57,2.58,-0.091873,-0.01,-0.078571,-0.077252,-0.077188,0.011703,-0.08068,-0.066097,-0.103902,-0.087819,-0.01583,-0.144792,-0.046938,0.023406,-0.144792,-0.083548,-0.0681,-0.100694,-0.10166,-0.0681,-0.149502,-0.06211,0.017361,-0.149502,2.593333,2.62,2.58,2.63,2.69,2.58,2.748571,3.05,2.58,2.59,2.6,2.58,2.608571,2.69,2.56,2.71,2.93,2.56,5251877.0,6829916.0,4177147.0,7934562.0,15014025.0,4177147.0,9865433.0,18371508.0,4177147.0,0,0.996124,0.996124,2.575,19009264.57
2918112,2016-12-30 22:00:00+00:00,ZTO.N,3146519.0,12.07,12.5,-0.065066,-0.43,-0.042146,-0.078104,-0.043813,0.083367,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.051198,-0.04053,-0.057635,-0.062396,-0.029499,-0.110403,-1.0,-1.0,-1.0,12.543333,12.7,12.41,12.771429,13.07,12.41,-1.0,-1.0,-1.0,12.446667,12.59,12.31,12.724286,13.16,12.31,-1.0,-1.0,-1.0,2372087.0,2851496.0,1817528.0,1656624.0,2851496.0,855289.0,-1.0,-1.0,-1.0,0,0.9656,0.9656,12.285,37978484.33
2918113,2016-12-30 22:00:00+00:00,ZTS.N,1701204.0,53.53,53.64,0.023127,-0.11,0.028177,0.026566,0.028719,-0.01622,0.052461,0.055288,0.050286,0.038251,0.055288,0.022189,0.038112,0.068147,0.013969,0.052162,0.054988,0.046817,0.045302,0.057348,0.030573,0.040848,0.068432,0.009131,53.671567,53.78,53.4747,53.347814,53.78,52.67,52.606764,53.78,51.03,53.593333,53.72,53.44,53.434286,53.78,53.1,52.565,53.78,50.84,1211282.0,1344976.0,1047017.0,1775436.0,2755192.0,1047017.0,2733968.0,5177724.0,1047017.0,3,0.997949,0.997949,53.585,91065450.12


<a id='section3'></a>
## Step 3. Modelling using LightGBM

### Preparing Datasets for Modelling

In [24]:
y_train = X_train['returnsOpenNextMktres10']
X_train = X_train[[f for f in X_train.columns if f not in ['time', 'assetCode', 'universe', 'assetName', 'returnsOpenNextMktres10',
                                                          'headline', 'subjects', 'audiences']]].fillna(0)

### Fixed Training Split
The reason why we need to do a fixed training test split that fetches the last few rows of the training dataset is to avoid odd results, since randomly choosing rows will cause the validation dataset to be filled with rows with different timestamps.

In [25]:
def fixed_train_test_split(X, y, train_size):
    
    # round train size
    train_size = int(train_size * len(X))
    
    # split data
    X_train, y_train = X[train_size:], y[train_size:]
    X_valid, y_valid = X[:train_size], y[:train_size]
    
    return X_train, y_train, X_valid, y_valid

In [26]:
X_train, y_train, X_valid, y_valid = fixed_train_test_split(X_train, y_train, 0.85)

### Using LightGBM for modelling

Model from:
https://www.kaggle.com/rabaman/0-64-in-100-lines/code

In [None]:
import lightgbm as lgb

params = {"objective" : "binary",
          "metric" : "binary_logloss",
          "num_leaves" : 60,
          "max_depth": -1,
          "learning_rate" : 0.01,
          "bagging_fraction" : 0.9,  # subsample
          "feature_fraction" : 0.9,  # colsample_bytree
          "bagging_freq" : 5,        # subsample_freq
          "bagging_seed" : 2018,
          "verbosity" : -1 }

lgtrain, lgval = lgb.Dataset(X_train, y_train), lgb.Dataset(X_valid, y_valid)
lgb_model = lgb.train(params, lgtrain, 2000, valid_sets=[lgtrain, lgval], early_stopping_rounds=100, verbose_eval=200)


<a id='section4'></a>
## Step 4. Applying the Model


Predictions will be made through a for loop, and apply all the functions above onto the test dataset.

In [None]:
def get_X(market_test_df):
    
    X_test = clean_data(market_test_df, train=False)
    X_test = generate_lag_features(X_test)
    X_test = clustering(X_test)
    misc_features(X_test)
    X_test = X_test[[f for f in X_test.columns if f not in ['time', 'assetCode', 'universe', \
                                                            'assetName', 'returnsOpenNextMktres10',\
                                                          'headline', 'subjects', 'audiences']]].fillna(0)
    
    return X_test

def make_predictions(market_obs_df):
    
    # predict using given model
    X_test = get_X(market_obs_df)
    prediction_values = np.clip(lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration), -1, 1)

    return prediction_values


In [None]:
%%time
n_days = 0
for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days(): # Looping over days from start of 2017 to 2019-07-15
    
    n_days +=1
    if n_days % 50 == 0:
        print(n_days)
    
    # make predictions
    predictions_template_df['confidenceValue'] = make_predictions(market_obs_df)
    
    # save predictions
    env.predict(predictions_template_df)
    
env.write_submission_file()

**Sources:**
* [Amateur Hour - Using Headlines to Predict Stocks by Magichanics](https://www.kaggle.com/magichanics/amateur-hour-using-headlines-to-predict-stocks)
* [Simple Quant Features by Youhanlee](https://www.kaggle.com/youhanlee/simple-quant-features-using-python)
* [>0.64 in 100 lines by rabaman](https://www.kaggle.com/rabaman/0-64-in-100-lines/comments)
* [eda script 67 by qqgeogor](https://www.kaggle.com/qqgeogor/eda-script-67)