# Amateur Hour - Predicting Stocks using LightGBM
### Starter Kernel by ``Magichanics`` 
*([GitHub](https://github.com/Magichanics) - [Kaggle](https://www.kaggle.com/magichanics))*

Feel free to post suggestions or criticisms!

## Table of Contents

* [Step 1. Merging Datasets](#section1)
* [Step 2. Feature Engineering](#section2)
* [Step 3. Modelling using LightGBM](#section3)
* [Step 4. Applying the Model](#section4)

In [1]:
import numpy as np
import pandas as pd
import os
import gc
from itertools import chain

import matplotlib.pyplot as plt

In [2]:
# import environment for data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

Loading the data... This could take a minute.
Done!


In [3]:
(market_train_df, news_train_df) = env.get_training_data()
sampling = True
if sampling:
    market_train_df = market_train_df.tail(400_000)
    news_train_df = news_train_df.tail(1_000_000)

<a id='section1'></a>
## Step 1. Merging Datasets

While most of the notebooks focuses only on the market dataset, I'm going to attempt on bringing both the news and market dataset together.

In [4]:
market_train_df.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
3672956,2016-02-19 22:00:00+00:00,SPWR.O,SunPower Corp,4109129.0,21.14,22.25,-0.074431,-0.094055,-0.074354,-0.081761,-0.175828,-0.102823,-0.180055,-0.111746,0.068726,1.0
3672957,2016-02-19 22:00:00+00:00,SQM.N,Sociedad Quimica y Minera de Chile SA,414021.0,17.15,16.94,0.002924,-0.033105,0.002975,-0.025269,0.051502,0.057428,0.049969,0.054736,-0.003696,0.0
3672958,2016-02-19 22:00:00+00:00,SRC.N,Spirit Realty Capital Inc,7481287.0,11.09,11.09,-0.0018,0.024954,-0.001777,0.029012,0.03839,0.052182,0.035912,0.046627,-0.067333,1.0
3672959,2016-02-19 22:00:00+00:00,SRCL.O,Stericycle Inc,898932.0,109.66,111.3,-0.016855,0.004241,-0.016824,0.008331,-0.054166,-0.052121,-0.055767,-0.054056,-0.044206,1.0
3672960,2016-02-19 22:00:00+00:00,SRE.N,Sempra Energy,2143306.0,97.25,96.66,0.003819,0.014058,0.003833,0.015573,0.020355,0.012359,0.019882,0.011141,-0.006034,1.0


In [5]:
news_train_df.head()

Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetCodes,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
8328750,2015-12-08 13:56:53+00:00,2015-12-08 13:56:53+00:00,2015-12-08 13:56:53+00:00,f9c4067a6d20f21b,"CHESAPEAKE ENERGY CORP SHARES EXTEND LOSSES, N...",1,1,RTRS,"{'BLR', 'STX', 'OILG', 'EXPL', 'HOT', 'ENER', ...","{'E', 'U'}",0,1,,False,2,21,{'CHK.N'},Chesapeake Energy Corp,1,1.0,-1,0.819143,0.125228,0.055629,21,2,3,4,7,9,17,23,24,41,63
8328751,2015-12-08 13:57:20+00:00,2015-12-08 13:57:20+00:00,2015-12-08 13:57:20+00:00,749e57557c589fca,REG - Societe Generale SA Anheuser-Busch InBev...,3,1,LSE,"{'NEWR', 'FOBE', 'WEU', 'BEVS', 'NCYC', 'LEN',...",{'LSEN'},21427,1,,False,47,1528,"{'ABI.BR', 'BUD.N'}",Anheuser Busch Inbev SA,1,1.0,0,0.014524,0.801992,0.183484,62,0,0,0,1,3,22,30,51,90,167
8328752,2015-12-08 13:57:20+00:00,2015-12-08 13:57:19+00:00,2015-12-08 13:57:19+00:00,e61c180b2be5eb45,REG - Societe Generale SA Anheuser-Busch InBev...,3,1,LSE,"{'NEWR', 'FOBE', 'WEU', 'BEVS', 'NCYC', 'LEN',...",{'LSEN'},59958,1,,False,53,4563,"{'ABI.BR', 'BUD.N'}",Anheuser Busch Inbev SA,1,1.0,1,0.035002,0.161918,0.80308,176,19,25,46,74,133,21,29,50,89,166
8328753,2015-12-08 13:57:37+00:00,2015-12-08 13:57:37+00:00,2015-12-08 13:57:37+00:00,35e01becdbd06d17,IIROC Trade Resumption - BIP.PR.B <BIP.N>,3,1,CNW,"{'NEWR', 'LEN', 'ELEU', 'FINS', 'US', 'DFIN', ...","{'CNR', 'CNW'}",753,1,,False,8,137,{'BIP.N'},Brookfield Infrastructure Partners LP,4,0.57735,-1,0.811987,0.129426,0.058586,87,1,1,1,1,1,1,1,1,1,10
8328754,2015-12-08 13:57:41+00:00,2015-12-08 13:57:41+00:00,2015-12-08 13:57:41+00:00,36a59986b3a81936,TRANSOCEAN'S U.S.-LISTED SHARES DOWN 2.41 PCT ...,1,1,RTRS,"{'BLR', 'STX', 'WEU', 'HOT', 'CH', 'DRIL', 'EN...","{'E', 'U'}",0,1,,False,1,14,"{'RIG.N', 'RIGN.VX', 'RIGN.BN'}",Transocean Ltd,1,1.0,-1,0.819123,0.125241,0.055637,14,0,0,0,1,1,1,3,14,17,17


### Getting rid of Data prior to 2009
Data from the Financial Crisis may not benefit this model.

In [6]:
from datetime import datetime, timedelta
start = datetime(2009, 1, 1, 0, 0, 0).date()
market_train_df = market_train_df.loc[market_train_df['time'].dt.date >= start].reset_index(drop=True)
news_train_df = news_train_df.loc[news_train_df['time'].dt.date >= start].reset_index(drop=True)

### Time difference between time and firstCreated
Maybe the news isn't that urgent if there was a time difference between the two columns.

In [7]:
# WIP, will work on it later on
def time_diff(news_df):
    
    news_df['num_publishing_diff_secs'] = news_df['time'] - news_df['firstCreated']
    news_df['num_publishing_diff_secs'] = news_df['num_publishing_diff_secs'] / np.timedelta64(1, 's')
    return news_df

In [8]:
news_train_df = time_diff(news_train_df)

### Cleaning Data
We will be removing the rows with the following qualities:
* Empty headlines
* Repeat headlines
* Urgency of 2
* Null assetName

In [9]:
def clean_data(market_df, news_df, train=True):
    
    # get rid of invalid rows
    news_df = news_df[news_df.headline != '']
    news_df = news_df[news_df.urgency != 2]
    
    # remove duplicate headlines with the same assetCodes
    news_df = news_df.drop_duplicates(subset=['assetCodes', 'headline'],keep='first')

    return market_df, news_df

In [10]:
market_train_df, news_train_df = clean_data(market_train_df, news_train_df, train=True)

In [11]:
news_train_df.shape

(869377, 36)

In [12]:
market_train_df.shape

(400000, 16)

### Expanding News data
We are going to be splitting the news data by assetCode.

In [13]:
def expanding_news(news_df):
    
    # split to list
    news_output = news_df.copy()
    news_output['assetCodes'] = news_output['assetCodes'].str.findall(f"'([\w\./]+)'")
    
    # separate to assetcodes
    assetCodes_expanded = list(chain(*news_output['assetCodes']))
    assetCodes_index = news_df.index.repeat(news_output['assetCodes'].apply(len))
    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})
    
    # merge to dataframe
    merging_cols = [f for f in news_output if f not in ['assetCodes', 'sourceId']]
    news_df_expanded = pd.merge(df_assetCodes, news_output[merging_cols], left_on='level_0', 
                                right_index=True, suffixes=(['','_old']))
    
    return news_df_expanded

In [14]:
expand_train_df = expanding_news(news_train_df)

In [15]:
expand_train_df.tail()

Unnamed: 0,level_0,assetCode,time,sourceTimestamp,firstCreated,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D,num_publishing_diff_secs
1608941,999997,SGEN.O,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,EQUITY ALERT: Rosen Law Firm Announces Investi...,3,1,BSW,"{'CMSS', 'CLJ', 'GEN', 'NEWR', 'HECA', 'PHMR',...","{'BSW', 'CNR'}",3734,1,,False,16,664,Seattle Genetics Inc,1,1.0,-1,0.6519,0.227707,0.120393,360,0,0,3,4,4,1,2,18,41,41,0.0
1608942,999997,SGEN.OQ,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,EQUITY ALERT: Rosen Law Firm Announces Investi...,3,1,BSW,"{'CMSS', 'CLJ', 'GEN', 'NEWR', 'HECA', 'PHMR',...","{'BSW', 'CNR'}",3734,1,,False,16,664,Seattle Genetics Inc,1,1.0,-1,0.6519,0.227707,0.120393,360,0,0,3,4,4,1,2,18,41,41,0.0
1608943,999998,IPDN.O,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,PROFESSIONAL DIVERSITY NETWORK INC - FILES FOR...,1,1,RTRS,"{'BLR', 'SWIT', 'ITSE', 'SISU', 'BACT', 'TMT',...","{'E', 'U'}",0,1,,False,1,23,Professional Diversity Network Inc,1,1.0,-1,0.816252,0.126928,0.056819,23,0,0,0,0,0,0,0,3,3,3,0.0
1608944,999998,IPDN.OQ,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,PROFESSIONAL DIVERSITY NETWORK INC - FILES FOR...,1,1,RTRS,"{'BLR', 'SWIT', 'ITSE', 'SISU', 'BACT', 'TMT',...","{'E', 'U'}",0,1,,False,1,23,Professional Diversity Network Inc,1,1.0,-1,0.816252,0.126928,0.056819,23,0,0,0,0,0,0,0,3,3,3,0.0
1608945,999999,JFC.N,2016-12-30 22:00:00+00:00,2016-12-30 22:00:00+00:00,2016-12-30 22:00:00+00:00,"JPMorgan China Region Fund, Inc. Board to Subm...",3,1,BSW,"{'CMSS', 'NEWR', 'INVT', 'BACT', 'BSUP', 'INDS...","{'BSW', 'CNR'}",2969,1,,False,15,492,JPMorgan China Region Fund Inc,1,1.0,1,0.130152,0.388845,0.481002,383,0,0,0,0,0,0,0,0,0,0,0.0


In [16]:
market_train_df.tail()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
399995,2016-12-30 22:00:00+00:00,ZIOP.O,ZIOPHARM Oncology Inc,1608829.0,5.35,5.37,-0.003724,0.0,0.000536,-0.001868,-0.165367,-0.138042,-0.139597,-0.135913,0.051189,0.0
399996,2016-12-30 22:00:00+00:00,ZLTQ.O,ZELTIQ Aesthetics Inc,347830.0,43.52,43.62,-0.000689,0.0,-0.000515,0.000493,0.002996,0.002989,0.008213,0.00321,-0.048555,0.0
399997,2016-12-30 22:00:00+00:00,ZNGA.O,Zynga Inc,7396601.0,2.57,2.58,-0.011538,0.0,-0.006004,-0.001034,-0.091873,-0.078571,-0.077252,-0.077188,0.011703,0.0
399998,2016-12-30 22:00:00+00:00,ZTO.N,Unknown,3146519.0,12.07,12.5,-0.029743,0.007252,-0.02846,0.006719,-0.065066,-0.042146,-0.078104,-0.043813,0.083367,1.0
399999,2016-12-30 22:00:00+00:00,ZTS.N,Zoetis Inc,1701204.0,53.53,53.64,-0.001678,0.003091,0.00506,0.002885,0.023127,0.028177,0.026566,0.028719,-0.01622,1.0


### Cleaning Headlines
The following will simplify strings to only get the necessary words needed for text processing.

In [17]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re

ps = PorterStemmer()
sw = stopwords.words('english')

# this takes up a lot of time, so apply it when getting coefficients to filter out words.
def clean_headlines(headline):
    
    # remove numerical and convert to lowercase
    headline =  re.sub('[^a-zA-Z]',' ',headline)
    headline = headline.lower()
    
    # use stemming to simplify words
    headline_words_rough = headline.split(' ')
    
    # check if stopwords are present in headlines
    headline_words = []
    for word in headline_words_rough:
        if word not in sw:
            # use stemming to simplify
            headline_words.append(ps.stem(word))
    
    # join sentence back again
    return ' '.join(headline_words)

### Categorical Groupby
This will merge groups of categorical data together into either lists or sets.

In [18]:
def categorical_groupby(expand_df):

    # get categorical groupbys
    main_cols = ['time', 'assetCode']
    expand_headline_groupby = expand_df[main_cols + ['headline']].groupby(['time', 'assetCode'])
    expand_cat_groupby = expand_df[main_cols + ['subjects', 'audiences']].groupby(['time', 'assetCode'])
    
    # split subjects and audiences
    def cat_to_list(x):
        if x.name not in ['time', 'assetCode'] and x.name != 'headline':
            result = []
            for item in x:
                result += item
            return list(set(result)) # returns unique audiences/subjects
        elif x.name == 'headline':
            return list(x)
    
    # convert groupby to dataframes
    expand_cat_df = expand_cat_groupby.transform(lambda x: cat_to_list(x))
    expand_headline_df = expand_headline_groupby.transform(lambda x: cat_to_list(x)) # can't iterate through?

    # merge to categorical dataframes
    return pd.concat([expand_cat_df, expand_headline_df], axis=1)
    

### Numerical Groupby
This will merge groups of numerical data together through aggregating the data.

In [19]:
# get aggregated columns + aggregation map
news_agg_cols = [f for f in news_train_df.columns if 'novelty' in f or
                'volume' in f or
                'sentiment' in f or
                'bodySize' in f or
                'Count' in f or
                'marketCommentary' in f or
                'relevance' in f or
                'num_' in f]
news_agg_dict = {}
for col in news_agg_cols:
    news_agg_dict[col] = ['mean']
news_agg_dict['urgency'] = ['min', 'count']
news_agg_dict['takeSequence'] = ['max']
    
def numerical_groupby(expand_df):
    
    # aggregate dataframe
    expand_agg_groupby = expand_df[['time', 'assetCode'] + sorted(list(news_agg_dict.keys()))].groupby(['time', 'assetCode'])
    expand_agg_df = expand_agg_groupby.agg(news_agg_dict).apply(np.float32)
    expand_agg_df.columns = ['_'.join(col).strip() for col in expand_agg_df.columns.values]
    
    return expand_agg_df

### Merge by time &  assetCode to News Article
We will be merging rows with the same time and assetCode.

In [20]:
def get_matches(market_df, expand_df):
    
    # get temporary columns as data
    temp_market_df = market_df[['time', 'assetCode']].copy()
    temp_expand_df = expand_df[['time', 'assetCode']].copy()
    
    # get indecies
    temp_expand_df['expand_index'] = temp_expand_df.index.values
    
    # join the two
    temp_expand_df.set_index(['time', 'assetCode'], inplace=True)
    temp_expand_market_df = temp_market_df.join(temp_expand_df, on=['time', 'assetCode'])
    
    # remove nulls
    temp_expand_market_df = temp_expand_market_df[temp_expand_market_df.expand_index.isnull() == False]
    expand_indicies = temp_expand_market_df['expand_index'].tolist()
    
    # do final cleanup
    del temp_market_df
    del temp_expand_df
    
    # fetch matches
    return expand_df.loc[expand_indicies]

def merge_by_code(market_df, expand_df):
    
    # use a copy
    market_df_copy = market_df.copy()
    
    # get expansion of rows
    expand_df = get_matches(market_df, expand_df)
    
    # convert to lists
    expand_df['subjects'] = expand_df['subjects'].str.findall(f"'([\w\./]+)'")
    expand_df['audiences'] = expand_df['audiences'].str.findall(f"'([\w\./]+)'")
    
    # clean headlines
    expand_df['headline'] = expand_df['headline'].apply(clean_headlines)
    
    # group categoricals
    expand_cat_df = categorical_groupby(expand_df)
    expand_cat_df['time'] = expand_df['time']
    expand_cat_df['assetCode'] = expand_df['assetCode']
    
    # convert to sets
    for cat_col in ['subjects', 'audiences']:
        expand_cat_df[cat_col] = expand_cat_df[cat_col].apply(tuple)
        
    # remove duplicate rows
    expand_cat_df = expand_cat_df.drop_duplicates(subset=['time', 'assetCode'],
                                                  keep='first')

    # group numericals
    expand_num_df = numerical_groupby(expand_df)
    
    # merge datasets
    expanded_market_df = expand_cat_df.join(expand_num_df, on=['time', 'assetCode'])
    expanded_market_df = pd.merge(market_df_copy, expanded_market_df, 
                                  on=['time', 'assetCode'], how='left')
    
    return expanded_market_df
    

In [21]:
%%time
X_train = merge_by_code(market_train_df, expand_train_df)

CPU times: user 21.4 s, sys: 7.2 s, total: 28.6 s
Wall time: 28.5 s


In [22]:
X_train.tail()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,subjects,audiences,headline,bodySize_mean,companyCount_mean,marketCommentary_mean,sentenceCount_mean,wordCount_mean,relevance_mean,sentimentClass_mean,sentimentNegative_mean,sentimentNeutral_mean,sentimentPositive_mean,sentimentWordCount_mean,noveltyCount12H_mean,noveltyCount24H_mean,noveltyCount3D_mean,noveltyCount5D_mean,noveltyCount7D_mean,volumeCounts12H_mean,volumeCounts24H_mean,volumeCounts3D_mean,volumeCounts5D_mean,volumeCounts7D_mean,num_publishing_diff_secs_mean,urgency_min,urgency_count,takeSequence_max
399995,2016-12-30 22:00:00+00:00,ZIOP.O,ZIOPHARM Oncology Inc,1608829.0,5.35,5.37,-0.003724,0.0,0.000536,-0.001868,-0.165367,-0.138042,-0.139597,-0.135913,0.051189,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399996,2016-12-30 22:00:00+00:00,ZLTQ.O,ZELTIQ Aesthetics Inc,347830.0,43.52,43.62,-0.000689,0.0,-0.000515,0.000493,0.002996,0.002989,0.008213,0.00321,-0.048555,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399997,2016-12-30 22:00:00+00:00,ZNGA.O,Zynga Inc,7396601.0,2.57,2.58,-0.011538,0.0,-0.006004,-0.001034,-0.091873,-0.078571,-0.077252,-0.077188,0.011703,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399998,2016-12-30 22:00:00+00:00,ZTO.N,Unknown,3146519.0,12.07,12.5,-0.029743,0.007252,-0.02846,0.006719,-0.065066,-0.042146,-0.078104,-0.043813,0.083367,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399999,2016-12-30 22:00:00+00:00,ZTS.N,Zoetis Inc,1701204.0,53.53,53.64,-0.001678,0.003091,0.00506,0.002885,0.023127,0.028177,0.026566,0.028719,-0.01622,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [23]:
def data_step1(market_test_df, news_test_df):
    
    news_test_df = time_diff(news_test_df)
    market_test_df, news_test_df = clean_data(market_test_df, news_test_df, train=False)
    expand_test_df = expanding_news(news_test_df)
    X_test = merge_by_code(market_test_df, expand_test_df)
    
    return X_test
    

Step 1 with the Full dataset: 

* CPU times: user 38.1 s, sys: 9.73 s, total: 47.8 s

* Wall time: 47.8 s

Compared to [Amateur Hour's](https://www.kaggle.com/magichanics/amateur-hour-using-headlines-to-predict-stocks) method of merging datasets, this one preforms the merge a lot faster.

<a id='section2'></a>
## Step 2. Feature Engineering

From Quant features to text processing features.

In [24]:
X_train.tail()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,subjects,audiences,headline,bodySize_mean,companyCount_mean,marketCommentary_mean,sentenceCount_mean,wordCount_mean,relevance_mean,sentimentClass_mean,sentimentNegative_mean,sentimentNeutral_mean,sentimentPositive_mean,sentimentWordCount_mean,noveltyCount12H_mean,noveltyCount24H_mean,noveltyCount3D_mean,noveltyCount5D_mean,noveltyCount7D_mean,volumeCounts12H_mean,volumeCounts24H_mean,volumeCounts3D_mean,volumeCounts5D_mean,volumeCounts7D_mean,num_publishing_diff_secs_mean,urgency_min,urgency_count,takeSequence_max
399995,2016-12-30 22:00:00+00:00,ZIOP.O,ZIOPHARM Oncology Inc,1608829.0,5.35,5.37,-0.003724,0.0,0.000536,-0.001868,-0.165367,-0.138042,-0.139597,-0.135913,0.051189,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399996,2016-12-30 22:00:00+00:00,ZLTQ.O,ZELTIQ Aesthetics Inc,347830.0,43.52,43.62,-0.000689,0.0,-0.000515,0.000493,0.002996,0.002989,0.008213,0.00321,-0.048555,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399997,2016-12-30 22:00:00+00:00,ZNGA.O,Zynga Inc,7396601.0,2.57,2.58,-0.011538,0.0,-0.006004,-0.001034,-0.091873,-0.078571,-0.077252,-0.077188,0.011703,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399998,2016-12-30 22:00:00+00:00,ZTO.N,Unknown,3146519.0,12.07,12.5,-0.029743,0.007252,-0.02846,0.006719,-0.065066,-0.042146,-0.078104,-0.043813,0.083367,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
399999,2016-12-30 22:00:00+00:00,ZTS.N,Zoetis Inc,1701204.0,53.53,53.64,-0.001678,0.003091,0.00506,0.002885,0.023127,0.028177,0.026566,0.028719,-0.01622,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Entire Market and Individual Asset Quant Features
We are going to be obtaining Quant Features from both the entire market dataframe and from each individual asset based on assetCode.

In [25]:
def quant_feats(X):
    
    def moving_average(X_MA, columns, str_type):
        
        windows = [3, 7, 14]
        
        for col in columns:
            
            for window in windows:
                
                roll_col = X_MA[col].rolling(window=window)
                X_MA['%s_%s_%sMA'%(str_type, col, window)] = roll_col.mean()
                X_MA['%s_%s_%sSTD'%(str_type, col, window)] = roll_col.std()
                X_MA['%s_%s_%sMAX'%(str_type, col, window)] = roll_col.max()
                X_MA['%s_%s_%sMIN'%(str_type, col, window)] = roll_col.min()
        
        # convert to float32
        return X_MA
    
    # get std and moving average of the entire dataset
    X = moving_average(X, columns=['close', 'volume'], str_type='global')
    
    print('finished global')
    
    # get std and moving average of each individual asset based on assetCode
    iterations = 0
    for asset in X['assetCode'].unique():
        
        # get indices (faster)
        asset_indices = X[X.assetCode == asset].index.values
        
        # get std and ma
        X.loc[asset_indices] = moving_average(X.loc[asset_indices], columns=['close', 'volume'], str_type='asset')
        
        # display iterations
        if iterations % 250 == 0:
            print('On asset: ' + str(iterations) + ' of ' + str(len(X['assetCode'].unique())))
        iterations += 1
        
    return X

``market_train_df = market_train_df.tail(400_000)
news_train_df = news_train_df.tail(1_000_000)``
* CPU times: user 10min 51s, sys: 76 ms, total: 10min 52s
* Wall time: 10min 52s

In [None]:
%%time
X_train = quant_feats(X_train)

### Text Processing with CountVectorizer
We are going to be using CountVectorizer on headlines, audiences and subjects to determine its influence on the target column.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

class TextToCoeff:

    def vectorize(self, X_):

        # get required dataset
        X_text = X_.copy()[['returnsOpenNextMktres10', self.col]].dropna()

        # round data (if objective : binary)
        def round_scores(x):
            if x >= 0:
                return 1
            else:
                return 0
        X_text['returnsOpenNextMktres10'] = X_text['returnsOpenNextMktres10'].apply(round_scores)

        # convert tuples to string format if applicable
        def tuple_to_str(x):
            if isinstance(x, tuple):
                return ' '.join(x)
            else:
                return x
        X_text[self.col] = X_text[self.col].apply(tuple_to_str)

        # get lists
        text_lst = list(X_text[self.col])
        target_lst = list(X_text['returnsOpenNextMktres10'])

        # vectorize text features
        vectorizer = CountVectorizer()
        text_vectorized = vectorizer.fit_transform(text_lst)

        # model data (will use other modelling methods in the future)
        text_model = LogisticRegression()
        text_model = text_model.fit(text_vectorized, target_lst)

        # get coefficients
        basictext = vectorizer.get_feature_names()
        basiccoeffs = text_model.coef_.tolist()[0]
        coeff_df = pd.DataFrame({'Text' : basictext, 
                                'Coefficient' : basiccoeffs})

        # convert dataframe to dictionary of coefficients
        self.coeff_dict = dict(zip(coeff_df.Text, coeff_df.Coefficient))

        # get value that accounts for nulls
        self.coeff_default = coeff_df['Coefficient'].mean()

    def predict_data(self, X_):

        def get_coeff(x):
            
            try:

                # iterate through each set of text data
                coeff_total_score = 0
                
                if isinstance(x, tuple):
                    x = ' '.join(x)
                    
                if isinstance(x, str):
                    x = [x]
                
                for textset in x:

                    text_lst = textset.split(' ')
                
                    # iter through every word
                    coeff_sum = 0
                    for text in text_lst:
                        text = text.lower()
                        if text in self.coeff_dict:
                            coeff_sum += self.coeff_dict[text]
                        else:
                            coeff_sum += self.coeff_default

                    # get average coefficient
                    coeff_total_score += coeff_sum / len(text_lst)

                return coeff_total_score / len(x)
            
            except TypeError:
                
                return np.nan

        X_[self.col + '_coeff_mean'] = X_[self.col].apply(get_coeff)
        
        return X_

    # obtain target volumn
    def __init__(self, input_col):
        self.col = input_col

In [None]:
def text_processing_train(X):
    
    # get list of text to coeff converters [headline, subjects, audiences]
    text_to_coeff_lst = []
    for f in ['headline', 'subjects', 'audiences']:
        text_to_coeff = TextToCoeff(f)
        text_to_coeff.vectorize(X)
        text_to_coeff_lst.append(text_to_coeff)
    
    return text_to_coeff_lst

def text_processing_test(X, text_to_coeff_lst):
    
    for text_to_coeff in text_to_coeff_lst:
        X = text_to_coeff.predict_data(X)
    
    return X

In [None]:
%%time
text_to_coeff_lst = text_processing_train(X_train)
X_train = text_processing_test(X_train, text_to_coeff_lst)

### Clustering
We will be clustering the open and close features using KMeans.

In [None]:
from sklearn.cluster import KMeans

def clustering(X):

    def cluster_modelling(features):
        df_set = X[features]
        cluster_model = KMeans(n_clusters = 8)
        cluster_model.fit(df_set)
        return cluster_model.predict(df_set)
    
    # get columns:
    vol_cols = [f for f in X.columns if f != 'volume' and 'volume' in f]
    novelty_cols = [f for f in X.columns if 'novelty' in f]
    
    # fill nulls
    cluster_cols = novelty_cols + vol_cols + ['open', 'close']
    X[cluster_cols] = X[cluster_cols].fillna(0)
    
    X['cluster_open_close'] = cluster_modelling(['open', 'close'])
    X['cluster_volume'] = cluster_modelling(vol_cols)
    X['cluster_novelty'] = cluster_modelling(novelty_cols)
    
    return X

In [None]:
X_train = clustering(X_train)

### Misc. Features
Inclues the following features:
* Daily Difference

In [None]:
def misc_features(X):
    
    # Adding daily difference
    new_col = X["close"] - X["open"]
    X.insert(loc=6, column="daily_diff", value=new_col)
    X['close_to_open'] =  np.abs(X['close'] / X['open'])


In [None]:
misc_features(X_train)

In [None]:
X_train.dropna().tail()

In [None]:
del market_train_df, news_train_df

In [None]:
def data_step2(X_test):
    
    X_test = quant_feats(X_test)
    X_test = text_processing_test(X_test, text_to_coeff_lst)
    X_test = clustering(X_test)
    misc_features(X_test)
    
    return X_test

<a id='section3'></a>
## Step 3. Modelling using LightGBM

### Preparing Datasets for Modelling

In [None]:
y_train = X_train['returnsOpenNextMktres10']
X_train = X_train[[f for f in X_train.columns if f not in ['time', 'assetCode', 'universe', 'assetName', 'returnsOpenNextMktres10',
                                                          'headline', 'subjects', 'audiences']]].fillna(0)

### Fixed Training Split
The reason why we need to do a fixed training test split that fetches the last few rows of the training dataset is to avoid odd results, since randomly choosing rows will cause the validation dataset to be filled with rows with different timestamps.

In [None]:
def fixed_train_test_split(X, y, train_size):
    
    # round train size
    train_size = int(train_size * len(X))
    
    # split data
    X_train, y_train = X[train_size:], y[train_size:]
    X_valid, y_valid = X[:train_size], y[:train_size]
    
    return X_train, y_train, X_valid, y_valid

In [None]:
X_train, y_train, X_valid, y_valid = fixed_train_test_split(X_train, y_train, 0.85)

### Using LightGBM for modelling

In [None]:
import lightgbm as lgb

params = {"objective" : "binary",
          "metric" : "binary_logloss",
          "num_leaves" : 60,
          "max_depth": -1,
          "learning_rate" : 0.01,
          "bagging_fraction" : 0.9,  # subsample
          "feature_fraction" : 0.9,  # colsample_bytree
          "bagging_freq" : 5,        # subsample_freq
          "bagging_seed" : 2018,
          "verbosity" : -1 }

lgtrain, lgval = lgb.Dataset(X_train, y_train), lgb.Dataset(X_valid, y_valid)
lgb_model = lgb.train(params, lgtrain, 2000, valid_sets=[lgtrain, lgval], early_stopping_rounds=100, verbose_eval=200)


<a id='section4'></a>
## Step 4. Applying the Model


Predictions will be made through a for loop, and apply all the functions above onto the test dataset.

In [None]:
def get_X(market_test_df, news_test_df):
    X_test = data_step1(market_test_df, news_test_df)
    X_test = data_step2(X_test)
    X_test = X_test[[f for f in X_train.columns if f not in ['time', 'assetCode', 'universe', 'assetName', 'returnsOpenNextMktres10',
                                                          'headline', 'subjects', 'audiences']]].fillna(0)
    return X_test

def make_predictions(market_obs_df, news_obs_df):
    
    # predict using given model
    X_test = get_X(market_obs_df, news_obs_df)
    prediction_values = np.clip(lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration), -1, 1)

    return prediction_values


In [None]:
%%time
for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days(): # Looping over days from start of 2017 to 2019-07-15
    
    # make predictions
    predictions_template_df['confidenceValue'] = make_predictions(market_obs_df, news_obs_df)
    
    # save predictions
    env.predict(predictions_template_df)
    

In [None]:
env.write_submission_file()

**Sources:**
* [Market Data NN Baseline by Christofhenkel](https://www.kaggle.com/christofhenkel/market-data-nn-baseline)
* [a simple model using the market and news data by Bguberfain](https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-and-news-data)
* [Amateur Hour - Using Headlines to Predict Stocks by Magichanics](https://www.kaggle.com/magichanics/amateur-hour-using-headlines-to-predict-stocks)
* [Simple Quant Features by Youhanlee](https://www.kaggle.com/youhanlee/simple-quant-features-using-python)
* [>0.64 in 100 lines by rabaman](https://www.kaggle.com/rabaman/0-64-in-100-lines/comments)