# "Stock" Grade Neural Network
### Starter Kernel by ``Magichanics`` 
*([GitHub](https://github.com/Magichanics) - [Kaggle](https://www.kaggle.com/magichanics))*

With more features from public kernels, as well as the idea of using Neural Networks for modelling, I've decided to do some experimenting myself in hopes of producing the best results. Feel free to post suggestions or criticisms!

## Table of Contents

* [Step 1. Merging Datasets](#section1)
* [Step 2. Feature Engineering](#section2)
* [Step 3. Modelling using Keras' Neural Network](#section3)
* [Step 4. Applying the Model](#section4)

In [1]:
import numpy as np
import pandas as pd
import os
import gc
from itertools import chain

import matplotlib.pyplot as plt

In [2]:
# import environment for data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

Loading the data... This could take a minute.
Done!


In [3]:
(market_train_df, news_train_df) = env.get_training_data()
market_train_df = market_train_df.tail(400_000)
news_train_df = news_train_df.tail(1_000_000)

<a id='section1'></a>
## Step 1. Merging Datasets

While most of the notebooks focuses only on the market dataset, I'm going to attempt on bringing both the news and market dataset together.

In [4]:
market_train_df.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
3672956,2016-02-19 22:00:00+00:00,SPWR.O,SunPower Corp,4109129.0,21.14,22.25,-0.074431,-0.094055,-0.074354,-0.081761,-0.175828,-0.102823,-0.180055,-0.111746,0.068726,1.0
3672957,2016-02-19 22:00:00+00:00,SQM.N,Sociedad Quimica y Minera de Chile SA,414021.0,17.15,16.94,0.002924,-0.033105,0.002975,-0.025269,0.051502,0.057428,0.049969,0.054736,-0.003696,0.0
3672958,2016-02-19 22:00:00+00:00,SRC.N,Spirit Realty Capital Inc,7481287.0,11.09,11.09,-0.0018,0.024954,-0.001777,0.029012,0.03839,0.052182,0.035912,0.046627,-0.067333,1.0
3672959,2016-02-19 22:00:00+00:00,SRCL.O,Stericycle Inc,898932.0,109.66,111.3,-0.016855,0.004241,-0.016824,0.008331,-0.054166,-0.052121,-0.055767,-0.054056,-0.044206,1.0
3672960,2016-02-19 22:00:00+00:00,SRE.N,Sempra Energy,2143306.0,97.25,96.66,0.003819,0.014058,0.003833,0.015573,0.020355,0.012359,0.019882,0.011141,-0.006034,1.0


In [5]:
news_train_df.head()

Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetCodes,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
8328750,2015-12-08 13:56:53+00:00,2015-12-08 13:56:53+00:00,2015-12-08 13:56:53+00:00,f9c4067a6d20f21b,"CHESAPEAKE ENERGY CORP SHARES EXTEND LOSSES, N...",1,1,RTRS,"{'BLR', 'STX', 'OILG', 'EXPL', 'HOT', 'ENER', ...","{'E', 'U'}",0,1,,False,2,21,{'CHK.N'},Chesapeake Energy Corp,1,1.0,-1,0.819143,0.125228,0.055629,21,2,3,4,7,9,17,23,24,41,63
8328751,2015-12-08 13:57:20+00:00,2015-12-08 13:57:20+00:00,2015-12-08 13:57:20+00:00,749e57557c589fca,REG - Societe Generale SA Anheuser-Busch InBev...,3,1,LSE,"{'NEWR', 'FOBE', 'WEU', 'BEVS', 'NCYC', 'LEN',...",{'LSEN'},21427,1,,False,47,1528,"{'ABI.BR', 'BUD.N'}",Anheuser Busch Inbev SA,1,1.0,0,0.014524,0.801992,0.183484,62,0,0,0,1,3,22,30,51,90,167
8328752,2015-12-08 13:57:20+00:00,2015-12-08 13:57:19+00:00,2015-12-08 13:57:19+00:00,e61c180b2be5eb45,REG - Societe Generale SA Anheuser-Busch InBev...,3,1,LSE,"{'NEWR', 'FOBE', 'WEU', 'BEVS', 'NCYC', 'LEN',...",{'LSEN'},59958,1,,False,53,4563,"{'ABI.BR', 'BUD.N'}",Anheuser Busch Inbev SA,1,1.0,1,0.035002,0.161918,0.80308,176,19,25,46,74,133,21,29,50,89,166
8328753,2015-12-08 13:57:37+00:00,2015-12-08 13:57:37+00:00,2015-12-08 13:57:37+00:00,35e01becdbd06d17,IIROC Trade Resumption - BIP.PR.B <BIP.N>,3,1,CNW,"{'NEWR', 'LEN', 'ELEU', 'FINS', 'US', 'DFIN', ...","{'CNR', 'CNW'}",753,1,,False,8,137,{'BIP.N'},Brookfield Infrastructure Partners LP,4,0.57735,-1,0.811987,0.129426,0.058586,87,1,1,1,1,1,1,1,1,1,10
8328754,2015-12-08 13:57:41+00:00,2015-12-08 13:57:41+00:00,2015-12-08 13:57:41+00:00,36a59986b3a81936,TRANSOCEAN'S U.S.-LISTED SHARES DOWN 2.41 PCT ...,1,1,RTRS,"{'BLR', 'STX', 'WEU', 'HOT', 'CH', 'DRIL', 'EN...","{'E', 'U'}",0,1,,False,1,14,"{'RIG.N', 'RIGN.VX', 'RIGN.BN'}",Transocean Ltd,1,1.0,-1,0.819123,0.125241,0.055637,14,0,0,0,1,1,1,3,14,17,17


### Cleaning Data
We will be removing the rows with the following qualities:
* Empty headlines
* Repeat headlines
* Urgency of 2
* Null assetName

In [6]:
def clean_data(market_df, news_df, train=True):
    
    # get rid of invalid rows
    news_df = news_df[news_df.headline != '']
    news_df = news_df[news_df.urgency != 2]
    
    # remove duplicate headlines with the same assetCodes
    news_df = news_df.drop_duplicates(subset=['assetCodes', 'headline'],keep='first')
    
#     if train:
#         market_df.drop('assetName', axis=1, inplace=True)

    return market_df, news_df

In [7]:
market_train_df, news_train_df = clean_data(market_train_df, news_train_df, train=True)

In [8]:
news_train_df.shape

(869377, 35)

In [9]:
market_train_df.shape

(400000, 16)

### Expanding News data
We are going to be splitting the news data by assetCode.

In [10]:
def expanding_news(news_df):
    
    # split to list
    news_output = news_df.copy()
    news_output['assetCodes'] = news_output['assetCodes'].str.findall(f"'([\w\./]+)'")
    
    # separate to assetcodes
    assetCodes_expanded = list(chain(*news_output['assetCodes']))
    assetCodes_index = news_df.index.repeat(news_output['assetCodes'].apply(len))
    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})
    
    # merge to dataframe
    merging_cols = [f for f in news_output if f not in ['assetCodes', 'sourceId']]
    news_df_expanded = pd.merge(df_assetCodes, news_output[merging_cols], left_on='level_0', 
                                right_index=True, suffixes=(['','_old']))
    
    return news_df_expanded

In [11]:
expand_train_df = expanding_news(news_train_df)

In [12]:
expand_train_df.tail()

Unnamed: 0,level_0,assetCode,time,sourceTimestamp,firstCreated,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
1608941,9328747,SGEN.O,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,EQUITY ALERT: Rosen Law Firm Announces Investi...,3,1,BSW,"{'CMSS', 'CLJ', 'GEN', 'NEWR', 'HECA', 'PHMR',...","{'BSW', 'CNR'}",3734,1,,False,16,664,Seattle Genetics Inc,1,1.0,-1,0.6519,0.227707,0.120393,360,0,0,3,4,4,1,2,18,41,41
1608942,9328747,SGEN.OQ,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,2016-12-30 21:57:00+00:00,EQUITY ALERT: Rosen Law Firm Announces Investi...,3,1,BSW,"{'CMSS', 'CLJ', 'GEN', 'NEWR', 'HECA', 'PHMR',...","{'BSW', 'CNR'}",3734,1,,False,16,664,Seattle Genetics Inc,1,1.0,-1,0.6519,0.227707,0.120393,360,0,0,3,4,4,1,2,18,41,41
1608943,9328748,IPDN.O,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,PROFESSIONAL DIVERSITY NETWORK INC - FILES FOR...,1,1,RTRS,"{'BLR', 'SWIT', 'ITSE', 'SISU', 'BACT', 'TMT',...","{'E', 'U'}",0,1,,False,1,23,Professional Diversity Network Inc,1,1.0,-1,0.816252,0.126928,0.056819,23,0,0,0,0,0,0,0,3,3,3
1608944,9328748,IPDN.OQ,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,2016-12-30 21:58:53+00:00,PROFESSIONAL DIVERSITY NETWORK INC - FILES FOR...,1,1,RTRS,"{'BLR', 'SWIT', 'ITSE', 'SISU', 'BACT', 'TMT',...","{'E', 'U'}",0,1,,False,1,23,Professional Diversity Network Inc,1,1.0,-1,0.816252,0.126928,0.056819,23,0,0,0,0,0,0,0,3,3,3
1608945,9328749,JFC.N,2016-12-30 22:00:00+00:00,2016-12-30 22:00:00+00:00,2016-12-30 22:00:00+00:00,"JPMorgan China Region Fund, Inc. Board to Subm...",3,1,BSW,"{'CMSS', 'NEWR', 'INVT', 'BACT', 'BSUP', 'INDS...","{'BSW', 'CNR'}",2969,1,,False,15,492,JPMorgan China Region Fund Inc,1,1.0,1,0.130152,0.388845,0.481002,383,0,0,0,0,0,0,0,0,0,0


In [13]:
market_train_df.tail()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
4072951,2016-12-30 22:00:00+00:00,ZIOP.O,ZIOPHARM Oncology Inc,1608829.0,5.35,5.37,-0.003724,0.0,0.000536,-0.001868,-0.165367,-0.138042,-0.139597,-0.135913,0.051189,0.0
4072952,2016-12-30 22:00:00+00:00,ZLTQ.O,ZELTIQ Aesthetics Inc,347830.0,43.52,43.62,-0.000689,0.0,-0.000515,0.000493,0.002996,0.002989,0.008213,0.00321,-0.048555,0.0
4072953,2016-12-30 22:00:00+00:00,ZNGA.O,Zynga Inc,7396601.0,2.57,2.58,-0.011538,0.0,-0.006004,-0.001034,-0.091873,-0.078571,-0.077252,-0.077188,0.011703,0.0
4072954,2016-12-30 22:00:00+00:00,ZTO.N,Unknown,3146519.0,12.07,12.5,-0.029743,0.007252,-0.02846,0.006719,-0.065066,-0.042146,-0.078104,-0.043813,0.083367,1.0
4072955,2016-12-30 22:00:00+00:00,ZTS.N,Zoetis Inc,1701204.0,53.53,53.64,-0.001678,0.003091,0.00506,0.002885,0.023127,0.028177,0.026566,0.028719,-0.01622,1.0


### Merge by time &  assetCode to News Article
We will be merging rows with the same time and assetCode.

In [33]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re

ps = PorterStemmer()
sw = stopwords.words('english')

# this takes up a lot of time, so apply it when getting coefficients to filter out words.
def clean_headlines(headline):
    
    # remove numerical and convert to lowercase
    headline =  re.sub('[^a-zA-Z]',' ',headline)
    headline = headline.lower()
    
    # use stemming to simplify words
    headline_words_rough = headline.split(' ')
    
    # check if stopwords are present in headlines
    headline_words = []
    for word in headline_words_rough:
        if word not in sw:
            # use stemming to simplify
            headline_words.append(ps.stem(word))
    
    # join sentence back again
    return ' '.join(headline_words)

In [135]:
def categorical_groupby(expand_df):
    
    # get categorical groupbys
    main_cols = ['time', 'assetCode']
    expand_headline_groupby = expand_train_df[main_cols + ['headline']].groupby(['time', 'assetCode'])
    expand_cat_groupby = expand_train_df[main_cols + ['subjects', 'audiences']].groupby(['time', 'assetCode'])
    
    # split subjects and audiences
    def cat_to_list(x):
        result = []
        for item in x:
            result += item
        return result
    
    # convert groupby to dataframes
    expand_cat_df = expand_cat_groupby.transform(cat_to_list)
    expand_headline_df = expand_headline_groupby.transform(lambda x: set(x)) # can't iterate through?
    
    # merge to categorical dataframes
    return pd.concat([expand_cat_df, expand_headline_df], axis=1)
    

In [None]:
# WIP
def numerical_groupby(expand_df):
    
    news_agg_cols = [f for f in news_train_df.columns if 'novelty' in f or
                    'volume' in f or
                    'sentiment' in f or
                    'bodySize' in f or
                    'Count' in f or
                    'marketCommentary' in f or
                    'relevance' in f]
    news_agg_dict = {}
    for col in news_agg_cols:
        news_agg_dict[col] = ['mean', 'sum', 'max', 'min']
    news_agg_dict['urgency'] = ['min', 'count']
    news_agg_dict['takeSequence'] = ['max']
    
    expand_agg_groupby = expand_train_df[['time', 'assetCode'] + news_agg_cols].groupby(['time', 'assetCode'])

In [16]:
def get_matches(market_df, expand_df):
    
    # get temporary columns as data
    temp_market_df = market_df[['time', 'assetCode']].copy()
    temp_expand_df = expand_df[['time', 'assetCode']].copy()
    
    # get indecies
    temp_expand_df['expand_index'] = temp_expand_df.index
    
    # join the two
    temp_expand_df.set_index(['time', 'assetCode'], inplace=True)
    temp_expand_market_df = temp_market_df.join(temp_expand_df, on=['time', 'assetCode'])
    
    # remove nulls
    temp_expand_market_df = temp_expand_market_df[temp_expand_market_df.expand_index.isnull() == False]
    expand_indicies = temp_expand_market_df['expand_index'].tolist()
    
    # do final cleanup
    del temp_market_df
    del temp_expand_df
    
    # fetch matches
    return expand_df.loc[expand_indicies]

def merge_by_code(market_df, expand_df):
    
    # get expansion of rows
    expand_df = get_matches(market_df, expand_df)
    
    # prepare categorical features for merging
    expand_train_df['subjects'] = expand_train_df['subjects'].str.findall(f"'([\w\./]+)'")
    expand_train_df['audiences'] = expand_train_df['audiences'].str.findall(f"'([\w\./]+)'")
    expand_train_df['headline'] = expand_train_df['headline'].apply(clean_headlines)
    
    # groupby datasets
    expand_cat_df = categorical_groupby(expand_df)
    
    # preform aggregations
    

<a id='section2'></a>
## Step 2. Feature Engineering

From Quant features to text processing features.

### News Features
* Last News Article - This feature will have the number of days it has been since a news article has targeted the given assetCode
* Number of Articles Today/Week/Month - Fetches the number of Articles that was written on the assetCode during the given timeframe.

### Entire Market and Individual Asset Quant Features
We are going to be obtaining Quant Features from both the entire market dataframe and from each individual asset based on assetCode.

### Text Processing with CountVectorizer and TfidfVectorizer
We are going to be using CountVectorizer and TfidfVectorizer on the headlines to determine its influence on the target column.

### Clustering
We will be clustering the open and close features using KMeans.

In [None]:
def clustering(X):

    def cluster_modelling(features):
        df_set = X[features]
        cluster_model = KMeans(n_clusters = 8)
        cluster_model.fit(df_set)
        return cluster_model.predict(df_set)
    
    # get columns:
    vol_cols = [f for f in X.columns if f != 'volume' and 'volume' in f]
    novelty_cols = [f for f in X.columns if 'novelty' in f]
    
    # fill nulls
    cluster_cols = novelty_cols + vol_cols + ['open', 'close']
    X[cluster_cols] = X[cluster_cols].fillna(0)
    
    X['cluster_open_close'] = cluster_modelling(['open', 'close'])
    X['cluster_volume'] = cluster_modelling(vol_cols)
    X['cluster_novelty'] = cluster_modelling(novelty_cols)
    
    return df

<a id='section3'></a>
## Step 3. Modelling using Keras' Neural Network

### Preparing Datasets for Modelling
We will convert all the numerical and categorical datasets into rows that the neural network can process.

In [None]:
from sklearn.preprocessing import StandardScaler

# scale numerical columns
scaler = StandardScaler()

X_train = scaler.fit_transform(market_train_df[test_cols].fillna(0))

y_train = market_train_df['returnsOpenNextMktres10']

In [None]:
def get_cols(X_train):
    
    # get numerical and categorical columns
    num_cols = [f for f in X_train.columns if X_train[f].dtype == 'int' or X_train[f].dtype == 'float' and f not in ['universe', 'returnsOpenNextMktres10']]
    cat_cols = [f for f in X_train.columns if f not in num_cols and f not in ['universe', 'returnsOpenNextMktres10']]
    
    return num_cols, cat_cols

### Fixed Training Split
The reason why we need to do a fixed training test split that fetches the last few rows of the training dataset is to avoid odd results, since randomly choosing rows will cause the validation dataset to be filled with rows with different timestamps.

In [None]:
def fixed_train_test_split(X, y, train_size):
    
    # round train size
    train_size = int(train_size * len(X))
    
    # split data
    X_train, y_train = X[train_size:], y[train_size:]
    X_valid, y_valid = X[:train_size], y[:train_size]
    
    return X_train, y_train, X_valid, y_valid

In [None]:
X_train, y_train, X_valid, y_valid = fixed_train_test_split(X_train, y_train, 5000)

In [None]:
# original from https://www.kaggle.com/christofhenkel/market-data-nn-baseline
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Concatenate, Flatten, BatchNormalization
from keras.losses import binary_crossentropy

# categorical data
categorical_inputs = []
for cat in cat_cols:
    categorical_inputs.append(Input(shape=[1], name=cat))

categorical_embeddings = []
for i, cat in enumerate(cat_cols):
    categorical_embeddings.append(Embedding(embed_sizes[i], 10)(categorical_inputs[i]))
    
categorical_logits = Flatten()(categorical_embeddings[0])
categorical_logits = Dense(32,activation='relu')(categorical_logits)

# numerical data
numerical_inputs = Input(shape=(11,), name='num')
numerical_logits = numerical_inputs
numerical_logits = BatchNormalization()(numerical_logits)

numerical_logits = Dense(128,activation='relu')(numerical_logits)
numerical_logits = Dense(64,activation='relu')(numerical_logits)

# combined
logits = Concatenate()([numerical_logits,categorical_logits])
logits = Dense(128,activation='relu')(logits)
logits = Dense(64,activation='relu')(logits)
out = Dense(1, activation='sigmoid')(logits)

model = Model(inputs = categorical_inputs + [numerical_inputs], outputs=out)
model.compile(optimizer='adam',loss=binary_crossentropy)

In [None]:
get_cols(market_train_df)

In [None]:
from keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler

# set cylical learning rate per epoch
learning_rate = 1e-4
dynamic_lr = LearningRateScheduler(lambda epoch: learning_rate * 0.99 ** epoch)

# set early stopping
early_stop = EarlyStopping(patience=3)

model.fit(X_train,y_train.astype(int),
          validation_data=(X_valid,y_valid.astype(int)),
          epochs=200,
          verbose=0,
         callbacks=[dynamic_lr, early_stop]) 

In [None]:
model.predict(X_valid)

<a id='section4'></a>
## Step 4. Applying the Model


**Sources:**
* [Market Data NN Baseline by Christofhenkel](https://www.kaggle.com/christofhenkel/market-data-nn-baseline)
* [a simple model using the market and news data by Bguberfain](https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-and-news-data)
* [Amateur Hour - Using Headlines to Predict Stocks by Magichanics](https://www.kaggle.com/magichanics/amateur-hour-using-headlines-to-predict-stocks)
* [Simple Quant Features by Youhanlee](https://www.kaggle.com/youhanlee/simple-quant-features-using-python)