# "Stock" Grade Neural Network
### Starter Kernel by ``Magichanics`` 
*([GitHub](https://github.com/Magichanics) - [Kaggle](https://www.kaggle.com/magichanics))*

With more features from public kernels, as well as the idea of using Neural Networks for modelling, I've decided to do some experimenting myself in hopes of producing the best results. Feel free to post suggestions or criticisms!

## Table of Contents

* [Step 1. Merging Datasets](#section1)
* [Step 2. Feature Engineering](#section2)
* [Step 3. Modelling using Keras' Neural Network](#section3)
* [Step 4. Applying the Model](#section4)

In [None]:
import numpy as np
import pandas as pd
import os
import gc
from itertools import chain

import matplotlib.pyplot as plt

In [None]:
# import environment for data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

In [None]:
(market_train_df, news_train_df) = env.get_training_data()
market_train_df = market_train_df.tail(400_000)
news_train_df = news_train_df.tail(1_000_000)

<a id='section1'></a>
## Step 1. Merging Datasets

While most of the notebooks focuses only on the market dataset, I'm going to attempt on bringing both the news and market dataset together.

In [None]:
market_train_df.head()

In [None]:
news_train_df.head()

### Time difference between time and firstCreated
Maybe the news isn't that urgent if there was a time difference between the two columns.

In [None]:
# WIP, will work on it later on
def time_diff(news_df):
    news_train_df['num_publishing_diff_secs'] = news_train_df['time'] - news_train_df['firstCreated']
    news_train_df['num_publishing_diff_secs'] = news_train_df['num_publishing_diff_secs'] / np.timedelta64(1, 's')
    return news_df

In [None]:
news_train_df = time_diff(news_train_df)

### Cleaning Data
We will be removing the rows with the following qualities:
* Empty headlines
* Repeat headlines
* Urgency of 2
* Null assetName

In [None]:
def clean_data(market_df, news_df, train=True):
    
    # get rid of invalid rows
    news_df = news_df[news_df.headline != '']
    news_df = news_df[news_df.urgency != 2]
    
    # remove duplicate headlines with the same assetCodes
    news_df = news_df.drop_duplicates(subset=['assetCodes', 'headline'],keep='first')
    
#     if train:
#         market_df.drop('assetName', axis=1, inplace=True)

    return market_df, news_df

In [None]:
market_train_df, news_train_df = clean_data(market_train_df, news_train_df, train=True)

In [None]:
news_train_df.shape

In [None]:
market_train_df.shape

### Expanding News data
We are going to be splitting the news data by assetCode.

In [None]:
def expanding_news(news_df):
    
    # split to list
    news_output = news_df.copy()
    news_output['assetCodes'] = news_output['assetCodes'].str.findall(f"'([\w\./]+)'")
    
    # separate to assetcodes
    assetCodes_expanded = list(chain(*news_output['assetCodes']))
    assetCodes_index = news_df.index.repeat(news_output['assetCodes'].apply(len))
    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})
    
    # merge to dataframe
    merging_cols = [f for f in news_output if f not in ['assetCodes', 'sourceId']]
    news_df_expanded = pd.merge(df_assetCodes, news_output[merging_cols], left_on='level_0', 
                                right_index=True, suffixes=(['','_old']))
    
    return news_df_expanded

In [None]:
expand_train_df = expanding_news(news_train_df)

In [None]:
expand_train_df.tail()

In [None]:
market_train_df.tail()

### Cleaning Headlines
The following will simplify strings to only get the necessary words needed for text processing.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re

ps = PorterStemmer()
sw = stopwords.words('english')

# this takes up a lot of time, so apply it when getting coefficients to filter out words.
def clean_headlines(headline):
    
    # remove numerical and convert to lowercase
    headline =  re.sub('[^a-zA-Z]',' ',headline)
    headline = headline.lower()
    
    # use stemming to simplify words
    headline_words_rough = headline.split(' ')
    
    # check if stopwords are present in headlines
    headline_words = []
    for word in headline_words_rough:
        if word not in sw:
            # use stemming to simplify
            headline_words.append(ps.stem(word))
    
    # join sentence back again
    return ' '.join(headline_words)

### Categorical Groupby
This will merge groups of categorical data together into either lists or sets.

In [None]:
def categorical_groupby(expand_df):

    # get categorical groupbys
    main_cols = ['time', 'assetCode']
    expand_headline_groupby = expand_df[main_cols + ['headline']].groupby(['time', 'assetCode'])
    expand_cat_groupby = expand_df[main_cols + ['subjects', 'audiences']].groupby(['time', 'assetCode'])
    
    # split subjects and audiences
    def cat_to_list(x):
        if x.name not in ['time', 'assetCode'] and x.name != 'headline':
            result = []
            for item in x:
                result += item
            return result
        elif x.name == 'headline':
            return list(x)
    
    # convert groupby to dataframes
    expand_cat_df = expand_cat_groupby.transform(lambda x: cat_to_list(x))
    expand_headline_df = expand_headline_groupby.transform(lambda x: cat_to_list(x)) # can't iterate through?

    # merge to categorical dataframes
    return pd.concat([expand_cat_df, expand_headline_df], axis=1)
    

### Numerical Groupby
This will merge groups of numerical data together through aggregating the data.

In [None]:
def numerical_groupby(expand_df):
    
    # get aggregated columns + aggregation map
    news_agg_cols = [f for f in news_train_df.columns if 'novelty' in f or
                    'volume' in f or
                    'sentiment' in f or
                    'bodySize' in f or
                    'Count' in f or
                    'marketCommentary' in f or
                    'relevance' in f or
                    'num_' in f]
    news_agg_dict = {}
    for col in news_agg_cols:
        news_agg_dict[col] = ['mean', 'sum', 'max', 'min']
    news_agg_dict['urgency'] = ['min', 'count']
    news_agg_dict['takeSequence'] = ['max']
    
    # aggregate dataframe
    expand_agg_groupby = expand_df[['time', 'assetCode'] + sorted(list(news_agg_dict.keys()))].groupby(['time', 'assetCode'])
    expand_agg_df = expand_agg_groupby.agg(news_agg_dict).apply(np.float32)
    expand_agg_df.columns = ['_'.join(col).strip() for col in expand_agg_df.columns.values]
    
    return expand_agg_df

### Merge by time &  assetCode to News Article
We will be merging rows with the same time and assetCode.

In [None]:
def get_matches(market_df, expand_df):
    
    # get temporary columns as data
    temp_market_df = market_df[['time', 'assetCode']].copy()
    temp_expand_df = expand_df[['time', 'assetCode']].copy()
    
    # get indecies
    temp_expand_df['expand_index'] = temp_expand_df.index.values
    
    # join the two
    temp_expand_df.set_index(['time', 'assetCode'], inplace=True)
    temp_expand_market_df = temp_market_df.join(temp_expand_df, on=['time', 'assetCode'])
    
    # remove nulls
    temp_expand_market_df = temp_expand_market_df[temp_expand_market_df.expand_index.isnull() == False]
    expand_indicies = temp_expand_market_df['expand_index'].tolist()
    
    # do final cleanup
    del temp_market_df
    del temp_expand_df
    
    # fetch matches
    return expand_df.loc[expand_indicies]

def merge_by_code(market_df, expand_df):
    
    # get expansion of rows
    expand_df = get_matches(market_df, expand_df)
    
    # prepare categorical features for merging
    expand_df['subjects'] = expand_df['subjects'].str.findall(f"'([\w\./]+)'")
    expand_df['audiences'] = expand_df['audiences'].str.findall(f"'([\w\./]+)'")
    expand_df['headline'] = expand_df['headline'].apply(clean_headlines)
    
    # groupby datasets
    expand_cat_df = categorical_groupby(expand_df)
    expand_num_df = numerical_groupby(expand_df)
    
    # merge datasets
    expanded_market_df = market_df.join(expand_num_df, on=['time', 'assetCode'])
    expanded_market_df = expanded_market_df.join(expand_cat_df) # m.index.values in n.index.values >>False
    
    return expanded_market_df
    

In [None]:
m = merge_by_code(market_train_df, expand_train_df)

In [None]:
n = merge_by_code(market_train_df, expand_train_df)

In [None]:
m.tail()

In [None]:
m.index.values

In [None]:
n.head()

In [None]:
n.index.values

In [None]:
X_train[X_train.headline.isnull() == False].head()

In [None]:
X_train.shape

<a id='section2'></a>
## Step 2. Feature Engineering

From Quant features to text processing features.

### News Features
* Last News Article - This feature will have the number of days it has been since a news article has targeted the given assetCode
* Number of Articles Today/Week/Month - Fetches the number of Articles that was written on the assetCode during the given timeframe.

### Entire Market and Individual Asset Quant Features
We are going to be obtaining Quant Features from both the entire market dataframe and from each individual asset based on assetCode.

### Text Processing with CountVectorizer and TfidfVectorizer
We are going to be using CountVectorizer and TfidfVectorizer on the headlines to determine its influence on the target column.

### Clustering
We will be clustering the open and close features using KMeans.

In [None]:
def clustering(X):

    def cluster_modelling(features):
        df_set = X[features]
        cluster_model = KMeans(n_clusters = 8)
        cluster_model.fit(df_set)
        return cluster_model.predict(df_set)
    
    # get columns:
    vol_cols = [f for f in X.columns if f != 'volume' and 'volume' in f]
    novelty_cols = [f for f in X.columns if 'novelty' in f]
    
    # fill nulls
    cluster_cols = novelty_cols + vol_cols + ['open', 'close']
    X[cluster_cols] = X[cluster_cols].fillna(0)
    
    X['cluster_open_close'] = cluster_modelling(['open', 'close'])
    X['cluster_volume'] = cluster_modelling(vol_cols)
    X['cluster_novelty'] = cluster_modelling(novelty_cols)
    
    return df

<a id='section3'></a>
## Step 3. Modelling using Keras' Neural Network

### Preparing Datasets for Modelling
We will convert all the numerical and categorical datasets into rows that the neural network can process.

In [None]:
from sklearn.preprocessing import StandardScaler

# scale numerical columns
scaler = StandardScaler()

X_train = scaler.fit_transform(market_train_df[test_cols].fillna(0))

y_train = market_train_df['returnsOpenNextMktres10']

In [None]:
def get_cols(X_train):
    
    # get numerical and categorical columns
    num_cols = [f for f in X_train.columns if X_train[f].dtype == 'int' or X_train[f].dtype == 'float' and f not in ['universe', 'returnsOpenNextMktres10']]
    cat_cols = [f for f in X_train.columns if f not in num_cols and f not in ['universe', 'returnsOpenNextMktres10']]
    
    return num_cols, cat_cols

### Fixed Training Split
The reason why we need to do a fixed training test split that fetches the last few rows of the training dataset is to avoid odd results, since randomly choosing rows will cause the validation dataset to be filled with rows with different timestamps.

In [None]:
def fixed_train_test_split(X, y, train_size):
    
    # round train size
    train_size = int(train_size * len(X))
    
    # split data
    X_train, y_train = X[train_size:], y[train_size:]
    X_valid, y_valid = X[:train_size], y[:train_size]
    
    return X_train, y_train, X_valid, y_valid

In [None]:
X_train, y_train, X_valid, y_valid = fixed_train_test_split(X_train, y_train, 5000)

In [None]:
# original from https://www.kaggle.com/christofhenkel/market-data-nn-baseline
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Concatenate, Flatten, BatchNormalization
from keras.losses import binary_crossentropy

# categorical data
categorical_inputs = []
for cat in cat_cols:
    categorical_inputs.append(Input(shape=[1], name=cat))

categorical_embeddings = []
for i, cat in enumerate(cat_cols):
    categorical_embeddings.append(Embedding(embed_sizes[i], 10)(categorical_inputs[i]))
    
categorical_logits = Flatten()(categorical_embeddings[0])
categorical_logits = Dense(32,activation='relu')(categorical_logits)

# numerical data
numerical_inputs = Input(shape=(11,), name='num')
numerical_logits = numerical_inputs
numerical_logits = BatchNormalization()(numerical_logits)

numerical_logits = Dense(128,activation='relu')(numerical_logits)
numerical_logits = Dense(64,activation='relu')(numerical_logits)

# combined
logits = Concatenate()([numerical_logits,categorical_logits])
logits = Dense(128,activation='relu')(logits)
logits = Dense(64,activation='relu')(logits)
out = Dense(1, activation='sigmoid')(logits)

model = Model(inputs = categorical_inputs + [numerical_inputs], outputs=out)
model.compile(optimizer='adam',loss=binary_crossentropy)

In [None]:
get_cols(market_train_df)

In [None]:
from keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler

# set cylical learning rate per epoch
learning_rate = 1e-4
dynamic_lr = LearningRateScheduler(lambda epoch: learning_rate * 0.99 ** epoch)

# set early stopping
early_stop = EarlyStopping(patience=3)

model.fit(X_train,y_train.astype(int),
          validation_data=(X_valid,y_valid.astype(int)),
          epochs=200,
          verbose=0,
         callbacks=[dynamic_lr, early_stop]) 

In [None]:
model.predict(X_valid)

<a id='section4'></a>
## Step 4. Applying the Model


**Sources:**
* [Market Data NN Baseline by Christofhenkel](https://www.kaggle.com/christofhenkel/market-data-nn-baseline)
* [a simple model using the market and news data by Bguberfain](https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-and-news-data)
* [Amateur Hour - Using Headlines to Predict Stocks by Magichanics](https://www.kaggle.com/magichanics/amateur-hour-using-headlines-to-predict-stocks)
* [Simple Quant Features by Youhanlee](https://www.kaggle.com/youhanlee/simple-quant-features-using-python)