# Two Sigma Financial News Competition Official Getting Started Kernel
## Introduction
In this competition you will predict how stocks will change based on the market state and news articles.  You will loop through a long series of trading days; for each day, you'll receive an updated state of the market, and a series of news articles which were published since the last trading day, along with impacted stocks and sentiment analysis.  You'll use this information to predict whether each stock will have increased or decreased ten trading days into the future.  Once you make these predictions, you can move on to the next trading day. 

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Kernels, and you may not use other data sources, GPU, or internet access.
* This is a **two-stage competition**.  In Stage One you can edit your Kernels and improve your model, where Public Leaderboard scores are based on their predictions relative to past market data.  At the beginning of Stage Two, your Kernels are locked, and we will re-run your Kernels over the next six months, scoring them based on their predictions relative to live data as those six months unfold.
* You must use our custom **`kaggle.competitions.twosigmanews`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions for the current trading day.

## In this Starter Kernel, we'll show how to use the **`twosigmanews`** module to get the training data, get test features and make predictions, and write the submission file.
## TL;DR: End-to-End Usage Example
```
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

(market_train_df, news_train_df) = env.get_training_data()
train_my_model(market_train_df, news_train_df)

for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days():
  predictions_df = make_my_predictions(market_obs_df, news_obs_df, predictions_template_df)
  env.predict(predictions_df)
  
env.write_submission_file()
```
Note that `train_my_model` and `make_my_predictions` are functions you need to write for the above example to work.

## In-depth Introduction
First let's import the module and create an environment.

In [None]:
import numpy as np
import pandas as pd
import gc

In [None]:
from kaggle.competitions import twosigmanews
# You can only call make_env() once, so don't lose it!
env = twosigmanews.make_env()

## **`get_training_data`** function

Returns the training data DataFrames as a tuple of:
* `market_train_df`: DataFrame with market training data
* `news_train_df`: DataFrame with news training data

These DataFrames contain all market and news data from February 2007 to December 2016.  See the [competition's Data tab](https://www.kaggle.com/c/two-sigma-financial-news/data) for more information on what columns are included in each DataFrame.

In [None]:
(market_train_df, news_train_df) = env.get_training_data()

Find out the shape of each df.

In [None]:
#market_train_df.shape #(4072956, 16)

In [None]:
#market_train_df.head()

In [None]:
#market_train_df.tail()

In [None]:
#market_train_df.nunique()

In [None]:
#market_train_df.dtypes

In [None]:
#market_train_df.isna().sum()

Pre-process market data.

In [None]:
def market_process(market_train_df):
    
    market_train_df['time'] = market_train_df.time.dt.date
    market_train_df['bartrend'] = market_train_df['close'] / market_train_df['open']
    market_train_df['average'] = (market_train_df['close'] + market_train_df['open'])/2
    market_train_df['pricevolume'] = market_train_df['volume'] * market_train_df['close']
    
    # drop nans or not?
    market_train_df.dropna(axis=0, inplace=True)
    market_train_df.drop('assetName', axis=1, inplace=True)
    
    #market_train_df.columns = pd.Index(["{}_{}".format(e[0], e[1]) for e in market_train_df.columns.tolist()])
    #market_train_df.reset_index(inplace=True)
    # Set datatype to float32
    float_cols = {c: 'float32' for c in market_train_df.columns if c not in ['assetCode', 'time']}
    
    return market_train_df.astype(float_cols)


market_train_df = market_process(market_train_df)
#market_train_df.shape # (4072956, 19) dropnans(3979902, 15)
#market_train_df.head()

In [None]:
#news_train_df.shape #(9328750, 35) no nans

In [None]:
#news_train_df.dtypes

In [None]:
#news_train_df.nunique()

In [None]:
#news_train_df.head()

In [None]:
#news_train_df.tail()

Pre-process news data.

In [None]:
def news_process(news_train_df):
    
    news_train_df['time'] = news_train_df.time.dt.date
    news_train_df['position'] = news_train_df['firstMentionSentence'] / news_train_df['sentenceCount']
    news_train_df['coverage'] = news_train_df['sentimentWordCount'] / news_train_df['wordCount']
    droplist = ['sourceTimestamp','firstCreated','sourceId','headline','takeSequence','provider',
            'firstMentionSentence','headlineTag','marketCommentary','subjects','audiences',
            'assetName','noveltyCount12H','noveltyCount24H','noveltyCount3D','noveltyCount5D',
            'noveltyCount7D','urgency','sentimentClass']
    news_train_df.drop(droplist, axis=1, inplace=True)
    
    # Factorize categorical columns
#     for col in ['headlineTag', 'provider', 'sourceId']:
#         news_train[col], uniques = pd.factorize(news_train[col])
#         del uniques
    
    # Remove {} and '' from assetCodes column
    news_train_df['assetCodes'] = news_train_df['assetCodes'].apply(lambda x: x[1:-1].replace("'", ""))
    return news_train_df

news_train_df = news_process(news_train_df)
gc.collect()
#news_train_df.head()

Unstack assetCodes.

In [None]:
def unstack_asset_codes(news_train_df):
    codes = []
    indexes = []
    for i, values in news_train_df['assetCodes'].iteritems():
        explode = values.split(", ")
        codes.extend(explode)
        repeat_index = [int(i)]*len(explode)
        indexes.extend(repeat_index)
    index_df = pd.DataFrame({'news_index': indexes, 'assetCode': codes})
    del codes, indexes
    gc.collect()
    return index_df

index_df = unstack_asset_codes(news_train_df)
#index_df.head(3)

In [None]:
def merge_news_on_index(news_train_df, index_df):
    news_train_df['news_index'] = news_train_df.index.copy()

    # Merge news on unstacked assets
    news_unstack_df = index_df.merge(news_train_df, how='left', on='news_index')
    news_unstack_df.drop(['news_index', 'assetCodes'], axis=1, inplace=True)
    return news_unstack_df

news_unstack_df = merge_news_on_index(news_train_df, index_df)
del news_train_df, index_df
gc.collect()
#news_unstack_df.head(3)
#news_unstack_df.shape #(18821885, 23)

Comine multiple news reports for same assets on same day

In [None]:
def group_news(news_frame):
    
    aggregations = ['mean']
    gp = news_frame.groupby(['assetCode', 'time']).agg(aggregations)
    gp.columns = pd.Index(["{}_{}".format(e[0], e[1]) for e in gp.columns.tolist()])
    gp.reset_index(inplace=True)
    # Set datatype to float32
    float_cols = {c: 'float32' for c in gp.columns if c not in ['assetCode', 'time']}
    return gp.astype(float_cols)

news_agg_df = group_news(news_unstack_df)
del news_unstack_df; gc.collect()
gc.collect()
#news_agg_df.head(3)
#news_agg_df.shape #((3839367, 23))

In [None]:
def merge(market_train_df,news_agg_df):
    
    df = market_train_df.merge(news_agg_df, how='left', on=['time','assetCode'])
    # drop nans or not?
    df.dropna(axis=0, inplace=True)
    
    del market_train_df, news_agg_df
    return df

df = merge(market_train_df,news_agg_df)
gc.collect()
#df.head(3)
#df.shape # (4072956, 35) dropnans(1121521, 36)

In [None]:
time = df.time
num_target = df.returnsOpenNextMktres10.astype('float32')
bin_target = (df.returnsOpenNextMktres10 >= 0).astype('int8')
universe = df.universe.astype('int8')
# Drop columns that are not features
df.drop(['returnsOpenNextMktres10', 'universe', 'assetCode', 'time'], axis=1, inplace=True)
gc.collect()
df.head(3)
# shape (, 30)

In [None]:
from sklearn import *
from lightgbm import LGBMClassifier
import time
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [None]:
train_index, test_index = model_selection.train_test_split(df.index.values, test_size=0.25, 
                                                           random_state = 11)

In [None]:
train_index, val_index = model_selection.train_test_split(train_index, test_size=0.25, 
                                                           random_state = 11)

Tuning parameters.

In [None]:
# def evaluate_model(df, target, train_index, val_index, params):
#     model = LGBMClassifier(objective='binary',
#                            boosting='gbdt',
#                            #'n_jobs': 4,
#                            **params)
#     model.fit(df.loc[train_index],bin_target.loc[train_index])
#     return metrics.log_loss(target.loc[val_index], 
#                             model.predict_proba(df.loc[val_index]))

In [None]:
# param_grid = {
#     'learning_rate': [0.1, 0.05, 0.01],
#     'max_depth': [-1, 8],
#     'num_leaves': [60, 70, 80],
#     'n_estimators': [200, 400], #default class*iteration=2*100
#     #'min_child_samples': [100, 500],
#     'bagging_fraction' : [0.8, 0.9],  # subsample
#     'feature_fraction' : [0.8, 0.9],  # colsample_bytree
#     #'subsample': [0.8, 0.9, 1],
#     'reg_alpha': [0.2, 0.6, 0.8],
#     'reg_lambda': [0.4, 0.6, 0.8]
# }

# print('Tuning begins...')
# best_eval_score = 0
# for i in range(50):
#     params = {k: np.random.choice(v) for k, v in param_grid.items()}
#     score = evaluate_model(df, bin_target, train_index, val_index, params)
#     if score < best_eval_score or best_eval_score == 0:
#         best_eval_score = score
#         best_params = params
# print("Best evaluation logloss", best_eval_score)

In [None]:
#best_params

In [None]:
lgb = LGBMClassifier(
    objective='binary',
    boosting='gbdt',
    learning_rate = 0.05,
    max_depth = 8,
    num_leaves = 80,
    n_estimators = 400,
    bagging_fraction = 0.8,
    feature_fraction = 0.9,
    reg_alpha = 0.2,
    reg_lambda = 0.4)

LGBMClassifier(bagging_fraction=0.8, boosting_type='gbdt', class_weight=None,
        colsample_bytree=1.0, feature_fraction=0.9,
        importance_type='split', learning_rate=0.05, max-depth=8,
        max_depth=-1, min_child_samples=20, min_child_weight=0.001,
        min_split_gain=0.0, n_estimators=400, n_jobs=-1, num_leaves=80,
        objective=None, random_state=None, reg_alpha=0.6, reg_lambda=0.4,
        silent=True, subsample=1.0, subsample_for_bin=200000,
        subsample_freq=0)

In [None]:
#t = time.time()
print('Fitting Up')
lgb.fit(df.loc[train_index],bin_target.loc[train_index])
print('Done')
#print(f'Done, time = {time.time() - t}')

In [None]:
metrics.accuracy_score(lgb.predict(df.loc[test_index]),bin_target.loc[test_index])
#print("AUC Score : %f" % metrics.roc_auc_score(xgb.predict_proba(df.loc[test_index])[:, 1],num_target.loc[test_index]))

Said to be competition's scoring metrics.

## `get_prediction_days` function

Generator which loops through each "prediction day" (trading day) and provides all market and news observations which occurred since the last data you've received.  Once you call **`predict`** to make your future predictions, you can continue on to the next prediction day.

Yields:
* While there are more prediction day(s) and `predict` was called successfully since the last yield, yields a tuple of:
    * `market_observations_df`: DataFrame with market observations for the next prediction day.
    * `news_observations_df`: DataFrame with news observations for the next prediction day.
    * `predictions_template_df`: DataFrame with `assetCode` and `confidenceValue` columns, prefilled with `confidenceValue = 0`, to be filled in and passed back to the `predict` function.
* If `predict` has not been called since the last yield, yields `None`.

In [None]:
# You can only iterate through a result from `get_prediction_days()` once
# so be careful not to lose it once you start iterating.
days = env.get_prediction_days()

In [None]:
n_days = 0

for (market_obs_df, news_obs_df, predictions_template_df) in days:
    n_days += 1
    print(n_days,end=' ')
    
    # process market data
    market_obs_df = market_process(market_obs_df)
    
    # process news data
    news_obs_df = news_process(news_obs_df)
    index_df = unstack_asset_codes(news_obs_df)
    news_unstack = merge_news_on_index(news_obs_df, index_df)
    news_obs_agg = group_news(news_unstack)

    # merge
    obs_df = merge(market_obs_df,news_obs_agg)
    del market_obs_df, news_obs_agg, news_obs_df, news_unstack, index_df
    gc.collect()
    obs_df = obs_df[obs_df.assetCode.isin(predictions_template_df.assetCode)]

    # Drop cols that are not features
    feats = [c for c in obs_df.columns if c not in ['assetCode', 'time']]

    #t = time.time()
    preds = lgb.predict_proba(obs_df[feats])[:, 1] * 2 - 1
    sub = pd.DataFrame({'assetCode': obs_df['assetCode'], 'confidence': preds})
    predictions_template_df = predictions_template_df.merge(sub, how='left').drop(
        'confidenceValue', axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'})

    env.predict(predictions_template_df)
    del obs_df, predictions_template_df, preds, sub
    gc.collect()


In [None]:
env.write_submission_file()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from xgboost import plot_importance

feature_importances_. Default is 'split': result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.

In [None]:
feat_importance = pd.DataFrame()
feat_importance["feature"] = df.columns
feat_importance["weight"] = lgb.feature_importances_
feat_importance.sort_values(by='weight', ascending=False, inplace=True)

plt.figure(figsize=(8,10))
ax = sns.barplot(y="feature", x="weight", data=feat_importance)

In [None]:
#(market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
#market_obs_df.head()

In [None]:
#market_obs_df.isna().sum()

In [None]:
#news_obs_df.head()

In [None]:
#news_obs_df.isna().sum()

In [None]:
#predictions_template_df.head()

Note that we'll get an error if we try to continue on to the next prediction day without making our predictions for the current day.

In [None]:
# next(days)

### **`predict`** function
Stores your predictions for the current prediction day.  Expects the same format as you saw in `predictions_template_df` returned from `get_prediction_days`.

Args:
* `predictions_df`: DataFrame which must have the following columns:
    * `assetCode`: The market asset.
    * `confidenceValue`: Your confidence whether the asset will increase or decrease in 10 trading days.  All values must be in the range `[-1.0, 1.0]`.

The `predictions_df` you send **must** contain the exact set of rows which were given to you in the `predictions_template_df` returned from `get_prediction_days`.  The `predict` function does not validate this, but if you are missing any `assetCode`s or add any extraneous `assetCode`s, then your submission will fail.

Let's make random predictions for the first day:

In [None]:
# import numpy as np
# def make_random_predictions(predictions_df):
#     predictions_df.confidenceValue = 2.0 * np.random.rand(len(predictions_df)) - 1.0

In [None]:
# make_random_predictions(predictions_template_df)
# env.predict(predictions_template_df)

Now we can continue on to the next prediction day and make another round of random predictions for it:

In [None]:
# (market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
# market_obs_df.head()

In [None]:
# news_obs_df.head()

In [None]:
# predictions_template_df.head()

In [None]:
# make_random_predictions(predictions_template_df)
# env.predict(predictions_template_df)

## Main Loop
Let's loop through all the days and make our random predictions.  The `days` generator (returned from `get_prediction_days`) will simply stop returning values once you've reached the end.

In [None]:
# for (market_obs_df, news_obs_df, predictions_template_df) in days:
#     make_random_predictions(predictions_template_df)
#     env.predict(predictions_template_df)
# print('Done!')

## **`write_submission_file`** function

Writes your predictions to a CSV file (`submission.csv`) in the current working directory.

In [None]:
# env.write_submission_file()

In [None]:
# We've got a submission file!
import os
print([filename for filename in os.listdir('.') if '.csv' in filename])

As indicated by the helper message, calling `write_submission_file` on its own does **not** make a submission to the competition.  It merely tells the module to write the `submission.csv` file as part of the Kernel's output.  To make a submission to the competition, you'll have to **Commit** your Kernel and find the generated `submission.csv` file in that Kernel Version's Output tab (note this is _outside_ of the Kernel Editor), then click "Submit to Competition".  When we re-run your Kernel during Stage Two, we will run the Kernel Version (generated when you hit "Commit") linked to your chosen Submission.

## Restart the Kernel to run your code again
In order to combat cheating, you are only allowed to call `make_env` or iterate through `get_prediction_days` once per Kernel run.  However, while you're iterating on your model it's reasonable to try something out, change the model a bit, and try it again.  Unfortunately, if you try to simply re-run the code, or even refresh the browser page, you'll still be running on the same Kernel execution session you had been running before, and the `twosigmanews` module will still throw errors.  To get around this, you need to explicitly restart your Kernel execution session, which you can do by pressing the Restart button in the Kernel Editor's bottom Console tab:
![Restart button](https://i.imgur.com/hudu8jF.png)