# Template for Machine Learning Course Project


***1. Feature Engineering***
- Merge Market & News Data
- Impute returns data using NOCB 
   https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
- Bin numerical to binary when there is not much data for factors.
- Create new Features to account for Time Series auto Correlation between rows.

***2. Data reduction & Exploration***
- Subset of Data for top companies that always appear in news. We considered 15 companies with news data based on our research.

The reason for doing so was due to the abundant news articles as well as data available for those in particular. The five stocks we used were:
      
       'Barclays PLC'\
,'Citigroup Inc'\
,'Apple Inc'\
,'JPMorgan Chase & Co'\
,'Bank of America Corp'\
,'HSBC Holdings PLC'\
,'Goldman Sachs Group Inc'\
,'Deutsche Bank AG'\
,'BHP Billiton PLC'\
,'BP PLC'\
,'Google Inc'\
,'Boeing Co'\
,'Rio Tinto PLC'\
,'Royal Dutch Shell PLC'\
,'Ford Motor Co'\
,'General Electric Co'\
,'Morgan Stanley'\
,'Microsoft Corp'\
,'Exxon Mobil Corp'\
,'UBS AG'\
       
- Subset of Data from 2013
- Use numeric news data (Novelty, Volume counts, Sentececount, Relevance, takeSequence etc.) & returns data columns
-  Spot outliers and plot correlation


**3. Split train and Test**
1. Transform target variable to binary Stock-Movement Up/Down (0/1)
2. Stock-Movement Up/Down will be the label for training.
3. TimeSeries Training is different from regular dataset training
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

**4. Use Gradient Descent to tune parameters.**

**5. Fit Classifier with Train**
Using Classifiers that work well with Mixed Data.

    1. Random Forest
    2. BaggingClassifier with DecisionTrees
    3. XGBoost
    4. Nueral Networks

**6. Cross validation to estimate test error for this model.**

**7. Use GridSearchCV to tune hyper parameters.**

**8. Use the best estimator for test prediction and accuracy.**

**Tips/Tricks: **

Measures to improve Test Accuracy of the models:
1. Use companies that have data for VolumeCounts7D/NoveltyCount7D.

2. Find outliers of all variables and treat them.
https://www.kaggle.com/artgor/eda-feature-engineering-and-everything

3 Create new features to capture autocorrelation:
    e.g: https://www.kaggle.com/youhanlee/simple-quant-features-using-python
    Make these specific to a particular Stock.
    
4. Use randomSearchCV instead of GridSearchCv

5. HyperParameter Tuning
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

6. Find means to add more data.

7. Top 10 Features impacting next 10 day movement

8. Splitting date into discrete components can allow decision trees were able to make better guesses.

9. Make assetCode-specific datasets, train different assetCode specific models separtately on these datasets.
Create an ensemble of assetCode specific models?
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

10. Overfit vs. Underfit curve plot
http://scikit-learn.org/stable/modules/learning_curve.html


# Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import *
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()
(market_inputMain, news_inputMain) = env.get_training_data()

In [None]:
print(market_inputMain.head())
print(news_inputMain.head())

In [None]:
for i, j in zip([-1, 0, 1], ['negative', 'neutral', 'positive']):
    df_sentiment = news_inputMain.loc[news_inputMain['sentimentClass'] == i, 'assetCodes']
    print(f'Top mentioned companies for {j} sentiment are:')
    print(df_sentiment.value_counts().head(5))
    print('')

In [None]:
df_volumeCount = news_inputMain.loc[news_inputMain['volumeCounts7D'] > 0, 'assetName']
print(f'Top mentioned companies for {j} volumeCounts7D are:')
print(df_volumeCount.value_counts().head(30))
print('')

In [None]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from sklearn import model_selection
from sklearn.metrics import accuracy_score

# 1. Feature Engineering

#### **Function to merge Market & News Datasets**

In [None]:
#Make a deep copy to keep the main dataset. Environment cannot be restarted at will.
dfm = market_inputMain.copy(deep=True)
dfn = news_inputMain.copy(deep=True)


In [None]:
#print(dfm[dfm["assetName"].str.startswith('Goo')]["assetName"].unique())
 
dfm[dfm["assetName"].isin([ 'Apple Inc'\
     ,'JPMorgan Chase & Co'\
     ,'Bank of America Corp'\
     ,'HSBC Holdings PLC'\
     ,'Goldman Sachs Group Inc'\
     ,'Deutsche Bank AG'\
     ,'BHP Billiton PLC'\
     ,'BP PLC'\
     ,'Google Inc'\
     ,'Boeing Co'\
     ,'Rio Tinto PLC'\
     ,'Royal Dutch Shell PLC'\
     ,'Ford Motor Co'\
     ,'General Electric Co'\
     ,'Morgan Stanley'\
     ,'Microsoft Corp'\
     ,'Exxon Mobil Corp'\
     ,'UBS AG'\
     ,'Exxon Mobil Corp'\
     ,'UBS AG'\
    ,'Toyota Motor Corp'\
,'Royal Bank of Scotland Group PLC'\
,'Wal-Mart Stores Inc'\
,'BHP Billiton Ltd'\
,'General Motors Co'\
,'Verizon Communications Inc'\
,'AT&T Inc'\
,'Wells Fargo & Co'\
,'Amazon.com Inc'\
,'Lloyds Banking Group PLC'
       ])]['assetCode'].unique()

In [None]:
 dfm = dfm[dfm["assetName"].isin([   'Apple Inc'\
     ,'JPMorgan Chase & Co'\
     ,'Bank of America Corp'\
     ,'HSBC Holdings PLC'\
     ,'Goldman Sachs Group Inc'\
     ,'Deutsche Bank AG'\
     ,'BHP Billiton PLC'\
     ,'BP PLC'\
     ,'Google Inc'\
     ,'Boeing Co'\
     ,'Rio Tinto PLC'\
     ,'Royal Dutch Shell PLC'\
     ,'Ford Motor Co'\
     ,'General Electric Co'\
     ,'Morgan Stanley'\
     ,'Microsoft Corp'\
     ,'Exxon Mobil Corp'\
     ,'UBS AG'\
       ])]

 ,'Intel Corp'\
     ,'Cisco Systems Inc'\
     ,'Oracle Corp'\
     ,'Microsoft Corp'\
     ,'International Business Machines Corp'
     ,'Barclays PLC'\
     ,'Citigroup Inc'\
,'Barclays PLC'\
,'Citigroup Inc'\
,'Apple Inc'\
,'JPMorgan Chase & Co'\
,'Bank of America Corp'\
,'HSBC Holdings PLC'\
,'Goldman Sachs Group Inc'\
,'Deutsche Bank AG'\
,'BHP Billiton PLC'\
,'BP PLC'\
,'Google Inc'\
,'Boeing Co'\
,'Rio Tinto PLC'\
,'Royal Dutch Shell PLC'\
,'Ford Motor Co'\
,'General Electric Co'\
,'Morgan Stanley'\
,'Microsoft Corp'\
,'Exxon Mobil Corp'\
,'UBS AG'\

In [None]:
market_inputMain.shape

In [None]:
dfm.shape

In [None]:
market_inputMain.head()

In [None]:
news_inputMain.shape

In [None]:
dfn.shape

#### Cutdown datasets

In [None]:
#utc 
import datetime
import pytz

utc=pytz.UTC

#cut down datasets to return
startdate = pd.to_datetime("2012-01-01").replace(tzinfo=utc)
dfm = dfm[dfm.time > startdate]
dfn = dfn[dfn.time > startdate]


#### EXPAND NEWS Dataset as each "assetCodes" field is a  list of assetCodes

In [None]:
#News dataset shape before expanding
news_df = dfn
news_df.shape

In [None]:
dfm.shape

In [None]:
#First five asset codes of non-expaned News Dataset
news_df["assetCodes"].head(5)

In [None]:
#Expanding assetCodes
from itertools import chain
news_cols = news_df.columns.values
news_df['assetCodes'] = news_df['assetCodes'].str.findall(f"'([\w\./]+)'")  
#print(chain(*news_df['assetCodes']))
assetCodes_expanded = list(chain(*news_df['assetCodes']))


In [None]:
dfm["assetCode"].unique()

In [None]:
#assetCodes_expanded = assetCodes_expanded

assetCodes_index = news_df.index.repeat( news_df['assetCodes'].apply(len) )
assert len(assetCodes_index) == len(assetCodes_expanded)
assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})
assetCodes = assetCodes[assetCodes["assetCode"].isin(['AAPL.O', 'AMZN.O', 'BA.N', 'BAC.N', 'BHP.N', 'BP.N', 'F.N',
       'GE.N', 'GS.N', 'HBC.N', 'JPM.N', 'MS.N', 'MSFT.O', 'RDSa.N',
       'RDSb.N', 'RTP.N', 'T.N', 'TM.N', 'VZ.N', 'WFC.N', 'XOM.N', 'DB.N',
       'LYG.N', 'BBL.N', 'RBS.N', 'RIO.N', 'GM.N', 'HSBC.N'
       ])]
news_df_expanded = pd.merge(assetCodes, news_df[news_cols], left_on='level_0', right_index=True, suffixes=(['','_old']))

'AAPL.O', 'BA.N', 'BAC.N', 'BBL.N', 'BCS.N', 'BP.N', 'DB.N', 'F.N',\
       'GE.N', 'GS.N', 'HBC.N', 'JPM.N', 'MS.N', 'MSFT.O', 'RDSa.N',
       'RDSb.N', 'RTP.N', 'XOM.N', 'RIO.N', 'C.N', 'HSBC.N'

In [None]:
#Shape of news_df after expanding
print(news_df_expanded.shape)

In [None]:
news_df_expanded.iloc[:5, :10]

In [None]:
#Checking to see if there are missing values in news
news_df_expanded.isna().sum()

#### We found out that there are no missing values(NAs) in news dataset

### Merge Market vs. News datasets

In [None]:
#"data_prep" will do Merge and some basic cleaning

def data_prep(market_df,news_df):
    asset_code_dict = {k: v for v, k in enumerate(market_df['assetCode'].unique())}
    columns_tobe_retained = ['time','assetCode', 'assetName' ,'volume', 'open', 'close','returnsClosePrevRaw1',\
                             'returnsOpenPrevRaw1','returnsClosePrevMktres1','returnsOpenPrevMktres1',\
                             'returnsClosePrevRaw10','returnsOpenPrevRaw10','returnsClosePrevMktres10',\
                             'returnsOpenPrevMktres10','returnsOpenNextMktres10',\
                             'assetCodeT','urgency', 'takeSequence', 'companyCount','marketCommentary','sentenceCount',\
           'firstMentionSentence','relevance','sentimentClass','sentimentWordCount','noveltyCount24H',\
           'firstCreated',   \
                      # 'asset_sentiment_count', 'asset_sentence_mean','len_audiences',\
           'noveltyCount3D', 'noveltyCount5D', 'noveltyCount7D','volumeCounts24H','volumeCounts3D','volumeCounts5D','volumeCounts7D']
    market_df['date'] = market_df['time'].dt.date
    market_df['close_to_open'] = market_df['close'] / market_df['open']
    market_df['assetCodeT'] = market_df['assetCode'].map(asset_code_dict)
    #News data feature creation
    #news_df['time'] = news_df.time.dt.hour
    #news_df['sourceTimestamp']= news_df.sourceTimestamp.dt.hour
    #news_df['firstCreated'] = news_df['firstCreated'].dt.date 
    #news_df['asset_sentiment_count'] = news_df.groupby(['assetName', 'sentimentClass'])['time'].transform('count')
    #news_df['asset_sentence_mean'] = news_df.groupby(['assetName', 'sentenceCount'])['time'].transform('mean')
    #news_df['len_audiences'] = news_df['audiences'].map(lambda x: len(eval(x)))
    #kcol = ['firstCreated', 'assetCode']
    news_df = news_df.groupby(kcol, as_index=False).mean()

    # Merge news and market data. Only keep numeric columns
    market_df_merge = pd.merge(market_df, news_df, how='left', left_on=['date', 'assetCode'], 
                            right_on=['firstCreated', 'assetCode'])

    #return only data for the numeric columns + key information (assetCode, time)
    return market_df_merge[columns_tobe_retained]


In [None]:
#Group News Data by firstCreated & assetCode
kcol = ['firstCreated', 'assetCode']
d = news_df_expanded.sort_values('firstCreated').copy(deep=True)
d['firstCreated'] = d['firstCreated'].dt.date
d = d.groupby(kcol, as_index=False).mean()

In [None]:
d.tail(10)

In [None]:
dfmI = dfm.copy(deep=True)
dfnI = d.copy(deep=True)

### Merge Market & News data

In [None]:
len(dfnI)

In [None]:
#"data_prep" will do Merge of Market & News Datasets
merged_dataset = data_prep(dfmI,dfnI)

In [None]:
#Look at missing value summary
merged_dataset.count()

In [None]:
#Checkingfor NAs
merged_dataset.isna().sum()

In [None]:
# Function to plot time series data
def plot_vs_time(data_frame, column, calculation='mean', span=10):
    if calculation == 'mean':
        group_temp = data_frame.groupby('firstCreated')[column].mean().reset_index()
    if calculation == 'count':
        group_temp = data_frame.groupby('firstCreated')[column].count().reset_index()
    if calculation == 'nunique':
        group_temp = data_frame.groupby('firstCreated')[column].nunique().reset_index()
    group_temp = group_temp.ewm(span=span).mean()
    fig = plt.figure(figsize=(10,3))
    plt.plot(group_temp['firstCreated'], group_temp[column])
    plt.xlabel('Time')
    plt.ylabel(column)
    plt.title('%s versus time' %column)

### Look at NA values of returnsOpenPrevMktres10. We will re-verify this graph after interpolating to miss NA values in returns variables.

In [None]:
d = merged_dataset[merged_dataset['assetCode'] == 'AAPL.O']
import matplotlib.pyplot as plt
#print(d[d['returnsOpenPrevMktres10'].isna()]['assetCode'])
print(d.head())
plt.plot(d['time'], d['returnsOpenPrevMktres10'])


#### Impute Market Returns data using  (NOCB) imputation: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

### Fill nan values in MktRes columns using NOCB interpolation.

In [None]:
# Fill nan values in MktRes columns using NOCB interpolation.
market_fill = merged_dataset.copy(deep=True)
column_market = ['returnsClosePrevMktres1','returnsOpenPrevMktres1','returnsClosePrevMktres10', 'returnsOpenPrevMktres10']
column_raw = ['returnsClosePrevRaw1', 'returnsOpenPrevRaw1','returnsClosePrevRaw10', 'returnsOpenPrevRaw10']
for i in range(len(column_market)):
    market_fill[column_market[i]].interpolate(method='nearest', inplace=True)   

In [None]:
market_fill[market_fill['assetCode'] == 'AAPL.O'].head()

### Checking to make sure NAs are filled

In [None]:
d = market_fill[market_fill['assetCode'] == 'AAPL.O']
import matplotlib.pyplot as plt
#print(d[d['returnsOpenPrevMktres10'].isna()]['assetCode'])
print(d.head())
plt.plot(d['time'], d['returnsOpenPrevMktres10'])

In [None]:
#Checkingfor NAs before extracting data only for the 5 companies
market_fill.isna().sum()

# 2. Data Reduction to include only 10 Companies

##### Almost all of them have NAs

# 1. Feature Engineering Contd.

#### Bin numerical to binary when there is not much data for factors.


#### Create new Features to account for Time Series auto Correlation between rows.

In [None]:
def rsiFunc(prices, n=14):
    deltas = np.diff(prices)
    seed = deltas[:n+1]
    up = seed[seed>=0].sum()/n
    down = -seed[seed<0].sum()/n
    rs = up/down
    rsi = np.zeros_like(prices)
    rsi[:n] = 100. - 100./(1.+rs)

    for i in range(n, len(prices)):
        delta = deltas[i-1] # cause the diff is 1 shorter

        if delta>0:
            upval = delta
            downval = 0.
        else:
            upval = 0.
            downval = -delta

        up = (up*(n-1) + upval)/n
        down = (down*(n-1) + downval)/n

        rs = up/down
        rsi[i] = 100. - 100./(1.+rs)

    return rsi

In [None]:
#'AAPL.O',  'CSCO.O', 'IBM.N', 'INTC.O', 'MSFT.O', 'ORCL.O', 'ORCL.N'
full_dataset = pd.DataFrame()
for assetCode in ['AAPL.O', 'BA.N', 'BAC.N', 'BCS.N', 'BP.N', 'CSCO.O', 'F.N',
       'GE.N', 'GS.N', 'HBC.N', 'IBM.N', 'INTC.O', 'JPM.N', 'MS.N',
       'MSFT.O', 'ORCL.O', 'RDSa.N', 'RDSb.N', 'RTP.N', 'XOM.N', 'DB.N',
       'BBL.N', 'RIO.N', 'C.N', 'HSBC.N', 'ORCL.N']:
   df = pd.DataFrame()
   # Gather asset specific data
   df = market_fill[market_fill["assetCode"] == assetCode]
   df['rsi20D'] = rsiFunc(df['close'].values, 20)
   #Calculating all the trend variables for this assetCode
   df['volume10DMA'] = df["volume"].rolling(window=10).mean() 
   #Create new feature for close price moving average.
   df['close10DMA'] = df['close'].rolling(window=10).mean()
   df['sentenceCount_20DMA'] = df['sentenceCount'].rolling(window=7).mean()
   df['firstMentionSentence_20DMA'] = df['firstMentionSentence'].rolling(window=7).mean()
   df['relevance_20DMA'] = df['relevance'].rolling(window=7).mean()
   df['sentimentWordCount_20DMA'] = df['sentimentWordCount'].rolling(window=7).mean()
   df['sentimentClass_20DMA'] = df['sentimentClass'].rolling(window=7).mean()
   #Exponential Moving Average
   ewma = pd.Series.ewm
   df['close_10EMA'] = ewma(df["close"], span=10).mean()
   #Bollinger Bands are a type of statistical chart characterizing the prices and 
   #volatility over time of a financial instrument or commodity, using a formulaic method 
   #propounded by John Bollinger in the 1980s. Financial traders employ these charts as 
   #a methodical tool to inform trading decisions, control automated trading systems, 
   #or as a component of technical analysis. Bollinger Bands display a graphical band 
   #(the envelope maximum and minimum of moving averages, similar to
   #Keltner or Donchian channels) and volatility (expressed by the width of the envelope) 
   #in one two-dimensional chart.

   #ref. https://en.wikipedia.org/wiki/Bollinger_Bands 
   #Moving average convergence divergence (MACD) is a trend-following momentum indicator that shows the 
   #relationship between two moving averages of prices.
   #The MACD is calculated by subtracting the 26-day exponential moving average (EMA) from the 12-day EMA
   df['close_26EMA'] = ewma(df["close"], span=26).mean()
   df['close_12EMA'] = ewma(df["close"], span=12).mean()
   df['MACD'] = df['close_12EMA'] - df['close_26EMA']
   no_of_std = 2
   #ref. https://www.investopedia.com/terms/m/macd.asp

   df['MA_10MA'] = df['close'].rolling(window=10).mean()
   df['MA_10MA_std'] = df['close'].rolling(window=10).std() 
   df['MA_10MA_BB_high'] = df['MA_10MA'] + no_of_std * df['MA_10MA_std']
   df['MA_10MA_BB_low'] = df['MA_10MA'] - no_of_std * df['MA_10MA_std']
   full_dataset = full_dataset.append(df)



full_dataset["firstCreated"] = full_dataset["time"].dt.date
full_dataset['Year'] = full_dataset.time.dt.year
full_dataset['Month'] = full_dataset.time.dt.month
full_dataset['Day'] = full_dataset.time.dt.day
full_dataset['Week'] = full_dataset.time.dt.week
full_dataset = full_dataset.sort_values('firstCreated')
#ref. https://www.fidelity.com/learning-center/trading-investing/technical-analysis/technical-indicator-guide/RSI
#full_dataset['rsi10D'] = rsiFunc(full_dataset['close'].values, 10)

#The Relative Strength Index (RSI) to identify general trend.

#full_dataset[full_dataset['urgency'].isna()]["assetName"].unique()
#Look for missing values in news and which Companies have NAs
print(full_dataset.isna().sum())

#Create new feature for volume moving average.
#full_dataset['volume10DMA'] = full_dataset['volume'].rolling(window=10).mean()

#Create new feature for close price moving average.
#full_dataset['close10DMA'] = full_dataset['close'].rolling(window=10).mean()

#The Relative Strength Index (RSI) to identify general trend.
#ref. https://www.fidelity.com/learning-center/trading-investing/technical-analysis/technical-indicator-guide/RSI
#full_dataset['rsi10D'] = rsiFunc(full_dataset['close'].values, 10)

#Exponential Moving Average
#ewma = pd.Series.ewm
#full_dataset['close_10EMA'] = ewma(full_dataset["close"], span=10).mean()

#Bollinger Bands are a type of statistical chart characterizing the prices and 
#volatility over time of a financial instrument or commodity, using a formulaic method 
#propounded by John Bollinger in the 1980s. Financial traders employ these charts as 
#a methodical tool to inform trading decisions, control automated trading systems, 
#or as a component of technical analysis. Bollinger Bands display a graphical band 
#(the envelope maximum and minimum of moving averages, similar to
#Keltner or Donchian channels) and volatility (expressed by the width of the envelope) 
#in one two-dimensional chart.

#ref. https://en.wikipedia.org/wiki/Bollinger_Bands



#Moving average convergence divergence (MACD) is a trend-following momentum indicator that shows the 
#relationship between two moving averages of prices.
#The MACD is calculated by subtracting the 26-day exponential moving average (EMA) from the 12-day EMA
#full_dataset['close_26EMA'] = ewma(full_dataset["close"], span=26).mean()
#full_dataset['close_12EMA'] = ewma(full_dataset["close"], span=12).mean()

#full_dataset['MACD'] = full_dataset['close_12EMA'] - full_dataset['close_26EMA']
#no_of_std = 2
#ref. https://www.investopedia.com/terms/m/macd.asp

#full_dataset['MA_10MA'] = full_dataset['close'].rolling(window=10).mean()
#full_dataset['MA_10MA_std'] = full_dataset['close'].rolling(window=10).std() 
#full_dataset['MA_10MA_BB_high'] = full_dataset['MA_10MA'] + no_of_std * full_dataset['MA_10MA_std']
#full_dataset['MA_10MA_BB_low'] = full_dataset['MA_10MA'] - no_of_std * full_dataset['MA_10MA_std']
 


In [None]:
#Lets fill NAs with NOCB
# Fill nan values in News Variables
 
column_market = ["urgency", "takeSequence","companyCount","marketCommentary",
"sentenceCount","firstMentionSentence","relevance",
"sentimentClass","sentimentWordCount","noveltyCount24H",
"firstCreated","noveltyCount3D","noveltyCount5D",
"noveltyCount7D","volumeCounts24H","volumeCounts3D"
,"volumeCounts5D","volumeCounts7D"]
#column_raw = ['returnsClosePrevRaw1', 'returnsOpenPrevRaw1','returnsClosePrevRaw10', 'returnsOpenPrevRaw10']
for i in range(len(column_market)):
    full_dataset[column_market[i]].interpolate(method='linear', inplace=True) 

#### Spot outlier Companies for close/open price difference

market_train_df = full_dataset.copy(deep=True)
market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open']
grouped = market_train_df.groupby('time').agg({'price_diff': ['std', 'min']}).reset_index()
grouped.sort_values(('price_diff', 'std'), ascending=False)[:10].head()


g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10]
g['min_text'] = 'Maximum price drop: ' + (-1 * g['price_diff']['min']).astype(str)
trace = go.Scatter(
    x = g['time'].dt.strftime(date_format='%Y-%m-%d').values,
    y = g['price_diff']['std'].values,
    mode='markers',
    marker=dict(
        size = g['price_diff']['std'].values,
        color = g['price_diff']['std'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = g['min_text'].values
    #text = f"Maximum price drop: {g['price_diff']['min'].values}"
    #g['time'].dt.strftime(date_format='%Y-%m-%d').values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Top 10 months by standard deviation of price change within a day',
    hovermode= 'closest',
    yaxis=dict(
        title= 'price_diff',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

### No surprising outliers to remove in terms of prices

#### Plot Correlations of Market Data

In [None]:
market_train_df = full_dataset.copy(deep=True)

In [None]:
import seaborn as sns
market_train_df["target_stockMoveUp"] = market_train_df.returnsOpenNextMktres10 > 0
columns_corr_market = ['volume', 'returnsClosePrevRaw1','returnsOpenPrevRaw1',\
           'returnsClosePrevMktres1','returnsOpenPrevMktres1','returnsClosePrevMktres10','returnsOpenPrevRaw10','MA_10MA_BB_high', 'MA_10MA_BB_low'\
          , 'returnsClosePrevMktres10', 'returnsOpenPrevMktres10', 'volume10DMA', 'close10DMA','rsi10D','close_10EMA','MACD',\
                       'target_stockMoveUp']
colormap = plt.cm.RdBu
plt.figure(figsize=(18,15))
sns.heatmap(market_train_df[columns_corr_market].astype(float).corr(), linewidths=0.1, vmax=1.0, vmin=-1., square=True, cmap=colormap, linecolor='white', annot=True)
plt.title('Pair-wise correlation')

**Conclusions:**

1. Stock volumes have some positive impact on the Stock movement.
2. All of the returns variable have positive correlation with each other.
3. Close & Open prices have strong correlation.

In [None]:
columns_corr_merge = ['volume','open','close','returnsOpenPrevRaw1','returnsOpenPrevMktres1'\
                     ,'returnsOpenPrevRaw10' ,'returnsOpenPrevMktres10','target_stockMoveUp'\
                      ,'noveltyCount7D','volumeCounts7D' ]
colormap = plt.cm.RdBu
# Scaling 
df = market_train_df[columns_corr_merge]
mins = np.min(df, axis=0)
maxs = np.max(df, axis=0)
rng = maxs - mins
df = 1 - ((maxs - df) / rng)

plt.figure(figsize=(18,15))
sns.heatmap(df.astype(float).corr(), linewidths=0.1, vmax=1.0, vmin=-1., square=True, cmap=colormap, linecolor='white', annot=True)
plt.title('Pair-wise correlation market and news')

**Conclusions:**

1. Stock volumes have positive correlation with Stock Movement variables and the news Novelty/Volume.
2. Novelty of the content seems to have correlation with Stock Closing Price.
3. Novelty Indicators and Volume counts are postively correlated with each other.


# 3. Split Train and Test

In [None]:
df1 = full_dataset.copy(deep=True)
y = df1.returnsOpenNextMktres10 > 0
# Rest of the dataset is X
cols = ['Year', 'Day', 'Week', 'Month', 'volume', 'close', 'open' \
        ,'returnsOpenPrevMktres10',  'rsi20D', 'MA_10MA_BB_high','MA_10MA_BB_low','MACD'\
        ,'volumeCounts7D','takeSequence','assetCodeT'\
        ,'sentenceCount_20DMA','firstMentionSentence_20DMA','relevance_20DMA'\
        ,'sentimentWordCount','sentimentClass_20DMA'
       ]
X = df1[cols] 
train_size = int(len(X) * 0.66)
X_train, X_test = X[0:train_size], X[train_size:len(X)]
y_train, y_test = y[0:train_size], y[train_size:len(X)]

print('Observations: %d' % (len(X)))
print('Training Observations: %d' % (len(X_train)))
print('Testing Observations: %d' % (len(X_test)))

# The target is binary


# 5. Fit Classifier after Cross Validation using GridSearchCV

In [None]:
import xgboost as xgb
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import numpy as np
 
xgb_model = xgb.XGBClassifier()
 

## Training Accuracy

In [None]:
import matplotlib.pylab as plt
from matplotlib import pyplot
from xgboost import plot_importance
from sklearn.metrics  import accuracy_score
xgb_model.fit(X_train,y_train)
plot_importance(xgb_model, max_num_features=20) # top 10 most important features
plt.show()
accuracy_score(xgb_model.predict(X_train),y_train)

#DTC.fit(X_train,y_train)
#plot_importance(DTC, max_num_features=20) # top 10 most important features
#plt.show()
#accuracy_score(DTC.predict(X_train),y_train)

# **6. Cross validation - Rolling Cross Validation for TimeSeries data.**


In [None]:
#brute force scan for all parameters, here are the tricks
#usually max_depth is 6,7,8
#learning rate is around 0.05, but small changes may make big diff
#tuning min_child_weight subsample colsample_bytree can have 
#much fun of fighting against overfit 
#n_estimators is how many round of boosting
#finally, ensemble xgboost with multiple seeds may reduce variance
from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit(n_splits=10).split(X_train)

params = {
       'min_child_weight': [1, 5, 10],
       'gamma': [1.5],
       # 'subsample': [0.6, 0.8, 1.0],
       # 'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5,6],
        'n_estimators': [100]
        }

gsearch = RandomizedSearchCV(xgb_model, params, n_jobs=5,cv=tss, scoring='accuracy',verbose=1, refit=True)

# # **7. Use GridSearchCV to tune hyper parameters.**

In [None]:
import warnings
warnings.filterwarnings('ignore')
gsearch.fit(X_train, y_train)

# 8. Validation set accuracy.

In [None]:
#from sklearn.metrics import accuracy_score
#trust your CV!
best_parameters, score, _ = max(gsearch.grid_scores_, key=lambda x: x[1])
print('Cross validation Accuracy score:', score)
for param_name in sorted(best_parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))

# 9. Predict Test  and calculate accuracy

In [None]:
accuracy_score(gsearch.predict(X_test), y_test)

# Build NeuralNet

cat_cols = ['sentimentClass','assetCodeT', 'relevance','takeSequence']
num_cols = ['volume', 'close', 'open', 'returnsClosePrevRaw1', 'returnsOpenPrevRaw1', 'returnsClosePrevMktres1', \
                    'returnsOpenPrevMktres1', 'returnsClosePrevRaw10', 'returnsOpenPrevRaw10', 'returnsClosePrevMktres10',\
                    'returnsOpenPrevMktres10','noveltyCount7D','volumeCounts7D']

# Handling Catogerical Variables

def encode(encoder, x):
    len_encoder = len(encoder)
    try:
        id = encoder[x]
    except KeyError:
        id = len_encoder
    return id

encoders = [{} for cat in cat_cols]


for i, cat in enumerate(cat_cols):
    print('encoding %s ...' % cat, end=' ')
    encoders[i] = {l: id for id, l in enumerate(X_train.loc[:, cat].astype(str).unique())}
    X_train[cat] = X_train[cat].astype(str).apply(lambda x: encode(encoders[i], x))
    print('Done')

embed_sizes = [len(encoder) + 1 for encoder in encoders] #+1 for possible unknown assets

from sklearn.preprocessing import StandardScaler
 
X_train[num_cols] = X_train[num_cols].fillna(0)
print('scaling numerical columns')

scaler = StandardScaler()

#col_mean = market_train[col].mean()
#market_train[col].fillna(col_mean, inplace=True)
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])

from keras.models import Model
from keras.layers import Input, Dense, Embedding, Concatenate, Flatten, BatchNormalization
from keras.losses import binary_crossentropy, mse

categorical_inputs = []
for cat in cat_cols:
    categorical_inputs.append(Input(shape=[1], name=cat))

categorical_embeddings = []
for i, cat in enumerate(cat_cols):
    categorical_embeddings.append(Embedding(embed_sizes[i], 10)(categorical_inputs[i]))

#categorical_logits = Concatenate()([Flatten()(cat_emb) for cat_emb in categorical_embeddings])
categorical_logits = Flatten()(categorical_embeddings[0])
categorical_logits = Dense(32,activation='relu')(categorical_logits)

numerical_inputs = Input(shape=(11,), name='num')
numerical_logits = numerical_inputs
numerical_logits = BatchNormalization()(numerical_logits)

numerical_logits = Dense(128,activation='relu')(numerical_logits)
numerical_logits = Dense(64,activation='relu')(numerical_logits)

logits = Concatenate()([numerical_logits,categorical_logits])
logits = Dense(64,activation='relu')(logits)
out = Dense(1, activation='sigmoid')(logits)

model = Model(inputs = categorical_inputs + [numerical_inputs], outputs=out)
model.compile(optimizer='adam',loss=binary_crossentropy)

# Train NN Model

from keras.callbacks import EarlyStopping, ModelCheckpoint

check_point = ModelCheckpoint('model.hdf5',verbose=True, save_best_only=True)
early_stop = EarlyStopping(patience=5,verbose=True)
model.fit(X_train,y_train.astype(int),
          validation_data=(X_train,y_train.astype(int)),
          epochs=2,
          verbose=True,
          callbacks=[early_stop,check_point]) 