# PROJECT OVERVIEW

James M. Irving

Flatiron Full Time Data Science 021119 Cohort


>- **Note: this notebook (`Capstone_Project_part1_time_series.ipynb`) is one of 3 project notebooks..** 
    1. **Tweet Preprocessing and NLP Classifications**
    2. Time Series Modeling of S&P 500
    3. Combined NLP + Time Series Modeling with S&P500 and Trump's Tweets

# 📚 ABSTRACT:

> Stock Market prices are notoriously difficult to model, but advances in machine learning algorithms in recent years provide renewed possibilities in accurately modeling market performance. One notable addition in modern machine learning is that of Natural Language Processing (NLP). For those modeling a specific stock, performing NLP feature extraction and analysis on the collection of news headlines, shareholder documents, or social media postings that mention the company can provide additional information about the human/social elements to predicting market behaviors. These insights could not be captured by historical price data and technical indicators alone.

> President Donald J. Trump is one of the most prolific users of social media, specifically Twitter, using it as a direct messaging channel to his followers, avoiding the traditional filtering and restriction that normally controls the public influence of the President of the United States. An additional element of the presidency that Trump has avoided is that of financial transparency and divesting of assets. Historically, this is done in order to avoid conflicts of interest, apparent or actual. The president is also known to target companies directly with his Tweets, advocating for specific changes/decisions by the company, or simply airing his greivances. This leads to the natural question, how much influence *does* President Trump exert over the financial markets? 

> To explore this question, we built multiple types of models attempting to answer this question, using the S&P500 as our market index. First, we built a classification model to predict the change in stock price 60 mins after the tweet. We trained Word2Vec embeddings on President Trump's tweets since his election, which we used as the embedding layer for LSTM and GRU neural networks. 

> We next build a baseline time series regression model, using historical price data alone to predict price by trading-hour. We then built upon this, adding several technical indicators of market performance as additional features. 
Finally, we combined the predicitons of our classification model, as well as several other metrics about the tweets (sentiment scores, # of retweets/favorites, upper-to-lowercase ratio,etc.) to see if combining all of these sources of information could explain even more of the variance in stock market prices. 



## Table of Contents Legend

- 📚: Info sections
- 🕹: Coding sections
    - 🎛: **yperparameters to tune**
    - 🏋️: fitting models
    - 🤔: New Things to Potentially Try 
- Use the Table of Contents view on the left sidebar to find the relevant sections (button looks like a bulleted list)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#PROJECT-OVERVIEW" data-toc-modified-id="PROJECT-OVERVIEW-1">PROJECT OVERVIEW</a></span></li><li><span><a href="#📚-ABSTRACT:" data-toc-modified-id="📚-ABSTRACT:-2">📚 ABSTRACT:</a></span><ul class="toc-item"><li><span><a href="#Table-of-Contents-Legend" data-toc-modified-id="Table-of-Contents-Legend-2.1">Table of Contents Legend</a></span></li><li><span><a href="#📚-MAIN-QUESTION:" data-toc-modified-id="📚-MAIN-QUESTION:-2.2">📚 MAIN QUESTION:</a></span><ul class="toc-item"><li><span><a href="#REFERENCES-/-INSPIRATION:" data-toc-modified-id="REFERENCES-/-INSPIRATION:-2.2.1">REFERENCES / INSPIRATION:</a></span></li></ul></li><li><span><a href="#OVERVIEW-OF-DATA/FEATURES-USED-PER-MODEL" data-toc-modified-id="OVERVIEW-OF-DATA/FEATURES-USED-PER-MODEL-2.3">OVERVIEW OF DATA/FEATURES USED PER MODEL</a></span><ul class="toc-item"><li><span><a href="#FINAL-MODEL:-COMBINING-STOCK-MARKET-DATA,--NLP-CLASSIFICATION,-AND-OTHER-TWEET-METRICS" data-toc-modified-id="FINAL-MODEL:-COMBINING-STOCK-MARKET-DATA,--NLP-CLASSIFICATION,-AND-OTHER-TWEET-METRICS-2.3.1">FINAL MODEL: COMBINING STOCK MARKET DATA,  NLP CLASSIFICATION, AND OTHER TWEET METRICS</a></span></li></ul></li><li><span><a href="#OSEMN-FRAMEWORK" data-toc-modified-id="OSEMN-FRAMEWORK-2.4">OSEMN FRAMEWORK</a></span><ul class="toc-item"><li><span><a href="#OBTAIN" data-toc-modified-id="OBTAIN-2.4.1"><a href="#OBTAIN">OBTAIN</a></a></span></li><li><span><a href="#SCRUB" data-toc-modified-id="SCRUB-2.4.2"><a href="#SCRUB">SCRUB</a></a></span></li><li><span><a href="#EXPLORE-/-VISUALIZE" data-toc-modified-id="EXPLORE-/-VISUALIZE-2.4.3"><a href="#EXPLORE/VISUALIZE">EXPLORE / VISUALIZE</a></a></span></li><li><span><a href="#MODELING-(Initial)" data-toc-modified-id="MODELING-(Initial)-2.4.4"><a href="#INITIAL-MODELING">MODELING (Initial)</a></a></span></li><li><span><a href="#iNTERPRETATION" data-toc-modified-id="iNTERPRETATION-2.4.5">iNTERPRETATION</a></span></li></ul></li></ul></li><li><span><a href="#OBTAIN" data-toc-modified-id="OBTAIN-3">OBTAIN</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#📚-DATA-SOURCES:" data-toc-modified-id="📚-DATA-SOURCES:-3.0.1">📚 DATA SOURCES:</a></span></li></ul></li></ul></li><li><span><a href="#SCRUB" data-toc-modified-id="SCRUB-4">SCRUB</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Importing-Functions" data-toc-modified-id="Importing-Functions-4.0.1">Importing Functions</a></span></li></ul></li></ul></li><li><span><a href="#FORECASTING-STOCK-MARKET-PRICE" data-toc-modified-id="FORECASTING-STOCK-MARKET-PRICE-5">FORECASTING STOCK MARKET PRICE</a></span><ul class="toc-item"><li><span><a href="#Loading-&amp;-Processing-Stock-Data-(SCRUB)" data-toc-modified-id="Loading-&amp;-Processing-Stock-Data-(SCRUB)-5.1">Loading &amp; Processing Stock Data (SCRUB)</a></span></li><li><span><a href="#Load-in-raw-text-file-with-minute-resolutin-S&amp;P-500-prices" data-toc-modified-id="Load-in-raw-text-file-with-minute-resolutin-S&amp;P-500-prices-5.2">Load in raw text file with minute-resolutin S&amp;P 500 prices</a></span></li></ul></li><li><span><a href="#BOOKMARK-06/18" data-toc-modified-id="BOOKMARK-06/18-6">BOOKMARK 06/18</a></span><ul class="toc-item"><li><span><a href="#Model-1:-Using-Price-as-only-feature" data-toc-modified-id="Model-1:-Using-Price-as-only-feature-6.1">Model 1: Using Price as only feature</a></span><ul class="toc-item"><li><span><a href="#Model-1-Summary" data-toc-modified-id="Model-1-Summary-6.1.1">Model 1 Summary</a></span></li></ul></li><li><span><a href="#Model-2:-Stock-Price-+-Technical-Indicators" data-toc-modified-id="Model-2:-Stock-Price-+-Technical-Indicators-6.2">Model 2: Stock Price + Technical Indicators</a></span><ul class="toc-item"><li><span><a href="#Technical-Indicator-Details" data-toc-modified-id="Technical-Indicator-Details-6.2.1">Technical Indicator Details</a></span></li><li><span><a href="#Model-2:-Summary" data-toc-modified-id="Model-2:-Summary-6.2.2">Model 2: Summary</a></span></li></ul></li></ul></li><li><span><a href="#COMBINING-TWEET-STATS,-NLP-CLASSIFICATION,-AND-MARKET-DATA" data-toc-modified-id="COMBINING-TWEET-STATS,-NLP-CLASSIFICATION,-AND-MARKET-DATA-7">COMBINING TWEET STATS, NLP CLASSIFICATION, AND MARKET DATA</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Loading-in-NLP-Model-for-Predictions" data-toc-modified-id="Loading-in-NLP-Model-for-Predictions-7.0.1">Loading in NLP Model for Predictions</a></span></li><li><span><a href="#Get-Predictions-for-Hour-Binned-Tweets" data-toc-modified-id="Get-Predictions-for-Hour-Binned-Tweets-7.0.2">Get Predictions for Hour-Binned Tweets</a></span></li></ul></li><li><span><a href="#Model-3:-Stock-Price-+-Indicators-+-NLP-Preds-&amp;-Tweet-Features" data-toc-modified-id="Model-3:-Stock-Price-+-Indicators-+-NLP-Preds-&amp;-Tweet-Features-7.1">Model 3: Stock Price + Indicators + NLP Preds &amp; Tweet Features</a></span><ul class="toc-item"><li><span><a href="#Finalize-colums-for-final-model" data-toc-modified-id="Finalize-colums-for-final-model-7.1.1">Finalize colums for final model</a></span></li><li><span><a href="#Model-3-Summary" data-toc-modified-id="Model-3-Summary-7.1.2">Model 3 Summary</a></span></li></ul></li><li><span><a href="#Model-X:-XGB-Regression-+-Feature-Importance" data-toc-modified-id="Model-X:-XGB-Regression-+-Feature-Importance-7.2">Model X: XGB Regression + Feature Importance</a></span><ul class="toc-item"><li><span><a href="#Model-Interpretation" data-toc-modified-id="Model-Interpretation-7.2.1">Model Interpretation</a></span></li><li><span><a href="#Model-X-Summary" data-toc-modified-id="Model-X-Summary-7.2.2">Model X Summary</a></span></li></ul></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-8">Summary</a></span></li></ul></div>

## 📚 MAIN QUESTION:

> #### **Can the Twitter activity of Donald Trump explain fluctuations in the stock market?**

**We will use a combination of traditional stock market forecasting combined with Natural Language Processing and word embeddings from President Trump's tweets to predict fluctuations in the stock market (using S&P 500 as index).**

- Question 1: Can we predict if stock prices will go up or down at a fixed time point, based on the language in Trump's tweets?
    - [NLP Model 0](#Model-0)<br><br>
    
>- **Question 2: How well can explain stock market fluctuations using only historical price data?**
    - [Stock Market Model 1](#Model-1:-Using-Price-as-only-feature)<br><br>
    - **Question 3: Does adding technical market indicators to our model improve its ability to predict stock prices?**
    - [Stock Market Model 2](#Model-2:-Stock-Price-+-Technical-Indicators)<br><br>
- Question 4: Can the NLP predictions from Question 1, combined with all of the features from Question 3, as well as additional information regarding Trump's Tweets explain even more of the stock market fluctuations?
    - Stock Market Model 3
    - Stock Market Model X<br><br>

    

### REFERENCES / INSPIRATION:

1. **Stanford Scientific Poster Using NLP ALONE to predict if stock prices increase or decrease 5 mins after Trump tweets.**  
    - [Poster PDF LINK](http://cs229.stanford.edu/proj2017/final-posters/5140843.pdf)
    - Best accuracy was X, goal 1 is to create a classifier on a longer timescale with superior results.
    

2. **TowardsDataScience Blog Plost on "Using the latest advancements in deep learning to predict stock price movements."** 
     - [Blog Post link](https://towardsdatascience.com/aifortrading-2edd6fac689d)

## OVERVIEW OF DATA/FEATURES USED PER MODEL


#### TWITTER DATA - CLASSIFICATION MODEL
**Trained Word2Vec embeddings on collection of Donal Trump's Tweets.**
- Used negative skip-gram method and negative sampling to best represent infrequently used words.
    
**Classified tweets based on change in stock price (delta_price)**
- Calculated price change from time of tweet to 60 mins later.
    - "No Change" if the delta price was < \\$0.05 
    - "Increase" if delta price was >+\\$0.05
    - "Decrease if delta price was >-\\$0.05
    
*NOTE: This model's predictions will become a feature in our final model.*


#### STOCK MARKET (S&P 500) DATA :
##### TIME SERIES FORECASTING USING MARKET DATA
**Model 1: Use price alone to forecast hourly price.**
- Train model using time sequences of 7-trading-hours (1 day) to predict the following hour. 
    * [x] ~~SARIMAX model~~
    * [x] LSTM neural network 

**Model 2: Use price combined with technical indicators.**
    * LSTM neural network
- **Calculate 7 technical indicators from S&P 500 hourly closing price.**
    * [x] 7 days moving average 
    * [x] 21 days moving average
    * [x] exponential moving average
    * [x] momentum
    * [x] Bollinger bands
    * [x] MACD
    
  

### FINAL MODEL: COMBINING STOCK MARKET DATA,  NLP CLASSIFICATION, AND OTHER TWEET METRICS

- **FEATURES FOR FINAL MODEL:**<br><br>
    - **Stock Data:**
        * [x] 7 days moving average 
        * [x] 21 days moving average
        * [x] exponential moving average
        * [x] momentum
        * [x] Bollinger bands
        * [x] MACD<br><br>
    - **Tweet Data:**
        * [x] 'delta_price' prediction classification for body of tweets from prior hour (model 0)
        * [x] Number of tweets in hour
        * [x] Ratio of uppercase:lowercase ratio (case_ratio)
        * [x] Total # of favorites for the tweets
        * [x] Total # of retweets for the tweets
        * [x] Sentiment Scores:
            - [x] Individual negative, neutral, and positive sentiment scores
            - [x] Compound Sentiment Score (combines all 3)
            - [x] sentiment class (+/- compound score)    

## OSEMN FRAMEWORK

### [OBTAIN](#OBTAIN)
- Obtaining 1-min resolution stock market data (S&P 500 Index)
- Obtain batch of historical tweets by President Trump 

### [SCRUB](#SCRUB)
1. **[Tweets](#TRUMP'S-TWEETS)**
    - Preprocessing for Natural Language Processing<br><br>
2. **[Stock Market](#Loading-&-Processing-Stock-Data-(SCRUB))**
    - Time frequency conversion
    - Technical Indicator Calculation

### [EXPLORE / VISUALIZE](#EXPLORE/VISUALIZE)
- [Tweet Delta Price Classes](#Delta-Price-Classes) 
- [NLP Figures / Example Tweets](#Natural-Language-Processing)
- [S&P 500 Price](#Model-1:-Using-Price-as-only-feature)
- [S&P 500 Technical Indicators](#Technical-Indicator-Details)

### [MODELING (Initial)](#INITIAL-MODELING)
- [Delta-Stock-Price NLP Classifier](#TWEET-DELTA-PRICE-CLASSIFICATON)
- [S&P 500 Neural Network (price only)] ( )

### iNTERPRETATION 
- Delta-Stock-Price NLP Models
    - Model 0A Summary
    - Model 0B Summary
    
- Stock-Market-Forecasting
    - Model 1 Summary
    - Model 2 Summary
    - Model 3 Summary
    - Model 4 Summary
- Final Summary

# OBTAIN

### 📚 DATA SOURCES:

* **All Donald Trump tweets from 12/01/2016 (pre-inaugaration day) to end of 08/23/2018**
    *          Extracted from http://www.trumptwitterarchive.com/

* **Minute-resolution data for the S&P500 covering the same time period.**
    *         IVE S&P500 Index from - http://www.kibot.com/free_historical_data.aspx
    
    
* NOTE: Both sources required manual extraction and both 1-min historical stock data and batch-historical-tweet data are difficult to obtain without paying \\$150-\\$2000 monthly developer memberships. 

# SCRUB

### Importing Functions

In [None]:
## Personal Functions 
# Note: the bs_ds package on pip is not compatible with python 3.8+
# Therefore I am importing it locally instead

## IMPORT CUSTOM CAPSTONE FUNCTIONS
%load_ext autoreload
%autoreload 2 
import bsds as bs
from bsds import ihelp,ihelp_menu,reload, inspect_variables
from bsds import functions_combined_BEST as ji
from bsds import functions_io as io
# from bsds.imports import *


## The Basics 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import glob,sys,os,time
#Set pd.set_options for tweet visibility
pd.set_option('display.max_colwidth',200)
pd.set_option('display.max_columns',50)


print(f"Pandas v: \t {pd.__version__:>5}")
print(f"Numpy v: \t{np.__version__:>5}")
print(f"Seaborn v:\t {sns.__version__:>5}")

In [None]:
## NLP TOOLS
import nltk
nltk.download('vader_lexicon')

In [None]:
# Import plotly and cufflinks for iplots
import plotly.express as px
import cufflinks as cf
from plotly import graph_objs as go
from plotly.offline import iplot
cf.go_offline()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
## IMPORT CONVENIENCE FUNCTIONS
from pprint import pprint
# import qgrid
import json

In [None]:
file_dict = io.def_filename_dictionary(load_prior=False, save_directory=True)
# file_dict = ji.load_filename_directory()

# FORECASTING STOCK MARKET PRICE

In [None]:
DOWNLOAD_STOCK_DATA = False


In [None]:
## DOWNLOAD THE DATA IF REQUESTED
if DOWNLOAD_STOCK_DATA:
    print('[i] Downoading data sets...')
    stock_df = bs.data.download_stock_data()
    
else: 
    ## Or Load in Most Recent Files
    print('[i] Loading most recent data sets...')
    
    ## Check for prx-existing files
    files_glob = glob.glob('data/*.csv.gz')
    stock_files = list(filter(lambda x: 'ive_minute' in x, files_glob))


    ## Get Time Files Modified 
    STOCK_FILES = {f:pd.to_datetime(time.ctime(os.path.getmtime(f))) for f in stock_files}

    ## Get most recent files using idxmin
    recent_stocks = pd.Series(STOCK_FILES).idxmax()
    
    ## Load in the csvs with datetime indices
    stock_df = pd.read_csv(recent_stocks,parse_dates=['datetime'],index_col='datetime')
    
    ## Sort timeseries
    stock_df.sort_index(inplace=True)
    
    
## Display Preview of DFs
stock_df

In [None]:
# ts = stock_df[['BidClose']].sort_index()#.reset_index()#.asfreq('T')
# px.line(ts)

In [None]:
stock_df.tail()

## Loading & Processing Stock Data (SCRUB)

In [None]:
# # DISPLAY CODE TO BE USED BELOW TO LOAD AND PROCESS STOCK DATA
# functions_used=['ji.load_processed_stock_data', # This script combines the oriignal 4 used:
#                 'ji.load_raw_stock_data_from_txt',
#                 'ji.set_timeindex_freq','ji.custom_BH_freq',
#                'ji.get_technical_indicators']

# ji.ihelp_menu(functions_used)

## Load in raw text file with minute-resolutin S&P 500 prices

In [None]:
try:
    stock_df
except: 
    print('loading')
    stock_df = ji.load_processed_stock_data()

In [None]:
stock_df

In [None]:
reload(ji)
# fname = file_dict['stock_df']['raw_csv_file']
raw_stock_df = ji.load_raw_stock_data_from_txt(filename = 'data/ive_minute_tick_bidask_API_2021_06-18-21.csv.gz',#fname,
                                               verbose=2)


In [None]:
stock_df = ji.get_technical_indicators(raw_stock_df,make_price_from='BidClose')
stock_df

In [None]:
stock_df.index

In [None]:
# # ## Plot TIme Series and Calcualte Technical INdicators 
# # fig = ji.plotly_time_series(raw_stock_df,x_col='Date', y_col='BidClose',as_figure=True)
# def custom_BH_freq():
#     CBH = pd.tseries.offsets.CustomBusinessHour(start='09:30',end='16:30')
#     return CBH

# stock_df.resample(custom_BH_freq()).first()

In [None]:
# def get_technical_indicators(dataset,make_price_from='BidClose'):

#     dataset['price'] = dataset[make_price_from].copy()
#     if dataset.index.freq == custom_BH_freq():
#         days = get_day_window_size_from_freq(dataset)#,freq='CBH')
#     else:
#         days = get_day_window_size_from_freq(dataset)
        
#     # Create 7 and 21 days Moving Average
#     dataset['ma7'] = dataset['price'].rolling(window=7*days).mean()
#     dataset['ma21'] = dataset['price'].rolling(window=21*days).mean()
    
#     # Create MACD
#     dataset['26ema'] = dataset['price'].ewm(span=26*days).mean()
# #     dataset['12ema'] = pd.ewma(dataset['price'], span=12)
#     dataset['12ema'] = dataset['price'].ewm(span=12*days).mean()

#     dataset['MACD'] = (dataset['12ema']-dataset['26ema'])

#     # Create Bollinger Bands
# #     dataset['20sd'] = pd.stats.moments.rolling_std(dataset['price'],20)
#     dataset['20sd'] = dataset['price'].rolling(20*days).std()
#     dataset['upper_band'] = dataset['ma21'] + (dataset['20sd']*2)
#     dataset['lower_band'] = dataset['ma21'] - (dataset['20sd']*2)
    
#     # Create Exponential moving average
#     dataset['ema'] = dataset['price'].ewm(com=0.5).mean()
    
#     # Create Momentum
#     dataset['momentum'] = dataset['price']-days*1
    
#     return dataset

# # del raw_stock_df
# # stock_df

In [None]:
# SELECT DESIRED COLUMNS
stock_df = stock_df[[
    'price','ma7','ma21','26ema','12ema','MACD','20sd',
    'upper_band','lower_band','ema','momentum']]

# Make stock_price for twitter functions
stock_df.dropna(inplace=True)
ji.index_report(stock_df)
display(stock_df.head(3))

In [None]:
# ihelp_menu([ji.train_test_split_by_last_days,
#            ji.make_scaler_library,
#            ji.transform_cols_from_library,
#            ji.make_train_test_series_gens])

# BOOKMARK 06/18

In [None]:
## SPECIFY # OF TRAINING TEST DAYS 
num_test_days=5
num_train_adays= 260
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=1

# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq( stock_df, ji.custom_BH_freq() )


## Get the number of rows for x_window 
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window'] 
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')

## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(stock_df,
                                                     periods_per_day =periods_per_day, 
                                                     num_test_days   = num_test_days,
                                                     num_train_days  = num_train_days,
                                                     verbose=1, iplot=True)



## Model 1: Using Price as only feature

In [None]:
###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))

scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)

df_test = ji.transform_cols_from_library(df_test, col_list=None,
                                         scaler_library=scaler_library,
                                         inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))

# Show transformed dataset
# display( df_train.head(2).round(3).style.set_caption('training data - scaled'))

# Create timeseries generators
train_generator, test_generator = ji.make_train_test_series_gens( 
    df_train['price'], df_test['price'], 
    x_window=x_window,n_features=1,batch_size=1, verbose=0)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import optimizers
from tensorflow.keras.layers import Bidirectional, Dense, LSTM, Dropout
from tensorflow.keras.regularizers import l2

# Specifying input shape (size of samples, rank of samples?)
n_input = x_window
n_features = 1 # just stock Price

print(f'input shape: ({n_input},{n_features})')
input_shape=(n_input, n_features)

# Create model architecture
model1 = Sequential()
model1.add(LSTM(units=50, input_shape =input_shape,return_sequences=True))#,  kernel_regularizer=l2(0.01),recurrent_regularizer=l2(0.01),
#     model.add(Dropout(0.2))
model1.add(LSTM(units=50, activation='relu'))
model1.add(Dense(1))

model1.compile(loss=ji.my_rmse, metrics=['acc'],
              optimizer=optimizers.Nadam())

display(model1.summary())


In [None]:
## FIT MODEL
dashes = '---'*20
print(f"{dashes}\n\tFITTING MODEL:\n{dashes}")

## set params
epochs=5

# override keras warnings
ji.quiet_mode(True,True,True)

# Instantiating clock timer
clock = bs.Clock()
clock.tic('')

# Fit the model
history = model1.fit_generator(train_generator,
                               epochs=epochs,
                               verbose=2, 
                               use_multiprocessing=True,
                               workers=3)


clock.toc('')


model_key = "model_1"
hist_fname = file_dict[model_key]['fig_keras_history.ext']
summary_fname = file_dict[model_key]['model_summary']

# eval_results = ji.evaluate_model_plot_history(model1, train_generator, test_generator)
ji.evaluate_regression_model(model1,history,
                             train_generator=train_generator,
                             test_generator=test_generator,
                            true_test_series=df_test['price'],
                            true_train_series =df_train['price'],
                             save_history=True,history_filename=hist_fname,
                             save_summary=True, summary_filename=summary_fname)

In [None]:
### PREFER NEW WAY - GET DF_MODEL FIRST THEN GET EVALUATE_REGRESSION INFORMATION?
## Get true vs pred data as a dataframe and iplot
df_model1 = ji.get_model_preds_df(model1, 
                                  test_generator = test_generator,
                                  true_train_series = df_train['price'],
                                  true_test_series = df_test['price'],
                                  include_train_data=True,
                                  inverse_tf = True, 
                                  scaler = scaler_library['price'],
                                  preds_from_gen = True, 
                                  preds_from_train_preds = True, 
                                  preds_from_test_preds = True,
                                  iplot = True,
                                  verbose=0)
#                                   subplot_mode='lines+markers')
    
# Get evaluation metrics
df_results1, dfs_results1, df_shifted1 =\
ji.compare_eval_metrics_for_shifts(df_model1['true_test_price'],
                                   df_model1['pred_from_gen'],
                                   shift_list=np.arange(-4,4,1),
                                   true_train_series_to_add=df_model1['true_train_price'],
                                   display_results=True,
                                   display_U_info=True,
                                   return_results=True,
                                   return_styled_df=True,
                                   return_shifted_df=True)

In [None]:
ji.dict_dropdown(file_dict)

In [None]:
reload(ji)
save_model=True
ji.save_model_dfs(file_dict, 'model_1',df_model1,dfs_results1,df_shifted1)

filename_prefix = file_dict['model_1']['base_filename']
if save_model ==True:
    model_1_output_files = ji.save_model_weights_params(model1,
                                 filename_prefix=filename_prefix,
                                 auto_increment_name=True,
                                 auto_filename_suffix=True, 
                                 suffix_time_format='%m-%d-%y_%I%M%p',
                                 save_model_layer_config_xlsx=True)

### Model 1 Summary

## Model 2: Stock Price + Technical Indicators

### Technical Indicator Details

In [None]:
# SELECT DESIRED COLUMNS
stock_df = stock_df[[
    'price','ma7','ma21','26ema','12ema','MACD','20sd',
    'upper_band','lower_band','ema','momentum']]

# Make stock_price for twitter functions
stock_df.dropna(inplace=True)
ji.index_report(stock_df)
display(stock_df.head(3))

In [None]:
fig =ji.plotly_technical_indicators(stock_df)

1. **7 and 21 day moving averages**
```python
df['ma7'] df['price'].rolling(window = 7 ).mean() #window of 7 if daily data
df['ma21'] df['price'].rolling(window = 21).mean() #window of 21 if daily data
```    
2. **MACD(Moving Average Convergence Divergence)**

> Moving Average Convergence Divergence (MACD) is a trend-following momentumindicator that shows the relationship between two moving averages of a security’s price. The MACD is calculated by subtracting the 26-period Exponential Moving Average (EMA) from the 12-period EMA.

>The result of that calculation is the MACD line. A nine-day EMA of the MACD, called the "signal line," is then plotted on top of the MACD line, which can function as a trigger for buy and sell signals. 

> Traders may buy the security when the MACD crosses above its signal line and sell - or short - the security when the MACD crosses below the signal line. Moving Average Convergence Divergence (MACD) indicators can be interpreted in several ways, but the more common methods are crossovers, divergences, and rapid rises/falls.  - _[from Investopedia](https://www.investopedia.com/terms/m/macd.asp)_

```python
df['ewma26'] = pd.ewma(df['price'], span=26)
df['ewma12'] = pd.ewma(df['price'], span=12)
df['MACD'] = (df['12ema']-df['26ema'])
```
3. **Exponentially weighted moving average**
```python
dataset['ema'] = dataset['price'].ewm(com=0.5).mean()
```

4. **Bollinger bands**
    > "Bollinger Bands® are a popular technical indicators used by traders in all markets, including stocks, futures and currencies. There are a number of uses for Bollinger Bands®, including determining overbought and oversold levels, as a trend following tool, and monitoring for breakouts. There are also some pitfalls of the indicators. In this article, we will address all these areas."
> Bollinger bands are composed of three lines. One of the more common calculations of Bollinger Bands uses a 20-day simple moving average (SMA) for the middle band. The upper band is calculated by taking the middle band and adding twice the daily standard deviation, the lower band is the same but subtracts twice the daily std. - _[from Investopedia](https://www.investopedia.com/trading/using-bollinger-bands-to-gauge-trends/)_

    - Boilinger Upper Band:<br>
    $BOLU = MA(TP, n) + m * \sigma[TP, n ]$<br><br>
    - Boilinger Lower Band<br>
    $ BOLD = MA(TP,n) - m * \sigma[TP, n ]$
    - Where:
        - $MA$  = moving average
        - $TP$ (typical price) = $(High + Low+Close)/ 3$
        - $n$ is number of days in smoothing period
        - $m$ is the number of standard deviations
        - $\sigma[TP, n]$ = Standard Deviations over last $n$ periods of $TP$

```python
# Create Bollinger Bands
dataset['20sd'] = pd.stats.moments.rolling_std(dataset['price'],20)
dataset['upper_band'] = dataset['ma21'] + (dataset['20sd']*2)
dataset['lower_band'] = dataset['ma21'] - (dataset['20sd']*2)
```


5. **Momentum**
> "Momentum is the rate of acceleration of a security's price or volume – that is, the speed at which the price is changing. Simply put, it refers to the rate of change on price movements for a particular asset and is usually defined as a rate. In technical analysis, momentum is considered an oscillator and is used to help identify trend lines." - _[from Investopedia](https://www.investopedia.com/articles/technical/081501.asp)_

    - $ Momentum = V - V_x$
    - Where:
        - $ V $ = Latest Price
        - $ V_x $ = Closing Price
        - $ x $ = number of days ago

```python
# Create Momentum
dataset['momentum'] = dataset['price']-1
```

In [None]:
## SPECIFY # OF TRAINING TEST DAYS 
reload(ji)
num_test_days=10
num_train_days=260
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=5

# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq( stock_df, ji.custom_BH_freq() )


## Get the number of rows for x_window 
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window'] 
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')

## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(stock_df,
                                                     periods_per_day =periods_per_day, 
                                                     num_test_days   = num_test_days,
                                                     num_train_days  = num_train_days,
                                                     verbose=1, iplot=True)



In [None]:
###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))

scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)

df_test = ji.transform_cols_from_library(df_test, col_list=None,
                                         scaler_library=scaler_library,
                                         inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))

# Show transformed dataset
# display( df_train.head(2).round(3).style.set_caption('training data - scaled'))

# Create timeseries generators
train_generator, test_generator = ji.make_train_test_series_gens( 
    df_train['price'], df_test['price'], 
    x_window=x_window,n_features=1,batch_size=1, verbose=0)

In [None]:
## Make new time series generators with all stock_indicators for X_sequences
train_generator, test_generator = ji.make_train_test_series_gens(
    train_data_series=df_train,
    test_data_series=df_test,
    y_cols='price',
    x_window=x_window,
    n_features=len(df_train.columns),
    batch_size=1, verbose=1)

In [None]:
# Create keras model from model_params
# import functions_combined_BEST as ji
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Bidirectional, Dense, LSTM, Dropout
from IPython.display import display
from tensorflow.keras.regularizers import l2

# Specifying input shape (size of samples, rank of samples?)
n_input = x_window #model_params['input_params']['n_input']
n_features = len(df_train.columns) # model_params['input_params']['n_features']

print(f'input shape: ({n_input},{n_features}')
input_shape=(n_input, n_features)

# Create model architecture
model2 = Sequential()
model2.add(LSTM(units=50, input_shape =input_shape,return_sequences=True))#,  kernel_regularizer=l2(0.01),recurrent_regularizer=l2(0.01),
# model2.add(Dropout(0.2))
model2.add(LSTM(units=50, activation='relu'))
model2.add(Dense(1))

model2.compile(loss=ji.my_rmse, metrics=['acc',ji.my_rmse],
              optimizer=optimizers.Nadam())

display(model2.summary())

In [None]:
epochs=5

clock = bs.Clock()
print('---'*20)
print('\tFITTING MODEL:')
print('---'*20,'\n')     

# start the timer
clock.tic('')

# Fit the model
history = model2.fit_generator(train_generator,epochs=epochs) 
clock.toc('')

model_key = "model_2"
hist_fname = file_dict[model_key]['fig_keras_history.ext']
summary_fname = file_dict[model_key]['model_summary']

# eval_results = ji.evaluate_model_plot_history(model1, train_generator, test_generator)
ji.evaluate_regression_model(model2,history,
                             train_generator=train_generator,
                             test_generator=test_generator,
                            true_test_series=df_test['price'],
                            true_train_series =df_train['price'],
                             save_history=True,history_filename=hist_fname,
                             save_summary=True, summary_filename=summary_fname)

In [None]:
### PREFER NEW WAY - GET DF_MODEL FIRST THEN GET EVALUATE_REGRESSION INFORMATION?
## Get true vs pred data as a dataframe and iplot
df_model2 = ji.get_model_preds_df(model2, 
                                  test_generator=test_generator,
                                  true_train_series = df_train['price'],
                                  true_test_series = df_test['price'],
                                  x_window=x_window,
                                  n_features=len(df_train.columns),
                                  scaler=scaler_library['price'],
                                  preds_from_gen=True, 
                                  inverse_tf=True,
                                  iplot=True)

# Compare predictions if predictions timebins shifted
df_results2, dfs_results2, df_shifted2 =\
ji.compare_eval_metrics_for_shifts(df_model2['true_test_price'],
                                   df_model2['pred_from_gen'],
                                   shift_list=np.arange(-4,5,1),
                                   true_train_series_to_add=df_model2['true_train_price'],
                                   display_results=True,
                                   return_styled_df=True,
                                   display_U_info=False,
                                   return_shifted_df=True,
                                   return_results=True)

In [None]:
##SAVING DFS
ji.save_model_dfs(file_dict,'model_2',
               df_model=df_model2,
              df_results=dfs_results2,
              df_shifted=df_shifted2)

In [None]:
df_results2, dfs_results2, df_shifted2 =\
ji.compare_eval_metrics_for_shifts(df_model2['true_test_price'],
                                   df_model2['pred_from_gen'],
                                   shift_list=np.arange(-4,5,1),
                                   true_train_series_to_add=df_model2['true_train_price'],
                                   display_results=True,
                                   return_styled_df=True,
                                   display_U_info=False,
                                   return_shifted_df=True,
                                   return_results=True)

### Model 2: Summary

# COMBINING TWEET STATS, NLP CLASSIFICATION, AND MARKET DATA

1. Load up stock data in CBH form
2. Load up twitter data without NLP
3. Create time_interval_bins ...
    - from *stock CBH* time index
4. Check twitter_df for any tweets from 1_hour prior
5. Extract the 'content' column and retweet/fav counts 


    

In [None]:
file_dict=ji.def_filename_dictionary(load_prior=False,save_directory=True)

In [None]:
# LOAD IN FULL STOCK DATASET using ClosingBig S&P500 WITH INDEX.FREQ=CBH
fname = file_dict['stock_df']['stock_df_with_indicators']
full_df = ji.load_processed_stock_data(processed_data_filename=fname)

# SELECT DESIRED COLUMNS
stock_df = full_df[[
    'price','ma7','ma21','26ema','12ema','MACD',
    '20sd','upper_band','lower_band','ema','momentum'
]]

stock_df.head()

stock_df['date_time'] = stock_df.index.to_series()
ji.index_report(stock_df)

stock_df.sort_index(inplace=True)
display(stock_df.head(2),stock_df.tail(2))
del full_df

In [None]:
## LOAD IN RAW TWITTER DATA, NO PROCESSING
twitter_df= ji.load_raw_twitter_file(filename='data/trumptwitterarchive_export_iphone_only__08_23_2019.csv',
                                     date_as_index=True,
                                     rename_map={'text': 'content', 'created_at': 'date'})
twitter_df = ji.check_twitter_df(twitter_df,text_col='content',remove_duplicates=True, remove_long_strings=True)


In [None]:
# MAKE TIME INTERVALS BASED ON BUSINESS HOUR START (09:30-10:30)
clock = bs.Clock(verbose=1)
clock.tic()

time_intervals= \
ji.make_time_index_intervals(stock_df,
                             col='date_time', 
                             closed='right',
                             return_interval_dicts=False) 
clock.lap('time_intervals created.')


## USE THE TIME INDEX TO FILTER OUT TWEETS FROM THE HOUR PRIOR
twitter_df, bin_codes = ji.bin_df_by_date_intervals(twitter_df ,time_intervals)
stock_df, bin_codes_stock = ji.bin_df_by_date_intervals(stock_df, time_intervals, column='date_time')

clock.lap('bins added to dataframes')
# display(twitter_df.head(2), stock_df.head(2))

## COLLAPSE DFs BY CODED BINS
twitter_grouped = ji.collapse_df_by_group_index_col(twitter_df,
                                                    group_index_col='int_bins',
                                                    drop_orig=True,
                                                    verbose=0)

stocks_grouped = ji.collapse_df_by_group_index_col(stock_df,
                                                    drop_orig=True,
                                                    group_index_col='int_bins', 
                                                  verbose=0)

clock.toc('collapsed dfs to _grouped')
display(twitter_grouped.head(3),stocks_grouped.head(3))

In [None]:
ihelp_menu(ji.merge_stocks_and_tweets)

In [None]:
## STOCKS AND TWEETS 
df_combined = ji.merge_stocks_and_tweets(stocks_grouped, 
                                      twitter_grouped,
                                      on='int_bins',how='left',
                                      show_summary=True)

In [None]:
ji.column_report(df_combined)

In [None]:
## Check for and address new null values
ji.check_null_small(df_combined);
cols_to_fill_zeros = ['num_tweets','total_retweet_count','total_favorite_count']
for col in cols_to_fill_zeros:
    idx_null = ji.find_null_idx(df_combined, column=col)
    df_combined.loc[idx_null,col] = 0

cols_to_fill_blank_str = ['group_content','source','tweet_times','is_retweet']
for col in cols_to_fill_blank_str:
    idx_null = ji.find_null_idx(df_combined, column=col)
    df_combined.loc[idx_null, col] = ""
ji.check_null_small(df_combined);

In [None]:
ji.dict_dropdown(file_dict)

fname = file_dict['df_combined']['pre_nlp']
df_combined.to_csv(fname)
print(fname)

In [None]:
## Add nlp
df_nlp = ji.full_twitter_df_processing(df_combined,'group_content',force=True)
ji.column_report(df_nlp, as_qgrid=True)

In [None]:
df_nlp.head()

In [None]:
## Use case ratio null values as index to replace values
idx_null= ji.check_null_small(df_nlp,null_index_column='case_ratio')
df_nlp.loc[idx_null,'case_ratio'] = 0.0
ji.check_null_small(df_nlp)

## replace sentiment_class, set =-1
cols_to_replace_misleading_values = ['sentiment_class']
for col in cols_to_replace_misleading_values:
    df_nlp.loc[idx_null,col] = -1

## remap sentiment class
sent_class_mapper = {'neg':0,
                     -1:1,
                    'pos':2}
df_nlp['sentiment_class'] = df_nlp['sentiment_class'].apply(lambda x: sent_class_mapper[x])

bool_cols_to_ints = ['has_tweets']
for col in bool_cols_to_ints:
    df_nlp[col] = df_nlp[col].apply(lambda x: 1 if x==True else 0)
    

In [None]:
df_nlp.head()

In [None]:
ji.display_same_tweet_diff_cols(df_nlp.groupby('has_tweets').get_group(True),
                                columns=['group_content','content_min_clean','cleaned_stopped_lemmas'],as_md=True)

In [None]:
ji.check_twitter_df(df_nlp,char_limit=61*350)
# get_floats = df_nlp['content_min_clean'].apply(lambda x: isinstance(x,float))


In [None]:
fname =file_dict['df_combined']['post_nlp']
df_nlp.to_csv(fname)
print(f'saved to {fname}')

### Loading in NLP Model for Predictions

In [None]:
ji.dict_dropdown(file_dict)

In [None]:
def get_most_recent_filenames(full_filename,str_to_find=None):
    import os
    import time
    fparts = full_filename.split('/')
    folder = '/'.join(fparts[0:-1])
    name = fparts[-1]
    
    filelist = os.listdir(folder)

    mtimes = [['file','date modified']]
    for file in filelist:
        if str_to_find is None:
            mtimes.append([file, time.ctime(os.path.getmtime(folder+'/'+file))])
        elif str_to_find in file:
            mtimes.append([file, time.ctime(os.path.getmtime(folder+'/'+file))])
    res = bs.list2df(mtimes)
    res['date modified'] = pd.to_datetime(res['date modified'])
    res.set_index('date modified',inplace=True)
    res.sort_index(ascending=False, inplace=True)
    
    most_recent = res.iloc[0]
    import re
    re.compile(r'()')
    
    return    res

In [None]:
res = get_most_recent_filenames(file_dict['model_0A']['base_filename'])
res.iloc[0:10]

In [None]:
reload(ji)

In [None]:
## Load the nlp model and weights with layers set trainable=False
base_fname = file_dict['nlp_model_for_predictions']['base_filename']
nlp_model,df_model_layers =  ji.load_model_weights_params(base_filename= base_fname,#'models/NLP/nlp_model0B__09-02-2019_0121pm',
                                        load_model_params=False,
                                        load_model_layers_excel=True,
                                        trainable=False)
## Load in Word2Vec model from earlier
w2v_model = io.load_word2vec(file_dict=file_dict)

### Get Predictions for Hour-Binned Tweets

In [None]:
ihelp_menu([ji.get_tokenizer_and_text_sequences,
           ji.replace_embedding_layer])

In [None]:
ji.column_report(df_nlp)

In [None]:
## GET X_SEQUENES FOR BINNED TWEETS AND CREATE NEW EMBEDDING LAYER FOR THEIR SIZE
reload(ji)
text_data=df_nlp['cleaned_stopped_lemmas']
tokenizer, X_sequences = ji.get_tokenizer_and_text_sequences(w2v_model,text_data)

new_nlp_model = ji.replace_embedding_layer(nlp_model,w2v_model,text_data,verbose=2)
new_nlp_model.summary()

In [None]:
## GET PREDICTIONS FROM NEW MODEL
preds = new_nlp_model.predict_classes(X_sequences)
print(type(preds), preds.shape)
ji.check_y_class_balance(preds)

In [None]:
## add to df
df_nlp['pred_classes_int'] = preds
mapper= {0:'neg',
        1:'no_change',
        2:'pos'}
df_nlp['pred_classes'] = df_nlp['pred_classes_int'].apply(lambda x: mapper[x])
display(df_nlp.head())

In [None]:
ji.dict_dropdown(file_dict)

In [None]:
# fname = file_dict['df_combined']['with_preds']

# df_nlp.to_csv(fname)
# print(fname)

## Model 3: Stock Price + Indicators + NLP Preds & Tweet Features

In [None]:
## IMPORT CUSTOM CAPSTONE FUNCTIONS
import functions_combined_BEST as ji
import functions_io as io

from functions_combined_BEST import ihelp, ihelp_menu,\
reload, inspect_variables

## IMPORT MY PUBLISHED PYPI PACKAGE 
import bs_ds as  bs
from bs_ds.imports import *

## IMPORT CONVENIENCE FUNCTIONS
from pprint import pprint
import qgrid
import json

# Import plotly and cufflinks for iplots
import plotly
import cufflinks as cf
from plotly import graph_objs as go
from plotly.offline import iplot
cf.go_offline()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

#Set pd.set_options for tweet visibility
pd.set_option('display.max_colwidth',100)
pd.set_option('display.max_columns',50)

file_dict = io.def_filename_dictionary(load_prior=False, save_directory=True)
# file_dict = ji.load_filename_directory()

In [None]:
df_combined = pd.read_csv('data/__combined_stock_data_with_tweet_preds.csv', index_col=0,parse_dates=True)
df_combined.head()

### Finalize colums for final model

In [None]:
model_col_list = ['price', 'ma7', 'ma21', '26ema', '12ema', 'MACD', '20sd', 'upper_band','lower_band', 'ema', 'momentum',
                  'has_tweets','num_tweets','case_ratio', 'compound_score','pos','neu','neg','sentiment_class',
                  'pred_classes','pred_classes_int','total_favorite_count','total_retweet_count']

df_combined = ji.set_timeindex_freq(df_combined,fill_nulls=False)

df_to_model = df_combined[model_col_list].copy()#df_nlp[model_col_list].copy()
# df_to_model.to_csv('data/_df_to_model_final_model.csv')
df_to_model.head()

In [None]:
# del_me= ['X_sequences','df_nlp','twitter_grouped','bin_codes_stock','bin_codes']#list of variable names
# for me in del_me:    
#     try: 
#         exec(f'del {me}')
#         print(f'del {me} succeeded')
#     except:
#         print(f'del {me} succeeded')
#         continue
# ji.inspect_variables(locals())


In [None]:
## SPECIFY # OF TRAINING TEST DAYS 
reload(ji)
num_test_days=5
num_train_days=260
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=1

cols_to_exclude = ['pred_classes','has_tweets']
# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq(df_to_model.drop(cols_to_exclude,axis=1), ji.custom_BH_freq() )


## Get the number of rows for x_window 
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window'] 
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')

## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(df_to_model.drop(cols_to_exclude,axis=1),
                                                     periods_per_day =periods_per_day, 
                                                     num_test_days   = num_test_days,
                                                     num_train_days  = num_train_days,
                                                     verbose=1, iplot=True)



In [None]:
###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))

scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)

df_test = ji.transform_cols_from_library(df_test, col_list=None,
                                         scaler_library=scaler_library,
                                         inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))

# Show transformed dataset
# display( df_train.head(2).round(3).style.set_caption('training data - scaled'))

# Create timeseries generators
train_generator, test_generator = ji.make_train_test_series_gens(
    train_data_series=df_train,
    test_data_series=df_test,
    y_cols='price',
    x_window=x_window,
    n_features=len(df_train.columns),
    batch_size=1, verbose=1)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import optimizers
from tensorflow.keras.layers import Bidirectional, Dense, LSTM, Dropout
from IPython.display import display
from tensorflow.keras.regularizers import l2

# Specifying input shape (size of samples, rank of samples?)
n_input =x_window
n_features = len(df_train.columns)
print(f'input shape: ({n_input},{n_features})')
input_shape=(n_input, n_features)

# Create model architecture
model3 = Sequential()
model3.add(LSTM(units=100, input_shape =input_shape,return_sequences=True,dropout=0.3,recurrent_dropout=0.3))#,  kernel_regularizer=l2(0.01),recurrent_regularizer=l2(0.01),
model3.add(LSTM(units=100, activation='relu', return_sequences=False,dropout=0.3,recurrent_dropout=0.3))
#     model.add(Dense(units=10, activation='relu'))
model3.add(Dense(1))#,activation='relu'))


model3.compile(loss=ji.my_rmse, metrics=['acc'],optimizer=optimizers.Nadam())
    
model3.summary()

In [None]:
## FIT MODEL
dashes = '---'*20
print(f"{dashes}\n\tFITTING MODEL:\n{dashes}")

## set params
epochs=5

# override keras warnings
ji.quiet_mode(True,True,True)

# Instantiating clock timer
clock = bs.Clock()
clock.tic('')

# Fit the model
history = model3.fit_generator(train_generator,
                               epochs=epochs,
                               verbose=2, 
                               use_multiprocessing=True,
                               workers=3)
clock.toc('')

model_key = "model_3"
hist_fname = file_dict[model_key]['fig_keras_history.ext']
summary_fname = file_dict[model_key]['model_summary']

# eval_results = ji.evaluate_model_plot_history(model1, train_generator, test_generator)
ji.evaluate_regression_model(model3,history,
                             train_generator=train_generator,
                             test_generator=test_generator,
                            true_test_series=df_test['price'],
                            true_train_series =df_train['price'],
                             save_history=True,history_filename=hist_fname,
                             save_summary=True, summary_filename=summary_fname)

In [None]:
### PREFER NEW WAY - GET DF_MODEL FIRST THEN GET EVALUATE_REGRESSION INFORMATION?
## Get true vs pred data as a dataframe and iplot
df_model3 = ji.get_model_preds_df(model3, 
                                  test_generator = test_generator,
                                  true_train_series = df_train['price'],
                                  true_test_series = df_test['price'],
                                  include_train_data=True,
                                  inverse_tf = True, 
                                  scaler = scaler_library['price'],
                                  preds_from_gen = True, 
                                  iplot = False,
                                  verbose=1)
#                                   subplot_mode='lines+markers')
ji.plotly_true_vs_preds_subplots(df_model3)
    
# Get evaluation metrics
df_results3, dfs_results3, df_shifted3 =\
ji.compare_eval_metrics_for_shifts(df_model3['true_test_price'],
                                   df_model3['pred_from_gen'],
                                   shift_list=np.arange(-4,4,1),
                                   true_train_series_to_add=df_model3['true_train_price'],
                                   display_results=True,
                                   display_U_info=True,
                                   return_results=True,
                                   return_styled_df=True,
                                   return_shifted_df=True)


save_model=True
ji.save_model_dfs(file_dict, 'model_3',df_model3,dfs_results3,df_shifted3)

In [None]:
reload(ji)
filename_prefix = file_dict['model_3']['base_filename']
if save_model ==True:
    model_3_output_files = bs.save_model_weights_params(model3,
                                 filename_prefix=filename_prefix,
                                 auto_increment_name=True,
                                 auto_filename_suffix=True, 
                                 suffix_time_format='%m-%d-%y_%I%M%p',
                                 save_model_layer_config_xlsx=True)

### Model 3 Summary

## Model X: XGB Regression + Feature Importance


In [None]:
## SPECIFY # OF TRAINING TEST DAYS 
reload(ji)
num_test_days=20
num_train_days=2*52*5
### SPECIFY Number of days included in each X_sequence (each prediction)
days_for_x_window=1

cols_to_exclude = ['pred_classes','has_tweets']
# Calculate number of rows to bin for x_windows
periods_per_day = ji.get_day_window_size_from_freq(df_to_model.drop(cols_to_exclude,axis=1), ji.custom_BH_freq() )


## Get the number of rows for x_window 
x_window = periods_per_day * days_for_x_window#data_params['days_for_x_window'] 
print(f'X_window size = {x_window} -- ({days_for_x_window} day(s) * {periods_per_day} rows/day)\n')

## Train-test-split by the # of days
df_train, df_test = ji.train_test_split_by_last_days(df_to_model.drop(cols_to_exclude,axis=1),
                                                     periods_per_day =periods_per_day, 
                                                     num_test_days   = num_test_days,
                                                     num_train_days  = num_train_days,
                                                     verbose=1, iplot=True)

###### RESCALE DATA USING MinMaxScalers FIT ON TRAINING DATA's COLUMNS ######
display(df_train.head(2).style.set_caption('df_train - pre-scaling'))

scaler_library, df_train = ji.make_scaler_library(df_train, transform=True, verbose=1)

df_test = ji.transform_cols_from_library(df_test, col_list=None,
                                         scaler_library=scaler_library,
                                         inverse=False)
display(df_train.head(2).style.set_caption('df_train - post-scaling'))


In [None]:
## Shift price values such that the y-value being predicted is the following hour's Closing Price
df_train['price_shifted'] = df_train['price'].shift(-1)
df_test['price_shifted'] = df_test['price'].shift(-1)

display(df_train[['price','price_shifted','momentum','ema','num_tweets',]].head(10))

# Drop the couple of null values created by the shift
df_train.dropna(subset=['price_shifted'], inplace=True)
df_test.dropna(subset=['price_shifted'], inplace=True)

## Drop columns and make train-test-X and y
target_col = 'price_shifted'
drop_cols = ['price_shifted','price']

X_train = df_train.drop(drop_cols,axis=1)
y_train = df_train[target_col]
X_test = df_test.drop(drop_cols,axis=1)
y_test = df_test[target_col]

In [None]:
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error

reg = xgb.XGBRegressor(n_estimators=1000,silent=False,max_depth=4)

reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        early_stopping_rounds=50,
       verbose=False)


## Get Predictions
pred_price = reg.predict(X_test)
pred_price_series = pd.Series(pred_price,index=df_test.index,name='pred_test_price')#.plot()
df_xgb = pd.concat([df_train['price'].rename('true_train_price'), pred_price_series,df_test['price'].rename('true_test_price')],axis=1)


df_results = ji.evaluate_regression(df_test['price'], pred_price_series,show_results=True);


fig = ji.plotly_true_vs_preds_subplots(df_xgb,true_train_col='true_train_price',
                                true_test_col='true_test_price',
                                pred_test_columns='pred_test_price')


## PLOT FEATURE IMPORTANCE
feature_importance={}
for import_type in ['weight','gain','cover']:
    reg.importance_type = import_type
    cur_importances = reg.feature_importances_
    feature_importance[import_type] = pd.Series(data = cur_importances,
                                               index=df_train.drop(drop_cols,axis=1).columns,
                                               name=import_type)

df_importance = pd.DataFrame(feature_importance)
    
importance_fig = df_importance.sort_values(by='weight', ascending=True).iplot(kind='barh',theme='solar',
                                                                    title='Feature Importance',
                                                                    xTitle='Relative Importance<br>(sum=1.0)',
                                                                    asFigure=True)

iplot(importance_fig)

In [None]:
# from plotly.offline import plot,iplot
# html_fig = plot(importance_fig,output_type='div')

# with open ('html_importance_fig.html','w') as f:
#     f.write(html_fig)

In [None]:
# Compare predictions if predictions timebins shifted
df_resultsX, dfs_resultsX, df_shiftedX =\
ji.compare_eval_metrics_for_shifts(df_xgb['true_test_price'],
                                   df_xgb['pred_test_price'],
                                   shift_list=np.arange(-4,5,1),
                                   true_train_series_to_add=df_xgb['true_train_price'],
                                   display_results=True,
                                   return_styled_df=True,
                                   display_U_info=False,
                                   return_shifted_df=True,
                                   return_results=True)
df_importance.to_csv('results/modelxgb/df_importance.csv')

ji.save_model_dfs(file_dict, 'model_xgb',df_xgb,dfs_resultsX,df_shiftedX)

In [None]:
tree_vis = xgb.to_graphviz(reg)#,**{'format':'svg'})

tree_vis.render("xgb_full_model_",format="pdf",)

### Model Interpretation

In [None]:
import shap
shap.initjs()
explainer = shap.TreeExplainer(reg)
shap_values = explainer.shap_values(X_train)
shap_interaction_values = explainer.shap_interaction_values(X_train)

In [None]:
shap.summary_plot(shap_interaction_values,X_train)

In [None]:
shap.summary_plot(shap_values, features=X_train)

### Model X Summary

In [None]:
# importance_fig = df_importance.sort_values(by='weight', ascending=True).iplot(kind='barh',theme='solar',
#                                                                     title='Feature Importance',
#                                                                     xTitle='Relative Importance<br>(sum=1.0)',
#                                                                     asFigure=True)

# iplot(importance_fig)

# Summary

In [None]:
dfs_list = {'Model 1':dfs_results1,
            'Model 2':dfs_results2,
            'Model 3':dfs_results3,
            'XGB Regressor':dfs_resultsX}
for k,v in dfs_list.items():
    new_cap = f'Evaluation Metrics for {k}'
    display(v.set_caption(new_cap))
#     [display(x.set_cat) for x in dfs_list]