## Capstone Project

In this Capstone Project we will be engaging with the stock price data from MSN Money, which can be found at:
./Data/histroical-stock-prices.csv

In this notebook we will be looking at financial forecasting with machine learning.  This is inherently one of the hardest problems in machine learning, because some of the most advanced and well funded machine learning teams in the world are trying to use machine learning and other techniques to find patterns in the financial data.  When they find patterns and if they trade on those findings, prices will move in a way that makes those patterns less pronounced over time. This is not to say that this is not a fun and rewarding area.  Just do not get discouraged if you don't find an instant money machine.  

### Outline:
1. Preparing our tools and getting and describing the data.

2. Exploring, cleaning and visualizing the data

3. Developing analytics

4. Preparing and splitting our data

5. Building our first model

6. Extending to other ML models

7. Ideas for further strategies

8. Wrapping up

### Options
As we progress you are encouraged to take this dataset further. You are also encouraged to explore any aspects of the data. Develop your own algorithms. Be explicit about your inquiry and success in predicting affects on our world.

### Warning: Not financial advice
This exercise is meant purely for educational purposes, uses many simplifications and is not intended, nor should be considered as financial advice. There are many risks involved in implementation of financial trading strategies that are not considered nor described here.

### Setting up
If you have not yet set up your environment, you can easily do so with VS Code, and the python extension and Anaconda.  
#TODO Add more details and links

*Notes from Sarah*  
Could you include a bit more setup for why learners are here, what hte goal os this might be, and what the data is that they are about to engage with? You could even suggest that they open the csv file to check it out before getting started.

### 1. Preparing our tools and getting and describing the data.

*Notes from Sarah*  
Can you add a small paragraph here about what they are importing, you don't have to go into details, but just mentioning why these are important

In [1]:
# Bring our tools in:
import numpy             as np
import pandas            as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
import sys

%matplotlib inline

In [2]:
# rows can be set if you only want to import a small subset.
rows = None
stocks = pd.read_csv('./Data/daily-historical-sample.csv', nrows=rows, parse_dates=True, index_col='date')

Now that we have our data loaded, let's explore what we have.

*Notes from Sarah*  
Can you describe why you're droping the following columns?

In [3]:
stocks = stocks.drop(['Unnamed: 0', 'dt'], axis=1)
stocks.head()
#stocks.columns

Unnamed: 0_level_0,ticker,open,close,adj_close,low,high,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-01-02,RRC,63.650002,62.049999,60.994125,61.25,63.650002,2271200
2013-01-03,RRC,61.970001,63.400002,62.321163,61.630001,63.919998,1842400
2013-01-04,RRC,63.630001,65.139999,64.031532,63.439999,65.190002,1268200
2013-01-07,RRC,65.110001,65.209999,64.100365,64.43,65.610001,1692900
2013-01-08,RRC,65.019997,64.169998,63.078049,63.389999,65.099998,1776700


While the head gives a good preview of a piece of the data, it may not be a great overall view of the entire dataset, especially for larger data sets or ones that may have been sorted at some point.  We can investigate numerical columns with describe.  

In [4]:
stocks.describe()

Unnamed: 0,open,close,adj_close,low,high,volume
count,619636.0,619636.0,619636.0,619636.0,619636.0,619636.0
mean,79.067758,79.082518,75.544285,78.324008,79.797608,4579010.0
std,84.71234,84.713877,84.457066,83.921011,85.45104,8639048.0
min,1.66,1.59,1.59,1.5,1.71,100.0
25%,38.310001,38.310001,35.458687,37.919998,38.700001,1210700.0
50%,60.009998,60.040001,56.34938,59.439999,60.606785,2308300.0
75%,92.629997,92.660004,88.014315,91.830002,93.43,4624100.0
max,1919.390015,1919.650024,1919.650024,1902.540039,1925.0,616620500.0


*Notes from Sarah*  
Can you move the comments in the cell below to be markdown above it and add more explanation? We can have comments be more applied to the actual code written and the more abstract comments be moved to markdown.

In [5]:
# And another way to summarize data
# This method lets us know the type of columns, how many non-null values there are,
# and the size of the data set, among other things.
stocks.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 619636 entries, 2013-01-02 to 2018-08-24
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   ticker     619636 non-null  object 
 1   open       619636 non-null  float64
 2   close      619636 non-null  float64
 3   adj_close  619636 non-null  float64
 4   low        619636 non-null  float64
 5   high       619636 non-null  float64
 6   volume     619636 non-null  int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 37.8+ MB


### 2. Exploring, cleaning and visualizing the data

#### Modeling our data
One of the most important steps in preparing your data is to think of how we want to think about it.  For those who have worked with SQL, this is like identifying what the key to your table is. For this data exploration, we can think of the index into the data as being a a compound key of both the ticker and the date.  We will process the data in that way that will make a time series model for each stock.  Predicting the future based on what has happened in the past for that individual stock.  

Thought Question: Can you think of other ways that you might want to consider this data?  What questions might prompt you to think of this data having a different data model?  The data model is not fixed, but is a lense that lets us look at our data.  

#### Visualizing our data
It is often helpful to look at our data visually to see if their are any issues that "look funny."  With experience, just looking at the data can help us understand it in a short amount of time.  This is often the step where domain expertise (in this case the financial markets) is especially useful.

#### Pandas Profiling
Often data exploration can start with a turnkey tool like pandas-profiling.  The important this with this is to make sure you actually look at the report and digest the output.  Make it the start of your investigation

Exercise: Examine this reprot and see what you notice.  What do you notice about the data?  Are there ways to slice the data that would give more information?  Are there other visualizations that would give you greater understanding?  


*Notes from Sarah*  
For open ended questions, can you provide either some examples or talking points/suggestions for instructors? Imagine a learner wants to continue exploring, and asks the instructor about it, what might the instructor say in order to get that learner started?

In [6]:

# Minimal avoids expensive calculations that won't have much meaning for us.
profile = ProfileReport(stocks, minimal=True) 
profile.to_notebook_iframe()

*Notes from Sarah*  
Can you describe the above?

Now that you have looked at the data report are there other things that would help you understand the data?  Perhaps it would be helpful to see the data for a single stock.

In [7]:
profile = ProfileReport(stocks[stocks['ticker']=='MSFT'], minimal=False)  # This time run the full report.
profile.to_notebook_iframe()

Looking at an individual stock gives a much clearer impression of the distributions of each column.  Even if you can not do this for every stock, taking a sample can be very helpful.  

#### Optional Exercise: 
If you are interested, take the time to investigate your data further.  It is usually time well spent.

*Notes from Sarah*  
Again, either some examples in code, or simply examples described in English for what learners could do to continue exploring

In Machine Learning, we need to predict something, typically called our target or predictor. Let's predict the value of the stock price 5 days into the future.

### Feature engineering
For modelling data with machine learning, it is helpful to transform the data into a form that is closer to the theoretical expectations where the ML models should perform well. Let's transform the data into returns and generate other features.  We will transform returns with logarithms based on financial research that log returns are closer to normally distributed and (statistically) stable. 

In [8]:
def feature_target_generation(df):
    """
    df: a pandas dataframe containing numerical columns
    num_days_ahead: an integer that can be used to shift the prediction value from the future into a prior row.
    """

    if ticker in df.columns and len(df.ticker.unique())==1:
        df.drop(columns=['ticker'], inplace=True)

    # This ensures the data is in date order    
    features = pd.DataFrame(index=df.index).sort_index() 
    features['f01'] = np.log(df.close / df.open) # intra-day log return
    features['f02'] = np.log(df.open / df.close.shift(1)) # overnight log return
    features['f03'] = df.volume
    features['f04'] = np.log(df.volume) 
    features['f05'] = df.volume.diff() 
    features['f06'] = df.volume.pct_change()
    features['f07'] = df.volume.rolling(5, min_periods=1).mean().apply(np.log)
    features['f08'] = df.volume.rolling(10, min_periods=1).mean().apply(np.log)
    features['f09'] = df.volume.rolling(30, min_periods=1).mean().apply(np.log)
    features['f10'] = df.low
    features['f11'] = df.high
    features['f12'] = df.close
    features['f13'] = np.log(df.close / df.close.shift(1)) # 1 day log return
    features['f14'] = np.log(df.close / df.close.shift(5)) # 5 day log return
    features['f15'] = np.log(df.close / df.close.shift(10))

    return features

*Notes from Sarah*  
When you add more English description above code cells, could you add more detail? For example, below you says "these are hyperparametesr you can play with or tune", can you decribe how one might change them and what effect that might have on the output?

In [9]:
# Let's generate a list of tickers so we can easily select them
ticker_list = stocks.ticker.unique()

# these are hyperparameters you can play with or tune
prediction_horizon = -5 # negative number
ticker = ticker_list[19] # choose any ticker
n_splits = 5

#Let's make an individual model for each ticker
features = feature_target_generation(stocks[stocks.ticker==ticker])

### Separating our data
It is important that we separate our training, validation and test data.  Since our rows are already time ordered, we can easily do splits.  This is one of the areas where times series data is different from other machine learning problems.  

In [10]:
# We are trying to predict the price prediction_horizon dyas in the future.  So we take the future value and move it prediction_horizon into the past to line up       our data in the Scikit-learn format.  
y = features.f12.shift(prediction_horizon)
# The latest (prediction_horizon) rows will have nans because we have no future data, so let's drop them.
shifted = ~np.isnan(y)
X = features[y.notna()]
y = y[shifted]

tscv = TimeSeriesSplit(n_splits=n_splits)
print(tscv)

TimeSeriesSplit(max_train_size=None, n_splits=5)


### Model fun.
This is a regression problem.  Why? 

#### Linear regresson
In our ML framework we can use linear regression, just as in standard statistics or econometrics.


*Notes from Sarah*  
Can you provide explanation for why it is a regression problem? Even if it is talking points for an instructor (if you wanted the learners to think about that on their own)

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error



model = LinearRegression()
for train_ind, test_ind in tscv.split(X):
    print(f'Train is from {X.iloc[train_ind].index.min()} to {X.iloc[train_ind].index.max()}. ')
    print(f'Test is from {X.iloc[test_ind].index.min()} to {X.iloc[test_ind].index.max()}. ')
    X_train, X_test = X.iloc[train_ind], X.iloc[test_ind]
    y_train, y_test = y.iloc[train_ind], y.iloc[test_ind]

    # Since linear regression cannot deal with NaN, we need to impute.  This may not be the best choice.
    X_train.fillna(0, inplace=True)
    X_test.fillna(0, inplace=True)

    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    print("Training results:")
    print("RMSE:", mean_squared_error(y_train, y_pred_train, squared=False))
    print("MAE:", mean_absolute_error(y_train, y_pred_train))
    print("R^2", r2_score(y_train, y_pred_train))

    y_pred_test = model.predict(X_test)
    print("Test results:")
    print("RMSE:", mean_squared_error(y_test, y_pred_test, squared=False))
    print("MAE:", mean_absolute_error(y_test, y_pred_test))
    print("R^2", r2_score(y_test, y_pred_test))



Train is from 2013-01-02 00:00:00 to 2013-12-10 00:00:00. 
Test is from 2013-12-11 00:00:00 to 2014-11-17 00:00:00. 
Training results:
RMSE: 12.194687620849338
MAE: 8.269789181468507
R^2 0.971794668636506
Test results:
RMSE: 25.08803386279769
MAE: 18.88693951339991
R^2 0.8346855708624527
Train is from 2013-01-02 00:00:00 to 2014-11-17 00:00:00. 
Test is from 2014-11-18 00:00:00 to 2015-10-26 00:00:00. 
Training results:
RMSE: 17.680579749474273
MAE: 11.934407810948635
R^2 0.9782110088737253
Test results:
RMSE: 23.39645909020436
MAE: 16.396257604855755
R^2 0.6973798830428417
Train is from 2013-01-02 00:00:00 to 2015-10-26 00:00:00. 
Test is from 2015-10-27 00:00:00 to 2016-10-03 00:00:00. 
Training results:
RMSE: 19.445258479768935
MAE: 13.241340666253965
R^2 0.9785426911578539
Test results:
RMSE: 26.035471691886507
MAE: 19.98573882362275
R^2 0.8101300107145598
Train is from 2013-01-02 00:00:00 to 2016-10-03 00:00:00. 
Test is from 2016-10-04 00:00:00 to 2017-09-11 00:00:00. 
Training r

As we look at our results we can see that in each period we do better in training than in testing.  That is typical of finance and machine learnning. 
Interesetingly, we are able to explain 90% of the variance with our first linear model.  

If you have questions about the root-mean-squared-error (RMSE), mean absolute error (MAE) or the R squared goodness of fit measure, please see the Microsoft learn module on machine learning.

#### Ensemble Model
Let's try a RandomForest, which is a common model that blends a group of decision trees, each of which have access to a sub-sample of features.  This is NOT a classic model commonly used in econometrics.  However it is used in predicting items other than prices.  


In [12]:
from sklearn import ensemble
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

#model=ensemble.ExtraTreesRegressor()
model = RandomForestRegressor()
for train_ind, test_ind in tscv.split(X):
    print(f'Train is from {X.iloc[train_ind].index.min()} to {X.iloc[train_ind].index.max()}. ')
    print(f'Test is from {X.iloc[test_ind].index.min()} to {X.iloc[test_ind].index.max()}. ')
    X_train, X_test = X.iloc[train_ind], X.iloc[test_ind]
    y_train, y_test = y.iloc[train_ind], y.iloc[test_ind]

    # Since linear regression cannot deal with NaN, we need to impute.  This may not be the best choice.
    X_train.fillna(0, inplace=True)
    X_test.fillna(0, inplace=True)

    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    print("Training results:")
    print("RMSE:", mean_squared_error(y_train, y_pred_train, squared=False))
    print("MAE:", mean_absolute_error(y_train, y_pred_train))
    print("R^2", r2_score(y_train, y_pred_train))

    y_pred_test = model.predict(X_test)
    print("Test results:")
    print("RMSE:", mean_squared_error(y_test, y_pred_test, squared=False))
    print("MAE:", mean_absolute_error(y_test, y_pred_test))
    print("R^2", r2_score(y_test, y_pred_test))

Train is from 2013-01-02 00:00:00 to 2013-12-10 00:00:00. 
Test is from 2013-12-11 00:00:00 to 2014-11-17 00:00:00. 
Training results:
RMSE: 3.3447128957408365
MAE: 2.004199599578613
R^2 0.9978781849000666
Test results:
RMSE: 91.7050512974462
MAE: 73.71684743590274
R^2 -1.2088388061050512
Train is from 2013-01-02 00:00:00 to 2014-11-17 00:00:00. 
Test is from 2014-11-18 00:00:00 to 2015-10-26 00:00:00. 
Training results:
RMSE: 5.030309724665998
MAE: 3.149532310389282
R^2 0.9982362640334342
Test results:
RMSE: 36.59228129723509
MAE: 28.189713952015993
R^2 0.2597532705242832
Train is from 2013-01-02 00:00:00 to 2015-10-26 00:00:00. 
Test is from 2015-10-27 00:00:00 to 2016-10-03 00:00:00. 
Training results:
RMSE: 5.947169943683202
MAE: 3.9384867665465437
R^2 0.9979929023195632
Test results:
RMSE: 42.2954274028084
MAE: 33.301960367752315
R^2 0.49891400910024863
Train is from 2013-01-02 00:00:00 to 2016-10-03 00:00:00. 
Test is from 2016-10-04 00:00:00 to 2017-09-11 00:00:00. 
Training res

Wow, we are able to bring down our training data error metrics by a lot!  However our test results are not as good for most of the time slices.  This is an indication that we are overfitting our model to the training data.  Ths model would likely not do as well in production as it would in our backtests

### Open Ended Exercise

Now is your turn to go ahead and improve these models.  Some areas that might help could be to: 
- Tune the existing models (Random forest has a number of parameters that may help)
- Try this for more stocks (Just becasue it did not work for one stock, it may still be useful for most stocks)
- Get more features, through transformations or outside data 
- Clean the existing data (How are we dealing with weekends and holidays?)
- Try other models such as Support Vector Regressor, Extra Trees Regressor or ElasticNet
  

*Notes from Sarah*  
Please provide either an example or more specific talking points for the instructor here.