## Capstone Project

In this Capstone Project we will be engaging with the stock price data from MSN Money, which can be found at:
./Data/histroical-stock-prices.csv

In this notebook we will be looking at financial forecasting with machine learning.  This is inherently one of the hardest problems in machine learning, because some of the most advanced and well funded technical teams in the world are trying to use machine learning and other techniques to find patterns in the financial data.  When they find patterns and if they trade on those findings, prices will move in a way that makes those patterns less pronounced over time. This is not to say that this is not a fun and rewarding area.  Just do not get discouraged if you don't find an instant money machine.  

### Outline:
0. Background

1. Preparing our tools

2. Importing and describing the data.

3. Exploring, cleaning and visualizing the data

3. Developing analytics

4. Preparing and splitting our data

5. Building our first model

6. Extending to other ML models

7. Ideas for further strategies

8. Wrapping up

### Options
As we progress you are encouraged to take this dataset further. You are also encouraged to explore any aspects of the data. Develop your own algorithms. Be explicit about your inquiry and success in predicting affects on our world.

### Warning: Not financial advice
This exercise is meant purely for educational purposes, uses many simplifications and is not intended, nor should be considered as financial advice. There are many risks involved in implementation of financial trading strategies that are not considered nor described here.

### Setting up
If you have not yet set up your environment, you can easily do so with VS Code, and the python extension and Anaconda.  

For VSCode go here: [https://code.visualstudio.com/]

and then you can follow these instructions:
[https://code.visualstudio.com/docs/python/data-science-tutorial]

0. Background

Machine learning is of increasing importance in finance. As volumes of data grow ever faster, the need for machine driven models to find patterns in that data becomes ever more important.  In the ever accelerating race to better process data into predictions about securities prices, machine learning has become an important tool.  Today we will be examining patterns in stock prices themselves to practice developing models to predict future set prices. If there are systematic trends, patterns or reversals, then we may detect them. 

While the chances that we discover totally new and unexploited price patterns today is low, we will practice organizing our data, creating and analyzing machine learaning models that will give us the tools to develop state of the art signals of value.

Goals: 

1. Become familiar and practice the process of building machine learning models as they relate to financial data.  

2. Understand the special processing that is required when working with time series data such those found in finance.

Data: Today we will be working with a sample of daily price data on several hundred stocks over a period of 5 years.  
The data is from [Kaggle](https://www.kaggle.com/dgawlik/nyse#prices-split-adjusted.csv) and is available in its original form there under a CC0 liscence.  We will be using a slightly preprocessed version in the repo.  


### 1. Preparing our tools.

Let's review our standard imports:
- numpy for rapid numerical calculations with fast vectorized C implementations
- pandas for processing data
- matplotlib and Seaborn for visualizing charts
- scikit-learn (imported as sklearn) is the de facto standard machine learning library in the pydata ecosystem.  
 
Additionally, we will be using [pandas_profiling](https://github.com/pandas-profiling/pandas-profiling) which is a newer convenience package that helps by putting toegether much of our initial *boilerplate* exploratory data analysis code. 


In [2]:
# Bring our tools in:
import numpy             as np
import pandas            as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit

### Importing and describing the data

Now we are ready to import our data.  The value rows can be set if you only want to import a small subset due to computor memory or speed constraints.

In [3]:
# We are using information from the data source description to know that date is a column containing just what is says
rows = None
stocks = pd.read_csv('./Data/prices-split-adjusted.csv', nrows=rows, parse_dates=['date'])

Now that we have our data successfully loaded, let's explore what we have. First, summarize the dataframe via the info method to validate the data reading and parsing.  When looking at the info report, it is best practice to note that each column is the expected type, noting that strings are reported as object.  Also note if there are null values, how many values there are and what the columns are.  

We do not have a data dictionary in this case.  But if you were lucky enough to have access to a data dictionary, this is a good time to check that the dictionary matches what you actually have.  Discrepancies could be the result of mis-parsing, undocumented schema changes, documentation that is not up to date, or a number of other reasons.  

In [4]:
stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 851264 entries, 0 to 851263
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   date    851264 non-null  datetime64[ns]
 1   symbol  851264 non-null  object        
 2   open    851264 non-null  float64       
 3   close   851264 non-null  float64       
 4   low     851264 non-null  float64       
 5   high    851264 non-null  float64       
 6   volume  851264 non-null  float64       
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 45.5+ MB


Everything looks as expected in info.  The string column symbol is the trading symbol, also known in finance as the ticker.  The dates were parsed as expected and all the other columns are numeric.  Now we can look at the first few rows of data to get a sample view.  

In [5]:
stocks.head(10)

Unnamed: 0,date,symbol,open,close,low,high,volume
0,2016-01-05,WLTW,123.43,125.839996,122.309998,126.25,2163600.0
1,2016-01-06,WLTW,125.239998,119.980003,119.940002,125.540001,2386400.0
2,2016-01-07,WLTW,116.379997,114.949997,114.93,119.739998,2489500.0
3,2016-01-08,WLTW,115.480003,116.620003,113.5,117.440002,2006300.0
4,2016-01-11,WLTW,117.010002,114.970001,114.089996,117.330002,1408600.0
5,2016-01-12,WLTW,115.510002,115.550003,114.5,116.059998,1098000.0
6,2016-01-13,WLTW,116.459999,112.849998,112.589996,117.07,949600.0
7,2016-01-14,WLTW,113.510002,114.379997,110.050003,115.029999,785300.0
8,2016-01-15,WLTW,113.330002,112.529999,111.919998,114.879997,1093700.0
9,2016-01-19,WLTW,113.660004,110.379997,109.870003,115.870003,1523500.0


While the head gives a good preview of a piece of the data, it may not be a great overall view of the entire dataset, especially for larger data sets or ones that may have been sorted at some point.  However we are more comfortable that the dates were parsed correctly.  We can investigate numerical columns with describe.  

In [6]:
stocks.describe()

Unnamed: 0,open,close,low,high,volume
count,851264.0,851264.0,851264.0,851264.0,851264.0
mean,64.993618,65.011913,64.336541,65.639748,5415113.0
std,75.203893,75.201216,74.459518,75.906861,12494680.0
min,1.66,1.59,1.5,1.81,0.0
25%,31.27,31.292776,30.940001,31.620001,1221500.0
50%,48.459999,48.48,47.970001,48.959999,2476250.0
75%,75.120003,75.139999,74.400002,75.849998,5222500.0
max,1584.439941,1578.130005,1549.939941,1600.930054,859643400.0



#### Pandas Profiling
Data exploration can start with a turnkey tool like pandas-profiling.  The important this with this is to make sure you actually look at the report and digest the output.  Make it the start of your investigation.

Instructor note:

Let students take a minute ot look at this report.  See what they find.  You may want to do a think, pair, share or another form of reflection.  We run this report with minimal=True because the data set is large and this avoids slow calculations.  

Then go through the report and show students what they should be looking at for example: the number of variables, the number of observations, duplicate rows.  
Be sure to point out that there is a warning that symbol has high cardinality. This warning is OK because we have data on a large number of stocks. It is expected with this data set and is typical of many finance data sets where there are many instruments.  

Students may be interested to note the long tail in the numeric columns that can be seen in the histograms of each variable (in the variables section).  The histogram combines data from many different stocks so the variation in variables like closing price "close", or in "volume" is greater and shows a long tail.

You can show the output in either a more dynamic widget or in an iframe using either of these lines of code:  

profile.to_widgets()

profile.to_notebook_iframe()


In [17]:
# Minimal avoids expensive calculations that won't have much insight for us and are slow.
profile = ProfileReport(stocks, minimal=True)
profile.to_widgets()

Tab(children=(HTML(value='<div id="overview-content" class="row variable spacing">\n    <div class="row">\n   …

Exercise: Examine this report and see what insights you notice.  What do you notice about the data?  Are there ways to slice the data that would give more information?  Are there other visualizations that would give you greater understanding?  

Instructor notes: 
Students can take a few minutes to do their own exploration.  If students don't see much or you covered this thoroughly, you can suggest that they dig into a particular stock by running a report on an individual stock, letting them chose their own symbol.  A sample code responce follows the empty cell provided for student responses.  
In the provided answer highlighting Microsoft, you can see that the variables have some tail, but not the same extremes as when many stocks were combined.  Point out that if they were doing this at work, they would probably look at each stock individually and really get to know their data.  Or at least do so for a representative sample, if not every stock.  

In [None]:
# Blank cell left for student exploration

In [20]:
# Instructor sample answer
profile = ProfileReport(stocks[stocks['symbol']=='MSFT'], minimal=False)
profile.to_widgets()

Tab(children=(HTML(value='<div id="overview-content" class="row variable spacing">\n    <div class="row">\n   …

### 2. Exploring, cleaning and visualizing the data

#### Modeling our data
One of the most important steps in preparing your data is to think of how we want to think about it.  For those who have worked with SQL, this is like identifying what the key to your table is. For this data exploration, we can think of the index into the data as being a a compound key of both the ticker and the date.  We will process the data in that way that will make a time series model for each stock.  Predicting the future based on what has happened in the past for that individual stock.  

Thought Question: Can you think of other ways that you might want to consider this data?  What questions might prompt you to think of this data having a different data model?  The data model is not fixed, but is a lense that lets us look at our data.  

#### Visualizing our data
It is often helpful to look at our data visually to see if their are any issues that "look funny."  With experience, just looking at the data can help us understand it in a short amount of time.  This is often the step where domain expertise (in this case the financial markets) is especially useful.


*Notes from Sarah*  
For open ended questions, can you provide either some examples or talking points/suggestions for instructors? Imagine a learner wants to continue exploring, and asks the instructor about it, what might the instructor say in order to get that learner started?

*Notes from Sarah*  
Can you describe the above?

Now that you have looked at the data report are there other things that would help you understand the data?  Perhaps it would be helpful to see the data for a single stock.

In [9]:
profile = ProfileReport(stocks[stocks['ticker']=='MSFT'], minimal=False)  # This time run the full report.
profile.to_notebook_iframe()

KeyError: 'ticker'

Looking at an individual stock gives a much clearer impression of the distributions of each column.  Even if you can not do this for every stock, taking a sample can be very helpful.  

#### Optional Exercise: 
If you are interested, take the time to investigate your data further.  It is usually time well spent.

*Notes from Sarah*  
Again, either some examples in code, or simply examples described in English for what learners could do to continue exploring

In Machine Learning, we need to predict something, typically called our target or predictor. Let's predict the value of the stock price 5 days into the future.

### Feature engineering
For modelling data with machine learning, it is helpful to transform the data into a form that is closer to the theoretical expectations where the ML models should perform well. Let's transform the data into returns and generate other features.  We will transform returns with logarithms based on financial research that log returns are closer to normally distributed and (statistically) stable. 

In [10]:
def feature_target_generation(df):
    """
    df: a pandas dataframe containing numerical columns
    num_days_ahead: an integer that can be used to shift the prediction value from the future into a prior row.
    """

    if ticker in df.columns and len(df.ticker.unique())==1:
        df.drop(columns=['ticker'], inplace=True)

    # This ensures the data is in date order    
    features = pd.DataFrame(index=df.index).sort_index() 
    features['f01'] = np.log(df.close / df.open) # intra-day log return
    features['f02'] = np.log(df.open / df.close.shift(1)) # overnight log return
    features['f03'] = df.volume
    features['f04'] = np.log(df.volume) 
    features['f05'] = df.volume.diff() 
    features['f06'] = df.volume.pct_change()
    features['f07'] = df.volume.rolling(5, min_periods=1).mean().apply(np.log)
    features['f08'] = df.volume.rolling(10, min_periods=1).mean().apply(np.log)
    features['f09'] = df.volume.rolling(30, min_periods=1).mean().apply(np.log)
    features['f10'] = df.low
    features['f11'] = df.high
    features['f12'] = df.close
    features['f13'] = np.log(df.close / df.close.shift(1)) # 1 day log return
    features['f14'] = np.log(df.close / df.close.shift(5)) # 5 day log return
    features['f15'] = np.log(df.close / df.close.shift(10))

    return features

*Notes from Sarah*  
When you add more English description above code cells, could you add more detail? For example, below you says "these are hyperparametesr you can play with or tune", can you decribe how one might change them and what effect that might have on the output?

In [11]:
# Let's generate a list of tickers so we can easily select them
ticker_list = stocks.ticker.unique()

# these are hyperparameters you can play with or tune
prediction_horizon = -5 # negative number
ticker = ticker_list[19] # choose any ticker
n_splits = 5

#Let's make an individual model for each ticker
features = feature_target_generation(stocks[stocks.ticker==ticker])

AttributeError: 'DataFrame' object has no attribute 'ticker'

### Separating our data
It is important that we separate our training, validation and test data.  Since our rows are already time ordered, we can easily do splits.  This is one of the areas where times series data is different from other machine learning problems.  

In [12]:
# We are trying to predict the price prediction_horizon dyas in the future.  So we take the future value and move it prediction_horizon into the past to line up       our data in the Scikit-learn format.  
y = features.f12.shift(prediction_horizon)
# The latest (prediction_horizon) rows will have nans because we have no future data, so let's drop them.
shifted = ~np.isnan(y)
X = features[y.notna()]
y = y[shifted]

tscv = TimeSeriesSplit(n_splits=n_splits)
print(tscv)

NameError: name 'features' is not defined

### Model fun.
This is a regression problem.  Why? 

#### Linear regresson
In our ML framework we can use linear regression, just as in standard statistics or econometrics.


*Notes from Sarah*  
Can you provide explanation for why it is a regression problem? Even if it is talking points for an instructor (if you wanted the learners to think about that on their own)

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error



model = LinearRegression()
for train_ind, test_ind in tscv.split(X):
    print(f'Train is from {X.iloc[train_ind].index.min()} to {X.iloc[train_ind].index.max()}. ')
    print(f'Test is from {X.iloc[test_ind].index.min()} to {X.iloc[test_ind].index.max()}. ')
    X_train, X_test = X.iloc[train_ind], X.iloc[test_ind]
    y_train, y_test = y.iloc[train_ind], y.iloc[test_ind]

    # Since linear regression cannot deal with NaN, we need to impute.  This may not be the best choice.
    X_train.fillna(0, inplace=True)
    X_test.fillna(0, inplace=True)

    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    print("Training results:")
    print("RMSE:", mean_squared_error(y_train, y_pred_train, squared=False))
    print("MAE:", mean_absolute_error(y_train, y_pred_train))
    print("R^2", r2_score(y_train, y_pred_train))

    y_pred_test = model.predict(X_test)
    print("Test results:")
    print("RMSE:", mean_squared_error(y_test, y_pred_test, squared=False))
    print("MAE:", mean_absolute_error(y_test, y_pred_test))
    print("R^2", r2_score(y_test, y_pred_test))



NameError: name 'tscv' is not defined

As we look at our results we can see that in each period we do better in training than in testing.  That is typical of finance and machine learnning. 
Interesetingly, we are able to explain 90% of the variance with our first linear model.  

If you have questions about the root-mean-squared-error (RMSE), mean absolute error (MAE) or the R squared goodness of fit measure, please see the Microsoft learn module on machine learning.

#### Ensemble Model
Let's try a RandomForest, which is a common model that blends a group of decision trees, each of which have access to a sub-sample of features.  This is NOT a classic model commonly used in econometrics.  However it is used in predicting items other than prices.  


In [14]:
from sklearn import ensemble
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

#model=ensemble.ExtraTreesRegressor()
model = RandomForestRegressor()
for train_ind, test_ind in tscv.split(X):
    print(f'Train is from {X.iloc[train_ind].index.min()} to {X.iloc[train_ind].index.max()}. ')
    print(f'Test is from {X.iloc[test_ind].index.min()} to {X.iloc[test_ind].index.max()}. ')
    X_train, X_test = X.iloc[train_ind], X.iloc[test_ind]
    y_train, y_test = y.iloc[train_ind], y.iloc[test_ind]

    # Since linear regression cannot deal with NaN, we need to impute.  This may not be the best choice.
    X_train.fillna(0, inplace=True)
    X_test.fillna(0, inplace=True)

    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    print("Training results:")
    print("RMSE:", mean_squared_error(y_train, y_pred_train, squared=False))
    print("MAE:", mean_absolute_error(y_train, y_pred_train))
    print("R^2", r2_score(y_train, y_pred_train))

    y_pred_test = model.predict(X_test)
    print("Test results:")
    print("RMSE:", mean_squared_error(y_test, y_pred_test, squared=False))
    print("MAE:", mean_absolute_error(y_test, y_pred_test))
    print("R^2", r2_score(y_test, y_pred_test))

NameError: name 'tscv' is not defined

Wow, we are able to bring down our training data error metrics by a lot!  However our test results are not as good for most of the time slices.  This is an indication that we are overfitting our model to the training data.  Ths model would likely not do as well in production as it would in our backtests

### Open Ended Exercise

Now is your turn to go ahead and improve these models.  Some areas that might help could be to: 
- Tune the existing models (Random forest has a number of parameters that may help)
- Try this for more stocks (Just becasue it did not work for one stock, it may still be useful for most stocks)
- Get more features, through transformations or outside data 
- Clean the existing data (How are we dealing with weekends and holidays?)
- Try other models such as Support Vector Regressor, Extra Trees Regressor or ElasticNet
  

*Notes from Sarah*  
Please provide either an example or more specific talking points for the instructor here.