# Timing the Market vs Time in the Market
   ## Comparison of Supervised Learning Models to a Simulated 401(k) Approach
  
  ### Unit 3 Capstone Project
  ### Matthew Kennedy, August 2017

   ## Section 1: Overview of Dataset and Analysis of Data
   
   The dataset used in this project comes from Kaggle user "CNuge." The dataset contains historical stock prices over the last five years for all companies in the S&P 500 index and can be found at https://www.kaggle.com/camnugent/sandp500. This project will use the files that have the historical prices for individual stocks.   
       
   The dataset contains the following columns: 
       
       Date - In the format of yy-mm-dd
       Open - Price of the stock in USD at market open
       High - Highest price reached in the day
       Low - Lowest price reached in the day
       Close - The price the stock had at the end of the day
       Volume - Number of shares traded
       Name - The stock's ticker name
       
   The user collected the data by using the python library, 'pandas_datareader,' to scrape Google Finance.

In [1]:
# Import the necessary modules
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import linear_model
from sklearn import preprocessing
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

In [2]:
# This project will use the historical prices of GOOGL

# Read the Dataset, store the original
# This is the filepath on my laptop:
#original = pd.read_csv('C:\\Users\\mkennedy\\sandp500\\individual_stocks_5yr\\GOOGL_data.csv', encoding='utf-8-sig')
# This is the filepath on my desktop:
original = pd.read_csv('D:\\Data\\sandp500\\GOOGL_data.csv')

FileNotFoundError: File b'D:\\Data\\sandp500\\GOOGL_data.csv' does not exist

In [None]:
# Copy a dataframe of the original data to manipulate
data = original

# Print the headers of the dataframe
data.head()

In [None]:
# Check the footer to make sure there are no rows of text
data.tail()

There are no footers that need to be excluded.
There are 1257 rows of stock data. 

In [None]:
# The describe method provides some additional information about the data
data.describe()

In [None]:
# The dtypes call will display the data types. 
# This is used to make sure all numerical values have the correct data type to work with in the models.
print(data.dtypes)

The dataset appears to be clean and easy to work with.

Observe the correlations between columns by using seaborn's heatmap.

In [None]:
# Create a heatmap to compare the correlation of the columns.
import seaborn as sns

corrmat = data.corr()

# Set up the matplotlib figure.
f, ax = plt.subplots(figsize=(12, 9))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()

As expected, the values are highly correlated. Creating prediction models based off of time-series data will not be helpful in reaching the goal of this project (to determine whether to buy or sell the at the next opening day). 

Features will need to be created to generate accurate predictions from the models. 

In [None]:
# The momentum will show how many days in a row the stock has moved up or down. 

# Create a list to store the momentum
momentum = [0]
i=1
# Calculate the momentums and store them in the new column, 'Momentum'
for row in data['Close']:
    if i < len(data):
        if data.Close[i] > data.Close[i-1]:
            momentum.append(+1)
            i = i+1
        elif data.Close[i] < data.Close[i-1]:
            momentum.append(-1)
            i = i+1

data['Momentum'] = momentum
data.head()

In [None]:
# Print out the total momentum and the average momentum across all rows.
total_mom = sum(data.Momentum)
print(total_mom)
ave_mom = data.Momentum.mean()
print(ave_mom)

In [None]:
streak = [0] * len(data)
i=1
# Calculate the streaks and store them in the new column, 'Streak'
for row in data['Close']:
    if i < len(data):
        if data.Close[i] > data.Close[i-1]:
            if streak[i-1] >= 0:
                streak[i] = streak[i-1]+1
                i = i+1
            else:
                streak[i]=0
                i = i+1
        elif data.Close[i] < data.Close[i-1]:
            if streak[i-1] <= 0:
                streak[i] = streak[i-1]-1
                i = i+1
            else:
                streak[i]=0
                i = i+1

data['Streak'] = streak
data.head()

In [None]:
# Create a 'Future Momentum' feature that the model will attempt to predict.
data['Future Momentum'] = data.Momentum.shift(-1)
data.head()

In [None]:
data.tail()

In [None]:
# Drop the last row to get rid of the NaN values
data = data.iloc[:len(data)-1,:]

In [None]:
# Look at the tail to make sure the data looks good
data.tail()

In [None]:
# Create a heatmap to compare the correlation of the columns.
import seaborn as sns

corrmat = data.corr()

# Set up the matplotlib figure.
f, ax = plt.subplots(figsize=(12, 9))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()

In [None]:
# Plot the change in GOOGL stock prices over time
plt = plot(data.Close)

# Section 2: Creation and Comparison of Predictive Models

Now that the data has been analyzed to ensure it can be manipulated, it is time to create some predictive models. For comparison, the scores from the models will be stored in a new table, titled "Model Comparison."

In [None]:
# Create a table to store the scores for each model.
# Title: Model Comparison
# Columns: Model, R^2, Accuracy, AUROC
# Model values: Linear Regression, Ridge Regression, Lasso Regression, Support Vector Regression, Gradient Boost Classification
models = {'Model':[], 'Scoring Metric':[], 'Scoring Value':[]}
columns = models.keys()
model_comparison = pd.DataFrame(data=models, columns=columns)
model_comparison

In [None]:
# Set the variables. 
# Use the closing value for Y
# Use the new features for X
Y = data['Future Momentum']
X = data[['Close', 'Volume', 'Momentum', 'Streak']]

# Create training and test sets.
offset = int(X.shape[0] * 0.8)

# Put 90% of the data in the training set.
X_train, Y_train = X[:offset], Y[:offset]

# And put 10% in the test set.
X_test, Y_test = X[offset:], Y[offset:]

In [None]:
# Print the average momentum for the test set to use in later analysis:
print(Y_test.mean())

# Logistic Regression

In [None]:
# Declare a logistic regression classifier.
# Larger C's lead to reduced regularization of parameters, but because there are
#   few features, the value of C has a trivial effect (tested for many C's)
lr = LogisticRegression(C=1e9)

# Fit the model.
lr.fit(X_train,Y_train)
y_pred = lr.fit(X_train,Y_train).predict(X_test)


print('Confusion Matrix of the Model:')
conf_mat = confusion_matrix(Y_test, y_pred)
print(conf_mat)

# Use Accuracy for Logistic Regression Scoring Metric
cv = cross_val_score(lr, X, Y, cv=10, scoring='accuracy') 
print('Accuracy Score:')
print(cv)
# Print the average of the R2 and store it in the table
print('Average of the Accuracy Score:')
print(cv.mean())

# Store the data in the model_comparison table
models = {'Model':['Logistic Regression'], 'Scoring Metric':'Accuracy', 'Scoring Value':[cv.mean()]}
model_comparison = model_comparison.append(pd.DataFrame(data=models, columns=models.keys()), ignore_index=True)

In [None]:
print(y_pred)
print(y_pred.mean())

# Ridge Regression

In [None]:
# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced. Note that by convention, the
# intercept is not regularized. Since we standardized the data
# earlier, the intercept should be equal to zero and can be dropped.

ridgeregr = linear_model.Ridge(alpha=10, fit_intercept=False) 
ridgeregr.fit(X_train, Y_train)

y_pred = ridgeregr.fit(X_train,Y_train).predict(X_test)

# Print the R2.
cv = cross_val_score(ridgeregr, X, Y, cv=10, scoring='r2') 
print('R2 Score:')
print(cv)
# Print the average of the R2 and store it in the table
print('Average of the R2 Score:')
print(cv.mean())

# Store the data in the model_comparison table
models = {'Model':['Ridge Regression'], 'Scoring Metric':'R2', 'Scoring Value':[cv.mean()]}
model_comparison = model_comparison.append(pd.DataFrame(data=models, columns=models.keys()), ignore_index=True)

In [None]:
print(y_pred)
print(y_pred.mean())

# Lasso Regression

In [None]:
lasso = linear_model.Lasso(alpha=.35)
lasso.fit(X_train, Y_train)

y_pred = lasso.fit(X_train,Y_train).predict(X_test)

# Print the R-Squared value and store it in the table
print('R-Squared of the model:') 
score = r2_score(Y_test, y_pred)
print(score)

# Print the R2.
cv = cross_val_score(lasso, X, Y, cv=10, scoring='r2') 
print('R2 Score:')
print(cv)
# Print the average of the AUROC and store it in the table
print('Average of the R2 Score:')
print(cv.mean())

# Store the data in the model_comparison table
models = {'Model':['Lasso Regression'], 'Scoring Metric':'R2', 'Scoring Value':[cv.mean()]}
model_comparison = model_comparison.append(pd.DataFrame(data=models, columns=models.keys()), ignore_index=True)

In [None]:
print(y_pred)
print(y_pred.mean())

# Support Vector Regression

In [None]:
# Changing value for epsilon may reduce the overfitting

# Make a model using SVR here
svr = SVR(epsilon=.5)
svr.fit(X_train,Y_train)
y_pred = svr.fit(X_train,Y_train).predict(X_test)

# Use Accuracy for the Scoring Metric
cv = cross_val_score(svr, X, Y, cv=10, scoring='r2') 
print('R2 Score:')
print(cv)
# Print the average of the AUROC and store it in the table
print('Average of the R2 Score:')
print(cv.mean())

# Store the data in the model_comparison table
models = {'Model':['Support Vector Regression'], 'Scoring Metric':'R2', 'Scoring Value':[cv.mean()]}
model_comparison = model_comparison.append(pd.DataFrame(data=models, columns=models.keys()), ignore_index=True)

In [None]:
print(y_pred)
print(y_pred.mean())

# Support Vector Classifier

In [None]:
# Use GridSearchCV to determine the best gamma and C values for SVC.
# C options: 0.01-1.0, default is 1.0
# Kernel types: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’, default is 'rbf'
# Gamma types: float, default is 1/n
from sklearn.model_selection import GridSearchCV
parameters = {'gamma':[0.1,1], 'C':[0.1,1]}
svc = SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, Y_train)

In [None]:
svc.score

In [None]:
svc = SVC(gamma = .9)
svc.fit(X_train,Y_train)
y_pred = svc.fit(X_train,Y_train).predict(X_test)

print('Confusion Matrix of the model:')
conf_mat = confusion_matrix(Y_test, y_pred)
print(conf_mat)

# Print the Accuracy.
cv = cross_val_score(svc, X, Y, cv=10, scoring='accuracy') 
print('Accuracy Score:')
print(cv)
# Print the average of the AUROC and store it in the table
print('Average of the Accuracy Score:')
print(cv.mean())


# Store the data in the model_comparison table
models = {'Model':['Support Vector Classifier'], 'Scoring Metric':'Accuracy', 'Scoring Value':[cv.mean()]}
model_comparison = model_comparison.append(pd.DataFrame(data=models, columns=models.keys()), ignore_index=True)

### Testing for many combinations of C, gamma, and kernel.

All params default: Average Accuracy (AA) = 0.524266973886

C = .25, other params default: AA = 0.524266973886

C = .75, other params default: AA = 0.524266973886

Kernel = 'linear,' other params default: AA = Would not run

Kernel = 'poly,' 'sigmoid,' or 'precomputed,' other params default: ----> 9 svc.fit(X_train,Y_train) TypeError: must be real number, not str

Gamma = .1, other params default: AA = 0.524266973886

Gamma = .9, other params default: AA = 0.524266973886

In [None]:
print(y_pred)
print(y_pred.mean())

# Gradient Boosting Classifier

In [None]:
# 500 iterations, using 5-deep trees, and loss function 'deviance.'
params = {'n_estimators': 500,
          'max_depth': 5,
          'loss': 'deviance'}

# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train,Y_train)
y_pred = clf.fit(X_train,Y_train).predict(X_test)

print('Confusion Matrix of the model:')
conf_mat = confusion_matrix(Y_test, y_pred)
print(conf_mat)

# Print the AUROC.
cv = cross_val_score(clf, X, Y, cv=10, scoring='accuracy') 
print('Accuracy Score:')
print(cv)
# Print the average of the AUROC and store it in the table
print('Average of the Accuracy Score:')
print(cv.mean())

# Store the data in the model_comparison table
models = {'Model':['Gradient Boosting Classifier'], 'Scoring Metric':'Accuracy', 'Scoring Value':[cv.mean()]}
model_comparison = model_comparison.append(pd.DataFrame(data=models, columns=models.keys()), ignore_index=True)

In [None]:
print(y_pred)
print(y_pred.mean())

# Section 3: Selection and Analysis of the Best Performing Model

In [None]:
model_comparison

# Being provided one thousand dollars cash every ten business days, how will the top three models perform compared to one another as well as to a 401(k) approach?

In [None]:
# Create a cash_available list that stores how much cash is available 
# to buy the stocks. It will be $1000 every ten business days.
cash_available = [0]

# Start out with 0 stocks_owned
stocks_owned = [0]

# Create DataFrame for profits
profits = {'Model':[],'Profit':[]}
columns = profits.keys()
model_profits = pd.DataFrame(data=profits, columns=columns)

In [None]:
# Create a pred_return function to calculate the returns of the prediction models
def pred_return(y_pred, data, cash_available, stocks_owned):
    # Create DataFrame for values
    new_values = {'Stocks':[],'Cash':[]}
    columns = new_values.keys()
    values = pd.DataFrame(data=new_values, columns=columns)
    for i in range(len(y_pred)):
        # For every tenth iteration of i, add 1000 to cash_available
        if i%10 == 0:
            cash_available = cash_available+1000
        # If the predicted value is greater than zero, buy more stock
        if y_pred[i] > 0:
            [stocks_owned,cash_available] = buy_stock(cash_available, 
                                                      stocks_owned, y_pred, 
                                                      data.Close[i])
        # If the predicted value is less than zero, sell stock
        elif y_pred[i] < 0:
            [stocks_owned,cash_available] = sell_stock(cash_available, 
                                                      stocks_owned, y_pred, 
                                                      data.Close[i])
        stocks_owned = [stocks_owned,cash_available][0]
        cash_available = [stocks_owned,cash_available][1]
        new_values = {'Stocks':[stocks_owned], 'Cash':[cash_available]}
        values = values.append(pd.DataFrame(data=new_values, columns=
                                            new_values.keys()), 
                                            ignore_index=True)
    #print(values,y_pred)
    return(values)

In [None]:
# Create a buy_stock function that buys as many stocks as can be afforded
def buy_stock(cash_available, stocks_owned, y_pred, value):
    # Set number of stocks to buy
    num_stocks_buy = int(cash_available/value)
    # Subtract from cash_available, store it in a list
    cash_available = cash_available-num_stocks_buy*value
    stocks_owned = stocks_owned+num_stocks_buy
    return(stocks_owned, cash_available)

In [None]:
# Create a sell_stock function that sells all stocks
def sell_stock(cash_available, stocks_owned, y_pred, value):
    sell_value = stocks_owned*value
    cash_available = cash_available + sell_value
    stocks_owned = 0
    return(stocks_owned, cash_available)

In [None]:
# Create the 401k approach simulator, safe_invest, 
# That buys as much stock as is available every two weeks / ten business days.
def safe_invest(data, cash_available, stocks_owned):
    # Create DataFrame for values
    new_values = {'Stocks':[],'Cash':[]}
    columns = new_values.keys()
    values = pd.DataFrame(data=new_values, columns=columns)
    for i in range(len(data.loc[offset:,:])):
        # For every tenth iteration of i, add 1000 to cash_available
        if i%10 == 0:
            cash_available = cash_available+1000
        num_stocks_buy = int(cash_available/data.Close[i])
        stocks_owned = num_stocks_buy + stocks_owned
        cash_available = cash_available-num_stocks_buy*data.Close[i]
        # Store the values in a table
        new_values = {'Stocks':[stocks_owned], 'Cash':[cash_available]}
        values = values.append(pd.DataFrame(data=new_values, columns=
                                            new_values.keys()), 
                                            ignore_index=True)

    #print(values)
    return(values)

In [None]:
# Run the pred_return function for lr
y_pred = lr.fit(X_train,Y_train).predict(X_test)
cash_available = 0
stocks_owned = 0
values = pred_return(y_pred, data, cash_available, stocks_owned)

In [None]:
# Calculate the ending returns from the lr model
lr_returns = values.loc[len(values)-1,'Stocks']*data.loc[len(y_pred)-1,'Close'] + values.loc[len(values)-1, 'Cash']
print(lr_returns)

In [None]:
# 1257 rows of data
# $1000 granted every 10 days = 125,700 dollars granted.
# Since we are predicting the last 20% of the data, multiply by .20 
# A total of $25,140 granted in the last 20% of the data
# Get total profit by subrtracting returns by dollars granted. 
# Get total profit by subrtracting returns by dollars granted.  
lr_profit = lr_returns - 25140
print(lr_profit)

# Store the profit in the model_profits table
profits = {'Model':['Logistic Regression'], 'Profit':lr_profit}
model_profits = model_profits.append(pd.DataFrame(data=profits, columns=profits.keys()), ignore_index=True)

In [None]:
# Run the pred_return function for ridge
y_pred = ridgeregr.fit(X_train,Y_train).predict(X_test)
cash_available = 0
stocks_owned = 0
values = pred_return(y_pred, data, cash_available, stocks_owned)

In [None]:
# Calculate the ending returns from the lr model
ridgeregr_returns = values.loc[len(values)-1,'Stocks']*data.loc[len(y_pred)-1,'Close'] + values.loc[len(values)-1, 'Cash']
print(ridgeregr_returns)

In [None]:
# 1257 rows of data
# $1000 granted every 10 days = 125,700 dollars granted.
# Since we are predicting the last 20% of the data, multiply by .20 
# A total of $25,140 granted in the last 20% of the data
# Get total profit by subrtracting returns by dollars granted. 
# Get total profit by subrtracting returns by dollars granted.  
ridgeregr_profit = ridgeregr_returns - 25140
print(ridgeregr_profit)

# Store the profit in the model_profits table
profits = {'Model':['Ridge Regression'], 'Profit':ridgeregr_profit}
model_profits = model_profits.append(pd.DataFrame(data=profits, columns=profits.keys()), ignore_index=True)

In [None]:
# Run the pred_return function for lasso
y_pred = lasso.fit(X_train,Y_train).predict(X_test)
cash_available = 0
stocks_owned = 0
values = pred_return(y_pred, data, cash_available, stocks_owned)

In [None]:
# Calculate the ending returns from the lr model
lasso_returns = values.loc[len(values)-1,'Stocks']*data.loc[len(y_pred)-1,'Close'] + values.loc[len(values)-1, 'Cash']
print(lasso_returns)

In [None]:
# 1257 rows of data
# $1000 granted every 10 days = 125,700 dollars granted.
# Since we are predicting the last 20% of the data, multiply by .20 
# A total of $25,140 granted in the last 20% of the data
# Get total profit by subrtracting returns by dollars granted. 
# Get total profit by subrtracting returns by dollars granted.  
lasso_profit = lasso_returns - 25140
print(lasso_profit)

# Store the profit in the model_profits table
profits = {'Model':['Lasso Regression'], 'Profit':lasso_profit}
model_profits = model_profits.append(pd.DataFrame(data=profits, columns=profits.keys()), ignore_index=True)

In [None]:
# Run the pred_return function for svr
y_pred = svr.fit(X_train,Y_train).predict(X_test)
cash_available = 0
stocks_owned = 0
values = pred_return(y_pred, data, cash_available, stocks_owned)

In [None]:
# Calculate the ending returns from the svr model
svr_returns = values.loc[len(values)-1,'Stocks']*data.loc[len(y_pred)-1,'Close'] + values.loc[len(values)-1, 'Cash']
print(svr_returns)

In [None]:
# 1257 rows of data
# $1000 granted every 10 days = 125,700 dollars granted.
# Since we are predicting the last 20% of the data, multiply by .20 
# A total of $25,140 granted in the last 20% of the data
# Get total profit by subrtracting returns by dollars granted. 
# Get total profit by subrtracting returns by dollars granted.  
svr_profit = svr_returns - 25140
print(svr_profit)

# Store the profit in the model_profits table
profits = {'Model':['Support Vector Regression'], 'Profit':svr_profit}
model_profits = model_profits.append(pd.DataFrame(data=profits, columns=profits.keys()), ignore_index=True)

In [None]:
# Run the pred_return function for svc
y_pred = svc.fit(X_train,Y_train).predict(X_test)
cash_available = 0
stocks_owned = 0
values = pred_return(y_pred, data, cash_available, stocks_owned)

In [None]:
# Calculate the ending returns from the svc model
svc_returns = values.loc[len(values)-1,'Stocks']*data.loc[len(y_pred)-1,'Close'] + values.loc[len(values)-1, 'Cash']
print(svc_returns)

In [None]:
# 1257 rows of data
# $1000 granted every 10 days = 125,700 dollars granted.
# Since we are predicting the last 20% of the data, multiply by .20 
# A total of $25,140 granted in the last 20% of the data
# Get total profit by subrtracting returns by dollars granted. 
# Get total profit by subrtracting returns by dollars granted.  
svc_profit = svc_returns - 25140
print(svc_profit)

# Store the profit in the model_profits table
profits = {'Model':['Support Vector Classifier'], 'Profit':svc_profit}
model_profits = model_profits.append(pd.DataFrame(data=profits, columns=profits.keys()), ignore_index=True)

In [None]:
# Run the pred_return function for gradient boosting classifier
y_pred = clf.fit(X_train,Y_train).predict(X_test)
cash_available = 0
stocks_owned = 0
values = pred_return(y_pred, data, cash_available, stocks_owned)

In [None]:
# Calculate the ending returns from the clf model
clf_returns = values.loc[len(values)-1,'Stocks']*data.loc[len(y_pred)-1,'Close'] + values.loc[len(values)-1, 'Cash']
print(clf_returns)

In [None]:
# 1257 rows of data
# $1000 granted every 10 days = 125,700 dollars granted.
# Since we are predicting the last 20% of the data, multiply by .20 
# A total of $25,140 granted in the last 20% of the data
# Get total profit by subrtracting returns by dollars granted. 
# Get total profit by subrtracting returns by dollars granted.  
clf_profit = clf_returns - 25140
print(clf_profit)

# Store the profit in the model_profits table
profits = {'Model':['Gradient Boosting Classifier'], 'Profit':clf_profit}
model_profits = model_profits.append(pd.DataFrame(data=profits, columns=profits.keys()), ignore_index=True)

In [None]:
# Run the safe_invest function
cash_available = 0
stocks_owned = 0
values = safe_invest(data, cash_available, stocks_owned)

In [None]:
# Calculate the ending returns from the safe_invest function
safe_returns = values.loc[len(values)-1,'Stocks']*data.loc[len(data.loc[offset:,:])-1,'Close'] + values.loc[len(values)-1, 'Cash']
print(safe_returns)

In [None]:
# 1257 rows of data
# $1000 granted every 10 days = 125,700 dollars granted.
# Since we are predicting the last 20% of the data, multiply by .20 
# A total of $25,140 granted in the last 20% of the data
# Get total profit by subrtracting returns by dollars granted. 
# Get total profit by subrtracting returns by dollars granted.  
safe_profit = safe_returns - 25140
print(safe_profit)

# Store the profit in the model_profits table
profits = {'Model':['401k Simulator'], 'Profit':safe_profit}
model_profits = model_profits.append(pd.DataFrame(data=profits, columns=profits.keys()), ignore_index=True)

In [None]:
model_profits

# Analysis:

The test data had an average momentum of 0.07142857142857142, which correlates to the overall growth of the stock over the test data. This average will be compared to the average predicted momentums (average of y_pred) for each model. 

Logistic Regression and Support Vector Classifier had an average y_pred of 1.0, meaning that the model predicted an upward movement for every point, and therefore acted in the same manner as the 401(k) simulator. 

The Gradient Boosting Classifier predicted rises and falls in the stock value, with an overall average of -0.166666666667. This shows that the model behaved appropriately in that it predicted rises and falls, but it did not perform as well as the 401(k) simulator. The results make sense, since the Gradient Boosting Classifier had an average accuracy of .50, whereas the Logistic Regression and Support Vector Classifiers had average accuries of .52.

Lasso Regression and Support Vector Regression had average y_pred values of 0.0228357109921 and 0.0213930348259, respectively. This shows that these two models behaved appropriately in that they predicted rises and falls, and they actually generated a profit that was equal to the 401(k) simulator. 

# What happened with Ridge Regression?

Ridge Regression had an average y_pred of 0.0248827351445. This shows that the model performed appropriately in that it predicted the rises and fals of the stock. The model returned a higher profit than the 401(k) simulator.

Ridge Regression performed better than Lasso Regression because Ridge shrinks the higher explanatory feature rather than dropping groups of features (which is what Lasso does). 

There is some multicollinearity caused by adding the streak feature (it has a high correlation with momentum), but it was kept.

Note that Ridge and Lasso Regression both work to optimize the variance explained in the test set.

# Future Work

The models could be improved upon by creating additional features such as the 52-week moving average as well as by implementing some Natural Language Processing to parse the web for sentiment analysis of stocks.