# Unit 3 Capstone Project
  ## Matthew Kennedy, August 2017

   ## Section 1: Overview of Dataset and Analysis of Data
   
   The dataset used in this project comes from Kaggle user "CNuge." The dataset contains historical stock prices over the last five years for all companies in the S&P 500 index and can be found at https://www.kaggle.com/camnugent/sandp500. This project will use the files that have the historical prices for individual stocks.   
       
   The dataset contains the following columns: 
       
       Date - In the format of yy-mm-dd
       Open - Price of the stock in USD at market open
       High - Highest price reached in the day
       Low Close - Lowest price reached in the day
       Volume - Number of shares traded
       Name - The stock's ticker name
       
   The user collected the data by using the python library, 'pandas_datareader,' to scrape Google Finance.

In [28]:
# Import the necessary modules
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import linear_model
from sklearn import preprocessing
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVR
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

In [32]:
# Change this to the individual stock files

# Read the Dataset, store the original
original = pd.read_csv('C:\\Users\\mkennedy\\sandp500\\all_stocks_5yr.csv', encoding='utf-8-sig')

In [19]:
# Copy a dataframe of the original data to manipulate
data = original

# Print the headers of the dataframe
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Name
0,2012-08-13,92.29,92.59,91.74,92.4,2075391.0,MMM
1,2012-08-14,92.36,92.5,92.01,92.3,1843476.0,MMM
2,2012-08-15,92.0,92.74,91.94,92.54,1983395.0,MMM
3,2012-08-16,92.75,93.87,92.21,93.74,3395145.0,MMM
4,2012-08-17,93.93,94.3,93.59,94.24,3069513.0,MMM


In [20]:
# Check the footer to make sure there are no rows of text
data.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Name
606796,2017-08-07,62.12,62.34,61.25,61.83,4208287.0,ZTS
606797,2017-08-08,60.49,61.0,59.5,60.0,4663668.0,ZTS
606798,2017-08-09,59.95,60.87,59.76,60.81,4017297.0,ZTS
606799,2017-08-10,60.87,61.37,59.71,59.74,2690725.0,ZTS
606800,2017-08-11,60.05,60.22,59.64,59.73,2285863.0,ZTS


There are no footers that need to be excluded.
There are 606,800 rows of stock data. 

In [21]:
# The describe method provides some additional information about the data
data.describe()

Unnamed: 0,Open,High,Low,Close,Volume
count,606417.0,606593.0,606574.0,606801.0,606395.0
mean,79.529041,80.257435,78.799338,79.55792,4500925.0
std,93.383162,94.187977,92.5353,93.382168,9336171.0
min,1.62,1.69,1.5,1.59,0.0
25%,38.07,38.46,37.7,38.09,1077091.0
50%,59.24,59.79,58.69,59.27,2131913.0
75%,89.39,90.15,88.62,89.43,4442768.0
max,2044.0,2067.99,2035.11,2049.0,618237600.0


In [22]:
# The dtypes call will display the data types. 
# This is used to make sure all numerical values have the correct data type to work with in the models.
print(data.dtypes)

Date       object
Open      float64
High      float64
Low       float64
Close     float64
Volume    float64
Name       object
dtype: object


The dataset appears to be clean and easy to work with.

## Section 2: Creation and Comparison of Predictive Models

Now that the data has been analyzed to ensure it can be manipulated, it is time to create some predictive models. For comparison, the scores from the models will be stored in a new table, titled "Model Comparison."

In [31]:
# Create a table to store the scores for each model.
# Title: Model Comparison
# Columns: Model, R^2, Run_Time
# Model values: Linear Regression, Ridge Regression, Lasso Regression, Support Vector Regression, Gradient Boost Classification

New features will be created to show a one week trend as well as a four week average.
### NOTE: I want to add features to show momentum (by using how many days in a row it has increased compared to its average increase or decrease), the one month, one quarter, and one year averages, etc.

In [33]:
# Set the variables. Y is the closing value, X is everything else except date and name.
Y = data['Close'].values.reshape(-1, 1)
X = data.loc[:, ~(data.columns).isin(['Close', 'Date', 'Name'])]


# Create training and test sets.
offset = int(X.shape[0] * 0.9)

# Put 90% of the data in the training set.
X_train, Y_train = X[:offset], Y[:offset]

# And put 10% in the test set.
X_test, Y_test = X[offset:], Y[offset:]

### Binary Logistic Regression Classifier

In [34]:
# Declare a logistic regression classifier.
lr = LogisticRegression(C=1e9)

# Fit the model.
lr.fit(X_train, Y_train)
print('\nR² for the model with train set:')
print(lr.score(X_train, Y_train))
print('Run time for the model with train set:')
#print(Runtime)
print('\nR² for the model with test set:')
print(lr.score(X_test, Y_test))
print('Run time for the model with test set:')
#print(Runtime)
# Store the run time from the test set to the appropriate row/column for the model.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Ridge Regression Classifier

In [35]:
# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced. Note that by convention, the
# intercept is not regularized. Since we standardized the data
# earlier, the intercept should be equal to zero and can be dropped.

ridgeregr = linear_model.Ridge(alpha=10, fit_intercept=False) 
ridgeregr.fit(X_train, Y_train)
print('\nR² for the model with train set:')
print(ridgeregr.score(X_train, Y_train))
print('Run time for the model with train set:')
#print(Runtime)
print('\nR² for the model with test set:')
print(ridgeregr.score(X_test, Y_test))
print('Run time for the model with test set:')
#print(Runtime)
# Store the run time from the test set to the appropriate row/column for the model.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Lasso Regression Classifier

In [36]:
lasso = linear_model.Lasso(alpha=.35)
lasso.fit(X_train, Y_train)
print('\nR² for the model with train set:')
print(lasso.score(X_train, Y_train))
print('Run time for the model with train set:')
#print(Runtime)
print('\nR² for the model with test set:')
print(lasso.score(X_test, Y_test))
print('Run time for the model with test set:')
#print(Runtime)
# Store the run time from the test set to the appropriate row/column for the model.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Support Vector Regression Classifier

In [37]:
# Make a model using SVR here
svr = SVR()
svr.fit(X_train, Y_train)
print('\nR² for the model with train set:')
print(svr.score(X_train, Y_train))
print('Run time for the model with train set:')
#print(Runtime)
print('\nR² for the model with test set:')
print(svr.score(X_test, Y_test))
print('Run time for the model with test set:')
#print(Runtime)
# Store the run time from the test set to the appropriate row/column for the model.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Gradient Boosting Classifier

In [38]:
# 500 iterations, using 2-deep trees, and loss function 'deviance.'
params = {'n_estimators': 500,
          'max_depth': 2,
          'loss': 'deviance'}

# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, Y_train)
print('\nR² for the model with train set:')
print(clf.score(X_train, Y_train))
print('Run time for the model with train set:')
#print(clf.Runtime)
print('\nR² for the model with test set:')
print(clf.score(X_test, Y_test))
print('Run time for the model with test set:')
#print(Runtime)
# Store the run time from the test set to the appropriate row/column for the model.

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

## Section 3: Selection and Analysis of the Best Performing Model

Display the model comparison table here. Write up an analysis. 