## Predicting the Stock Market

In this project, you'll be working with data from the S&P500 Index.

We'll be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

In this mission, you'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv

We'll be using this dataset to develop a predictive model. We'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

In [3]:
# PREDICTING THE STOCK MARKET

import pandas as pd
import numpy as np
from datetime import datetime

# READ IN THE DATA

df = pd.read_csv('sphist.csv')
df['DateTime'] = pd.to_datetime(df.Date)
df_ordered = df.sort_values('DateTime', ascending=True)
df_ordered['index'] = range(0, df.shape[0], 1)
df_ordered.set_index(['index'])

#GENERATING A FEW INDICATORS

#Average price for past 5, 30 and 365 days
df_ordered['date_after_april1_2015'] = df_ordered.DateTime > datetime(year=2015, month=4, day=1)

#data_mean_5day = pd.rolling_mean(df_ordered.Close, window=5).shift(1)
#data_mean_365day = pd.rolling_mean(df_ordered.Close, window=365).shift(1)

data_mean_5day = df_ordered['Close'].rolling(window=5, center=False).mean()
data_mean_365day = df_ordered['Close'].rolling(window=365, center=False).mean()

data_mean_ratio = data_mean_5day/data_mean_365day

#Standard deviation of price over 5 and 365 days
#data_std_5day = pd.rolling_std(df_ordered.Close, window=5).shift(1)
#data_std_365day = pd.rolling_std(df_ordered.Close, window=365).shift(1)

data_std_5day = df_ordered['Close'].rolling(window=5, center=False).std()
data_std_365day = df_ordered['Close'].rolling(window=365, center=False).std()

data_std_ratio = data_std_5day/data_std_365day 

#Create these new indicator columns
df_ordered['data_mean_5day'] = data_mean_5day
df_ordered['data_mean_365day'] = data_mean_365day
df_ordered['data_mean_ratio'] = data_mean_ratio
df_ordered['data_std_5day'] = data_std_5day
df_ordered['data_std_365day'] = data_std_365day
df_ordered['data_std_ratio'] = data_std_ratio

# SPLIT UP THE DATA

df_new = df_ordered[df_ordered["DateTime"] > datetime(year=1951, month=1, day=2)] #remove rows before 1951-01-03 becuase there isn't enough historical data to computer the indicators
df_no_NA = df_new.dropna(axis=0) # remove NaN values

df_train = df_no_NA[df_no_NA['DateTime'] < datetime(year=2013, month=1, day=1)]

df_test = df_no_NA[df_no_NA['DateTime'] >= datetime(year=2013, month=1, day=1)]

#MAKING PREDICTIONS

from sklearn.linear_model import LinearRegression
model = LinearRegression()

#Leave out the original columns for training, they contain knowledge of the future
features = ['data_mean_5day', 'data_mean_365day', 'data_mean_ratio', 'data_std_5day', 'data_std_365day', 'data_std_ratio']
X = df_train[features]
X_test = df_test[features]
y = df_train.Close
y_test = df_test.Close

model.fit(X, y)
pred = model.predict(X_test)

# mean absolute error
MAE = sum(abs(pred - y_test))/len(pred)
print(MAE)
print(model.score(X, y))




11.978757971300423
0.999735486959


Future addition: Add 2 additional indicators to your dataframe, and see if the error is reduced. You'll need to insert these indicators at the same point where you insert the others, before you clean out rows with NaN values and split the dataframe into train and test.

Here are some ideas that might be helpful:

- The average volume over the past five days.
- The average volume over the past year.
- The ratio between the average volume for the past five days, and the average volume for the past year.
- The standard deviation of the average volume over the past five days.
- The standard deviation of the average volume over the past year.
- The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of - the average volume for the past year.
- The year component of the date.
- The ratio between the lowest price in the past year and the current price.
- The ratio between the highest price in the past year and the current price.
- The year component of the date.
- The month component of the date.
- The day of week.
- The day component of the date.
- The number of holidays in the prior month.

Future addition 2:

There's a lot of improvement still to be made on the indicator side, and we urge you to think of better indicators that you could use for prediction. We can also make significant structural improvements to the algorithm, and pull in data from other sources.

Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm.

You can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.

You can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before, and the amount of Twitter activity around certain stocks.

You can also make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.

Finally, you can make the system "higher-resolution". You're currently making daily predictions, but you could make hourly, minute-by-minute, or second by second predictions. This will require obtaining more data, though. You could also make predictions for individual stocks instead of the S&P500