#### Read stock history data in data frame

In [77]:
import pandas as pd

df = pd.read_csv('sphist.csv')

In [78]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


#### Convert the date column to the datetime type and sort in ascending order

In [79]:
df['Date'] = pd.to_datetime(df['Date'])

In [80]:
from datetime import datetime

df.sort_values(by='Date', ascending=True, inplace=True)
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


### Pick 3 indicators:

Average price for past 30 days

Standard deviation for past 5 days

Ratio of average price for past 5 days and average price for past 365 days

In [81]:
# Add column for average price over previous 30 days
mean_prev30 = df.Close.rolling(window=30,center=False).mean()
mean_prev30 = mean_prev30.shift(1)
df['mean_prev30'] = mean_prev30

# Add column for standard deviation of price over previous 365 days
std_prev5 = df.Close.rolling(window=5,center=False).std()
std_prev5 = std_prev5.shift(1)
df['std_prev5'] = std_prev5

# Add column for average price over previous 5 days
mean_prev5 = df.Close.rolling(window=5,center=False).mean()
mean_prev5 = mean_prev5.shift(1)
df['mean_prev5'] = mean_prev5

# Add column for average price over previous 365 days
mean_prev365 = df.Close.rolling(window=365,center=False).mean()
mean_prev365 = mean_prev365.shift(1)
df['mean_prev365'] = mean_prev365

# Add column for average price over previous 5 days
mean_ratio_5to365 = mean_prev5 / mean_prev365
mean_ratio_5to365.head(370)
df['ratio_5to365'] = mean_ratio_5to365

### Clean and split the data

In [82]:
df = df[df['Date'] > datetime(year=1951, month=1, day=2)]
df.dropna(axis=0, inplace=True)

train = df[df['Date'] < datetime(year=2013, month=1, day=1)]
test = df[df['Date'] >= datetime(year=2013, month=1, day=1)]

### Make predictions

In [83]:
from sklearn.linear_model import LinearRegression
import numpy as np

lr = LinearRegression()
lr.fit(train[["mean_prev5","mean_prev365", "mean_prev30", "std_prev5", "ratio_5to365"]].values, train['Close'])

predictions = lr.predict(test[["mean_prev5","mean_prev365", "mean_prev30", "std_prev5", "ratio_5to365"]].values)
#print(predictions)
error = test['Close'].values - predictions
error = np.abs(error)
print(error.sum())

11919.403471


### Ideas for additional indicators
* The average volume over the past five days.
* The average volume over the past year.
* The ratio between the average volume for the past five days, and the average volume for the past year.
* The standard deviation of the average volume over the past five days.
* The standard deviation of the average volume over the past year.
* The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
* The year component of the date.
* The ratio between the lowest price in the past year and the current price.
* The ratio between the highest price in the past year and the current price.
* The year component of the date.
* The month component of the date.
* The day of week.
* The day component of the date.
* The number of holidays in the prior month.

### Ideas for next steps
There's a lot of improvement still to be made on the indicator side, and we urge you to think of better indicators that you could use for prediction.

We can also make significant structural improvements to the algorithm, and pull in data from other sources.

Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm.

You can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.

You can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before, and the amount of Twitter activity around certain stocks.

You can also make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.

Finally, you can make the system "higher-resolution". You're currently making daily predictions, but you could make hourly, minute-by-minute, or second by second predictions. This will require obtaining more data, though. You could also make predictions for individual stocks instead of the S&P500.

You can write scripts and explore here, or download the code to your computer using the download icon to the right. You'll then be able to run the scripts on your own computer.