In [25]:
import pandas as pd
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

## 1. The Dataset

In this mission, you'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv.

The columns of the dataset are:

- Date -- The date of the record.
- Open -- The opening price of the day (when trading starts).
- High -- The highest trade price during the day.
- Low -- The lowest trade price during the day.
- Close -- The closing price for the day (when trading is finished).
- Volume -- The number of shares traded.
- Adj Close -- The daily closing price, adjusted retroactively to include any corporate actions. Read more here.

You'll be using this dataset to develop a predictive model. You'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

## 2. Reading in the data

You'll need to read the data into Python, do some processing to set the right column types, and then sort the dataframe. You can do this in the predict.py script.

### Instructions

Here are the steps you'll need to take, at a high level:


Read the data into a Pandas DataFrame. You can use the read_csv Pandas function for this.

Convert the Date column to a Pandas date type. This will allow you to do date comparisons with the column.
- You can perform this conversion with the to_datetime function in Pandas.
- Once you convert the column, you can perform comparisons with df["Date"] > datetime(year=2015, month=4, day=1). This will generate a Boolean series that tells you if each item in the Date column is after 2015-04-01. You'll have to import the datetime module from the datetime library first with from datetime import datetime.

Sort the dataframe on the Date column. It's currently in descending order, but we'll want it to be in ascending order for some of the next steps. You can use the DataFrame.sort_values() method on data frames for this.

Make sure to run the predict.py script using python predict.py as you work through the steps.

In [26]:
df = pd.read_csv('sphist.csv')
df['Date'] = pd.to_datetime(df.Date)
df.sort_values(by="Date", ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000
1,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000
2,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000
3,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000
4,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000
...,...,...,...,...,...,...,...
16585,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883
16586,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010
16587,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117
16588,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941


## 3. Generating indicators

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it comes time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thus, the observations are not all independent, and you can't treat them as such.

This means you have to be extra careful to not inject "future" knowledge into past rows when you do training and prediction. Injecting future knowledge will make our model look good when you're training and testing it, but will make it fail in the real world. This is how many algorithmic traders lose money.

The time series nature of the data means that can generate indicators to make our model more accurate. For instance, you can create a new column that contains the average price of the last 10 trades for each row. This will incorporate information from multiple prior rows into one, and will make predictions much more accurate.

When you do this, you have to be careful not to use the current row in the values you average. You want to teach the model how to predict the current price from historical prices. If you include the current price in the prices you average, it will be equivalent to handing the answers to the model upfront, and will make it impossible to use in the "real world", where you don't know the price upfront.

Here are some indicators that are interesting to generate for each row:

- The average price from the past 5 days.
- The average price for the past 30 days.
- The average price for the past 365 days.
- The ratio between the average price for the past 5 days, and the average price for the past 365 days.
- The standard deviation of the price over the past 5 days.
- The standard deviation of the price over the past 365 days.
- The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

"Days" means "trading days" -- so if you're computing the average of the past 5 days, it should be the 5 most recent dates before the current one. Assume that "price" means the Close column. Always be careful not to include the current price in these indicators! You're predicting the next day price, so our indicators are designed to predict the current price from the previous prices.

Some of these indicators require a year of historical data to compute. Our first day of data falls on 1950-01-03, so the first day you can start computing indicators on is 1951-01-03.

To compute indicators, you'll need to loop through each day from 1951-01-03 to 2015-12-07 (the last day you have prices for). For instance, if we were computing the average price from the past 5 days, we'd start at 1951-01-03, get the prices for each day from 1950-12-26 to 1951-01-02, and find the average. The reason why we start on the 26th, and take more than 5 calendar days into account is because the stock market is shutdown on certain holidays. Since we're looking at the past 5 trading days, we need to look at more than 5 calendar days to find them. 

We'd keep repeating this process to compute all of the averages. Note how when we compute the average of the past 5 days for 1951-01-04, we don't include 1951-01-04 in that average. It's critical not to do this, or our model won't work in the "real world".

### Instructions

Pick 3 indicators to compute, and generate a different column for each one.

There are a few different ways to do this:

- You can use a for loop along with the [iterrows method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) to loop over the rows in the DataFrame and compute the indicators. This is the recommended way, as it's a bit simpler to understand what's happening. Since you'll be looping over all of the rows, for any date that comes before there is enough historical data to compute an indicator, just fill in 0.
- Pandas has some [time series tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html) that can help, including [the rolling function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html), which will do most of the hard computation for you. Set the window equal to the number of trading days in the past you want to use to compute the indicators. This will add in NaN values for any row where there aren't enough historical trading days to do the computation. Note: There is a giant caveat here, which is that the rolling mean will use the current day's price. You'll need to reindex the resulting series to shift all the values "forward" one day. For example, the rolling mean calculated for 1950-01-03 will need to be assigned to 1950-01-04, and so on. You can use the [shift method](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.shift.html) on Dataframes to do this.


In [38]:
df["avg_5"] = df["Close"].rolling(5).mean().shift(1)
df["avg_30"] = df["Close"].rolling(30).mean().shift(1)
df["avg_365"] = df["Close"].rolling(365).mean().shift(1)

df["std_5"] = df["Close"].rolling(5).std().shift(1)
df["std_365"] = df["Close"].rolling(365).std().shift(1)

df["avg_5/avg_365"] = df["avg_5"]/df["avg_365"]
df["std_5/std_365"] = df["std_5"]/df["std_365"]

## 4. Splitting up the data

Since you're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. You'll need to remove these rows before you split the data.

If you have a Dataframe df, you can select any rows with the Date column greater than 1951-01-02 using df[df["Date"] > datetime(year=1951, month=1, day=2)].

### Instructions

- Remove any rows from the DataFrame that fall before 1951-01-03.
- Use the dropna method to remove any rows with NaN values. Pass in the axis=0 argument to drop rows.
- Generate two new dataframes to use in making our algorithm. train should contain any rows in the data with a date less than 2013-01-01. test should contain any rows with a date greater than or equal to 2013-01-01.

In [39]:
df.dropna(axis=0, inplace=True)
df_train = df[df["Date"] < datetime(year=2013, month=1, day=1)]
df_test = df[df["Date"] >= datetime(year=2013, month=1, day=1)]

## 5. Making predictions

Now, you can define an error metric, train a model using the train data, and make predictions on the test data.

It's recommended to use Mean Absolute Error, also called MAE, as an error metric, because it will show you how "close" you were to the price in intuitive terms. Mean Squared Error, or MSE, is an alternative that is more commonly used, but makes it harder to intuitively tell how far off you are from the true price because it squares the error.

### Instructions

- Pick an error metric.
- Initialize an instance of the LinearRegression class.
- Train a linear regression model, using the train Dataframe. Leave out all of the original columns (Close, High, Low, Open, Volume, Adj Close, Date) when training your model. These all contain knowledge of the future that you don't want to feed the model. Use the Close column as the target.
- Make predictions for the Close column of the test data, using the same columns for training as you did with train.
- Compute the error between the predictions and the Close column of test.

In [41]:
model = LinearRegression()
features = ["avg_5", "avg_30", "avg_365", "std_5", "std_365", "avg_5/avg_365", "std_5/std_365"]
#model.fit(train[features], train["Close"])
#predictions = model.predict(test[features])
x = df_train[features]
x_test = df_test[features]
y = df_train.Close
y_test = df_test.Close

model.fit(x, y)
predictions = model.predict(x_test)

# Calculate error metrics
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
print("MAE: ", mae)
print("MSE: ", mse)
print(model.score(x, y))

MAE:  16.145140609743393
MSE:  492.9230344450359
0.9995223668123336


In [31]:
print(df.head(15))

         Date       Open       High        Low      Close     Volume  \
0  1950-01-03  16.660000  16.660000  16.660000  16.660000  1260000.0   
1  1950-01-04  16.850000  16.850000  16.850000  16.850000  1890000.0   
2  1950-01-05  16.930000  16.930000  16.930000  16.930000  2550000.0   
3  1950-01-06  16.980000  16.980000  16.980000  16.980000  2010000.0   
4  1950-01-09  17.080000  17.080000  17.080000  17.080000  2520000.0   
5  1950-01-10  17.030001  17.030001  17.030001  17.030001  2160000.0   
6  1950-01-11  17.090000  17.090000  17.090000  17.090000  2630000.0   
7  1950-01-12  16.760000  16.760000  16.760000  16.760000  2970000.0   
8  1950-01-13  16.670000  16.670000  16.670000  16.670000  3330000.0   
9  1950-01-16  16.719999  16.719999  16.719999  16.719999  1460000.0   
10 1950-01-17  16.860001  16.860001  16.860001  16.860001  1790000.0   
11 1950-01-18  16.850000  16.850000  16.850000  16.850000  1570000.0   
12 1950-01-19  16.870001  16.870001  16.870001  16.870001  11700

In [32]:
print(df[df["Date"] == datetime(year=1951, month=1, day=2)].index)

Int64Index([249], dtype='int64')


## 6. Improving error

Congratulations! You can now predict the S&P500 (with some error). You can improve the error of this model significantly, though. Think about some indicators that might be helpful to compute.

Here are some ideas that might be helpful:

- The average volume over the past five days.
- The average volume over the past year.
- The ratio between the average volume for the past five days, and the average volume for the past year.
- The standard deviation of the average volume over the past five days.
- The standard deviation of the average volume over the past year.
- The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
- The year component of the date.
- The ratio between the lowest price in the past year and the current price.
- The ratio between the highest price in the past year and the current price.
- The month component of the date.
- The day of week.
- The day component of the date.
- The number of holidays in the prior month.

### Instructions

Add 2 additional indicators to your dataframe, and see if the error is reduced. You'll need to insert these indicators at the same point where you insert the others, before you clean out rows with NaN values and split the dataframe into train and `test.

## 7. Next steps

There's a lot of improvement still to be made on the indicator side, and we urge you to think of better indicators that you could use for prediction. We can also make significant structural improvements to the algorithm, and pull in data from other sources.

- Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm.

- You can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.

- You can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before, and the amount of Twitter activity around certain stocks.

- You can also make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.

- Finally, you can make the system "higher-resolution". You're currently making daily predictions, but you could make hourly, minute-by-minute, or second by second predictions. This will require obtaining more data, though. You could also make predictions for individual stocks instead of the S&P500.