# Predicting Stock Market

In this project, we'll be working with data from the [S&P500 Index](https://en.wikipedia.org/wiki/S%26P_500_Index). The S&P500 is a stock market index. Before we get into what an index is, we'll need to get into the basics of the stock market. 

Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company, and to some percentage (or share) of the earnings of the company. When you buy or sell shares, it's common to say that you're trading a stock.

The price of a share is based mainly on supply and demand for a given stock. Stock price is also influenced by other factors, including the number of shares a company has issued. Stocks are traded daily, and the price can rise or fall from the beginning of a trading day to the end based on demand. Stocks that are in more in demand, are traded more often than stocks of smaller companies.

Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole is performing. For example, the Dow Jones Industrial Average aggregates the stock prices of 30 large American companies together. The S&P500 Index aggregates the stock prices of 500 large companies. When an index fund goes up or down, you can say that the underlying market or sector it represents is also going up or down. For example, if the Dow Jones Industrial Average price goes down one day, you can say that American stocks overall went down (ie, most American stocks went down in price).

We'll be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

## Importing and Cleaning the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

sp500 = pd.read_csv('sphist.csv')
sp500.head(5)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [2]:
sp500['Date'] = pd.to_datetime(sp500['Date'],format='%Y-%m-%d')
sp500.dtypes

Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Volume              float64
Adj Close           float64
dtype: object

In [3]:
sp500.sort_values(by=['Date'],inplace=True)
sp500.head(5)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


In [4]:
sp500['avg5'] = sp500['Close'].rolling(5).mean().shift(1)
sp500['avg365'] = sp500['Close'].rolling(365).mean().shift(1)
sp500['stdev5'] = sp500['Close'].rolling(5).std().shift(1)
sp500['stdev365'] = sp500['Close'].rolling(365).std().shift(1)
sp500.iloc[250:].head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,avg5,avg365,stdev5,stdev365
16339,1951-01-03,20.690001,20.690001,20.690001,20.690001,3370000.0,20.690001,20.36,,0.304385,
16338,1951-01-04,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.514,,0.204524,
16337,1951-01-05,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.628,,0.214057,
16336,1951-01-08,21.0,21.0,21.0,21.0,2780000.0,21.0,20.726001,,0.181879,
16335,1951-01-09,21.120001,21.120001,21.120001,21.120001,3800000.0,21.120001,20.840001,,0.117047,
16334,1951-01-10,20.85,20.85,20.85,20.85,3270000.0,20.85,20.910001,,0.16109,
16333,1951-01-11,21.190001,21.190001,21.190001,21.190001,3490000.0,21.190001,20.942001,,0.11606,
16332,1951-01-12,21.110001,21.110001,21.110001,21.110001,2950000.0,21.110001,21.006001,,0.149767,
16331,1951-01-15,21.299999,21.299999,21.299999,21.299999,2830000.0,21.299999,21.054001,,0.132778,
16330,1951-01-16,21.459999,21.459999,21.459999,21.459999,3740000.0,21.459999,21.114,,0.165922,


In [5]:
print('Our dataset has',sp500.shape[0],'rows')

Our dataset has 16590 rows


Since some of our indicators use one year of historical data, there are some rows where there isn't enough data to compute some of them, so we'll get rid of these rows. As we can observe below, the first row with all valis indicators is from June 19th, 1951.

In [6]:
sp500.iloc[363:].head(5)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,avg5,avg365,stdev5,stdev365
16226,1951-06-15,22.040001,22.040001,22.040001,22.040001,1370000.0,22.040001,21.602,,0.14025,
16225,1951-06-18,22.049999,22.049999,22.049999,22.049999,1050000.0,22.049999,21.712,,0.222194,
16224,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,19.447726,0.256223,1.790253
16223,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,19.462411,0.213659,1.789307
16222,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,19.476274,0.092574,1.788613


In [7]:
sp500_1 = sp500[sp500['Date'] > datetime(year=1951, month=6, day=18)].copy()
print('Our dataset has',sp500_1.shape[0],'rows, which means we eliminated',16590-sp500_1.shape[0],'rows')

Our dataset has 16225 rows, which means we eliminated 365 rows


## Training and Testing

In [9]:
train = sp500_1[sp500_1['Date'] < datetime(year=2013, month=1, day=1)]
test = sp500_1[sp500_1['Date'] >= datetime(year=2013, month=1, day=1)]

print('Train set has',train.shape[0],'rows, corresponding to the',round((train.shape[0]/sp500_1.shape[0])*100,2),'%')
print('Test set has',test.shape[0],'rows, corresponding to the',round((test.shape[0]/sp500_1.shape[0])*100,2),'%')

Train set has 15486 rows, corresponding to the 95.45 %
Test set has 739 rows, corresponding to the 4.55 %


In [10]:
lr = LinearRegression()
features = ['avg5','avg365','stdev5','stdev365']
target = 'Close'
lr.fit(train[features],train[target])
predictions = lr.predict(test[features])
mae = mean_absolute_error(test[target],predictions)
mae

16.129867989265527

## Adding more features

In [12]:
sp500['avgvol5'] = sp500['Volume'].rolling(5).mean().shift(1)
sp500['avgvol365'] = sp500['Volume'].rolling(365).mean().shift(1)
sp500_2 = sp500[sp500['Date'] > datetime(year=1951, month=6, day=18)].copy()
sp500_2.head(5)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,avg5,avg365,stdev5,stdev365,avgvol5,avgvol365
16224,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,19.447726,0.256223,1.790253,1196000.0,1989479.0
16223,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,19.462411,0.213659,1.789307,1176000.0,1989041.0
16222,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,19.476274,0.092574,1.788613,1188000.0,1986932.0
16221,1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96,19.489562,0.115108,1.787659,1148000.0,1982959.0
16220,1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862,19.502082,0.204132,1.786038,1142000.0,1981123.0


In [13]:
train = sp500_2[sp500_2['Date'] < datetime(year=2013, month=1, day=1)]
test = sp500_2[sp500_2['Date'] >= datetime(year=2013, month=1, day=1)]

In [14]:
lr = LinearRegression()
features = ['avg5','avg365','stdev5','stdev365','avgvol5','avgvol365']
target = 'Close'
lr.fit(train[features],train[target])
predictions = lr.predict(test[features])
mae = mean_absolute_error(test[target],predictions)
mae

16.142543940307196