# Imports

In [19]:
import pandas as pd
import quandl
import math

# Data import and preprocessing

In [24]:
df = quandl.get('WIKI/GOOGL')

LimitExceededError: (Status 429) (Quandl Error QELx01) You have exceeded the anonymous user limit of 50 calls per day. To make more calls today, please register for a free Quandl account and then include your API key with your requests.

In [31]:
df.head()

Unnamed: 0_level_0,Adj. Close,HL_PCT,PCT_change,Adj. Volume,label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-08-19,50.322842,8.072956,0.324968,44659000.0,67.739104
2004-08-20,54.322689,7.921706,7.227007,22834300.0,69.399229
2004-08-23,54.869377,4.04936,-1.22788,18256100.0,68.752232
2004-08-24,52.597363,7.657099,-5.726357,15247300.0,69.639972
2004-08-25,53.164113,3.886792,1.183658,9188600.0,69.078238


You want features, but you need meaningful features, which actually bring info to our regression model.
You need to simplify your data as much as possible. Useless, correlated features can cause more problems than bring info.

In [32]:
df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close'
         ,'Adj. Volume']]

KeyError: "['Adj. High', 'Adj. Open', 'Adj. Low'] not in index"

We want to keep features that have a meaningful relantionship. E.g.: Adj. High & Adj. Low tells us something about the volatility of the market that day. And, we wan to drop features that don't bring additional info.

In [33]:
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

KeyError: 'Adj. High'

In [34]:
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]

In [35]:
df.head()

Unnamed: 0_level_0,Adj. Close,HL_PCT,PCT_change,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-08-19,50.322842,8.072956,0.324968,44659000.0
2004-08-20,54.322689,7.921706,7.227007,22834300.0
2004-08-23,54.869377,4.04936,-1.22788,18256100.0
2004-08-24,52.597363,7.657099,-5.726357,15247300.0
2004-08-25,53.164113,3.886792,1.183658,9188600.0


Features are like attributes that make up the labels, and labels are like predictions.

So, which column is the label and which are the features?

Adj. Close can be feature or none of the above. It could be a label if we had chosen other features b/c we wouldn't know the High - Low or percent change until the close had already occured. 

If you trained an algorithm to predict that value, it would be very biased.

What we'll do is take the last 10 values of Adj. and try to predict another value. Close and that's a feature, but that's for when we write the algorithm ourselves.

A label will be a future price, and the only column that fits is Adj. Price, but it's for the next day or for the next 5 days.

In [36]:
forecast_col = 'Adj. Close'

We can't use empty rows in ML, so we need to fill NaN's. The below method will treat such rows, examples as outliers.

In [37]:
df.fillna(-99999, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


We try to predict out 10% of the dataframe. In reality, we get tomorrow's price, and next day's price. We're using data that came 10 days ago to predict today.

Float var can be played with.

In [38]:
forecast_out = int(math.ceil(0.01*len(df)))

We define the label and shifting the "rows" negatively, so up. This way each row's label column will be the Adjusted Close price 10 days into the future.

Our features are these attributes that, we consider, may cause the adjusted close price in 10 days to change. Actually, it's 10% of the timeframe.

In [28]:
df['label'] = df[forecast_col].shift(-forecast_out)

In [29]:
df.dropna(inplace = True)

In [30]:
df.head()

Unnamed: 0_level_0,Adj. Close,HL_PCT,PCT_change,Adj. Volume,label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-08-19,50.322842,8.072956,0.324968,44659000.0,67.739104
2004-08-20,54.322689,7.921706,7.227007,22834300.0,69.399229
2004-08-23,54.869377,4.04936,-1.22788,18256100.0,68.752232
2004-08-24,52.597363,7.657099,-5.726357,15247300.0,69.639972
2004-08-25,53.164113,3.886792,1.183658,9188600.0,69.078238
