# Regression - Training and Testing

Here, we're going to make a regression algorithm to predict "Adj Close" price of a stock.

## Goal

Implement a Regression algorithm that trains and tests a trading a symbol to predict its price value.

## Import libraries

In [11]:
import pandas as pd
import numpy as np  
import datetime as dt
import math

# To fetch data
from pandas_datareader import data as pdr   
import fix_yahoo_finance as yf  
yf.pdr_override()   

from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

from util import create_df_benchmark, get_data, fetchOnlineData
from strategyLearner import strategyLearner
from marketsim import compute_portvals_single_symbol, market_simulator
from indicators import get_momentum, get_sma, get_sma_indicator, compute_bollinger_value, get_RSI, plot_cum_return,  plot_momentum, plot_sma_indicator, plot_rsi_indicator, plot_momentum_sma_indicator, plot_stock_prices

## Initial variables



In [4]:
symbol = "AMZN"
# Specify the start and end dates for this period.
start_d = dt.datetime(2008, 1, 1)
#end_d = dt.datetime(2018, 10, 30)
yesterday = dt.date.today() - dt.timedelta(1)
end_d = yesterday

### Get portfolio data from Yahoo

In [5]:
portf_value = fetchOnlineData(start_d, end_d, symbol)

[*********************100%***********************]  1 of 1 downloaded


In [16]:
portf_value

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-12-31,93.809998,94.370003,92.449997,92.639999,92.639999,5755200
2008-01-02,95.349998,97.430000,94.699997,96.250000,96.250000,13858700
2008-01-03,96.059998,97.250000,94.519997,95.209999,95.209999,9122500
2008-01-04,93.260002,93.400002,88.500000,88.790001,88.790001,10270000
2008-01-07,88.620003,90.570000,85.470001,88.820000,88.820000,9981600
2008-01-08,87.550003,91.830002,86.930000,87.879997,87.879997,12283300
2008-01-09,87.559998,87.800003,80.239998,85.220001,85.220001,16410900
2008-01-10,83.980003,85.970001,82.970001,84.260002,84.260002,11609900
2008-01-11,84.029999,84.029999,80.290001,81.080002,81.080002,10624300
2008-01-14,82.180000,83.320000,78.870003,82.870003,82.870003,9056100


## Features and label

In our case, what are the features and what is the label? We're trying to predict the price, so is price the label? If so, what are the featuers? When it comes to forecasting out the price, our label, the thing we're hoping to predict, is actually the future price. As such, our features are actually: current price, high minus low percent, and the percent change volatility. The price that is the label shall be the price at some determined point the future. 

In [44]:
df = portf_value[['Open',  'High',  'Low',  'Close', 'Volume']]

df['HL_PCT'] = (df['High'] - df['Low']) / df['Close'] * 100.0
df['PCT_change'] = (df['Close'] - df['Open']) / df['Open'] * 100.0

df = df[['Close', 'HL_PCT', 'PCT_change', 'Volume']]
print(df.head())

                Close    HL_PCT  PCT_change    Volume
Date                                                 
2007-12-31  92.639999  2.072545   -1.247201   5755200
2008-01-02  96.250000  2.836367    0.943893  13858700
2008-01-03  95.209999  2.867349   -0.884863   9122500
2008-01-04  88.790001  5.518642   -4.793053  10270000
2008-01-07  88.820000  5.741949    0.225679   9981600


## Fill NaN data

We fill any NaN data with -99999. It's a popular option is to replace missing data with -99,999. With many machine learning classifiers, this will just be recognized and treated as an outlier feature.

In [45]:
forecast_col = 'Adj Close'
df.fillna(value=-99999, inplace=True)

## Forecast out

We're saying we want to forecast out 1% of the entire length of the dataset. Thus, if our data is 100 days of stock prices, we want to be able to predict the price 1 day out into the future. Choose whatever you like. If you are just trying to predict tomorrow's price, then you would just do 1 day out, and the forecast would be just one day out. If you predict 10 days out, we can actually generate a forcast for every day, for the next week and a half.

In [46]:
forecast_out = int(math.ceil(0.01 * len(df)))

## Shift the values

https://stackoverflow.com/questions/44675650/what-is-meant-by-shift-in-dataframe

In [47]:
df['label'] = portf_value[forecast_col].shift(-forecast_out)

## Drop NaN values

We'll then drop any still NaN information from the dataframe.

In [48]:
df.dropna(inplace=True)

##  Define our features and labels as arrays and preprocessing

Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. 

In [49]:
X = np.array(df.drop(['label'], 1))
y = np.array(df['label'])

## Create the label 'y'

In [50]:
y = np.array(df['label'])

# Training and testing

Now comes the training and testing. The way this works is you take, for example, 75% of your data, and use this to train the machine learning classifier. Then you take the remaining 25% of your data, and test the classifier. Since this is your sample data, you should have the features and known labels. Thus, if you test on the last 25% of your data, you can get a sort of accuracy and reliability, often called the confidence score. There are many ways to do this, but, probably the best way is using the build in cross_validation provided, since this also shuffles your data for you. 

In [51]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

## Classifier

Let's use Linear Regression from Scikit-Learn's svm package

In [52]:
clf = LinearRegression()

## Training

In [53]:
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Testing

In [54]:
confidence = clf.score(X_test, y_test)

## Confidence

In [55]:
print("Confidence: ", confidence)

Confidence:  0.9811923006018388


In [None]:
for k in ['linear','poly','rbf','sigmoid']:
    clf = svm.SVR(kernel=k)
    clf.fit(X_train, y_train)
    confidence = clf.score(X_test, y_test)
    print(k,confidence)