# Regression - Training and Testing

Here, we're going to make a regression algorithm to predict "Adj Close" price of a stock.

## Goal

Implement a Regression algorithm that trains and tests a trading a symbol to predict its price value.

## Import libraries

In [24]:
import pandas as pd
import numpy as np  
import datetime as dt
import math
import datetime

# To fetch data
from pandas_datareader import data as pdr   
import fix_yahoo_finance as yf  
yf.pdr_override()   

# sklearn
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

# Add plotly for interactive charts
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

## Initial variables



In [2]:
symbol = "AMZN"
# Specify the start and end dates for this period.
start_d = dt.datetime(2008, 1, 1)
#end_d = dt.datetime(2018, 10, 30)
yesterday = dt.date.today() - dt.timedelta(1)
end_d = yesterday

### Get portfolio data from Yahoo

In [3]:
portf_value = fetchOnlineData(start_d, end_d, symbol)

[*********************100%***********************]  1 of 1 downloaded


In [4]:
portf_value

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-12-31,93.809998,94.370003,92.449997,92.639999,92.639999,5755200
2008-01-02,95.349998,97.430000,94.699997,96.250000,96.250000,13858700
2008-01-03,96.059998,97.250000,94.519997,95.209999,95.209999,9122500
2008-01-04,93.260002,93.400002,88.500000,88.790001,88.790001,10270000
2008-01-07,88.620003,90.570000,85.470001,88.820000,88.820000,9981600
2008-01-08,87.550003,91.830002,86.930000,87.879997,87.879997,12283300
2008-01-09,87.559998,87.800003,80.239998,85.220001,85.220001,16410900
2008-01-10,83.980003,85.970001,82.970001,84.260002,84.260002,11609900
2008-01-11,84.029999,84.029999,80.290001,81.080002,81.080002,10624300
2008-01-14,82.180000,83.320000,78.870003,82.870003,82.870003,9056100


## Features and label

In our case, what are the features and what is the label? We're trying to predict the price, so is price the label? If so, what are the featuers? When it comes to forecasting out the price, our label, the thing we're hoping to predict, is actually the future price. As such, our features are actually: current price, high minus low percent, and the percent change volatility. The price that is the label shall be the price at some determined point the future. 

In [5]:
df = portf_value[['Open',  'High',  'Low',  'Close', 'Volume']]

df['HL_PCT'] = (df['High'] - df['Low']) / df['Close'] * 100.0
df['PCT_change'] = (df['Close'] - df['Open']) / df['Open'] * 100.0

df = df[['Close', 'HL_PCT', 'PCT_change', 'Volume']]
print(df.head())

                Close    HL_PCT  PCT_change    Volume
Date                                                 
2007-12-31  92.639999  2.072545   -1.247201   5755200
2008-01-02  96.250000  2.836367    0.943893  13858700
2008-01-03  95.209999  2.867349   -0.884863   9122500
2008-01-04  88.790001  5.518642   -4.793053  10270000
2008-01-07  88.820000  5.741949    0.225679   9981600


## Fill NaN data

We fill any NaN data with -99999. It's a popular option is to replace missing data with -99,999. With many machine learning classifiers, this will just be recognized and treated as an outlier feature.

In [30]:
forecast_col = 'Adj Close'
df.fillna(value=-99999, inplace=True)

Unnamed: 0_level_0,Close,HL_PCT,PCT_change,Volume,label,Forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-12-31 00:00:00,92.639999,2.072545,-1.247201,5755200.0,75.190002,-99999.000000
2008-01-02 00:00:00,96.250000,2.836367,0.943893,13858700.0,74.449997,-99999.000000
2008-01-03 00:00:00,95.209999,2.867349,-0.884863,9122500.0,77.730003,-99999.000000
2008-01-04 00:00:00,88.790001,5.518642,-4.793053,10270000.0,75.800003,-99999.000000
2008-01-07 00:00:00,88.820000,5.741949,0.225679,9981600.0,72.959999,-99999.000000
2008-01-08 00:00:00,87.879997,5.575788,0.376921,12283300.0,72.080002,-99999.000000
2008-01-09 00:00:00,85.220001,8.871163,-2.672450,16410900.0,73.639999,-99999.000000
2008-01-10 00:00:00,84.260002,3.560408,0.333412,11609900.0,69.900002,-99999.000000
2008-01-11 00:00:00,81.080002,4.612726,-3.510647,10624300.0,72.080002,-99999.000000
2008-01-14 00:00:00,82.870003,5.369853,0.839624,9056100.0,73.269997,-99999.000000


## Forecast out

We're saying we want to forecast out 1% of the entire length of the dataset. Thus, if our data is 100 days of stock prices, we want to be able to predict the price 1 day out into the future. Choose whatever you like. If you are just trying to predict tomorrow's price, then you would just do 1 day out, and the forecast would be just one day out. If you predict 10 days out, we can actually generate a forcast for every day, for the next week and a half.

In [7]:
forecast_out = int(math.ceil(0.01 * len(df)))

## Shift the values

https://stackoverflow.com/questions/44675650/what-is-meant-by-shift-in-dataframe

In [8]:
df['label'] = portf_value[forecast_col].shift(-forecast_out)

## Drop NaN values

We'll then drop any still NaN information from the dataframe.

In [9]:
df.dropna(inplace=True)

##  Define our features and labels as arrays and preprocessing

Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. 

In [10]:
X = np.array(df.drop(['label'], 1))
y = np.array(df['label'])

## Create the label 'y'

In [11]:
y = np.array(df['label'])

# Training and testing

Now comes the training and testing. The way this works is you take, for example, 75% of your data, and use this to train the machine learning classifier. Then you take the remaining 25% of your data, and test the classifier. Since this is your sample data, you should have the features and known labels. Thus, if you test on the last 25% of your data, you can get a sort of accuracy and reliability, often called the confidence score. There are many ways to do this, but, probably the best way is using the build in cross_validation provided, since this also shuffles your data for you. 

In [12]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

## Classifier

Let's use Linear Regression from Scikit-Learn's svm package

In [13]:
clf = LinearRegression()

## Training

In [14]:
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Testing

In [15]:
confidence = clf.score(X_test, y_test)

## Confidence

In [16]:
print("Confidence: ", confidence)

Confidence:  0.9853637951893229


## Forecasting results

In [18]:
X_lately = X[-forecast_out:]
forecast_set = clf.predict(X_lately)

In [19]:
print(forecast_set, confidence, forecast_out)

[2003.39283124 2000.28317094 1986.88511159 2038.03002512 2039.39517548
 2039.86020424 2019.86094352 1952.62396598 1990.29015963 1974.78114468
 1994.09177365 1959.17447329 1984.91341126 2023.99869339 2024.08102553
 2062.93886528 2052.87102375 2054.88750931 2018.93187172 1999.8817764
 1953.42782123 1934.35432202 1908.32684636 1917.26308776 1791.79990896
 1751.03334325 1827.67044737 1803.27072368] 0.9853637951893229 28


# Visualizing

Stock prices are daily, for 5 days, and then there are no prices on the weekends. I recognize this fact, but we're going to keep things simple, and plot each forecast as if it is simply 1 day out. If you want to try to work in the weekend gaps (don't forget holidays) go for it, but we'll keep it simple.

## Adding forecast column

We set the value as a NaN first, but we'll populate some shortly. We said we're going to just start the forecasts as tomorrow (recall that we predict 10% out into the future, and we saved that last 10% of our data to do this, thus, we can begin immediately predicting since -10% has data that we can predict 10% out and be the next prediction)

In [20]:
df['Forecast'] = np.nan

## Setting dates

Wset the value as a NaN first, but we'll populate some shortly. We said we're going to just start the forecasts as tomorrow (recall that we predict 10% out into the future, and we saved that last 10% of our data to do this, thus, we can begin immediately predicting since -10% has data that we can predict 10% out and be the next prediction). We need to first grab the last day in the dataframe, and begin assigning each new forecast to a new day.

In [21]:
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day

## Forecasting the existing dataframe

Now we have the next day we wish to use, and one_day is 86,400 seconds. Now we add the forecast to the existing dataframe.

So here all we're doing is iterating through the forecast set, taking each forecast and day, and then setting those values in the dataframe (making the future "features" NaNs). The last line's code just simply takes all of the first columns, setting them to NaNs, and then the final column is whatever i is (the forecast in this case). I have chosen to do this one-liner for loop like this so that, if we decide to change up the dataframe and features, the code can still work. 

In [25]:
for i in forecast_set:
    next_date = datetime.datetime.fromtimestamp(next_unix)
    next_unix += 86400
    df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]

In [29]:
df

Unnamed: 0_level_0,Close,HL_PCT,PCT_change,Volume,label,Forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-12-31 00:00:00,92.639999,2.072545,-1.247201,5755200.0,75.190002,
2008-01-02 00:00:00,96.250000,2.836367,0.943893,13858700.0,74.449997,
2008-01-03 00:00:00,95.209999,2.867349,-0.884863,9122500.0,77.730003,
2008-01-04 00:00:00,88.790001,5.518642,-4.793053,10270000.0,75.800003,
2008-01-07 00:00:00,88.820000,5.741949,0.225679,9981600.0,72.959999,
2008-01-08 00:00:00,87.879997,5.575788,0.376921,12283300.0,72.080002,
2008-01-09 00:00:00,85.220001,8.871163,-2.672450,16410900.0,73.639999,
2008-01-10 00:00:00,84.260002,3.560408,0.333412,11609900.0,69.900002,
2008-01-11 00:00:00,81.080002,4.612726,-3.510647,10624300.0,72.080002,
2008-01-14 00:00:00,82.870003,5.369853,0.839624,9056100.0,73.269997,


## Plotting

In [39]:
def plot_stock_prices(df_index, price, forecast, symbol="AMZN", title="Stock prices forecatsting", xlabel="Date", ylabel="Price", fig_size=(12, 6)):
    """Plot Stock Prices.

    Parameters:
    df_index: Date index
    price: Price, typically adjusted close price, series of symbol
    forecast: Forecast price
    symbol: Stock symbol
    title: Chart title
    xlabel: X axis title
    ylable: Y axis title
    fig_size: Width and height of the chart in inches
    
    Returns:
    Plot forecast prices
    """
    trace_prices = go.Scatter(
                x=df_index,
                y=price,
                name = symbol,
                line = dict(color = '#17BECF'),
                opacity = 0.8)

    trace_forecast_prices = go.Scatter(
                x=df_index,
                y=forecast,
                name = symbol,
                line = dict(color = '#FF8000'),
                opacity = 0.8)
        

    data = [trace_prices, trace_forecast_prices]

    layout = dict(
        title = title,
        showlegend=True,
        xaxis = dict(
                title=xlabel,
                linecolor='#000', linewidth=1,
                rangeselector=dict(
                        buttons=list([
                            dict(count=1,
                                 label='1m',
                                 step='month',
                                 stepmode='backward'),
                            dict(count=6,
                                 label='6m',
                                 step='month',
                                 stepmode='backward'),
                            dict(step='all')
                        ])
                ),
                range = [df_index.values[0], df_index.values[1]]),
            
        yaxis = dict(
                title=ylabel,
                linecolor='#000', linewidth=1
                ),
    )
        
        
        

    fig = dict(data=data, layout=layout)
    iplot(fig)

In [42]:
plot_price_forecast(portf_value.index, portf_value['Adj Close'], df['Forecast'])


DatetimeIndex(['2007-12-31', '2008-01-02', '2008-01-03', '2008-01-04',
               '2008-01-07', '2008-01-08', '2008-01-09', '2008-01-10',
               '2008-01-11', '2008-01-14',
               ...
               '2018-11-09', '2018-11-12', '2018-11-13', '2018-11-14',
               '2018-11-15', '2018-11-16', '2018-11-19', '2018-11-20',
               '2018-11-21', '2018-11-23'],
              dtype='datetime64[ns]', name='Date', length=2746, freq=None)