# Linear Regression on Time Series with SciKit Learn and Pandas

This notebook demonstrates linear regression using 

* scikit learn
* pandas
* numpy

## Imports

Import the required libraries, like so.

In [1]:
import numpy as np
import pandas as pd
import datetime

## Create time series data

There are many ways to do this. Refer to the [Time series](https://pandas.pydata.org/docs/getting_started/10min.html#time-series) section in the pandas documentation for more details.

Here, we take a date range for the year of 2020 and create a datetime index based on each day.

In [2]:
start = datetime.datetime(2020, 1, 1)
end = datetime.datetime(2020, 12, 31)
index = pd.date_range(start, end)
index, len(index)

(DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
                '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
                '2020-01-09', '2020-01-10',
                ...
                '2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
                '2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
                '2020-12-30', '2020-12-31'],
               dtype='datetime64[ns]', length=366, freq='D'), 366)

## Create a pandas dataframe out of that

Next, we can put this in a pandas dataframe. We can add an artificial "value" column that is a multiple of 5 for some generic target data.

In [3]:
multiple = 5
l = list(range(0, len(index)*multiple, multiple))
df = pd.DataFrame(l, index = index)
df.index.name = "date"
df.columns = ["value"]
df

Unnamed: 0_level_0,value
date,Unnamed: 1_level_1
2020-01-01,0
2020-01-02,5
2020-01-03,10
2020-01-04,15
2020-01-05,20
...,...
2020-12-27,1805
2020-12-28,1810
2020-12-29,1815
2020-12-30,1820


## Simple feature engineering from time series

We want something sensible to predict from. One simple option is to convert the date index into an integer from the minimum start date like so:

In [4]:
df['days_from_start'] = (df.index - df.index[0]).days; df

Unnamed: 0_level_0,value,days_from_start
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-01,0,0
2020-01-02,5,1
2020-01-03,10,2
2020-01-04,15,3
2020-01-05,20,4
...,...,...
2020-12-27,1805,361
2020-12-28,1810,362
2020-12-29,1815,363
2020-12-30,1820,364


## Simple Regression

We can now pull out the columns for simple linear regression.

In [5]:
x = df['days_from_start'].values
y = df['value'].values
x, y

(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
         26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
         91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
        117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
        130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
        143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
        156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
        169, 170, 171, 172, 173, 174, 175, 176, 177

Note that our input variables x have to be reshaped for input to the model.

In [6]:
x = x.reshape(-1, 1); x

array([[  0],
       [  1],
       [  2],
       [  3],
       [  4],
       [  5],
       [  6],
       [  7],
       [  8],
       [  9],
       [ 10],
       [ 11],
       [ 12],
       [ 13],
       [ 14],
       [ 15],
       [ 16],
       [ 17],
       [ 18],
       [ 19],
       [ 20],
       [ 21],
       [ 22],
       [ 23],
       [ 24],
       [ 25],
       [ 26],
       [ 27],
       [ 28],
       [ 29],
       [ 30],
       [ 31],
       [ 32],
       [ 33],
       [ 34],
       [ 35],
       [ 36],
       [ 37],
       [ 38],
       [ 39],
       [ 40],
       [ 41],
       [ 42],
       [ 43],
       [ 44],
       [ 45],
       [ 46],
       [ 47],
       [ 48],
       [ 49],
       [ 50],
       [ 51],
       [ 52],
       [ 53],
       [ 54],
       [ 55],
       [ 56],
       [ 57],
       [ 58],
       [ 59],
       [ 60],
       [ 61],
       [ 62],
       [ 63],
       [ 64],
       [ 65],
       [ 66],
       [ 67],
       [ 68],
       [ 69],
       [ 70],
      

In [7]:
from sklearn import linear_model
model = linear_model.LinearRegression().fit(x, y)
linear_model.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [8]:
model.predict([[1], [7], [50]])

array([  5.,  35., 250.])