# Store Sales - Multiple Features Forecasting

https://www.kaggle.com/competitions/store-sales-time-series-forecasting

***In this project, forecasting will be processed with considering 'promotion' featureas well.<br>
Also, we will implement forecasting per each product family***

The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

The RMSLE is calculated as:
$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$
where:

𝑛 is the total number of instances,<br>
𝑦̂ 𝑖 is the predicted value of the target for instance (i),<br>
𝑦𝑖 is the actual value of the target for instance (i), and,<br>
log is the natural logarithm.

The training data; <br>
***store_nbr*** identifies the store at which the products are sold.<br>
***family*** identifies the type of product sold.<br>
***sales*** gives the total sales for a product family at a particular store at a given date.
Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).<br>
***onpromotion*** gives the total number of items in a product family that were being promoted at a store at a given date.

## Blue print

1. Investigate the dataset. (unique values, data type etc)
2. How to numerize *family* features?
3. How to convert *date* to time features?
4. Split *train* dataset to *ourtrain* and *ourtest* for pre-validation.
5. Apply various ML models. (Trend, Periodtogram, Cycles, Hybrid)
6. Choose the best model and apply to our test set.
7. Apply and make csv file for submition.


## Preprocessing

In [3]:
# Import packages
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
import datetime
import math
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Ignore Future Warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [4]:
# Load dataset
train = pd.read_csv('train.csv', parse_dates=["date"])
test = pd.read_csv('test.csv', parse_dates=["date"])

In [5]:
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


#### - Error function : 
$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$

𝑛 is the total number of instances,<br>
𝑦̂ 𝑖 is the predicted value of the target for instance (i),<br>
𝑦𝑖 is the actual value of the target for instance (i)

In [26]:
# Error Function (RMSLE)
def error(y_p, y_t):    # y_p(sales, id), y_t(sales)
    pred_log = []
    for i in np.nditer(y_p["sales"]):
        if i < 0:
            i = 0
        pred_log.append(math.log(i+1))
        
    pred_log = np.array(pred_log)
    act_log = np.array([math.log(i+1) for i in np.nditer(y_t)])
    dum_error = sum((pred_log - act_log)**2)/len(pred_log)
    linear_error = np.power(dum_error, 1/2)
    


    # pred_log = np.array([math.log(i+1) for i in np.nditer(y_p["sales"])])
    # act_log = np.array([math.log(i+1) for i in np.nditer(y_t)])
    # dum_error = sum((pred_log - act_log)**2)/len(pred_log)
    # linear_error = np.power(dum_error, 1/2)
    return round(linear_error, 4)

In [None]:
# Compute error for each model
def errors_model(model):
    errors_list = []
    for store in store_list:
        for family in family_list:
            # split ourtrain and ourtest
            ourtrain, ourtest = split_train_test(date_features(store_family_subsets(store)[family]))
            
            # apply trend model
            y_test = ourtest['sales']
            y_fore = model(ourtrain, ourtest)
            
            # compute errors
            errors = error(y_fore, y_test)
            errors_list.append(round(errors, 2))

    return sum(errors_list)

## 1. Data investigation

#### - train dataset
* shape : 3000888 × 6
* null : none
<br><br>
* *date* : timestamp. 2013-01-01 ~ 2017-08-15
* *store_nbr* : numpy. 1 ~ 54
* *family* : str. ['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD']
* *sales* : numpy. 0 ~ 124717
* *onpromotion* : numpy. 0 ~ 741


#### - Correlation

In [7]:
# Check correlation
train.corr()

Unnamed: 0,id,store_nbr,sales,onpromotion
id,1.0,0.000301,0.085784,0.20626
store_nbr,0.000301,1.0,0.041196,0.007286
sales,0.085784,0.041196,1.0,0.427923
onpromotion,0.20626,0.007286,0.427923,1.0


## 2. Generate subsets

In [8]:
# Generate subsets for each store number
def storenbr_subsets(df, key):
    subset = df.loc[df["store_nbr"]==key, :]
    subset = subset.drop(columns=["store_nbr"])     
    return subset

# Save the subsets in dictionary
store_list = train["store_nbr"].unique()
train_subsets = {}
for store in store_list:
    train_subsets.update({store : storenbr_subsets(train, store)})

print(type(train_subsets[1]), train_subsets[1].shape)
train_subsets[1]    # train dafaframe for store number 1

<class 'pandas.core.frame.DataFrame'> (55572, 5)


Unnamed: 0,id,date,family,sales,onpromotion
0,0,2013-01-01,AUTOMOTIVE,0.000000,0
1,1,2013-01-01,BABY CARE,0.000000,0
2,2,2013-01-01,BEAUTY,0.000000,0
3,3,2013-01-01,BEVERAGES,0.000000,0
4,4,2013-01-01,BOOKS,0.000000,0
...,...,...,...,...,...
2999134,2999134,2017-08-15,POULTRY,234.892000,0
2999135,2999135,2017-08-15,PREPARED FOODS,42.822998,0
2999136,2999136,2017-08-15,PRODUCE,2240.230000,7
2999137,2999137,2017-08-15,SCHOOL AND OFFICE SUPPLIES,0.000000,0


In [9]:
# Generate subsets for each family
def family_subsets(df, key):
    subset = df.loc[df["family"]==key, :]
    subset = subset.drop(columns=["family"])     
    return subset

# Save the subsets in dictionary
family_list = train["family"].unique()
def store_family_subsets(storenbr):
    subsets = {}
    for family in family_list:
        subsets.update({family : family_subsets(train_subsets[storenbr], family)})
    return subsets

# store_family_subsets(1)  # categorized dafaframe for store number 1
store_family_subsets(1)['AUTOMOTIVE']   # sales dafaframe for 'AUTOMOTIVE' in store number 1

Unnamed: 0,id,date,sales,onpromotion
0,0,2013-01-01,0.0,0
1782,1782,2013-01-02,2.0,0
3564,3564,2013-01-03,3.0,0
5346,5346,2013-01-04,3.0,0
7128,7128,2013-01-05,5.0,0
...,...,...,...,...
2991978,2991978,2017-08-11,1.0,0
2993760,2993760,2017-08-12,6.0,0
2995542,2995542,2017-08-13,1.0,0
2997324,2997324,2017-08-14,1.0,0


## 3. One Hot Encode *'family'* features

In [None]:
# # Integer encode
# label_encoderF = LabelEncoder()
# integer_encodedF = label_encoderF.fit_transform(whole_train['family'])

# # Binary encode
# onehot_encoderF = OneHotEncoder(sparse=False)
# integer_encodedF = integer_encodedF.reshape(len(integer_encodedF), 1)
# onehot_encodedF = onehot_encoderF.fit_transform(integer_encodedF)
# # print(onehot_encodedF.shape, '\n', onehot_encodedF)

In [None]:
# # Rename
# onehot_encodedF = pd.DataFrame(onehot_encodedF)
# onehot_encodedF = onehot_encodedF.rename(columns = {0:'AUTOMOTIVE', 1:'BABY CARE', 2:'BEAUTY', 3:'BEVERAGES', 
#         4:'BOOKS', 5:'BREAD/BAKERY', 6:'CELEBRATION', 7:'CLEANING', 8:'DAIRY', 9:'DELI', 10:'EGGS',
#         11:'FROZEN FOODS', 12:'GROCERY I', 13:'GROCERY II', 14:'HARDWARE',
#         15:'HOME AND KITCHEN I', 16:'HOME AND KITCHEN II', 17:'HOME APPLIANCES',
#         18:'HOME CARE', 19:'LADIESWEAR', 20:'LAWN AND GARDEN', 21:'LINGERIE',
#         22:'LIQUOR,WINE,BEER', 23:'MAGAZINES', 24:'MEATS', 25:'PERSONAL CARE',
#         26:'PET SUPPLIES', 27:'PLAYERS AND ELECTRONICS', 28:'POULTRY',
#         29:'PREPARED FOODS', 30:'PRODUCE', 31:'SCHOOL AND OFFICE SUPPLIES',
#         32:'SEAFOOD'})

In [None]:
# # Add to train dataset
# whole_train = pd.concat([whole_train, onehot_encodedF], axis=1) # Combine
# whole_train = whole_train.drop(columns=['family'])              # Drop string values
# print(whole_train.shape)
# whole_train.head()

## 4. One Hot Encode *'store_nbr'* features

In [None]:
# # Convert to numpy
# integer_encodedS = np.array(whole_train['store_nbr'])

# # Binary encode
# onehot_encoderS = OneHotEncoder(sparse=False)
# integer_encodedS = integer_encodedS.reshape(len(integer_encodedS), 1)
# onehot_encodedS = onehot_encoderS.fit_transform(integer_encodedS)
# # print(onehot_encodedS.shape, '\n', onehot_encodedS)

In [None]:
# # Add to train dataset
# onehot_encodedS = pd.DataFrame(onehot_encodedS)
# whole_train = pd.concat([whole_train, onehot_encodedS], axis=1) # Combine
# whole_train = whole_train.drop(columns=['store_nbr'])           # Drop string values
# print(whole_train.shape)
# whole_train.head()

## 5. Convert *'date'* to time features

In [10]:
# Split 'date' into detailed features
def date_features(df):
    df = df.set_index('date')   # Make 'date' as an index
    df = df.to_period('D')

    df["day"] = df.index.dayofweek
    df["week"] = df.index.week
    df["dayofyear"] = df.index.dayofyear
    df["year"] = df.index.year

    # df = df.set_index('id')     # Make 'id' as an index
    return df

date_features(store_family_subsets(1)['AUTOMOTIVE'])

Unnamed: 0_level_0,id,sales,onpromotion,day,week,dayofyear,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-01-01,0,0.0,0,1,1,1,2013
2013-01-02,1782,2.0,0,2,1,2,2013
2013-01-03,3564,3.0,0,3,1,3,2013
2013-01-04,5346,3.0,0,4,1,4,2013
2013-01-05,7128,5.0,0,5,1,5,2013
...,...,...,...,...,...,...,...
2017-08-11,2991978,1.0,0,4,32,223,2017
2017-08-12,2993760,6.0,0,5,32,224,2017
2017-08-13,2995542,1.0,0,6,32,225,2017
2017-08-14,2997324,1.0,0,0,33,226,2017


## 6. Split *train* dataset into *ourtrain* and *ourtest*

In [13]:
def split_train_test(df):
    ourtrain = df[df.index < '2017-01-01']    # 2013-01-01 ~ 2016-12-31
    ourtest = df[df.index > '2016-12-31']     # 2017-01-01 ~ 2017-08-15
    return ourtrain, ourtest

x, y = split_train_test(date_features(store_family_subsets(1)["BEAUTY"]))
print(x.shape, type(x))
x

(1457, 7) <class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,sales,onpromotion,day,week,dayofyear,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-01-01,2,0.0,0,1,1,1,2013
2013-01-02,1784,2.0,0,2,1,2,2013
2013-01-03,3566,0.0,0,3,1,3,2013
2013-01-04,5348,3.0,0,4,1,4,2013
2013-01-05,7130,3.0,0,5,1,5,2013
...,...,...,...,...,...,...,...
2016-12-27,2587466,2.0,0,1,52,362,2016
2016-12-28,2589248,6.0,1,2,52,363,2016
2016-12-29,2591030,1.0,1,3,52,364,2016
2016-12-30,2592812,3.0,0,4,52,365,2016


## 7. Apply various ML models

### 1) Trend

In [24]:
# Fit data to trend model
def trend(ourtrain, ourtest):   # return y_fore

    # Targets
    y_train = ourtrain['sales']
    test_id = ourtest['id']
    
    # Create features
    trend_dp = DeterministicProcess(
    index=ourtrain.index,   # dates from the training data
    constant=True,       # dummy feature for the bias (y_intercept)
    order=2,             # the time dummy (trend)
    drop=True,           # drop terms if necessary to avoid collinearity
    )

    # `in_sample` creates features for the dates given in the `index` argument
    X_train = trend_dp.in_sample()

    # Fit model
    model = LinearRegression(fit_intercept=False)
    model.fit(X_train, y_train)

    # Out of Sample 
    X_oos = trend_dp.out_of_sample(steps=len(ourtest.index))
    y_fore = pd.Series(model.predict(X_oos), index=X_oos.index)
    y_fore = pd.concat([y_fore, test_id], axis=1)
    y_fore = y_fore.rename(columns={0: 'sales'})

    return y_fore

In [25]:
trend(x, y)

Unnamed: 0,sales,id
2017-01-01,3.299815,2596376
2017-01-02,3.302020,2598158
2017-01-03,3.304227,2599940
2017-01-04,3.306435,2601722
2017-01-05,3.308645,2603504
...,...,...
2017-08-11,3.829881,2991980
2017-08-12,3.832453,2993762
2017-08-13,3.835026,2995544
2017-08-14,3.837602,2997326


In [30]:
trend_errors = []
for store in store_list:
    for family in family_list:
        # split ourtrain and ourtest
        ourtrain, ourtest = split_train_test(date_features(store_family_subsets(store)[family]))
        
        # apply trend model
        y_test = ourtest['sales']
        y_trend_fore = trend(ourtrain, ourtest)
        
        # compute errors
        errors = error(y_trend_fore, y_test)
        trend_errors.append(round(errors, 2))

sum(trend_errors)

1221.5500000000027

### 2) Periodogram

In [33]:
def seasonal(ourtrain, ourtest):    # return y_fore

    test_id = ourtest['id']
    
    # 12 sin/cos pairs for "A"nnual seasonality
    fourier = CalendarFourier(freq="A", order=12)
    season_dp = DeterministicProcess(
        index=ourtrain.index,
        constant=True,               # dummy feature for bias (y-intercept)
        order=1,                     # trend (order 1 means linear)
        seasonal=True,               # weekly seasonality (indicators)
        additional_terms=[fourier],  # annual seasonality (fourier)
        drop=True,                   # drop terms to avoid collinearity
    )

    X = season_dp.in_sample()  # create features for dates in ourtrain.index
    y = ourtrain["sales"]

    season_model = LinearRegression(fit_intercept=False)
    _ = season_model.fit(X, y)

    # Forecasting for 2017-01-01 ~ 2017-08-15
    X_fore = season_dp.out_of_sample(steps=len(ourtest.index))
    y_fore = pd.Series(season_model.predict(X_fore), index=X_fore.index)
    y_fore = pd.concat([y_fore, test_id], axis=1)
    y_fore = y_fore.rename(columns={0: 'sales'})

    return y_fore

In [34]:
seasonal(x, y)

Unnamed: 0,sales,id
2017-01-01,3.155316,2596376
2017-01-02,2.844285,2598158
2017-01-03,2.869616,2599940
2017-01-04,2.789500,2601722
2017-01-05,2.950803,2603504
...,...,...
2017-08-11,3.165873,2991980
2017-08-12,2.940561,2993762
2017-08-13,3.015122,2995544
2017-08-14,2.731969,2997326


In [36]:
seasonal_errors = []
for store in store_list:
    for family in family_list:
        # split ourtrain and ourtest
        ourtrain, ourtest = split_train_test(date_features(store_family_subsets(store)[family]))
        
        # apply trend model
        y_test = ourtest['sales']
        y_seasonal_fore = seasonal(ourtrain, ourtest)
        
        # compute errors
        errors = error(y_seasonal_fore, y_test)
        seasonal_errors.append(round(errors, 2))

sum(seasonal_errors)

1184.7700000000004

### 3) Cycles

In [None]:
def lagplot(x, y=None, lag=1, standardize=False, ax=None, **kwargs):
    from matplotlib.offsetbox import AnchoredText
    x_ = x.shift(lag)
    if standardize:
        x_ = (x_ - x_.mean()) / x_.std()
    if y is not None:
        y_ = (y - y.mean()) / y.std() if standardize else y
    else:
        y_ = x
    corr = y_.corr(x_)
    if ax is None:
        fig, ax = plt.subplots()
    scatter_kws = dict(
        alpha=0.75,
        s=3,
    )
    line_kws = dict(color='C3', )
    ax = sns.regplot(x=x_,
                     y=y_,
                     scatter_kws=scatter_kws,
                     line_kws=line_kws,
                     lowess=True,
                     ax=ax,
                     **kwargs)
    at = AnchoredText(
        f"{corr:.2f}",
        prop=dict(size="large"),
        frameon=True,
        loc="upper left",
    )
    at.patch.set_boxstyle("square, pad=0.0")
    ax.add_artist(at)
    ax.set(title=f"Lag {lag}", xlabel=x_.name, ylabel=y_.name)
    return ax
    

In [None]:
def plot_lags(x, y=None, lags=6, nrows=1, lagplot_kwargs={}, **kwargs):
    import math
    kwargs.setdefault('nrows', nrows)
    kwargs.setdefault('ncols', math.ceil(lags / nrows))
    kwargs.setdefault('figsize', (kwargs['ncols'] * 2, nrows * 2 + 0.5))
    fig, axs = plt.subplots(sharex=True, sharey=True, squeeze=False, **kwargs)
    for ax, k in zip(fig.get_axes(), range(kwargs['nrows'] * kwargs['ncols'])):
        if k + 1 <= lags:
            ax = lagplot(x, y, lag=k + 1, ax=ax, **lagplot_kwargs)
            ax.set_title(f"Lag {k + 1}", fontdict=dict(fontsize=14))
            ax.set(xlabel="", ylabel="")
        else:
            ax.axis('off')
    plt.setp(axs[-1, :], xlabel=x.name)
    plt.setp(axs[:, 0], ylabel=y.name if y is not None else x.name)
    fig.tight_layout(w_pad=0.1, h_pad=0.1)
    return fig

In [None]:
# Partial Autocorrelataion
from statsmodels.graphics.tsaplots import plot_pacf

plot_lags(ourtrain.sales, lags=12, nrows=2)
_ = plot_pacf(ourtrain.sales, lags=12)

In [None]:
def make_lags(ts, lags):
    return pd.concat(
        {
            f'y_lag_{i}': ts.shift(i)
            for i in range(1, lags + 1)
        },
        axis=1)

In [None]:
def cycle(ourtrain):

    test_id = ourtest['id']
    
    X = make_lags(ourtrain.sales, lags=7)
    X = X.fillna(0.0)
    y = ourtrain.sales

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=int(len(X.index)*0.3), shuffle=False)

    ts_model = LinearRegression()  # `fit_intercept=True` since we didn't use DeterministicProcess
    ts_model.fit(X_train, y_train)
    y_fore = pd.Series(ts_model.predict(X_test), index=X_test.index)
    y_fore = pd.concat([y_fore, test_id], axis=1)
    y_fore = y_fore.rename(columns={0: 'sales'})

    return y_fore

In [None]:
# pd.to_datetime(date_col_to_force, errors = 'coerce')
cycle_error = cycle(ourtrain, ourtest, error)
print("Total Error from Cycle model: ", round(cycle_error, 4))

### 4) Hybrid

## 8. Modify test dataset

### - One Hot Encode 'family' features

In [None]:
# Integer encode
label_encoderTEST = LabelEncoder() 
integer_encodedTEST = label_encoderTEST.fit_transform(test['family'])

# Binary encode
onehot_encoderTEST = OneHotEncoder(sparse=False)   
integer_encodedTEST = integer_encodedTEST.reshape(len(integer_encodedTEST), 1)
onehot_encodedTEST = onehot_encoderTEST.fit_transform(integer_encodedTEST)

onehot_encodedTEST = pd.DataFrame(onehot_encodedTEST)
onehot_encodedTEST = onehot_encodedTEST.rename(columns = {0:'AUTOMOTIVE', 1:'BABY CARE', 2:'BEAUTY', 3:'BEVERAGES', 
        4:'BOOKS', 5:'BREAD/BAKERY', 6:'CELEBRATION', 7:'CLEANING', 8:'DAIRY', 9:'DELI', 10:'EGGS',
        11:'FROZEN FOODS', 12:'GROCERY I', 13:'GROCERY II', 14:'HARDWARE',
        15:'HOME AND KITCHEN I', 16:'HOME AND KITCHEN II', 17:'HOME APPLIANCES',
        18:'HOME CARE', 19:'LADIESWEAR', 20:'LAWN AND GARDEN', 21:'LINGERIE',
        22:'LIQUOR,WINE,BEER', 23:'MAGAZINES', 24:'MEATS', 25:'PERSONAL CARE',
        26:'PET SUPPLIES', 27:'PLAYERS AND ELECTRONICS', 28:'POULTRY',
        29:'PREPARED FOODS', 30:'PRODUCE', 31:'SCHOOL AND OFFICE SUPPLIES',
        32:'SEAFOOD'})  # Rename

# Add to train dataset
test = pd.concat([test, onehot_encodedTEST], axis=1) # Combine
test = test.drop(columns=['family'])              # Drop string values
print(test.shape)
test.head()

### - One Hot Encode 'store_nbr' features

In [None]:
# Convert to numpy
integer_encodedTEST2 = np.array(test['store_nbr']) 

# Binary encode
onehot_encoderTEST2 = OneHotEncoder(sparse=False)
integer_encodedTEST2 = integer_encodedTEST2.reshape(len(integer_encodedTEST2), 1)
onehot_encodedTEST2 = onehot_encoderTEST2.fit_transform(integer_encodedTEST2)

# Add to train dataset
onehot_encodedTEST2 = pd.DataFrame(onehot_encodedTEST2)
test = pd.concat([test, onehot_encodedTEST2], axis=1) # Combine
test = test.drop(columns=['store_nbr'])           # Drop string values
print(test.shape)
test.head()

In [None]:
# Make 'date' as an index
test = pd.DataFrame(test)
test = test.set_index('date')
test = test.to_period('D')

# Split 'date' into detailed features
test["day"] = test.index.dayofweek
test["week"] = test.index.week
test["dayofyear"] = test.index.dayofyear
test["year"] = test.index.year

print(test.shape)
test.head()

## 9. Proceed forecasting with the best model

In [None]:
# Modify the trend model
def trend_modified(train, test):  

    # Targets
    y_train = train['sales']
    
    # Create features
    trend_dp = DeterministicProcess(
    index=train.index,   # dates from the training data
    constant=True,       # dummy feature for the bias (y_intercept)
    order=2,             # the time dummy (trend)
    drop=True,           # drop terms if necessary to avoid collinearity
    )

    # `in_sample` creates features for the dates given in the `index` argument
    X_train = trend_dp.in_sample()

    # Fit model
    model = LinearRegression(fit_intercept=False)
    model.fit(X_train, y_train)

    # Out of Sample 
    X_oos = trend_dp.out_of_sample(steps=len(test.index))
    y_fore = pd.Series(model.predict(X_oos), index=X_oos.index)

    return y_fore

In [None]:
# train : 2013-01-01 ~ 2017-08-15
# test : 2017-08-16 ~ 2017-08-31

y_fore = trend_modified(train, test)
print(y_fore.shape, type(y_fore))
y_fore

In [None]:
print(test.shape, type(test))

## 10. Generate csv file