# Kaggle Getting Started Prediction Competition: Store Sales - Time Series Forecasting

In this [competition](https://www.kaggle.com/competitions/store-sales-time-series-forecasting), we will use time-series forecasting to forecast store sales on data from Corporación Favorita, a large Ecuadorian-based grocery retailer. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn course of [Time Series Course](https://www.kaggle.com/learn/time-series) where you will learn to leverage periodic trends for forecasting as well as combine different models such as linear regression and XGBoost to perfect your forecasting. For the purpose of this tutorial we are looking at periodic trend for forecasting.

Install the kfp package by uncommenting the below line and restarting the kernel. Do comment it out once the kernel is restarted

In [1]:
# Install the kfp 
# !pip install kfp --upgrade 

## Imports

Following are the imports required to build the pipeline and pass the data between components for building up the kubeflow pipeline

In [2]:
import kfp
from kfp.components import func_to_container_op
import kfp.components as comp
# from typing import NamedTuple

All the essential imports required in a pipeline component are put together in a list which then is passed on to each pipeline component. Though this might not be efficient when you are dealing with lot of packages, so in cases with many packages and dependencies you can go for docker image which then can be passed to each pipeline component

In [3]:
import_packages = ['pandas', 'sklearn', 'statsmodels', 'kaggle','pyarrow','scikit-learn']

In the following implementation of kubeflow pipeline we are making use of [lightweight python function components](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) to build up the pipeline. The data is passed between component instances(tasks) using InputPath and OutputPath. Different ways of storing and passing data between the pipelines have been explored in the following notebook.

The pipeline is divided into five components

    1. Download the data from Kaggle
    2. Load the data
    3. Create features
    4. Train and evaluate the model
    5. Forecast Sales

## Download data from Kaggle

Follow the prerequisites information in the Github README.md on how to create a secret for our credentials and mounting them to our pod using a pod-default resource. Once you have the secret mounted, you can use it to acccess the Username and key to download the files you need from kaggle. 

In [4]:
def download_kaggle_dataset(path:str)->str:
    
     import os

     # Retrieve the credentials from the secret mounted and 
     # bring it onto our working environment
     with open('/secret/kaggle/KAGGLE_KEY', 'r') as file:
          kaggle_key = file.read().rstrip()
     with open('/secret/kaggle/KAGGLE_USERNAME', 'r') as file:
          kaggle_user = file.read().rstrip()
     os.environ['KAGGLE_USERNAME'] = kaggle_user 
     os.environ['KAGGLE_KEY'] = kaggle_key

     os.chdir(os.getcwd())
     os.system("mkdir " + path)
     os.chdir(path)
    
     # Using Kaggle Public API to download the datasets
     import kaggle   
     from kaggle.api.kaggle_api_extended import KaggleApi
     
     api = KaggleApi()
     api.authenticate()
        
     # Download the required files individually. You can also choose to download the entire dataset if you want to work with images as well.   
     api.competition_download_files('store-sales-time-series-forecasting')   
     
     return path   

In [5]:
download_data_op = func_to_container_op(download_kaggle_dataset, packages_to_install = import_packages)

## Load the data

In [6]:
def load_data(path:str, train_data_path: comp.OutputPath(), holidays_data_path: comp.OutputPath())->str:
    
    # Imports
    import pandas as pd
    import os
    from zipfile import ZipFile 
    from pyarrow import parquet
    import pickle
    
    # Moving to current working directory and creating a new directory
    os.chdir(os.getcwd())
    print(os.listdir(path))
    os.chdir(path)
 
    # Extracting all files from competition zip file
    zipfile = ZipFile('store-sales-time-series-forecasting.zip', 'r')
    zipfile.extractall()
    zipfile.close()
    
    # Converting to pandas dataframe 
    train_data_filepath = path + "/train.csv"
    test_data_filepath = path + "/test.csv"
    holidays_filepath = path + "/holidays_events.csv"

    # Read the csv files into dataframes
    # Training data
    train_sales = pd.read_csv(train_data_filepath,
        usecols=['store_nbr', 'family', 'date', 'sales'],
        dtype={
            'store_nbr': 'category',
            'family': 'category',
            'sales': 'float32',
        },
        parse_dates=['date'],
        infer_datetime_format=True,
    )
    train_sales['date'] = train_sales.date.dt.to_period('D')
    train_sales = train_sales.set_index(['store_nbr', 'family', 'date']).sort_index()

    # Holiday features dataset
    holidays_events = pd.read_csv(
        holidays_filepath,
        dtype={
            'type': 'category',
            'locale': 'category',
            'locale_name': 'category',
            'description': 'category',
            'transferred': 'bool',
        },
        parse_dates=['date'],
        infer_datetime_format=True,
    )
    holidays_events = holidays_events.set_index('date').to_period('D')

    # Test data id required for submission of forecast sales
    df_test = pd.read_csv(
        test_data_filepath,
        dtype={
            'store_nbr': 'category',
            'family': 'category',
            'onpromotion': 'uint32',
        },
        parse_dates=['date'],
        infer_datetime_format=True,
    )
    df_test['date'] = df_test.date.dt.to_period('D')
    df_test = df_test.set_index(['store_nbr', 'family', 'date']).sort_index()
    
    # Dumping files to pickle
    pickle.dump(train_sales, open(train_data_path, 'wb'))
    pickle.dump(holidays_events, open(holidays_data_path, 'wb'))
  
    return path
    

In [7]:
load_data_op = func_to_container_op(load_data,packages_to_install = import_packages)

## Create features

1. indicators for weekly seasons
2. Fourier features of order 4 for monthly seasons
3. Creating holiday features provided in the Store Sales Dataset

In [8]:
def create_features(path:str, train_data_path: comp.InputPath(), holidays_data_path: comp.InputPath(), 
                    holidays_feat_path:comp.OutputPath(), data_path: comp.OutputPath('ApacheParquet'), 
                    target_data_path: comp.OutputPath(), dp_feat_path: comp.OutputPath())->str:
    
    # Imports
    import pandas as pd
    from pyarrow import parquet
    import pyarrow as pa
    from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
    from sklearn.preprocessing import OneHotEncoder
    import pickle
    
    train_sales = pickle.load(open(train_data_path, 'rb'))
    holidays_events = pickle.load(open(holidays_data_path, 'rb'))
    
    # National and regional holidays of Ecuador in the training set
    # Holiday features
    holidays = (
        holidays_events
        .query("locale in ['National', 'Regional']")
        .loc['2017':'2017-08-15', ['description']]
        .assign(description=lambda x: x.description.cat.remove_unused_categories())
    )
    
    # Create training data features
    y = train_sales.unstack(['store_nbr', 'family']).loc["2017"]

    # Using CalendarFourier to create fourier features 
    fourier = CalendarFourier(freq='M', order=4)

    # Using DeterministicProcess to create indicators for both 
    # weekly and monthly seasons
    dp = DeterministicProcess(
        index=y.index,
        constant=True,
        order=1,
        seasonal=True,               # weekly seasonality (indicators)
        additional_terms=[fourier],  # annual seasonality (fourier)
        drop=True,
    )

    # `in_sample` creates features for the dates given in the `index` argument
    X = dp.in_sample()

    ohe = OneHotEncoder(sparse=False)

    X_holidays = pd.DataFrame(
        ohe.fit_transform(holidays),
        index=holidays.index,
        columns=holidays.description.unique(),
    )

    X_holidays = pd.get_dummies(holidays)

    # Join holiday features to training data
    X_2 = X.join(X_holidays, on='date').fillna(0.0)
      
    table = pa.Table.from_pandas(X_2)
    parquet.write_table(table, data_path)
    
    pickle.dump(X_holidays, open(holidays_feat_path, 'wb'))
    pickle.dump(y, open(target_data_path, 'wb'))
    pickle.dump(dp, open(dp_feat_path, 'wb'))
    
    return path

In [9]:
create_features_op = func_to_container_op(create_features,packages_to_install = import_packages)

## Train and evaluate the model

In [10]:
def train_and_evaluate_model(path:str, data_path: comp.InputPath('ApacheParquet'), target_data_path: comp.InputPath(), 
                             model_path: comp.OutputPath())->str:
    
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_absolute_error
    from pyarrow import parquet
    import pyarrow as pa
    import pickle
    
    X_2 = parquet.read_pandas(data_path).to_pandas()
    y = pickle.load(open(target_data_path, 'rb'))
    
    # Split the data to train and valid datasets
    X_train, X_valid, y_train, y_valid = train_test_split(X_2, y, test_size=0.1, shuffle=False)

    # Train the model
    model = LinearRegression(fit_intercept=False)
    model.fit(X_train, y_train)

    # Get the training and valid data predictions
    y_train_pred = pd.DataFrame(model.predict(X_train), index=X_train.index, columns=y.columns)
    y_valid_pred = pd.DataFrame(model.predict(X_valid), index=X_valid.index, columns=y.columns)
    # Evaluate the model using mean_squared_log_error
    print(mean_absolute_error(y_valid, y_valid_pred))
    
    pickle.dump(model, open(model_path, 'wb'))
    
    return path

In [11]:
train_and_evaluate_model_op = func_to_container_op(train_and_evaluate_model,packages_to_install = import_packages)

In [12]:
def forecast_sales(path:str, model_path: comp.InputPath(), holidays_feat_path: comp.InputPath(), 
                   target_data_path: comp.InputPath(), dp_feat_path: comp.InputPath()):
    
    import pandas as pd
    from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
    import pickle
    
    model = pickle.load(open(model_path, 'rb'))
    dp = pickle.load(open(dp_feat_path, 'rb'))
    
    X_holidays = pickle.load(open(holidays_feat_path, 'rb'))
    y = pickle.load(open(target_data_path, 'rb'))
    
    # Create features for test set
    # "out of sample" refers to times outside of the observation period of the training data.
    # We are forecasting for next 16 days from the end of the training data date
    test = dp.out_of_sample(steps=16)
    test.index.name = 'date'
    X_test = test.join(X_holidays, on='date').fillna(0.0)
    y_forecast = pd.DataFrame(model.predict(X_test), index=X_test.index, columns=y.columns)
    print(y_forecast)


In [13]:
forecast_sales_op = func_to_container_op(forecast_sales, packages_to_install = import_packages)

## Defining function that implements the pipeline

In [14]:
def kfp_pipeline():
    
    vop = kfp.dsl.VolumeOp(
    name="create-volume",    
    resource_name="store-sales-pvc",
    size="5Gi",
    modes = kfp.dsl.VOLUME_MODE_RWM
    )
    
    download_task = download_data_op("/mnt/data/").add_pvolumes({"/mnt": vop.volume}).add_pod_label("kaggle-secret", "true")
    load_data_task = load_data_op(download_task.output).add_pvolumes({"/mnt": vop.volume})
    create_features_task = create_features_op(path = load_data_task.outputs['Output'],
                                              train_data =load_data_task.outputs['train_data'], 
                                              holidays_data = load_data_task.outputs['holidays_data']
                                             ).add_pvolumes({"/mnt": vop.volume})
    train_and_evaluate_model_task = train_and_evaluate_model_op(path = create_features_task.outputs['Output'], 
                                                          data = create_features_task.outputs['data'],
                                                          target_data = create_features_task.outputs['target_data'],      
                                                            ).add_pvolumes({"/mnt": vop.volume})
    forecast_sales_task = forecast_sales_op(path = train_and_evaluate_model_task.outputs['Output'],
                                    model = train_and_evaluate_model_task.outputs['model'], 
                                    holidays_feat = create_features_task.outputs['holidays_feat'],
                                    target_data = create_features_task.outputs['target_data'],
                                    dp_feat =  create_features_task.outputs['dp_feat']       
                                           ).add_pvolumes({"/mnt": vop.volume})

In [15]:
# Using kfp.Client() to run the pipeline from notebook itself
client = kfp.Client() # change arguments accordingly

# Running the pipeline
client.create_run_from_pipeline_func(
    kfp_pipeline,
    arguments={
    })

{'train_data': {{pipelineparam:op=load-data;name=train_data}}, 'holidays_data': {{pipelineparam:op=load-data;name=holidays_data}}, 'Output': {{pipelineparam:op=load-data;name=Output}}, 'output': {{pipelineparam:op=load-data;name=Output}}}


RunPipelineResult(run_id=db0cbcb3-9e78-4542-b4d5-7e061e170af7)