**Dataset Description**
In this competition, you will predict sales for the thousands of product families sold at Favorita stores located in Ecuador. The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models.

**File Descriptions and Data Field Information**


  ***train.csv***
         
          The training data, comprising time series of features store_nbr,family,and onpromotion as well as the target sales.
          store_nbr identifies the store at which the products are sold
          family identifies the type of product sold
          sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
          sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).

  ***test.csv***

      The test data, having the same features as the training data. You will predict the target sales for the dates in this file.
      The dates in the test data are for the 15 days after the last date in the training data

  ***sample_submission.csv***

      A sample submission file in the correct format
  ***stores.csv***

     Store metadata, including city, state, type, and cluster
     cluster is a grouping of similar stores

  ***oil.csv***

     Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices

     ***holidays_events.csv***
      
      Holidays and Events, with metadata
      
      NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
       
       Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

  **Additional Notes**
    
     Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.

     A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake




In [None]:
import pandas as pd
import pandas_profiling

In [None]:
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [None]:
from pandas_profiling import ProfileReport


In [None]:
oil=pd.read_csv('/content/oil.csv')
sample_submission=pd.read_csv('/content/sample_submission.csv')
stores=pd.read_csv('/content/stores.csv')
test=pd.read_csv('/content/test.csv')
train=pd.read_csv('/content/train.csv')
transactions=pd.read_csv('/content/transactions.csv')
holidays_events=pd.read_csv('/content/oil.csv')

In [None]:
profile = ProfileReport(oil , html = {'style' : {'full_width':True}})
profile.to_file(output_file="report.html")
profile.to_notebook_iframe()

In [None]:
profile = ProfileReport(train , html = {'style' : {'full_width':True}})
profile.to_file(output_file="report.html")
profile.to_notebook_iframe()

In [None]:
profile = ProfileReport(stores , html = {'style' : {'full_width':True}})
profile.to_file(output_file="report.html")
profile.to_notebook_iframe()

In [None]:
import os
os._exit(00)

In [None]:
sample_submission.head(20)

In [None]:
profile = ProfileReport( transactions , html = {'style' : {'full_width':True}})
profile.to_file(output_file="report.html")
profile.to_notebook_iframe()

In [None]:
#merge training  data with stores 
train_merged = pd.merge(train , stores , on='store_nbr')
train_merged = train_merged.astype({'store_nbr' : 'str' ,'family' : 'str' ,'city' :'str' ,'state':'str' , 'type': 'str' , 'cluster' :'str'})

# transactions data
transactions_pivoted = pd.pivot_table(transactions , values ='transactions' , index='date' , columns = ['store_nbr']).reset_index().rename_axis(None , axis =1)
transactions_pivoted = transactions_pivoted.rename(columns = {transactions_pivoted.columns[0] : 'date'})
# test data 
test_dropped = test.drop(['onpromotion'] , axis = 1) 
test_dropped = test_dropped.sort_values(by = ['store_nbr' , 'family'])

In [None]:
train_merged.head()

In [None]:
test_dropped.head()

In [None]:
transactions_pivoted.head()

In [None]:
! pip install Darts

In [None]:
# Create Darts-specific TimeSeries Objects
import darts
import numpy as np
 # Transactions
transactions_TS = darts.TimeSeries.from_dataframe(transactions_pivoted, 
                                            time_col = 'date',
                                            fill_missing_dates=True, 
                                            freq='D',
                                            fillna_value = 0)
transactions_TS = transactions_TS.astype(np.float32)

In [None]:
transactions_pivoted

In [None]:
transactions_TS [1]

In [None]:
# Training Data

train_sequence = darts.TimeSeries.from_group_dataframe(
    train_merged,
    time_col="date",
    group_cols=["store_nbr","family"],  # individual time series are extracted by grouping `df` by `group_cols`
    static_cols=["city","state","type","cluster"], # also extract these additional columns as static covariates
    value_cols="sales",
    fill_missing_dates=True,
    freq='D')

for i in range(0,len(train_sequence)):
    train_sequence[i] = train_sequence[i].astype(np.float32)
    
train_sequence = sorted(train_sequence, key=lambda ts: int(float(ts.static_covariates_values()[0,0])))



In [None]:
train_merged[train_merged['cluster'].astype(np.float32) == 13]

In [None]:
train_merged

In [None]:
import matplotlib.pyplot as plt
! pip install matplotlib

In [None]:
# Let's print two of the 1782 TimeSeries



plt.subplots(2, 2, figsize=(15, 6))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
train_sequence[5].plot(label='Sales for {}'.format(train_sequence[5].static_covariates_values()[0,1], 
                                                train_sequence[5].static_covariates_values()[0,0],
                                                train_sequence[5].static_covariates_values()[0,2]))

train_sequence[600].plot(label='Sales for {}'.format(train_sequence[600].static_covariates_values()[0,1], 
                                                train_sequence[600].static_covariates_values()[0,0],
                                                train_sequence[600].static_covariates_values()[0,2]))

plt.title("Two Out Of 1782 TimeSeries")
           
plt.subplot(1, 2, 2) # index 2
train_sequence[5][-365:].plot(label='Sales for {}'.format(train_sequence[5].static_covariates_values()[0,1], 
                                                train_sequence[5].static_covariates_values()[0,0],
                                                train_sequence[5].static_covariates_values()[0,2]))

train_sequence[600][-365:].plot(label='Sales for {}'.format(train_sequence[600].static_covariates_values()[0,1], 
                                                train_sequence[600].static_covariates_values()[0,0],
                                                train_sequence[600].static_covariates_values()[0,2]))

plt.title("Only The Last 365 Days")

plt.show()


In [None]:
train_sequence[50].static_covariates_values()

In [None]:
from darts.utils.statistics import plot_acf, check_seasonality
from darts.utils.missing_values import fill_missing_values

plot_acf(fill_missing_values(train_sequence[5]), m=7, alpha=0.05)
plt.title("{}, store {} in {}".format(train_sequence[5].static_covariates_values()[0,1], 
                                                train_sequence[5].static_covariates_values()[0,0],
                                                train_sequence[5].static_covariates_values()[0,2]))

plot_acf(fill_missing_values(train_sequence[600]), alpha=0.05)
plt.title("{}, store {} in {}".format(train_sequence[600].static_covariates_values()[0,1], 
                                                train_sequence[600].static_covariates_values()[0,0],
                                                train_sequence[600].static_covariates_values()[0,2]))

As we can see, the BREAD/BAKERY series displays strong weekly seasonality, as we would expect. The CELEBRATION series however has a much less clear seasonal pattern.

In the next step, we encode the static covariates and apply 0-1 Scaling + Log-Transformation to all series:

In [None]:
from darts import TimeSeries
from darts.models import NaiveSeasonal, ExponentialSmoothing, Prophet
from darts.metrics import rmsle
from darts.utils.timeseries_generation import datetime_attribute_timeseries
from darts.dataprocessing.transformers import Scaler,MissingValuesFiller,StaticCovariatesTransformer,InvertibleMapper
from darts.dataprocessing import Pipeline
from tqdm import tqdm

import sklearn
from sklearn import preprocessing

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import gc

%matplotlib inline

In [None]:
## Pre-Processing

# Encode Static Covariates

static_cov_transformer = StaticCovariatesTransformer(transformer_cat = sklearn.preprocessing.OrdinalEncoder()) #OneHot would be better, but takes much longer
static_cov_transformed = static_cov_transformer.fit_transform(train_sequence)

for i, (ts, ts_scaled) in enumerate(zip(train_sequence[32:33], static_cov_transformed[32:33])):
    print(f"Original series {i}")
    print(ts.static_covariates)
    print(f"Transformed series {i}")
    print(ts_scaled.static_covariates)
    print("")
    


train_filler =MissingValuesFiller(verbose=False, n_jobs=-1, name="Faster Filler")

log_transformer = InvertibleMapper(np.log1p, np.expm1, verbose=False, n_jobs=-1, name="Faster Log")   

train_scaler = Scaler(verbose=False, n_jobs=-1, name="Faster Scaler")

train_pipeline = Pipeline([train_filler, 
                           log_transformer, 
                           train_scaler])
training_transformed = train_pipeline.fit_transform(static_cov_transformed)

# Differencing the Series

plt.subplots(2, 2, figsize=(15, 6))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
training_transformed[4].plot(label='Sales for {}'.format(train_sequence[5].static_covariates_values()[0,1], 
                                                train_sequence[5].static_covariates_values()[0,0],
                                                train_sequence[5].static_covariates_values()[0,2]))

plt.title("TimeSeries After Scaling and Log-Transform")
           
plt.subplot(1, 2, 2) # index 2
training_transformed[5][-365:].plot(label='Sales for {}'.format(train_sequence[5].static_covariates_values()[0,1], 
                                                train_sequence[5].static_covariates_values()[0,0],
                                                train_sequence[5].static_covariates_values()[0,2]))

plt.title("Only The Last 365 Days")
plt.show()

plt.subplots(2, 2, figsize=(15, 6))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
training_transformed[600].plot(label='Sales for {}'.format(train_sequence[600].static_covariates_values()[0,1], 
                                                train_sequence[600].static_covariates_values()[0,0],
                                                train_sequence[600].static_covariates_values()[0,2]))

plt.title("TimeSeries After Scaling and Log-Transform")
           
plt.subplot(1, 2, 2) # index 2
training_transformed[600][-365:].plot(label='Sales for {}'.format(train_sequence[600].static_covariates_values()[0,1], 
                                                train_sequence[600].static_covariates_values()[0,0],
                                                train_sequence[600].static_covariates_values()[0,2]))

plt.title("Only The Last 365 Days")
plt.show()