<a href="https://colab.research.google.com/github/imadksiddiqui/Sales-Prediction/blob/main/store_sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install -Uq upgini catboost

[K     |████████████████████████████████| 89 kB 3.9 MB/s 
[K     |████████████████████████████████| 76.6 MB 1.2 MB/s 
[K     |████████████████████████████████| 1.6 MB 28.7 MB/s 
[K     |████████████████████████████████| 2.0 MB 55.9 MB/s 
[K     |████████████████████████████████| 12.2 MB 41.6 MB/s 
[?25h

In [None]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("tran.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=10_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()


Unnamed: 0,date,store,item,sales
0,2013-01-01,3,12,38
1,2013-01-01,4,9,19
2,2013-01-01,10,21,33
3,2013-01-01,3,27,11
4,2013-01-01,2,3,19


data from before 2017 goes into training the model and during 2017 goes into testing the model

In [None]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

In [None]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

Use upgini to find more relevant datasets to train model with using date as the search key

In [None]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys= {
        "date": SearchKey.DATE,
    },
    cv = CVType.time_series
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)])

<IPython.core.display.Javascript object>

Detected task type: ModelTaskType.REGRESSION


Column name,Status,Description
date,All valid,All values in this column are good to go
target,All valid,All values in this column are good to go


Running search request with search_id=93739408-7661-4f38-a758-4321b2d150f9
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

[92m[1m
12 relevant feature(s) found with the search keys: ['date'][0m


provider,source,feature name,shap value,coverage %,type,feature type
,,item,0.466819,100.0,categorical,
,,store,0.160156,100.0,categorical,
Upgini,Public/Comm. shared,f_weather_date_weather_pca_0_d7e0a1fc,0.046755,100.0,numerical,Free
Upgini,Public/Comm. shared,f_events_date_week_sin1_847b5db1,0.038227,100.0,numerical,Free
Upgini,Public/Comm. shared,f_weather_date_weather_umap_33_89bb7578,0.026237,100.0,numerical,Free
Upgini,Public/Comm. shared,f_weather_date_weather_umap_48_b39cd0c4,0.024478,100.0,numerical,Free
Upgini,Public/Comm. shared,f_events_date_week_cos1_f6a8c1fc,0.019963,100.0,numerical,Free
Upgini,Public/Comm. shared,f_events_date_year_cos1_9014a856,0.015023,100.0,numerical,Free
Upgini,Public/Comm. shared,f_weather_date_weather_umap_24_2e14c9a6,0.014776,100.0,numerical,Free
Upgini,Public/Comm. shared,f_financial_date_dow_jones_fe02128f,0.007141,100.0,numerical,Free


In [None]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric

model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)
enricher.calculate_metrics(
    train_features, train_target,
    eval_set=[(test_features, test_target)],
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating metrics...
Done


Unnamed: 0,match_rate,baseline mean_absolute_percentage_error,enriched mean_absolute_percentage_error,uplift
,,,,
train,100.0,0.257526,0.158546,0.098981
eval 1,100.0,0.2732,0.186723,0.086477


In [None]:
enriched_train_features = enricher.transform(train_features, keep_input=True)
enriched_test_features = enricher.transform(test_features, keep_input=True)
enriched_train_features.head()



Column name,Status,Description
date,All valid,All values in this column are good to go


Running search request with search_id=2402c77d-3b87-4e30-8980-fa4e7e297862
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

Collecting selected features...
Done


Column name,Status,Description
date,All valid,All values in this column are good to go


Running search request with search_id=f7f8787f-ce5c-4417-b9ff-7d467868963f
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com


The next two cells compare the error rate for the normal model vs the enriched model

In [None]:
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

[39.56447289695444]

In [None]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[14.965073430366601]