<a href="https://colab.research.google.com/github/jonkarrer/sales-forecasting-ml/blob/main/Sales_Time_Series.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Kaggle

In [1]:
!pip install kaggle



Sign into Kaggle using ENVs

In [5]:
from google.colab import userdata
from pathlib import Path

kaggle_username = userdata.get('KAGGLE_USERNAME')
kaggle_key = userdata.get('KAGGLE_KEY')
creds=f'{{"username":"{kaggle_username}","key":"{kaggle_key}"}}'

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)


Grab the dataset from Kaggle

In [12]:
from os import path
from kaggle import api
import pandas as pd

COMPETITION_NAME = "store-sales-time-series-forecasting"

if not path.exists(COMPETITION_NAME):
    api.competition_download_cli(COMPETITION_NAME)

!kaggle competitions download -c store-sales-time-series-forecasting

# Unzip the dataset
import zipfile
import os

with zipfile.ZipFile(f"{COMPETITION_NAME}.zip", "r") as zip_ref:
    zip_ref.extractall(f"{COMPETITION_NAME}")

data_path = f"{COMPETITION_NAME}/train.csv"
df = pd.read_csv(data_path)
print(df.head())


store-sales-time-series-forecasting.zip: Skipping, found more recently modified local copy (use --force to force download)
   id        date  store_nbr      family  sales  onpromotion
0   0  2013-01-01          1  AUTOMOTIVE    0.0            0
1   1  2013-01-01          1   BABY CARE    0.0            0
2   2  2013-01-01          1      BEAUTY    0.0            0
3   3  2013-01-01          1   BEVERAGES    0.0            0
4   4  2013-01-01          1       BOOKS    0.0            0


## Understanding the Data

In [14]:
tables_in_data = ["train.csv", "oil.csv", "stores.csv", "holidays_events.csv", "transactions.csv", "test.csv", "sample_submission.csv"]
for table in tables_in_data:
    print(table)
    df = pd.read_csv(f"{COMPETITION_NAME}/{table}")
    print(df.head())

train.csv
   id        date  store_nbr      family  sales  onpromotion
0   0  2013-01-01          1  AUTOMOTIVE    0.0            0
1   1  2013-01-01          1   BABY CARE    0.0            0
2   2  2013-01-01          1      BEAUTY    0.0            0
3   3  2013-01-01          1   BEVERAGES    0.0            0
4   4  2013-01-01          1       BOOKS    0.0            0
oil.csv
         date  dcoilwtico
0  2013-01-01         NaN
1  2013-01-02       93.14
2  2013-01-03       92.97
3  2013-01-04       93.12
4  2013-01-07       93.20
stores.csv
   store_nbr           city                           state type  cluster
0          1          Quito                       Pichincha    D       13
1          2          Quito                       Pichincha    D       13
2          3          Quito                       Pichincha    D        8
3          4          Quito                       Pichincha    D        9
4          5  Santo Domingo  Santo Domingo de los Tsachilas    D        4
holid

### Training Set

In [15]:
df = pd.read_csv(f"{COMPETITION_NAME}/train.csv", low_memory=False)
df.columns

Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')

The training data, comprising time series of features id, date, store_nbr, family, and onpromotion as well as the target sales.

- store_nbr identifies the store at which the products are sold.
- family identifies the type of product sold.
- sales gives the total sales for a product family at a particular store at a - given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
- onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.
- date is when the sales occured
- id is the row identifier

In [16]:
df.nunique()

Unnamed: 0,0
id,3000888
date,1684
store_nbr,54
family,33
sales,379610
onpromotion,362


It seems that the family and store number are the most crucial aspects of this table.

In [22]:
print("** Store Number")
print(df["store_nbr"].unique())

print("** Family")
print(df["family"].unique())

Store Number
[ 1 10 11 12 13 14 15 16 17 18 19  2 20 21 22 23 24 25 26 27 28 29  3 30
 31 32 33 34 35 36 37 38 39  4 40 41 42 43 44 45 46 47 48 49  5 50 51 52
 53 54  6  7  8  9]
Family


array(['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD'], dtype=object)

## Stores

- Store metadata, including city, state, type, and cluster.
- cluster is a grouping of similar stores.
- store_nbr should be what is in the uniques of the training set

In [23]:
df = pd.read_csv(f"{COMPETITION_NAME}/stores.csv", low_memory=False)
df.nunique()

Unnamed: 0,0
store_nbr,54
city,22
state,16
type,5
cluster,17


In [24]:
print("** Type")
print(df["type"].unique())

print("** Cluster")
print(df["cluster"].unique())

print("** City")
print(df["city"].unique())

print("** State")
print(df["state"].unique())

** Type
['D' 'B' 'C' 'E' 'A']
** Cluster
[13  8  9  4  6 15  7  3 12 16  1 10  2  5 11 14 17]
** City
['Quito' 'Santo Domingo' 'Cayambe' 'Latacunga' 'Riobamba' 'Ibarra'
 'Guaranda' 'Puyo' 'Ambato' 'Guayaquil' 'Salinas' 'Daule' 'Babahoyo'
 'Quevedo' 'Playas' 'Libertad' 'Cuenca' 'Loja' 'Machala' 'Esmeraldas'
 'Manta' 'El Carmen']
** State
['Pichincha' 'Santo Domingo de los Tsachilas' 'Cotopaxi' 'Chimborazo'
 'Imbabura' 'Bolivar' 'Pastaza' 'Tungurahua' 'Guayas' 'Santa Elena'
 'Los Rios' 'Azuay' 'Loja' 'El Oro' 'Esmeraldas' 'Manabi']
