# Data Exploration

## Loading Data

The data has been placed in a public S3 bucket for this tutorial. We will be using the data from Kaggle's [M5 Forecasting Accuracy](https://www.kaggle.com/competitions/m5-forecasting-accuracy) competition. This contains Walmart sales data for the USA.

The data comprises 3049 individual products from 3 categories and 7 departments, sold in 10 stores in 3 states. The hierachical aggregation captures the combinations of these factors. For instance, we can create 1 time series for all sales, 3 time series for all sales per state, and so on. The largest category is sales of all individual 3049 products per 10 stores for 30490 time series.

We start by downloading and unzipping the contents. This will give us the CSV files to work with.

In [None]:
!wget -q -O tmp.zip https://fugue-data.s3.us-east-2.amazonaws.com/m5-forecasting-accuracy.zip && mkdir -p 'data' && mv 'tmp.zip' 'data/tmp.zip' && unzip -o 'data/tmp.zip' -d 'data' && rm 'data/tmp.zip'

## First Look at Data

We'll take a quick look at the data given to us to understand the problem more. Most of the code snippets here are taken from [Rob Mulla's Starter Notebook](https://www.kaggle.com/code/robikscube/m5-forecasting-starter-data-exploration). We're not going to go to deep to understand everything. We're only interested in doing some quick visualization.

In [None]:
import pandas as pd
import os

# Read in the data
INPUT_DIR = os.path.abspath('data')
WORKING_DIR = os.path.abspath("data/working")
if not os.path.exists(WORKING_DIR):
   os.makedirs(WORKING_DIR)
calendar = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
training_data = pd.read_csv(f'{INPUT_DIR}/sales_train_evaluation.csv')
sell_prices = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')


**Training Data**

We take a look at the head. Note the following:
1. There is a hierarchichal nature to the data. There is a `dept_id` and a `cat_id`. 
2. A `store_id` + `item_id` is a unique identifier
3. Columns `d_1` to `d_1913` form our timeseries of purchases

In [None]:
training_data.head()

**Sell Prices**

In [None]:
sell_prices.head()

**Calendar**

We need to join `training_data` with `calendar` data by `wm_yr_wk` we can map the prices to dates

In [None]:
calendar.head()

## Initial Plots

We are not too concerned with getting the best model. We just want to understand the data better and what the timeseries looks like.

In [None]:
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle

plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

In [None]:
d_cols = [c for c in training_data.columns if 'd_' in c] # sales data columns

def plot_one(data: pd.DataFrame, calendar: pd.DataFrame, id: str) -> None:
    data = data.loc[data['id'] == id].copy(deep=True)
    idx = data.index.values[0]
    example = (
        data[d_cols]\
        .T\
        .rename(columns={str(idx):id})\
        .reset_index()\
        .rename(columns={'index': 'd'})\
        .merge(calendar, how='left', validate='one_to_one')\
        .set_index('date')[idx]
    )
    example.plot(figsize=(15, 5),
            color=next(color_cycle),
            title=f'{id} sales by actual sale dates')
    plt.show()
    return

plot_one(training_data, calendar, 'HOBBIES_1_234_CA_3_evaluation')
plot_one(training_data, calendar, 'FOODS_3_090_CA_3_evaluation')
plot_one(training_data, calendar, 'HOUSEHOLD_1_118_CA_3_evaluation')

## Next Steps

In this section, we took an initial look at the data. In the next section, we'll begin preprocessing it.