# Preprocessing with Fugue

## Loading in Data

We'll take a quick look at the data given to us to understand the problem more. Most of the code snippets here are taken from [Rob Mulla's Starter Notebook](https://www.kaggle.com/code/robikscube/m5-forecasting-starter-data-exploration). We're not going to go to deep to understand everything. We're only interested in setting up an end-to-end modelling pipeline.

In [None]:
import pandas as pd
import os

# Read in the data
INPUT_DIR = os.path.abspath('data')
WORKING_DIR = os.path.abspath("data/working")
training_data = pd.read_csv(f'{INPUT_DIR}/sales_train_evaluation.csv')


## Training Data

In [None]:
training_data.iloc[0:1]

In [None]:
def get_calendar_data():
    df = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
    df["date"] = pd.to_datetime(df["date"])
    return df

In [None]:
from typing import Iterable, List, Any, Dict
from fugue import transform
from datetime import timedelta

start = get_calendar_data()['date'].min()

# schema: unique_id:str,item_id:str,store_id:str,ds:date,y:int
def format_sales(df:Iterable[List[Any]], start) -> Iterable[List[Any]]:
    for row in df:
        counter = 0
        for y in row[6:]:
            # help with convergence
            if y == 0:
                y = y + 0.01
            date = start + timedelta(counter-1)
            yield row[:2] + [row[4]] + [date, y]
            counter=counter+1

In [None]:
transform(training_data.iloc[0:1], format_sales, params={"start": start})

In [None]:
ddf = transform(training_data[0:100], 
                format_sales, 
                params={"start": start}, 
                engine="dask")
ddf.compute().head(5)

## Exogenous Regressors

We want to add price in.

In [None]:
sell_prices = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')
sell_prices.head(2)

In [None]:
get_calendar_data().head(2)

In [None]:
from fugue import FugueWorkflow

sampled_sales = training_data.iloc[0:2]
calendar = get_calendar_data()
start = calendar['date'].min()

def process_data(sample=True) -> FugueWorkflow:
    dag = FugueWorkflow()
    if sample:
        sales = dag.df(sampled_sales)
    else:
        sales = dag.load(f'{INPUT_DIR}/sales_train_evaluation.csv', header=True)
    prices = dag.load(f'{INPUT_DIR}/sell_prices.csv', header=True)
    calendar = dag.load(f'{INPUT_DIR}/calendar.csv', header=True).rename({"date": "ds"}).alter_columns("ds:date")
    sales = sales.transform(format_sales, params={"start": start})
    combined = sales.join(calendar[["ds","wm_yr_wk"]], how="left_outer")\
                    .join(prices, how="inner")
    combined.show()
    combined.save(f"{WORKING_DIR}/combined.parquet")
    return dag

In [None]:
dag = process_data(sample=True)
dag.run()

In order to run on the full dataset and get the full combined file, you can execute:

```python
dag = process_data(sample=False)
dag.run(spark)
```

where Spark is the SparkSession

## Hierarchichal Preprocessing

We need to keep the hierchichal columns for aggregating later.

In [None]:
start = get_calendar_data()['date'].min()

# schema: unique_id:str,item_id:str,dept_id:str,cat_id:str,store_id:str,state_id:str,ds:date,y:int
def format_sales_hierarchical(df:Iterable[List[Any]], start) -> Iterable[List[Any]]:
    for row in df:
        counter = 0
        for y in row[6:]:
            # help with convergence
            if y == 0:
                y = y + 0.01
            date = start + timedelta(counter-1)
            yield row[:6] + [date, y]
            counter=counter+1

In [None]:
transform(training_data, format_sales_hierarchical, params={"start": start}, engine="spark", save_path=f"{WORKING_DIR}/hierarchical.parquet")