# Introduction

TODO:
- Frame the business problem (why we care) and 3-5 potential use cases for this technology in the retail sector
- Define success in terms of 1-2 specific metrics (recommend using MAPE and RMSE)
- Provide a high-level overview of your approach
- Write a high-level overview of your data
    - Where it came from
    - How it was collected
    - Potential biases it may have
- When you're all finished, write a 1-paragraph executive summary for this engagement

# Import

TODO:
- Load pkl into a dask dataframe for parallelized processing
- Print .head() and .summary()

OPTIONAL:
- Distributively process your code using dask. See here: https://docs.dask.org/en/latest/setup/cloud.html

In [2]:
import pandas as pd
import numpy as np
import dask.dataframe as dd

import seaborn as sns

import xgboost as xgb
import lightgbm as lgb

import shap

pandas_df = pd.read_pickle("./data/raw_weekly_df.pkl")
dask_df = dd.from_pandas(pandas_df, npartitions=8)

In [3]:
(pandas_df.info(memory_usage="Deep"))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6841121 entries, 0 to 6841120
Data columns (total 7 columns):
 #   Column    Dtype         
---  ------    -----         
 0   dept_id   category      
 1   cat_id    category      
 2   item_id   category      
 3   state_id  category      
 4   store_id  category      
 5   datetime  datetime64[ns]
 6   sales     float32       
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 117.5 MB


# Data Exploration & Transformation

TODO:
- Check for outliers and missing data
- Check for duplicates
- Check datatypes and comment on whether they make sense
- Plot univariate / bivariate graphs to better understand your data, especially for your target variable, in seaborn
- Standardize, normalize, or log your target variable
- Split your data into training, testing, and holdout sets and explain why this step is important

In [5]:
pandas_df.duplicated()

0          False
1          False
2          False
3          False
4          False
           ...  
6841116    False
6841117    False
6841118    False
6841119    False
6841120    False
Length: 6841121, dtype: bool

- shows that none of the rows are duplicated

In [6]:
pandas_df.loc[pandas_df.duplicated() != False]

Unnamed: 0,dept_id,cat_id,item_id,state_id,store_id,datetime,sales


 - 6841121 cells of data are in the data set
 - the mean of the sales from all pieces of data is 9.60301
 - the range of the sales from the data is 0 to 4220
 - the distribution of the data shows 25% of the sales data having a value below 0, 50% of the sales data having a value below 3 (median), and 75% of the sales data having a value below 9
 - the sales data is skewed right since the mean is greater than the median

In [7]:
pandas_df.describe()

Unnamed: 0,sales
count,6841121.0
mean,9.60301
std,25.84863
min,0.0
25%,0.0
50%,3.0
75%,9.0
max,4220.0


# Feature Engineering

TODO:
- Discretize any categorical features
- Develop 3+ new time-related features and 3+ new product-related features as functions in a script called 'feature_engineering.py' in ./src
    - Time-related features: MonthYear, holiday flags, etc.
    - Product-related features: average number of items sold per month in each store (frequency), last time product sold in the store (recency), etc. 

# Modeling

TODO: 
- Write a function to run your lightgbm and xgboost models on your training data and output your in-sample and out-of-sample error RMSE and MAPE
- Use your function to 'tune' both algorithms by passing in different combinations of hyperparameters (max depth, subsampling, etc.) to optimize for OOS RMSE
- Plot your OOS RMSE on the y-axis and your various hyperparameter combinations on the x-axis to select the best combination
- Retrain your models on your train + test data and measure your IS and OOS error metrics on your holdout data 

OPTIONAL:
- Validate that your results are accurate using k-fold cross-validation. Note that this step is extremely important, but can be time-intensive; including it here as optional to be conscious of our time constraints.

# Interpretation

TODO:
- Create a SHAP plot for our best model and interpret its results
