<a href="https://colab.research.google.com/github/jeffheaton/present/blob/master/WUSTL/CABI-Demand/lab-2-features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Washington University [Olin School of Business](https://olin.wustl.edu/EN-US/Pages/default.aspx)
[Center for Analytics and Business Insights](https://olin.wustl.edu/EN-US/Faculty-Research/research-centers/center-analytics-business-insights/Pages/default.aspx) (CABI)  
[Deep Learning for Demand Forecasting](https://github.com/jeffheaton/present/tree/master/WUSTL/CABI-Demand)  
Copyright 2022 by [Jeff Heaton](https://www.youtube.com/c/HeatonResearch), Released under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) 

# Lab 2: Features

Attempt to add one or more of the following features from the data.

* For each product, how many other products were also for sale on a given day?
* How long has each product been on the market for each day?
* What percent of sales does each product have per day?

You can use the following starter code. 
Connect GDrive, to write out any results.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Mounted at /content/drive
Note: using Google CoLab


Load the three data files.

In [2]:
import pandas as pd
import os

PATH = "/content/drive/MyDrive/projects/demand/"

df_sales = pd.read_csv("https://data.heatonresearch.com/wustl/CABI/demand-forecast/sales_train.csv", parse_dates=['date'])
df_items = pd.read_csv("https://data.heatonresearch.com/wustl/CABI/demand-forecast/items.csv")
df_resturant = pd.read_csv("https://data.heatonresearch.com/wustl/CABI/demand-forecast/resturants.csv")

Generate features. Add yours here.

In [3]:
df_sales['dow'] = df_sales['date'].dt.dayofweek
df_sales['doy'] = df_sales['date'].dt.dayofyear
df_sales

Unnamed: 0,date,item_id,price,item_count,dow,doy
0,2019-01-01,3,29.22,2.0,1,1
1,2019-01-01,4,26.42,22.0,1,1
2,2019-01-01,12,4.87,7.0,1,1
3,2019-01-01,13,4.18,12.0,1,1
4,2019-01-01,16,3.21,136.0,1,1
...,...,...,...,...,...,...
109595,2021-12-31,96,21.93,0.0,4,365
109596,2021-12-31,97,28.65,0.0,4,365
109597,2021-12-31,98,5.00,0.0,4,365
109598,2021-12-31,99,5.32,0.0,4,365


In [4]:
import tqdm

# For each product, how many other products were also for sale on a given day?
dates = df_sales.date.unique()
items = df_sales.item_id.unique()
df_sales['other_sales'] = 0

for d in tqdm.tqdm(dates):
  for item in items:
    target = (df_sales.date == d) & (df_sales.item_id==item)  
    #assert sum(target) == 1
    others = df_sales[(df_sales.item_id!=item) & (df_sales.date==d) ].item_count.sum()
    df_sales.loc[target,'other_sales'] = others

100%|██████████| 1096/1096 [05:02<00:00,  3.62it/s]


In [5]:
# How long has each product been on the market for each day?
df_first_sale = df_sales[df_sales.item_count>0]
df_first_sale = df_first_sale[['item_id','date']].groupby('item_id',as_index=False).min().sort_values(['date'],ascending=False)
df_first_sale.columns = ['item_id','first_sale']
df_sales = df_sales.merge(df_first_sale)
df_sales['age'] = (df_sales.date - df_sales.first_sale).dt.days
df_sales

In [12]:
# What percent of sales does each product have per day?
df_sales['pct_total'] = df_sales.item_count/(df_sales.item_count+df_sales.other_sales)
df_sales

Unnamed: 0,date,item_id,price,item_count,dow,doy,other_sales,first_sale,age,pct_total
0,2019-01-01,3,29.22,2.0,1,1,425,2019-01-01,0,0.004684
1,2019-01-02,3,29.22,0.0,2,2,337,2019-01-01,1,0.000000
2,2019-01-03,3,29.22,0.0,3,3,445,2019-01-01,2,0.000000
3,2019-01-04,3,29.22,6.0,4,4,558,2019-01-01,3,0.010638
4,2019-01-05,3,29.22,4.0,5,5,548,2019-01-01,4,0.007246
...,...,...,...,...,...,...,...,...,...,...
100827,2021-12-27,100,2.48,0.0,0,361,192,2019-02-18,1043,0.000000
100828,2021-12-28,100,2.48,0.0,1,362,344,2019-02-18,1044,0.000000
100829,2021-12-29,100,2.48,0.0,2,363,371,2019-02-18,1045,0.000000
100830,2021-12-30,100,2.48,0.0,3,364,527,2019-02-18,1046,0.000000
