<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#-Forecasting-Demand-for-Optimized-Inventory-Planning-" data-toc-modified-id="-Forecasting-Demand-for-Optimized-Inventory-Planning--1"><center> Forecasting Demand for Optimized Inventory Planning </center></a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.1">Overview</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-1.2">Imports</a></span></li><li><span><a href="#Read-Data" data-toc-modified-id="Read-Data-1.3">Read Data</a></span></li><li><span><a href="#Time-things" data-toc-modified-id="Time-things-1.4">Time things</a></span></li><li><span><a href="#Auto-time-EDA" data-toc-modified-id="Auto-time-EDA-1.5">Auto time EDA</a></span></li></ul></li></ul></div>

- Author: Bruno
- Start: 16/04


<h1><center> Forecasting Demand for Optimized Inventory Planning </center></h1>


## Overview
An  established  retailer  wants  to  optimize  its  inventory  planning  to  not  only  significantly  reduce storage space, but also its costs and need for logistical operations.
It plans to **restock its inventory every other week** and only **keep** in stock the items that it has actually **sold** during that period.

The  goal  of  the  participating  teams  is  to  create  a  machine  learning  model  to  **predict  the  demand for every product over the two-week period**. 
It is important to point out that some products will be promoted for limited periods of time. 
Products that are promoted during the simulation  period  will  be  earmarked.
However,  the  transaction  data  needs  to  indicate  whether a product is being promoted during the training period.

Finally, the model does not need  to  be  able  to  respond  to  price  changes  during  the  simulation  period.
To  simplify  matters, prices will not be changed during the period.
In  order  to  create  this  model,  the  teams  obtain  information  about  the  **exact  time  of  every  transaction  during  a  period  of  six  months  and  about  other  features  that  describe  the  products.**


## Imports

In [1]:
import zipfile
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 50)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
import seaborn as sns
sns.set()

## Read Data

In [2]:
DATA_DIR = "../main/datasets/"
DATA_FILE = "1.0v.zip"
with zipfile.ZipFile(DATA_DIR+DATA_FILE) as z:
    # I am saving the data again to use in my auto eda script;
    # Too lazy to change it :)
    dfs = []
    for name in ["infos", "items", "orders"]:
        dfs.append(pd.read_csv(z.open(f"1.0v/{name}.csv"), sep="|"))
    infos, items, orders = dfs

In [3]:
infos.head(2)

Unnamed: 0,itemID,simulationPrice,promotion
0,1,3.43,
1,2,9.15,


In [4]:
infos.isna().sum()

itemID                0
simulationPrice       0
promotion          8620
dtype: int64

In [5]:
items.head(2)

Unnamed: 0,itemID,brand,manufacturer,customerRating,category1,category2,category3,recommendedRetailPrice
0,1,0,1,4.38,1,1,1,8.84
1,2,0,2,3.0,1,2,1,16.92


In [6]:
items.isna().sum()

itemID                    0
brand                     0
manufacturer              0
customerRating            0
category1                 0
category2                 0
category3                 0
recommendedRetailPrice    0
dtype: int64

In [7]:
orders.head(2)

Unnamed: 0,time,transactID,itemID,order,salesPrice
0,2018-01-01 00:01:56,2278968,450,1,17.42
1,2018-01-01 00:01:56,2278968,83,1,5.19


In [8]:
orders.isna().sum()

time          0
transactID    0
itemID        0
order         0
salesPrice    0
dtype: int64

In [9]:
# Make sure each item and info is 1 row only
assert infos["itemID"].nunique() == len(infos)
assert items["itemID"].nunique() == len(items)
# Make sure they are both the same
assert (items["itemID"].unique() == infos["itemID"].unique()).all()

# Make sure every item in order is in items/info
assert (orders["itemID"].isin(items["itemID"])).all()

# But the opposite isn't true!
missing = (~items["itemID"].isin(orders["itemID"])).sum()
print(f"We have {missing} = {100*missing/len(items):.2f}%",
      " products in info/items but not in orders")

We have 623 = 5.95%  products in info/items but not in orders




- - - 

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## Time things

If ```Use the period starting on 30 June 2018 00:00:00, the day after the last date from the transaction files.``` that means the 29th is included, but the 30th not (it's the first day in our test data;

Also, the first 14 days backwards should be [16-29] June (The 15th should not be included!)

So we index "week_backwards" which is how many weeks BACKWARDS from test time we have (ie, 0 weeks backwards means we are at TEST TIME). Therefore, 0 doesn't exist for now :)

In [57]:
def process_time(df, should_print=False,
                 test_start=pd.to_datetime("30 June 2018 00:00:00")):
    df["time"] = pd.to_datetime(df["time"])
    
    # Make sure we only have data for 2018
    assert (df["time"].dt.year != 2018).sum() == 0
    if should_print:
        print("The first timestamp is", df["time"].min(),
              "and the last is", df["time"].max())

    df["days"] = df["time"].dt.dayofyear

    # Make sure we have data for every single day
    df["days"].unique() == np.arange(1, 181)

    df["days_backwards"] = test_start.dayofyear - df["days"]
    df["week_backwards"] = np.ceil(df["days_backwards"] / 14).astype(int)
    # Make sure we didn't make any mistake - 16th/06 should 1
    assert not (df.set_index("time").loc["16 June 2018 00:00:00":"16 June 2018 23:59:59",
                                             "week_backwards"] != 1).sum()
    # 15th/06 should be 2
    assert not (df.set_index("time").loc["15 June 2018 00:00:00":"15 June 2018 23:59:59",
                                             "week_backwards"] != 2).sum()

In [59]:
process_time(orders)



- - - 

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## Auto time EDA

The code bellows formats the dataframe for use in my auto time eda

In [65]:
aggs = {"order": "sum", "salesPrice": "mean",
        "days": "mean"}
df = orders.groupby(["week_backwards", "itemID"], as_index=False).agg(aggs)
df.rename(columns={x: x+"_"+y for x, y in aggs.items()}, inplace=True)

In [66]:
df.head(2)

Unnamed: 0,week_backwards,itemID,order_sum,salesPrice_mean,days_mean
0,1,1,3,3.43,174.0
1,1,3,140,14.04,178.909091


In [71]:
col = "itemID"
df = pd.merge(df, items, on=col, validate="m:1")
df = pd.merge(df, infos, on=col, validate="m:1")
df.head()

Unnamed: 0,week_backwards,itemID,order_sum,salesPrice_mean,days_mean,avg_time,brand,manufacturer,customerRating,category1,category2,category3,recommendedRetailPrice,simulationPrice,promotion
0,1,1,3,3.43,174.0,2018-06-23,0,1,4.38,1,1,1,8.84,3.43,
1,3,1,31,3.11,143.807692,2018-05-23,0,1,4.38,1,1,1,8.84,3.43,
2,4,1,3,3.11,132.0,2018-05-12,0,1,4.38,1,1,1,8.84,3.43,
3,5,1,299,3.11,113.057143,2018-04-23,0,1,4.38,1,1,1,8.84,3.43,
4,6,1,2,3.11,105.5,2018-04-15,0,1,4.38,1,1,1,8.84,3.43,


In [123]:
# convert days back to datetime for my script
# Idea from: https://stackoverflow.com/questions/34258892/converting-year-and-day-of-year-into-datetime-index-in-pandas
df["week_datetime"] = pd.to_datetime((2018*1000 + 183) 
                                     -14*df["week_backwards"], format='%Y%j')
# add binary is_promition
df["is_promotion"] = ~df["promotion"].isna()
df.head()

Unnamed: 0,week_backwards,itemID,order_sum,salesPrice_mean,days_mean,avg_time,brand,manufacturer,customerRating,category1,category2,category3,recommendedRetailPrice,simulationPrice,promotion,is_promotion,week_datetime
0,1,1,3,3.43,174.0,2018-06-23,0,1,4.38,1,1,1,8.84,3.43,,False,2018-06-18
1,3,1,31,3.11,143.807692,2018-05-23,0,1,4.38,1,1,1,8.84,3.43,,False,2018-05-21
2,4,1,3,3.11,132.0,2018-05-12,0,1,4.38,1,1,1,8.84,3.43,,False,2018-05-07
3,5,1,299,3.11,113.057143,2018-04-23,0,1,4.38,1,1,1,8.84,3.43,,False,2018-04-23
4,6,1,2,3.11,105.5,2018-04-15,0,1,4.38,1,1,1,8.84,3.43,,False,2018-04-09


In [124]:
df.to_csv("full.csv", index=False)

In [129]:
temp = df.query("week_datetime == '2018-02-12'")
#.groupby("itemID")["order_sum"].sum().sort_values(ascending=False)

In [137]:
prods = [7789, 5035]
temp.query("itemID in @prods")#["order_sum"].sum() / temp["order_sum"].sum()

Unnamed: 0,week_backwards,itemID,order_sum,salesPrice_mean,days_mean,avg_time,brand,manufacturer,customerRating,category1,category2,category3,recommendedRetailPrice,simulationPrice,promotion,is_promotion,week_datetime
11801,10,5035,1235,44.14,46.033058,2018-02-15,90,80,3.41,5,21,4,17.49,31.44,2018-07-04,True,2018-02-12
15668,10,7789,1604,14.96,42.724284,2018-02-11,0,186,4.68,4,39,7,14.7,12.14,2018-06-30,True,2018-02-12
