In [29]:
%autoreload 2
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show, output_notebook

# add project dir to path
import sys
from os import getcwd
from os.path import dirname
sys.path.append(dirname(getcwd()))

from inv_changes.load import load_from_csv, load_from_hdf, write_to_hdf

%matplotlib inline

READ_FROM_CSV = True

if READ_FROM_CSV:
    df = load_from_csv()
    write_to_hdf(df)
else:
    df = load_from_hdf()

# Inputs
There are two inputs, one data set that is raw, and another that is the result of preparing the raw data set.
## CSV
The csv input is the raw format of the source data.  It contains about 6.4 million rows, each representing an interval over which a SKU's inventory dimensions are effective.  This interval may span any amount of time.  Additionally, a change in any inventory dimension may cause the interval to expire and a new one to start, even if the inventory dimension isn't important to us (e.g. a product remained "in-stock" but switched from "is dropshipped" to "not dropped").

### Dataset Cleansing
Cleansing the dataset drops the total row count from 6.4 million to about 1.75 million.  That seems to be the first insight, that much of the data stored is not pertinent to this analysis.

#### Filtering known irrelevant SKUs
Some of the rows in the data set are unnecessary.  SKUs that are certain sizes or special order products are not sold on the web, and will just slow down and complicate analysis.

#### Filtering "always out of stock" products
It's difficult to know which products are published in the dataset.  This information isn't stored in the database, and can't be derived anywhere else (as far as I know).  However, it's a fair assumption to say that any product that has been considered "out of stock" for all intervals of recorded data has probably never been published, nor will it ever be published.

## HDF5
The HDF5 input is the processed csv input.  It saves us time.

In [34]:
df['status_duration_in_days'] = (df['expired_on'] - df['effective_on']).astype('timedelta64[D]').astype(int)
df['average_daily_durataion'] = df['status_duration'] / df['status_duration_in_days']

OverflowError: cannot convert float infinity to integer

In [None]:
# get sum of daily stock_status per sku, per status
df.pivot_table(index=['sku', 'effective_on_date'], 
               columns='stock_status', 
               values='status_duration', 
               aggfunc='sum').fillna(np.timedelta64(0, 'D'))

In [33]:
df.columns

Index(['sku', 'bulk_id', 'size_code', 'quantity_on_hand',
       'lifecycle_status_flag', 'expected_date', 'effective_on', 'expired_on',
       'effective_on_date', 'expired_on_date', 'stock_status',
       'status_duration'],
      dtype='object')