# Data Cleaning and Preprocessing Notebook

## Dataset Description:
**Medication Inventory Management Dataset - Michigan Medicine**

The dataset contains historical medication inventory and transaction data from Michigan Medicine, covering 3 hospitals, 40 outpatient locations, and over 120 clinics. It includes detailed information on inventory levels, transactions (picks, restocks, wastes, etc.), medication details, locations, and costs.

|                  |                                                     |
|--------------------------|---------------------------------------------------------------|
| **Estimated size**       | 3,807,314 rows and multiple columns (approximately 1GB)       |
| **Format**               | Comma Separated Values (.csv)                                 |
| **Location & Access**    | Access provided by the Pharmacy Manager (Michigan Medicine - Pharmacy Administration) |

### More About the Variables
| Column Name             | Description                                                                                             |
|:------------------------|:--------------------------------------------------------------------------------------------------------|
| **daily_inv_location**  | The physical location where inventory was counted on a given day.                                       |
| **daily_inv_date**      | The date of the inventory count.                                                                        |
| **isa_name**            | The name of the inventory storage area.                                                                 |
| **daily_inv_med_id**    | A unique identifier for the medication counted during inventory.                                        |
| **med_id_clean**        | A standardized version of the medication identifier. Often combines multiple med_id that are suffixed into one unit. |
| **med_description**     | A brief description of the medication.                                                                  |
| **first_count_of_day**  | The initial count of the medication on the inventory date.                                              |
| **last_count_of_day**   | The final count of the medication on the inventory date.                                                |
| **next_daily_inv_date** | The date of the next transaction (and therefore inventory change).                                      |
| **calendar_dt**         | A specific calendar date for reference. Should always be present regardless of whether there is a transaction for the med on a given day. |
| **pick**                | The quantity of medication picked from inventory.                                                       |
| **cycle_count**         | The quantity of medication cycle counted.                                                               |
| **waste**               | The quantity of medication wasted or discarded.                                                         |
| **destock**             | The quantity of medication removed from active inventory.                                               |
| **batch_pick**          | The quantity of medication picked in a batch process.                                                   |
| **load**                | The quantity of medication loaded into a dispensing system.                                             |
| **inventory**           | The total quantity of medication processed using the inventory function.                                |
| **restock**             | The quantity of medication added back into inventory.                                                   |
| **current_inv_med_id**  | The current unique identifier for the medication in inventory.                                          |
| **current_inv_location**| The current location of the medication in inventory.                                                    |
| **current_inv_min**     | The minimum quantity of the medication to be kept in inventory. Inventory below this value will create an order for restock. |
| **current_inv_max**     | The maximum quantity of the medication to be kept in inventory. This is the order up to level.          |
| **current_inv_qoh**     | The current quantity on hand (QOH) of the medication.                                                   |
| **pref_ndc_vendor_name**| The preferred vendor's name for the medication's National Drug Code (NDC).                              |
| **pref_ndc_package_size**| The preferred package size associated with the medication's NDC.                                       |
| **pref_ndc**            | The preferred National Drug Code (NDC) for the medication.                                              |
| **scd_name**            | The Standardized Concept Drug (SCD) name from RxNorm.                                                   |
| **scd_rxcui**           | The RxNorm Concept Unique Identifier (RxCUI) for the SCD.                                               |
| **in_min_name**         | The Ingredient Minimum (IN) name from RxNorm.                                                           |
| **in_min_rxcui**        | The RxNorm Concept Unique Identifier (RxCUI) for the IN.                                                |
| **source_description**  | A description of the med from the source system.                                                        |


### Key Variables
|                          |                                                                                                         |
|:-------------------------|:--------------------------------------------------------------------------------------------------------|
| **calendar_dt**       | This will be crucial for time-based analysis and seasonality detection                                  |
| **daily_inv_med_id**         | Unique medication identifier that also distinguishes whether the medication is sourced from a bulk or non-bulk container. |
| **med_id_clean**         | Standardized medication identifier, agnostic of source container size. This will be crucial for consistent analysis across different medication types |
| **pick**                 | Best surrogate for what left the inventory system. This will be key for demand forecasting              |
| **current_inv_min & current_inv_max** | This will be important for understanding whether individual medications are at a surplus, ideal, or scarce level of inventory.         |

### Key Considerations

|                  |                                                     |
|--------------------------|---------------------------------------------------------------|
| **Geographic Scope:** | Single health system in Michigan, with a central pharmacy distributing to multiple locations       |
| **Business Objective:** | Forecast future demand for medications. Provide strategic recommendations on inventory management to help reduce waste while minimizing stockouts.       |
| **Key Metrics:** | Picks, days sales inventory, stockout rate, scarce inventory rate, inventory turnover rate, and mean inventory, and coefficient of variation of daily picks; seasonality metrics       |
| **Potential Factors:** | Spikes in inventory followed by long drifts down may indicate bulk buys due to anticipated shortages |
| **Stockouts** | Can be caused by supply shortages, but we don't have data on shortage timelines                                 |

## Setup and Data Loading:
### Import Libraries
To start data cleaning and manipulation, we are going to import and install some helper libraries that we will need to process the
 data.

In [23]:
# Load black to help with notebook formatting
%load_ext nb_black

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

In [24]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis

pd.set_option("display.max_columns", None)

<IPython.core.display.Javascript object>

### Load Dataset
Next, we are going to load the dataset and examine its structure, including its shape, column names, and data types.

In [25]:
raw_data = pd.read_csv(
    "../inputs/pharmacy_central_inventory_and_transactions_by_date.csv",
)

  raw_data = pd.read_csv(


<IPython.core.display.Javascript object>

In [26]:
data = raw_data.copy()
print(f"Dataset shape: {data.shape}")
print("First few rows of the dataset:")
data.head(5)

Dataset shape: (3807314, 32)
First few rows of the dataset:


Unnamed: 0,daily_inv_location,daily_inv_date,isa_name,daily_inv_med_id,med_id_clean,med_description,first_count_of_day,last_count_of_day,next_daily_inv_date,calendar_dt,pick,cycle_count,waste,destock,batch_pick,load,inventory,restock,current_inv_med_id,current_inv_location,current_inv_min,current_inv_max,current_inv_qoh,reviewed_number,pref_ndc_vendor_name,pref_ndc_package_size,pref_ndc,scd_name,scd_rxcui,in_min_name,in_min_rxcui,source_description
0,2C224-01-01-01-01,2022-03-14,,1185,1185,sodium bicarbonate 8.4 % solution - 50 mL vial,21909.0,22909.0,2022-03-15,2022-03-14 00:00:00.000,-25.0,,,,,,,1000.0,,,,,,3.22,AB Short A,25.0,409662514.0,,,,,
1,2C224-01-01-01-01,2022-03-15,,1185,1185,sodium bicarbonate 8.4 % solution - 50 mL vial,22909.0,22509.0,2022-03-16,2022-03-15 00:00:00.000,-424.0,,,,,,,,,,,,,3.22,AB Short A,25.0,409662514.0,,,,,
2,2C224-01-01-01-01,2022-03-16,,1185,1185,sodium bicarbonate 8.4 % solution - 50 mL vial,22509.0,22509.0,2022-03-17,2022-03-16 00:00:00.000,-20.0,,,,,,,,,,,,,3.22,AB Short A,25.0,409662514.0,,,,,
3,2C224-01-01-01-01,2022-03-17,,1185,1185,sodium bicarbonate 8.4 % solution - 50 mL vial,22509.0,22009.0,2022-03-20,2022-03-17 00:00:00.000,-618.0,,,,,,,,,,,,,3.22,AB Short A,25.0,409662514.0,,,,,
4,2C224-01-01-01-01,2022-03-17,,1185,1185,sodium bicarbonate 8.4 % solution - 50 mL vial,22009.0,22009.0,2022-03-20,2022-03-18 00:00:00.000,,,,,,,,,,,,,,3.22,AB Short A,25.0,409662514.0,,,,,


<IPython.core.display.Javascript object>

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3807314 entries, 0 to 3807313
Data columns (total 32 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   daily_inv_location     object 
 1   daily_inv_date         object 
 2   isa_name               object 
 3   daily_inv_med_id       object 
 4   med_id_clean           object 
 5   med_description        object 
 6   first_count_of_day     float64
 7   last_count_of_day      float64
 8   next_daily_inv_date    object 
 9   calendar_dt            object 
 10  pick                   float64
 11  cycle_count            float64
 12  waste                  float64
 13  destock                float64
 14  batch_pick             float64
 15  load                   float64
 16  inventory              float64
 17  restock                float64
 18  current_inv_med_id     object 
 19  current_inv_location   object 
 20  current_inv_min        float64
 21  current_inv_max        float64
 22  current_inv_qoh   

<IPython.core.display.Javascript object>

## Data Cleaning and Preprocessing:
### Check for Duplicates and Identify Missing Values
Next, we will check for duplicate rows to ensure that each row is unique and avoid any redundancy in the dataset.

In [28]:
# Check for duplicate records
print(f"Number of duplicate records: {data.duplicated().sum()}")

Number of duplicate records: 0


<IPython.core.display.Javascript object>

In [29]:
# Check for missing values
data.isnull().sum()

daily_inv_location             0
daily_inv_date                 0
isa_name                 1344638
daily_inv_med_id               0
med_id_clean               23009
med_description            23009
first_count_of_day          4450
last_count_of_day           3658
next_daily_inv_date            0
calendar_dt                    0
pick                     2889223
cycle_count              3779489
waste                    3805317
destock                  3805036
batch_pick               3807314
load                     3804828
inventory                3777277
restock                  3494103
current_inv_med_id       1344638
current_inv_location     1344638
current_inv_min          1344638
current_inv_max          1344638
current_inv_qoh          1344638
reviewed_number          2834014
pref_ndc_vendor_name      236635
pref_ndc_package_size     236635
pref_ndc                  236635
scd_name                 2042786
scd_rxcui                2042786
in_min_name              1754990
in_min_rxc

<IPython.core.display.Javascript object>

### Fill in Missing or Inconsistent Inventory Min./Max. Levels

For every medication (`daily_inv_med_id`) there should be a specified inventory minimum (`current_inv_min`) and maximum (`current_inv_max`). Generally, a reorder will automatically be triggered upon a medication’s quantity on hand reaching the minimum threshold. By default, the quantity reordered will be whatever volume brings the quantity on hand back to the maximum threshold. However, this can be overridden by whomever is placing the order.
While a medication’s min./max. values can fluctuate over time (albeit, not often), the values within the dataset reflect the values at the time at which the data was queried (i.e., historical min./max. values are not maintained)
We'll use the following logic to fill in missing values or inconsistent values for each `daily_inv_med_id` value:


*   If a medication only has one historic min./max. value, that value will be used to backfill missing values.
*   If a medication has multiple historic min./max. values, the most frequent min./max. values will be used to backfill missing values.
*   If a medication has no min./max. values:
  * We'll use the 5th percentile of all `last_count_of_day` values to estimate the inventory minimum.
  *   We'll use the 95th percentile of all `first_count_of_day` values to estimate the inventory maximum.

This may beg the question, “Why use the 5th percentile of `last_count_of_day` values as an estimate for the inventory minimum and not the absolute minimum?” We chose to do so because we would expect the lowest `last_count_of_day` values to denote the days when a restock was delayed, thus sinking the inventory volume beneath the true inventory minimum. With this in mind, we opted to use the 5th percentile instead of the 0th percentile (i.e., the absolute minimum) to ensure we did not underestimate the true inventory minimum.

Similarly, we chose to use the 95th percentile of `first_count_of_day` values to estimate the maximum inventory level. We did so because we would expect the highest `first_count_of_day values` to denote the days when inventory was intentionally over-stocked, thus pushing the inventory volume above the true inventory maximum. With this in mind, we opted to use the 95th percentile instead of the 100th percentile (i.e., the absolute maximum) to ensure we did not overestimate the true inventory maximum.

In [30]:
# We'll create a grouped df of all medications which tracks the following for both the min and max inventory levels:
# - Count of null values
# - Count of unique values
# - Most frequent value
# - 5th percentile of the last_count_of_day / 95th percentile of the first_count_of_day

min_max_ref = data.groupby("daily_inv_med_id").agg(
    min_nulls=("current_inv_min", lambda x: x.isna().sum()),  # Number of null values
    min_unique=(
        "current_inv_min",
        lambda x: x.nunique(dropna=True),
    ),  # Number of unique values excluding nulls
    min_most_freq=(
        "current_inv_min",
        lambda x: x.mode().iloc[0] if not x.mode().empty else None,
    ),  # Most frequent value
    last_ct_percentile=(
        "last_count_of_day",
        lambda x: x.quantile(0.05),
    ),  # 5th percentile
    max_nulls=("current_inv_max", lambda x: x.isna().sum()),  # Number of null values
    max_unique=(
        "current_inv_max",
        lambda x: x.nunique(dropna=True),
    ),  # Number of unique values excluding nulls
    max_most_freq=(
        "current_inv_max",
        lambda x: x.mode().iloc[0] if not x.mode().empty else None,
    ),  # Most frequent value
    first_ct_percentile=(
        "first_count_of_day",
        lambda x: x.quantile(0.95),
    ),  # 95th percentile
)

# Create new columns to store our "clean" values for the inventory min/max values
min_max_ref[["clean_current_inv_min", "clean_current_inv_max"]] = None, None

# Apply business rules to calculate new inventory minimum levels
min_max_ref["clean_current_inv_min"] = min_max_ref.apply(
    lambda row: (
        row["min_most_freq"]
        if pd.notna(row["min_most_freq"])
        else row["last_ct_percentile"]
    ),
    axis=1,
)

# Apply business rules to calculate new inventory maximum levels
min_max_ref["clean_current_inv_max"] = min_max_ref.apply(
    lambda row: (
        row["max_most_freq"]
        if pd.notna(row["max_most_freq"])
        else row["first_ct_percentile"]
    ),
    axis=1,
)

# View a sample of the dataframe we created to check the results
min_max_ref.sample(5)

Unnamed: 0_level_0,min_nulls,min_unique,min_most_freq,last_ct_percentile,max_nulls,max_unique,max_most_freq,first_ct_percentile,clean_current_inv_min,clean_current_inv_max
daily_inv_med_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
DERM112,755,1,2.0,0.0,755,1,4.0,4.0,2.0,4.0
P3414,881,0,,0.0,881,0,,0.0,0.0,0.0
3571221,0,1,15.0,16.0,0,1,57.0,56.0,15.0,57.0
TEMP021021Z,122,0,,0.0,122,0,,0.0,0.0,0.0
000346,0,1,300.0,43.0,0,1,400.0,367.0,300.0,400.0


<IPython.core.display.Javascript object>

In [31]:
# Now let's merge in the "clean" min / max values back to the main df
data = pd.merge(
    data,
    min_max_ref[["clean_current_inv_min", "clean_current_inv_max"]],
    left_on="daily_inv_med_id",
    right_index=True,
)

<IPython.core.display.Javascript object>

## Data Standardization

### Standardize Medication IDs

Next, we will take a closer look at the missing values for `med_id_clean` since it will be crucial for consistent analysis across different medication types.

In [32]:
# Check identical and different values for 'daily_inv_med_id' and 'med_id_clean' for each row
identical_ids = data["daily_inv_med_id"] == data["med_id_clean"]
different_ids = data["daily_inv_med_id"] != data["med_id_clean"]

# Count how many are identical and how many are different
identical_count = identical_ids.sum()
different_count = different_ids.sum()

print(
    f"Number of rows where 'daily_inv_med_id' and 'med_id_clean' are identical: {identical_count}"
)
print(
    f"Number of rows where 'daily_inv_med_id' and 'med_id_clean' are different: {different_count}"
)

Number of rows where 'daily_inv_med_id' and 'med_id_clean' are identical: 3133789
Number of rows where 'daily_inv_med_id' and 'med_id_clean' are different: 673525


<IPython.core.display.Javascript object>

Overall, a majority of IDs in both `daily_inv_med_id` and `med_id_clean` seems to be identical. However, approximately 350,000 observations have different IDs.

Next, we will check for any inconsistencies and try to observe some common patterns between the two.

In [33]:
identical_rows = data[identical_ids]
print("Rows where 'daily_inv_med_id' and 'med_id_clean' are identical:")
print(identical_rows[["daily_inv_med_id", "med_id_clean"]].sample(10))

Rows where 'daily_inv_med_id' and 'med_id_clean' are identical:
        daily_inv_med_id med_id_clean
3050176        067043701    067043701
1386836           028090       028090
2420345         TEMP0391     TEMP0391
2776056         002766AA     002766AA
2854552        053713P02    053713P02
3397950           081454       081454
1340112           059081       059081
356608            039500       039500
1018352           021986       021986
292973            007894       007894


<IPython.core.display.Javascript object>

In [34]:
different_rows = data[different_ids]
print("Rows where 'daily_inv_med_id' and 'med_id_clean' are different:")
print(different_rows[["daily_inv_med_id", "med_id_clean"]].sample(20))

Rows where 'daily_inv_med_id' and 'med_id_clean' are different:
        daily_inv_med_id med_id_clean
1796952           015586        15586
2102174           040550        40550
1132025           005090         5090
542790            007801         7801
2468415         TEMP2010          NaN
1132222           039766        39766
976818            006858       6858.0
163118        009622BULK       009622
1901243       034188bulk        34188
1882685           071489        71489
1156733           006787         6787
1987506       023989bulk        23989
2126641       009323bulk         9323
2118209       004680bulk         4680
408998            011590        11590
1197904           000390          390
2045785       004654bulk       004654
1938001       004004bulk       004004
1158173           079924        79924
2338929           065227        65227


<IPython.core.display.Javascript object>

Although the `med_id_clean` column appears to be standardized medication identifier, there are some inconsistencies such as: i) how leading zeros are handled (most of the time it is kept, but sometimes removed); ii) data type inconsistencies, with some values as floats but majority as string.

We will standardize the ~350,000 different IDs to the common patterns observed to create more consistent `med_id_clean` values such as keeping leading zeros, removal of suffix (bulk) to keep numeric values only and conversion of floats.

In [35]:
# Create a new 'cleaned_med_id' column, initialize it with 'med_id_clean' values
data["cleaned_med_id"] = data["med_id_clean"]

# Apply the cleaning operation only to rows that are different
data.loc[different_ids, "cleaned_med_id"] = data.loc[
    different_ids, "daily_inv_med_id"
].apply(lambda x: re.sub(r"bulk$", "", str(x), flags=re.IGNORECASE).strip())

<IPython.core.display.Javascript object>

In [36]:
# Let's confirm
different_rows = data[different_ids]
print("Rows where 'daily_inv_med_id' and 'cleaned_med_id' are different:")
print(different_rows[["daily_inv_med_id", "cleaned_med_id", "med_id_clean"]].sample(10))

Rows where 'daily_inv_med_id' and 'cleaned_med_id' are different:
        daily_inv_med_id cleaned_med_id med_id_clean
1783806       024484bulk         024484       024484
2056395       000509bulk         000509       000509
1140968           008854         008854         8854
406001            003389         003389         3389
3593540            P2566          P2566          NaN
404856            044186         044186        44186
2325495           065452         065452        65452
1855363       042943bulk         042943       042943
541711            074807         074807        74807
547346            004752         004752         4752


<IPython.core.display.Javascript object>

In [37]:
data[
    [
        "daily_inv_location",
        "daily_inv_med_id",
        "med_id_clean",
        "cleaned_med_id",
        "pick",
        "waste",
        "restock",
        "current_inv_min",
        "current_inv_max",
        "current_inv_qoh",
    ]
].isna().sum()

daily_inv_location          0
daily_inv_med_id            0
med_id_clean            23009
cleaned_med_id              0
pick                  2889223
waste                 3805317
restock               3494103
current_inv_min       1344638
current_inv_max       1344638
current_inv_qoh       1344638
dtype: int64

<IPython.core.display.Javascript object>

After standardizing medication IDs, missing values seems to be resolved, this looks much better.

### Date & Numerical Feature Standardization
Some time-series features within the raw dataset are stored as dates, while others are stored as timestamps. We'll iterate through each of the three date features and standardize them so that they all read as dates without any timestamps.

Additionally, subtractions (e.g., `pick` and `waste`) from the inventory are represented as negative values. We'll convert those to positive numbers.

In [38]:
# Convert date columns to datetime type
date_columns = ["daily_inv_date", "next_daily_inv_date", "calendar_dt"]
data[date_columns] = data[date_columns].apply(pd.to_datetime)

# Convert 'pick' and 'waste' to absolute values before aggregation
data["abs_pick"] = data["pick"].abs()
data["abs_waste"] = data["waste"].abs()

<IPython.core.display.Javascript object>

## Export Cleaned Timeseries Data
At this point, we've now completed the cleaning of our raw timeseries data. We'll now export a copy of this for local exploratory data analysis using Tableau and subsequent machine learning analyses.

In [39]:
data.to_csv("../outputs/cleaned_inventory_data.csv", index=False)

<IPython.core.display.Javascript object>

## Data Aggregation:

### Aggregate time-series data by `clean_med_id`

Up to this point, the data is disaggregated by:
1.   Individual calendar days
2.   Location (although those locations may be in very close proximity to one another, e.g., an adjacent storage room)
3.   What size container they're stored in (e.g., a container of 500 pills vs. a storage container with a small quantity of individual pills).

Our goal is to create an aggregated dataframe where each observation represents one day of inventory data for a unique medication by dosage amount and dosage form. To achieve this, we'll group the data by the `cleaned_med_id` value and perform aggregations on the key features of interest.

**Limitation**: a critical limitation to note with this approach is that pick values may represent medications being transferred from a bulk container to a non-bulk container. For example, if 500 pills for a given medication are picked from a bulk container and used to restock smaller contains across various locations, the data would view this as a pick, whereas in reality this isn't *really* a pick: it's a redistribution of inventory. With how we received the data, we are unable to differentiate these instances of picks vs. picks that entail medications *actually* leaving the inventory system to be given to end-users. Because of this, all `cleaned_med_id` values with both a corresponding bulk and non-bulk `daily_inv_med_id` derivative may be susceptible to double-counting picks.

In [40]:
df_agg = (
    data.groupby(["cleaned_med_id", "calendar_dt"])
    .agg(
        {
            "abs_pick": "sum",
            "clean_current_inv_min": "sum",
            "clean_current_inv_max": "sum",
            "first_count_of_day": "sum",
            "last_count_of_day": "sum",
        }
    )
    .reset_index()
)

df_agg.columns = [
    "cleaned_med_id",
    "calendar_dt",
    "pick",
    "clean_inv_min",
    "clean_inv_max",
    "first_count_of_day",
    "last_count_of_day",
]

df_agg.head(5)

Unnamed: 0,cleaned_med_id,calendar_dt,pick,clean_inv_min,clean_inv_max,first_count_of_day,last_count_of_day
0,231,2022-03-14,5.0,0.0,48.0,0.0,25.0
1,231,2022-03-15,0.0,0.0,48.0,25.0,25.0
2,231,2022-03-16,0.0,0.0,48.0,25.0,25.0
3,231,2022-03-17,0.0,0.0,48.0,25.0,25.0
4,231,2022-03-18,0.0,0.0,48.0,25.0,25.0


<IPython.core.display.Javascript object>

### Calculate Medication-level Inventory KPIs
We'll now create a new dataframe that further aggregates the grouped `cleaned_med_id` time-series data so that we have one row per unique medication. For each medication value we'll track:

1.   Minimum inventory level
2.   Maximum inventory level
3.   Mean first count of day
4.   Mean last count of day
5.   Total picks
6.   Mean picks per day
7.   Median picks per day
8.   Variance of picks per day
9.   Standard deviation of picks per day
10.   Range of picks per day
11.   Interquartile range of picks per day
12.   Skewness of picks per day
13.   Kurtosis of picks per day
14.   Percent of days with non-zero picks
15.   Number of days with data present (even if there was no movement)
16.   Number of days with a stockout
17.   Number of days with scarce inventory (defined as `last_count_of_day` < `minimum_inventory_level`)
18.   Number of days with excess inventory (defined as `last_count_of_day` > `maximum_inventory_level`)

In [41]:
def calculate_statistics(group):
    pick_count = group["pick"].notnull() & (group["pick"] != 0)
    stockout = group["last_count_of_day"] == 0
    below_min = group["last_count_of_day"] < group["clean_inv_min"]
    above_max = group["last_count_of_day"] > group["clean_inv_max"]

    return pd.Series(
        {
            "inv_min": group["clean_inv_min"].mean(),
            "inv_max": group["clean_inv_max"].mean(),
            "mean_first_count": group["first_count_of_day"].mean(),
            "mean_last_count": group["last_count_of_day"].mean(),
            "total_picks": group["pick"].sum(),
            "mean_picks": group["pick"].mean(),
            "median_picks": group["pick"].median(),
            "variance_picks": group["pick"].var(),
            "std_dev_picks": group["pick"].std(),
            "range_picks": group["pick"].max() - group["pick"].min(),
            "iqr_picks": group["pick"].quantile(0.75) - group["pick"].quantile(0.25),
            "skewness_picks": skew(group["pick"]),
            "kurtosis_picks": kurtosis(group["pick"]),
            "percent_days_w_pick": pick_count.sum() / group["calendar_dt"].count(),
            "days_of_data": group["calendar_dt"].count(),
            "days_w_stockout": stockout.sum(),
            "days_w_scarce_inv": below_min.sum(),
            "days_w_excess_inv": above_max.sum(),
        }
    )

<IPython.core.display.Javascript object>

In [42]:
# Iterate through each medication and execute the calculate_statistics function
summary_df = df_agg.groupby("cleaned_med_id").apply(calculate_statistics).reset_index()

<IPython.core.display.Javascript object>

With the aggregate statistics now available, we'll engineer some additional features:


*   **Mean Inventory** - Measures the average volume on hand across the period of data.
*   **Inventory Turnover Ratio** - Measures the number of times the average volume on hand is exhausted over the period of data. A higher ratio implies efficient inventory management. A lower ratio denotes overstocking.
*   **Days Sales of Inventory (DSI)** - Measures the average number of days it takes to exhaust the average inventory level. A Lower DSI indicates quick inventory turnover and efficient inventory management. A higher DSI suggests slower inventory turnover.
*   **Stockout, Scarce Inventory, and Excess Inventory Rates** -  Measures the percentage of days whether there respectively was a stockout, scarce inventory, or excess inventory.
*   **Coefficient of Variation** -  Normalizes standard deviation values by the mean volume of picks.

In [43]:
summary_df["mean_inventory"] = summary_df.apply(
    lambda x: ((x.mean_first_count + x.mean_last_count) / 2), axis=1
)
summary_df["inv_turnover_ratio"] = summary_df.apply(
    lambda x: x.total_picks / x.mean_inventory if x.mean_inventory != 0 else np.nan,
    axis=1,
)
summary_df["days_sales_inventory"] = summary_df.apply(
    lambda x: x.mean_inventory / x.mean_picks if x.mean_picks != 0 else np.nan, axis=1
)
summary_df["stockout_rate"] = summary_df.apply(
    lambda x: x.days_w_stockout / x.days_of_data, axis=1
)
summary_df["excess_rate"] = summary_df.apply(
    lambda x: x.days_w_excess_inv / x.days_of_data, axis=1
)
summary_df["scarce_rate"] = summary_df.apply(
    lambda x: x.days_w_scarce_inv / x.days_of_data, axis=1
)
summary_df["coef_var"] = summary_df.apply(
    lambda x: x.std_dev_picks / x.mean_picks if x.mean_picks != 0 else np.nan, axis=1
)

summary_df.set_index("cleaned_med_id", drop=True, inplace=True)
summary_df.describe()

Unnamed: 0,inv_min,inv_max,mean_first_count,mean_last_count,total_picks,mean_picks,median_picks,variance_picks,std_dev_picks,range_picks,iqr_picks,skewness_picks,kurtosis_picks,percent_days_w_pick,days_of_data,days_w_stockout,days_w_scarce_inv,days_w_excess_inv,mean_inventory,inv_turnover_ratio,days_sales_inventory,stockout_rate,excess_rate,scarce_rate,coef_var
count,4593.0,4593.0,4593.0,4593.0,4593.0,4593.0,4593.0,4592.0,4592.0,4593.0,4593.0,3795.0,3795.0,4593.0,4593.0,4593.0,4593.0,4593.0,4593.0,3915.0,3795.0,4593.0,4593.0,4593.0,3795.0
mean,108.462499,207.126687,545.26623,544.660901,12625.75,15.778952,5.194426,1902345.0,152.162864,3987.915306,8.745972,9.095319,149.107123,0.243645,683.247115,119.734814,109.490529,190.25626,544.963566,629.9139,1396.594,0.259409,0.251089,0.140845,6.604906
std,910.460055,1417.151325,8648.447441,8639.612309,81561.08,100.288268,34.468183,30507990.0,1370.985439,38605.305442,103.707385,7.328474,210.110974,0.316009,230.915142,210.649,179.552887,223.760386,8644.025455,18978.8,44282.77,0.401275,0.291911,0.227393,6.777938
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.117382,-1.417104,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.195339
25%,0.0,0.0,2.150062,1.982895,8.0,0.013819,0.0,0.0261651,0.161756,2.0,0.0,3.669349,16.425501,0.006258,676.0,1.0,0.0,2.0,2.034314,4.642522,13.40504,0.001218,0.003645,0.0,2.0613
50%,6.0,17.0,21.212766,21.064304,243.0,0.3382,0.0,2.261757,1.503914,18.0,0.0,6.626117,55.946768,0.084158,812.0,8.0,18.0,98.0,21.077491,18.98519,32.62819,0.010936,0.128797,0.02445,3.817028
75%,40.0,111.0,101.309002,101.209246,2364.0,3.17983,0.0,56.03201,7.485452,83.0,2.0,12.503739,180.254633,0.38848,822.0,140.0,141.0,315.0,101.290754,50.01896,102.2619,0.474201,0.411192,0.181929,8.533286
max,32000.0,50000.0,364342.823815,364291.623329,2813873.0,3423.203163,1531.0,1117813000.0,33433.706056,959166.0,5315.0,28.635663,818.001217,1.0,823.0,823.0,823.0,823.0,364317.223572,1141090.0,2631315.0,1.0,1.0,1.0,28.687977


<IPython.core.display.Javascript object>

## Export Medication-level Data
We'll now export a copy of the data frame containing the aggregate summary statistics. This will be used for local exploratory data analysis using Tableau and subsequent machine learning analyses.

In [44]:
summary_df.to_csv(
    "../outputs/inventory_summary_statistics_wo_anomalies.csv", index=True
)

<IPython.core.display.Javascript object>