# Online Retail Dataset: Data Preparation

In this notebook, I'll prepare the dataset for analysis.

## Imports

In [1]:
from pathlib import Path
from typing import cast

import numpy as np
import pandas as pd
from pandas.testing import assert_frame_equal

## Read dataset

In [2]:
# File path for dataset
file_path = Path.cwd().parents[1] / "data" / "online_retail.xlsx"
assert file_path.exists(), f"file doesn't exist: {file_path}"
assert file_path.is_file(), f"not a file: {file_path}"

In [3]:
# Columns I'll actually use
cols = ["InvoiceNo", "InvoiceDate", "CustomerID", "Quantity", "UnitPrice"]

df = pd.read_excel(
    file_path,
    usecols=cols,
    dtype={col: object for col in cols},
).loc[:, cols]
df = cast(pd.DataFrame, df)

In [4]:
df.head(10)

Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,Quantity,UnitPrice
0,536365,2010-12-01 08:26:00,17850,6,2.55
1,536365,2010-12-01 08:26:00,17850,6,3.39
2,536365,2010-12-01 08:26:00,17850,8,2.75
3,536365,2010-12-01 08:26:00,17850,6,3.39
4,536365,2010-12-01 08:26:00,17850,6,3.39
5,536365,2010-12-01 08:26:00,17850,2,7.65
6,536365,2010-12-01 08:26:00,17850,6,4.25
7,536366,2010-12-01 08:28:00,17850,6,1.85
8,536366,2010-12-01 08:28:00,17850,6,1.85
9,536367,2010-12-01 08:34:00,13047,32,1.69


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   InvoiceDate  541909 non-null  datetime64[ns]
 2   CustomerID   406829 non-null  object        
 3   Quantity     541909 non-null  object        
 4   UnitPrice    541909 non-null  object        
dtypes: datetime64[ns](1), object(4)
memory usage: 20.7+ MB


## Data cleaning

In [6]:
# Missing values
df.isna().sum()

InvoiceNo           0
InvoiceDate         0
CustomerID     135080
Quantity            0
UnitPrice           0
dtype: int64

I really need to know who bought what. In other words, rows with missing
`CustomerID` have to go.

In [7]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 406829 entries, 0 to 541908
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    406829 non-null  object        
 1   InvoiceDate  406829 non-null  datetime64[ns]
 2   CustomerID   406829 non-null  object        
 3   Quantity     406829 non-null  object        
 4   UnitPrice    406829 non-null  object        
dtypes: datetime64[ns](1), object(4)
memory usage: 18.6+ MB


In [8]:
# Look for invalid quantities
(df["Quantity"] <= 0).sum()

8905

Not every row corresponds to a sale. When the invoice number starts with "C",
that transaction was canceled. That explains the observations with
non-positive quantities.

In [9]:
df["InvoiceNo"].astype(str).str.startswith("C").sum()

8905

I chose to remove those rows:

In [10]:
df = df.loc[df["Quantity"] > 0, :]
df = cast(pd.DataFrame, df)

# Quick check
assert df["InvoiceNo"].astype(str).str.startswith("C").sum() == 0, "there are remaining canceled transactions"

In [11]:
# Look for invalid prices
(df["UnitPrice"] == 0.0).sum()

40

I don't know how to explain such values. They should make no difference. Then
I chose to drop them:

In [12]:
df = df.loc[df["UnitPrice"] != 0.0, :]
df = cast(pd.DataFrame, df)

## Finish fixing dataset

In [13]:
df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397884 entries, 0 to 397883
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    397884 non-null  object        
 1   InvoiceDate  397884 non-null  datetime64[ns]
 2   CustomerID   397884 non-null  object        
 3   Quantity     397884 non-null  object        
 4   UnitPrice    397884 non-null  object        
dtypes: datetime64[ns](1), object(4)
memory usage: 15.2+ MB


In [14]:
# Use appropriate data types
df["InvoiceNo"] = df["InvoiceNo"].astype("category")
df["CustomerID"] = df["CustomerID"].astype("category")
df["Quantity"] = df["Quantity"].astype(np.int_)
df["UnitPrice"] = df["UnitPrice"].astype(np.float_)

df.dtypes

InvoiceNo            category
InvoiceDate    datetime64[ns]
CustomerID           category
Quantity                int64
UnitPrice             float64
dtype: object

The only part of `InvoiceDate` that matters is the date. The following
command sets all the times to midnight:

In [15]:
df["InvoiceDate"] = df["InvoiceDate"].dt.normalize()
df["InvoiceDate"].head()

0   2010-12-01
1   2010-12-01
2   2010-12-01
3   2010-12-01
4   2010-12-01
Name: InvoiceDate, dtype: datetime64[ns]

For convenience, I'll collect the essential parts of the above code, and
create a function:

In [16]:
def get_clean_data(file_path: Path) -> pd.DataFrame:
    cols = ["InvoiceNo", "InvoiceDate", "CustomerID", "Quantity", "UnitPrice"]
    df = pd.read_excel(file_path, usecols=cols, dtype={col: object for col in cols}).loc[:, cols]
    df = cast(pd.DataFrame, df)

    df = df.dropna()

    df = df.loc[df["Quantity"] > 0, :]
    df = cast(pd.DataFrame, df)

    df = df.loc[df["UnitPrice"] != 0.0, :]
    df = cast(pd.DataFrame, df)

    df = df.reset_index(drop=True)

    df["InvoiceNo"] = df["InvoiceNo"].astype("category")
    df["CustomerID"] = df["CustomerID"].astype("category")
    df["Quantity"] = df["Quantity"].astype(np.int_)
    df["UnitPrice"] = df["UnitPrice"].astype(np.float_)

    df["InvoiceDate"] = df["InvoiceDate"].dt.normalize()

    return df

In [17]:
# Quick check
df_func = get_clean_data(file_path)
assert_frame_equal(df_func, df)
del df_func

## Aggregate data

Before aggregating the data, I'll do some more consistency tests. Rows with
the same `InvoiceNo` must also have the same `InvoiceDate`. For a specific
value of `InvoiceNo`, this can be tested as follows:

In [18]:
df.loc[df["InvoiceNo"] == 536365, "InvoiceDate"].nunique() == 1

True

To test all values of `InvoiceNo`, one can do the following:

In [19]:
df.groupby(by="InvoiceNo", observed=True).InvoiceDate.nunique().eq(1).all()

True

Similarly, rows with the same `InvoiceNo` must also have the same
`CustomerID`. Checking if this is true:

In [20]:
# Single value
df.loc[df["InvoiceNo"] == 536365, "CustomerID"].nunique() == 1

True

In [21]:
# All values
df.groupby(by="InvoiceNo", observed=True).CustomerID.nunique().eq(1).all()

True

Everything is OK. Then I'll compute the total amount spent for each
`InvoiceNo`.

In [22]:
# Figuring out how to do it
tmp_df = df.loc[df["InvoiceNo"] == 536365, :]
tmp_df = cast(pd.DataFrame, tmp_df)
tmp_df

Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,Quantity,UnitPrice
0,536365,2010-12-01,17850,6,2.55
1,536365,2010-12-01,17850,6,3.39
2,536365,2010-12-01,17850,8,2.75
3,536365,2010-12-01,17850,6,3.39
4,536365,2010-12-01,17850,6,3.39
5,536365,2010-12-01,17850,2,7.65
6,536365,2010-12-01,17850,6,4.25


In [23]:
tmp_row = pd.Series(
    data={
        "InvoiceDate": tmp_df["InvoiceDate"].iloc[0],
        "CustomerID": tmp_df["CustomerID"].iloc[0],
        "TotalPrice": (tmp_df["Quantity"] * tmp_df["UnitPrice"]).sum(),
    }
)
tmp_row

InvoiceDate    2010-12-01 00:00:00
CustomerID                   17850
TotalPrice                  139.12
dtype: object

In [24]:
del tmp_df
del tmp_row

In [25]:
# Actual calculation
def compute_total_price(df_group: pd.DataFrame) -> pd.Series:
    return pd.Series(
        data={
            "InvoiceDate": df_group["InvoiceDate"].iloc[0],
            "CustomerID": df_group["CustomerID"].iloc[0],
            "TotalPrice": (df_group["Quantity"] * df_group["UnitPrice"]).sum(),
        }
    )


df_total = df.groupby(by="InvoiceNo", observed=True).apply(compute_total_price).reset_index()

In [26]:
df_total.head()

Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,TotalPrice
0,536365,2010-12-01,17850,139.12
1,536366,2010-12-01,17850,22.2
2,536367,2010-12-01,13047,278.73
3,536368,2010-12-01,13047,70.05
4,536369,2010-12-01,13047,17.85


In [27]:
df_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18532 entries, 0 to 18531
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    18532 non-null  category      
 1   InvoiceDate  18532 non-null  datetime64[ns]
 2   CustomerID   18532 non-null  int64         
 3   TotalPrice   18532 non-null  float64       
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 1.1 MB


## Save prepared data

Clearly, I've ended up with a much smaller dataset than the original. To
avoid having to repeat the above steps, I'll save the new `DataFrame` to a
CSV file.

In [28]:
# File path for output CSV
out_file = file_path.parent / "online_retail.csv"

df_total.to_csv(out_file, index=False)

## Summarizing through a function

To conclude, I'll implement a function that summarizes what was done in
this notebook.

In [29]:
def prepare_and_save_data(file_path: Path) -> None:
    clean_data = get_clean_data(file_path)
    aggregated_data = clean_data.groupby("InvoiceNo", observed=True).apply(compute_total_price).reset_index()
    out_file = file_path.with_suffix(".csv")
    aggregated_data.to_csv(out_file, index=False)

In [30]:
# prepare_and_save_data(Path.cwd().parents[1] / "data" / "online_retail.xlsx")