# Data Prep - Raw Data
This notebook documents my work on the [UK High-value Customers dataset](https://www.kaggle.com/vik2012kvs/high-value-customers-identification) from a Data Engineering and Feature Engineering perspective.

An essential part of any Data Science project is understanding the data that we are working with. I will use this workspace to identify the major characteristics of the dataset, identify its granularity and address issues with it. I will also process the data in a way that can be used in further analysis. 

The plan for this part of the project is the following:

1. Process the raw dataset so that it can be realibly used for modelling and analysis;
2. Split the raw dataset into different views that are relevant to the context of Customer Segmentation and Ecommerce companies;
3. Prepare the dataset for downstream tasks;

In [1]:
!pip install inflection nb_black >> ../configs/dependencies/package_installation.txt

In [2]:
# loading magic commands:
%load_ext nb_black
%load_ext autoreload
%autoreload 2

<IPython.core.display.Javascript object>

In [4]:
###### Loading the necessary libraries #########

# PySpark dependencies:
import pyspark
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
import pyspark.sql.types as Ts
from pyspark.sql.window import Window

# database utilities:
import pandas as pd

# other relevant libraries:
import warnings
import inflection
import unicodedata
from datetime import datetime, timedelta
import json
import re
import os
from glob import glob
import shutil
import itertools

# setting global parameters for visualizations:
warnings.filterwarnings("ignore")
pd.set_option("display.precision", 4)
pd.set_option("display.float_format", lambda x: "%.2f" % x)

<IPython.core.display.Javascript object>

# 0. Configuring Spark

In [5]:
# loading the configurations needed for Spark
def init_spark(app_name):

    spark = (
        SparkSession.builder.appName(app_name)
        .config("spark.files.overwrite", "true")
        .config("spark.sql.repl.eagerEval.enabled", True)
        .config("spark.sql.repl.eagerEval.maxNumRows", 5)
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY")
        .config("spark.sql.parquet.compression.codec", "gzip")
        .enableHiveSupport()
        .getOrCreate()
    )

    return spark


# init the spark session:
spark = init_spark("Raw Data Preparation")

<IPython.core.display.Javascript object>

In [7]:
# verifying the spark session:
spark

<IPython.core.display.Javascript object>

# 1. Utility Functions

In [22]:
def save_to_filesystem(df, target_path, parquet_path, filename):
    """Helper function to save pyspark dataframes as parquets in a way that is similar to writing to local files

    Args:
        df (pyspark.sql.dataframe.DataFrame): dataframe to be saved
        target_path (str): path that will store the file
        filename (str): name of the resulting file

    Returns:
        None
    """
    PARQUET_FILE = f"{target_path}/{parquet_path}"
    OUTPUT_FILE = f"{target_path}/{filename}"

    if os.path.exists(PARQUET_FILE):
        shutil.rmtree(
            PARQUET_FILE
        )  # if the directory already exists, remove it (throws error if not)

    # saves the dataframe:
    df.coalesce(1).write.save(PARQUET_FILE)

    # retrieves file resulting from the saving procedure:
    original_file = glob(f"{PARQUET_FILE}/*.parquet")[0]

    # renames the resulting file and saves it to the target directory:
    os.rename(original_file, OUTPUT_FILE)

    shutil.rmtree(PARQUET_FILE)

    return True


def apply_category_map(category_map):
    """Helper function to convert strings given a map

    Note:
        This function uses the function generator scheme, much like the PySpark code

    Args:
        original_category (str): the original category name
        category_map (dict): the hash table or dictionary for converting the values:

    Returns:
        new_category (str): the resulting category

    """

    def func(row):
        try:
            result = category_map[row]
        except:
            result = None
        return result

    return F.udf(func)


def get_datetime_features(df, time_col):
    """Function to extract time-based features from pyspark dataframes

    Args:
        df (pyspark.sql.dataframe.DataFrame): the original dataframe that needs to be enriched
        time_col (str): the string name of the column containing the date object

    Returns:
        df (pyspark.sql.dataframe.DataFrame): resulting pyspark dataframe with the added features
            -> See list of attribute the source code for the attributes

    """

    # applying date-related functions:

    # day-level attributes:
    df = df.withColumn("day_of_week", F.dayofweek(F.col(time_col)))

    df = df.withColumn("day_of_month", F.dayofmonth(F.col(time_col)))

    df = df.withColumn("day_of_year", F.dayofyear(F.col(time_col)))

    # week-level attributes:
    df = df.withColumn("week_of_year", F.weekofyear(F.col(time_col)))

    # month-level attributes:
    df = df.withColumn("month", F.month(F.col(time_col)))

    df = df.withColumn("quarter", F.quarter(F.col(time_col)))

    # year-level attributes:
    df = df.withColumn("year", F.year(F.col(time_col)))

    return df


def bulk_aggregate(df, group_col, aggs, target_cols):
    """Wrapper function to apply multiple aggregations when performing group bys

    It utilizes the spark's SQL Context and string interpolation to perform the aggregation using SQL syntax.

    Args:
        df (pyspark.sql.dataframe.DataFrame): dataframe with raw data
        group_col (str): the column that will be used for grouping
        aggs (list): list of aggregations that want to be made (must be the same name as pyspark.sql.functions)
        target_cols (str): columns in which aggregations will be performed

    Returns:
        df_grouped (pyspark.sql.dataframe.DataFrame): dataframe with the grouped data
    """

    # buils the cartersian product of the lists
    aggs_to_perform = itertools.product(aggs, target_cols)

    Q_LAYOUT = """
    SELECT
        {},
        {}
        FROM df
        GROUP BY {}
    """

    aggregations = []
    for agg, col in aggs_to_perform:

        # builds the string for aggregation
        statement = f"{agg.upper()}({col}) as {agg}_{col}"
        aggregations.append(statement)

    full_statement = ",\n".join(aggregations)

    # uses string interpolation to build the full query statement
    QUERY = Q_LAYOUT.format(group_col, full_statement, group_col)

    # registers the dataframe as temporary table:
    df.registerTempTable("df")
    df_grouped = spark.sql(QUERY)

    # rounds values:
    for column in df_grouped.columns:
        df_grouped = df_grouped.withColumn(column, F.round(F.col(column), 1))

    return df_grouped


######### Text Processing Functions ########
@udf("string")
def normalize_text(text):
    """Helper function to normalize text data to ASCII and lower case, removing spaces

    Args:
        text (string): the string that needs to be normalized

    Returns:
        text (string): cleaned up string

    """
    regex = r"[^a-zA-Z0-9]+"

    if text is not None:

        text = str(text)
        text = text.lower()
        text = re.sub(regex, " ", text)
        text = text.strip()
        text = str(
            unicodedata.normalize("NFKD", text).encode("ASCII", "ignore"), "utf-8"
        )

    return text


def get_null_columns(df, normalize=False):
    """Helper function to print the number of null records for each column of a PySpark DataFrame.

    Args:
        df (pyspark.sql.dataframe.DataFrame): a PySpark Dataframe object

    Returns:
        None -> prints to standard out

    """

    if normalize:
        total = df.count()

        df_nulls = df.select(
            [
                (F.sum(F.when(F.col(column).isNull(), 1).otherwise(0)) / total).alias(
                    column
                )
                for column in df.columns
            ]
        )

    else:
        df_nulls = df.select(
            [
                F.sum(F.when(F.col(column).isNull(), 1).otherwise(0)).alias(column)
                for column in df.columns
            ]
        )

    # displaying the results to standard out
    df_nulls.show(1, truncate=False, vertical=True)


@udf("boolean")
def is_set_or_pack(text):

    # description entries to match:
    set_descriptions = {"set", "set of", "pack", "pack of", "box", "box of"}

    if text is not None:
        text = str(text)

        if text in set_descriptions:
            return True

        else:
            return False

    else:
        return False


@udf("integer")
def get_unit_size(text):

    if text is not None:
        check_if_digit = len(re.findall(r"(\d+)", text)) > 0

        if check_if_digit:
            set_size = int(re.findall(r"(\d+)", text)[0])
            return set_size

        else:
            return 1

    else:
        return 1


@udf("boolean")
def has_non_digits_only(text):
    """Function to match entries in the dataset that are purely non-digit characters

    Args:
        text (str): string containing the invoice code

    Returns:
        boolean: whether the text contains non-digit characters and is not related to cancellations

    """

    if text is not None:
        condition = all(character.isalpha() for character in text)

        if condition:
            return True

        else:
            return False

    else:
        return False

<IPython.core.display.Javascript object>

# 1. Loading and Inspecting the Data

In [26]:
# loading the raw dataset:
df_raw = spark.read.option("header", True).csv("../data/raw/Ecommerce.csv")

<IPython.core.display.Javascript object>

In [27]:
# verifying the data schema:
df_raw.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: string (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: string (nullable = true)
 |-- CustomerID: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- _c8: string (nullable = true)



<IPython.core.display.Javascript object>

In [28]:
# displaying some records from the data:
df_raw

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,_c8
536365,85123A,WHITE HANGING HEA...,6,29-Nov-16,2.55,17850,United Kingdom,
536365,71053,WHITE METAL LANTERN,6,29-Nov-16,3.39,17850,United Kingdom,
536365,84406B,CREAM CUPID HEART...,8,29-Nov-16,2.75,17850,United Kingdom,
536365,84029G,KNITTED UNION FLA...,6,29-Nov-16,3.39,17850,United Kingdom,
536365,84029E,RED WOOLLY HOTTIE...,6,29-Nov-16,3.39,17850,United Kingdom,


<IPython.core.display.Javascript object>

In [29]:
# counting nulls in the dataframe:
get_null_columns(df_raw)

-RECORD 0-------------
 InvoiceNo   | 0      
 StockCode   | 0      
 Description | 1454   
 Quantity    | 0      
 InvoiceDate | 0      
 UnitPrice   | 0      
 CustomerID  | 135080 
 Country     | 0      
 _c8         | 541909 



<IPython.core.display.Javascript object>

There are changes I mapped out for this dataset:

1. Snake case column names for easier manipulation; 
2. Remove unnecessary columns (i.e: `Unnamed: 8`);
3. Convert `InvoiceDate` to a `DateTime` object;
4. Convert `CustomerID` to `int` (it is an unique code);
5. Normalizing all text data in `Description` and `Country` to remove unnecessary characters;
6. Handling the missing values in `CustomerID`;

In [30]:
# dropping the unnecessary column (last one):
df_raw = df_raw.drop("_c8")

<IPython.core.display.Javascript object>

## 1.1 Fixing Column Names

In [31]:
# fixing column names with snake casing:
for col in df_raw.columns:
    df_raw = df_raw.withColumnRenamed(col, inflection.underscore(col))

<IPython.core.display.Javascript object>

## 1.2 Fixing Data Types and Missing Values

In [32]:
# converting invoice_date to date:
df_raw = df_raw.withColumn(
    "invoice_date",
    F.from_unixtime(F.unix_timestamp("invoice_date", "dd-MMM-yy")).cast("date"),
)

<IPython.core.display.Javascript object>

In [33]:
# converting customer id to an integer:
df_raw = df_raw.withColumn("customer_id", F.col("customer_id").cast("int"))

<IPython.core.display.Javascript object>

In [34]:
# handling missing values and customer ids:
df_raw = df_raw.withColumn(
    "customer_id",
    F.when(
        F.col("customer_id").isNull(), -1
    ).otherwise(  # missing values will be given a -1 indicator
        F.col("customer_id")
    ),
)

<IPython.core.display.Javascript object>

In [35]:
# adding a indicator for missing customer id:
df_raw = df_raw.withColumn(
    "is_missing_customer_id", F.when(df_raw.customer_id == -1, True).otherwise(False)
)

<IPython.core.display.Javascript object>

## 1.3 Normalizing Text Data

In [36]:
# applying the normalize_text UDF previously defined to the columns listed:
cols_to_process = ["country", "description"]

for column in cols_to_process:
    df_raw = df_raw.withColumn(column, normalize_text(F.col(column)))

<IPython.core.display.Javascript object>

# 2. Raw Data Feature Prepation
The E-commerce dataset I am working with, as the dataset description suggests, is from an *website that sells primarily gifts*. I will split in a few different ways, as the outline below suggests. These `entities` are: 

1. `Customer`: features related to the customer, such as total time as a customer, time since last purchase, favorite products;
2. `Invoice`: features related to a single order, such as basket size (number of distinct items), size of the order, total paid;"
3. `Product`: features related to a specific product, such as last time sold, total sold at a specific period of time;

The schema I envisioned to this dataset can be found in the figure below.

<img src="../reports/figures/Ecommerce Schema.png" alt = "Ecommerce Dataset" style = "width:1182px; height=702px;">

I will add features to the raw dataset at the `invoice-item` granularity it is originally written to so that we can build the necessary views later on.

## 2.1 Datetime attribute features

In [37]:
# adding date-time attributes to invoice entries:
df_raw = get_datetime_features(df_raw, "invoice_date")

<IPython.core.display.Javascript object>

## 2.2 Holiday-related features
Given that the context of this E-commerce is that of a place that sells mostly gifts and novelty items, it is quite pertinent that we address the seasonal variations and trends related to commemorative dates. With that in mind, we will add features related to the context of holidays in the UK (the main location of the website, where the majority of customers are) as well as some global holidays.

Let's first clarify what I am considering a Commercial Holiday and a Bank Holiday:

1. **Commercial Holidays**: dates in which (most of the times) there aren't any interruptions to services (i.e banks), but that have some kind of stimulus for consumption. Purchases that are stimulated by these dates occur either on the specific day (such as Black Friday) or very close to it;
2. **Bank Holidays**: dates in which services are interrupted and people often don't work. Purchases stimulated by these dates usually occur earlier (i.e Christmas)

In [38]:
# defining a dictionary of commercial holidays:
commercial_holidays = {
    "Black Friday 2017": pd.to_datetime("2017-11-24"),
    "Mother's Day  2017": pd.to_datetime("2017-03-26"),
    "Father's Day 2017": pd.to_datetime("2017-06-18"),
    "Valentine's Day 2017": pd.to_datetime("2017-02-14"),
    "Boxing Day 2016": pd.to_datetime("2016-12-26"),
    "Boxing Day 2017": pd.to_datetime("2017-12-26"),
    "New Year's 2017": pd.to_datetime("2017-01-01"),
    "Saint Patrick's Day": pd.to_datetime("2017-03-17"),
}

bank_holidays = {
    "Christmas 2016": pd.to_datetime("2016-12-25"),
    "Christmas 2017": pd.to_datetime("2017-12-25"),
    "New Year's 2017": pd.to_datetime("2017-01-01"),
    "Saint Patrick's Day": pd.to_datetime("2017-03-17"),
    "Hogmanay 2016": pd.to_datetime("2016-12-31"),
    "Hogmanay 2017": pd.to_datetime("2017-12-31"),
    "Easter 2017": pd.to_datetime("2017-04-16"),
}

# holiday dates become:
commercial_dates = set(commercial_holidays.values())
bank_dates = set(bank_holidays.values())

# weeks of year and months that contain holidays:
weeks_commercial_dates = set([date.weekofyear for date in commercial_dates])
months_commercial_dates = set([date.month for date in commercial_dates])
weeks_bank_dates = set([date.weekofyear for date in bank_dates])
months_bank_dates = set([date.month for date in bank_dates])

<IPython.core.display.Javascript object>

In [39]:
# adding boolean flags for the commercial dates:
df_raw = df_raw.withColumn(
    "is_commercial_holiday", F.col("invoice_date").isin(commercial_dates)
)

df_raw = df_raw.withColumn(
    "is_commercial_holiday_week", F.col("week_of_year").isin(weeks_commercial_dates)
)

df_raw = df_raw.withColumn(
    "is_commercial_holiday_month", F.col("month").isin(months_commercial_dates)
)

<IPython.core.display.Javascript object>

In [40]:
# adding boolean flags for the bank dates:
df_raw = df_raw.withColumn("is_bank_holiday", F.col("invoice_date").isin(bank_dates))

df_raw = df_raw.withColumn(
    "is_bank_holiday_week", F.col("week_of_year").isin(weeks_bank_dates)
)

df_raw = df_raw.withColumn(
    "is_bank_holiday_month", F.col("month").isin(months_bank_dates)
)

<IPython.core.display.Javascript object>

## 2.3 Returns, Cancellations and Free items
There still some kinds of behaviors we would like to capture, such if the order/invoice refers to a cancellation. According to a similar dataset in UCI Machine Learning repository, we can differentiate these with invoice numbers that start with a C.

Also, we might need to address the items that are freebies (have unit price = 0) as another feature.

In [41]:
# addinng handles for cancellations:
df_raw = df_raw.withColumn("is_cancelled", F.col("invoice_no").startswith("C"))

<IPython.core.display.Javascript object>

In [42]:
# adding boolean handles for free items (price == 0)
df_raw = df_raw.withColumn(
    "is_return", ((df_raw.quantity < 0) & (df_raw.is_cancelled == False))
)

<IPython.core.display.Javascript object>

In [43]:
# adding free items:
df_raw = df_raw.withColumn(
    "is_free_item", ((df_raw.unit_price == 0) & (df_raw.is_cancelled == False))
)

<IPython.core.display.Javascript object>

## 2.4 Price features
We will add the total price paid for an item in the invoice to the dataset and address issues with different prices being associated with specific products. Since we don't have direct flags that indicate that a certain invoice-item-product combination contains discounts, I will use an heuristic to address it.

In [44]:
# adding the total paid per item:
df_raw = df_raw.withColumn("total_item_price", F.col("unit_price") * F.col("quantity"))

<IPython.core.display.Javascript object>

In [45]:
# grouping products to grab the most common price:
df_products = df_raw.groupby(["description", "unit_price"]).count()

<IPython.core.display.Javascript object>

In [46]:
# setting a window function to get the most common value:
mode_window = Window.partitionBy("description").orderBy(F.desc("count"))

# adding the ranking to the dataframe:
df_products = df_products.withColumn("row_idx", F.row_number().over(mode_window))

# retrieving only the most common item:
df_products = df_products.filter(df_products.row_idx == 1)

df_products = df_products.withColumnRenamed("unit_price", "retail_price").select(
    "description", "retail_price"
)

<IPython.core.display.Javascript object>

In [47]:
# retrieving product price statistics:
df_product_stats = df_raw.groupby("description").agg(
    F.min(F.abs(F.col("unit_price"))).alias("min_unit_price"),
    F.avg(F.abs(F.col("unit_price"))).alias("avg_unit_price"),
    F.percentile_approx(F.abs(F.col("unit_price")), 0.5).alias("median_unit_price"),
    F.max(F.abs(F.col("unit_price"))).alias("max_unit_price"),
)

# joining the datasets:
df_products_full = df_products.join(df_product_stats, how="inner", on=["description"])

# dropping the duplicates;
df_products_full = df_products_full.drop_duplicates()

<IPython.core.display.Javascript object>

In [48]:
# adding price statistics to the main dataframe:
df_raw = df_raw.join(df_products_full, on=["description"], how="left")

<IPython.core.display.Javascript object>

In [49]:
# adding a boolean handle to denote discounts:
df_raw = df_raw.withColumn(
    "is_discounted_item", (df_raw.unit_price < df_raw.retail_price)
)

<IPython.core.display.Javascript object>

## 2.5 Addressing non-product items
As we saw, there seems to be some items that are not related to the products themselves. For example, itens related to postage expenses. Removing these might be not as beneficial, so I will figure out a way to handle them.

In [50]:
# adding boolean handle for further exploration:
df_raw = df_raw.withColumn("has_non_digit", has_non_digits_only(F.col("stock_code")))

<IPython.core.display.Javascript object>

In [51]:
# visualizing some of the results:
df_raw.filter(df_raw.has_non_digit == True)

description,invoice_no,stock_code,quantity,invoice_date,unit_price,customer_id,country,is_missing_customer_id,day_of_week,day_of_month,day_of_year,week_of_year,month,quarter,year,is_commercial_holiday,is_commercial_holiday_week,is_commercial_holiday_month,is_bank_holiday,is_bank_holiday_week,is_bank_holiday_month,is_cancelled,is_return,is_free_item,total_item_price,retail_price,min_unit_price,avg_unit_price,median_unit_price,max_unit_price,is_discounted_item,has_non_digit
discount,C536379,D,-1,2016-11-29,27.5,14527,united kingdom,False,3,29,334,48,11,4,2016,False,False,True,False,False,False,True,False,False,-27.5,11.84,0.01,72.48454545454545,22.97,1867.86,False,True
discount,C537164,D,-1,2016-12-03,29.29,14527,united kingdom,False,7,3,338,48,12,4,2016,False,False,True,False,False,True,True,False,False,-29.29,11.84,0.01,72.48454545454545,22.97,1867.86,False,True
discount,C537597,D,-1,2016-12-05,281.0,15498,united kingdom,False,2,5,340,49,12,4,2016,False,False,True,False,False,True,True,False,False,-281.0,11.84,0.01,72.48454545454545,22.97,1867.86,False,True
discount,C537857,D,-1,2016-12-06,267.12,17340,united kingdom,False,3,6,341,49,12,4,2016,False,False,True,False,False,True,True,False,False,-267.12,11.84,0.01,72.48454545454545,22.97,1867.86,False,True
discount,C538897,D,-1,2016-12-13,5.76,16422,united kingdom,False,3,13,348,50,12,4,2016,False,False,True,False,False,True,True,False,False,-5.76,11.84,0.01,72.48454545454545,22.97,1867.86,False,True


<IPython.core.display.Javascript object>

It seems that there are items associated with distinct categories:
1. Postage and packaging costs (`POST`, `DOT`, `PADS`, `DCGSSGRIL` and `DCGSSBOY`);
2. Manuals (`M`);
3. Discounts and Samples (`D`, `S`);
4. Fees (`AMAZONFEE`);

In [52]:
# generating features regarding theses items:
postage = {"POST", "DOT", "PADS", "DCGSSGRIL", "DCGSSBOY"}

manuals = {"M"}

discounts = {"D", "S"}

fees = {"AMAZONFEE"}

## adding the boolean indicators for all the different cases:
df_raw = df_raw.withColumn("is_postage", F.col("stock_code").isin(postage))

df_raw = df_raw.withColumn("is_manual", F.col("stock_code").isin(manuals))

df_raw = df_raw.withColumn("is_discount", F.col("stock_code").isin(discounts))

df_raw = df_raw.withColumn("is_fee", F.col("stock_code").isin(fees))

<IPython.core.display.Javascript object>

## 2.6 Addressing time to the next holiday
Given the context of the dataset, it is important to evaluate the time before certain key dates in the calendar that are commonly associated with gift-giving.

In [53]:
# transforming the previously defined sets into a row in the original a smaller dataframe:
commercial_dates_list = list(commercial_dates)
bank_dates_list = list(bank_dates)

# extracting the invoice dates for each invoice record (this reduces the size of the dataset by removing the item-level granularity)
df_invoice_dates = df_raw.groupby("invoice_no").agg(
    F.first(F.col("invoice_date")).alias("invoice_date")
)

<IPython.core.display.Javascript object>

In [54]:
# adding the arrays as temporary columns:
df_commercial_dates = df_invoice_dates.withColumn(
    "commercial_dates",
    F.array([F.to_date(F.lit(date)) for date in commercial_dates_list]),
)

df_bank_dates = df_invoice_dates.withColumn(
    "bank_dates", F.array([F.to_date(F.lit(date)) for date in bank_dates_list])
)

# exploding the columns:
df_commercial_dates = df_commercial_dates.select(
    df_commercial_dates.invoice_no,
    df_commercial_dates.invoice_date,
    F.explode(df_commercial_dates.commercial_dates).alias("commercial_date"),
)

df_bank_dates = df_bank_dates.select(
    df_bank_dates.invoice_no,
    df_bank_dates.invoice_date,
    F.explode(df_bank_dates.bank_dates).alias("bank_date"),
)

<IPython.core.display.Javascript object>

In [55]:
# calculating the differences in dates:
df_commercial_dates = df_commercial_dates.withColumn(
    "diff_in_days",
    F.datediff(df_commercial_dates.commercial_date, df_commercial_dates.invoice_date),
)

df_bank_dates = df_bank_dates.withColumn(
    "diff_in_days", F.datediff(df_bank_dates.bank_date, df_bank_dates.invoice_date)
)

<IPython.core.display.Javascript object>

In [56]:
# filtering out the negative results:
df_commercial_dates = df_commercial_dates.filter(F.col("diff_in_days") > 0)

df_bank_dates = df_bank_dates.filter(F.col("diff_in_days") > 0)

<IPython.core.display.Javascript object>

In [57]:
# filtering out the negative results:
df_commercial_dates = df_commercial_dates.groupby("invoice_no", "invoice_date").agg(
    F.min(F.col("diff_in_days")).alias("days_to_next_commercial_holiday")
)

df_bank_dates = df_bank_dates.groupby("invoice_no", "invoice_date").agg(
    F.min(F.col("diff_in_days")).alias("days_to_next_bank_holiday")
)

<IPython.core.display.Javascript object>

In [58]:
# joining the dataset to the original dataframe:
df_raw = df_raw.join(df_commercial_dates, on=["invoice_no", "invoice_date"], how="left")

df_raw = df_raw.join(df_bank_dates, on=["invoice_no", "invoice_date"], how="left")

<IPython.core.display.Javascript object>

In [59]:
# verifying the data integrity:
df_raw.count()  # should match the same number of the records in the previous dataset

541909

<IPython.core.display.Javascript object>

In [60]:
# verifying schema:
df_raw.printSchema()

root
 |-- invoice_no: string (nullable = true)
 |-- invoice_date: date (nullable = true)
 |-- description: string (nullable = true)
 |-- stock_code: string (nullable = true)
 |-- quantity: string (nullable = true)
 |-- unit_price: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- is_missing_customer_id: boolean (nullable = false)
 |-- day_of_week: integer (nullable = true)
 |-- day_of_month: integer (nullable = true)
 |-- day_of_year: integer (nullable = true)
 |-- week_of_year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- quarter: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- is_commercial_holiday: boolean (nullable = true)
 |-- is_commercial_holiday_week: boolean (nullable = true)
 |-- is_commercial_holiday_month: boolean (nullable = true)
 |-- is_bank_holiday: boolean (nullable = true)
 |-- is_bank_holiday_week: boolean (nullable = true)
 |-- is_bank_holiday_month: boolean (n

<IPython.core.display.Javascript object>

## 2.7 Addressing missing customers
Previously, we observed that there are about 25% of records in the raw dataset that are associated with missing customer ids. I will now address these cases by trying to given unique identifiers to qualified records.

In [61]:
# visualizing some of the results with missing customers:
df_missing_customer = df_raw.filter(df_raw.is_missing_customer_id == True)

<IPython.core.display.Javascript object>

In [62]:
# how many invoices with missing customer ids are there?
df_missing_customer.select(
    F.countDistinct(F.col("invoice_no")).alias("n_invoices_missing_customer")
)

n_invoices_missing_customer
3710


<IPython.core.display.Javascript object>

There are `3710` invoices missing customer ids, which represents about `14%` of the unique invoices. Let's first try to identify system-based invoices.

In [63]:
# grouping on invoice level to get the invoices that are purely associated with returns and cancellations
df_invoice_missing = df_missing_customer.groupby("invoice_no").agg(
    F.first(F.col("customer_id")).alias("customer_id"),
    F.min(F.abs(F.col("quantity"))).alias("min_quantity"),
    F.max(F.abs(F.col("quantity"))).alias("max_quantity"),
    F.avg(F.abs(F.col("quantity"))).alias("avg_quantity"),
    F.min(F.abs(F.col("unit_price"))).alias("min_price"),
    F.max(F.abs(F.col("unit_price"))).alias("max_price"),
    F.avg(F.abs(F.col("unit_price"))).alias("avg_price"),
    F.countDistinct(F.col("description")).alias("n_items"),
    F.sum(
        F.when(F.col("is_return") == True, F.abs(F.col("quantity"))).otherwise(0)
    ).alias("n_returned_items"),
    F.sum(
        F.when(F.col("is_free_item") == True, F.abs(F.col("quantity"))).otherwise(0)
    ).alias("n_free_items"),
    F.sum(
        F.when(F.col("is_cancelled") == True, F.abs(F.col("quantity"))).otherwise(0)
    ).alias("n_cancelled_items"),
)

<IPython.core.display.Javascript object>

In [64]:
# originally, we have the following number of records:
df_invoice_missing.count()

3710

<IPython.core.display.Javascript object>

In [65]:
# removing the invoices that are related to one single item that is a return of free item
df_invoice_missing_clean = df_invoice_missing.filter(
    ~(F.col("min_quantity") == F.col("max_quantity"))
    & ~(F.col("min_quantity") == F.col("n_free_items"))
    & ~(F.col("n_items") == F.col("n_free_items"))
    & ~(F.col("n_items") == F.col("n_returned_items"))
    & ~(F.col("n_items") == F.col("n_cancelled_items"))
)

<IPython.core.display.Javascript object>

In [66]:
# verifying the resulting dataframe:
df_invoice_missing_clean

invoice_no,customer_id,min_quantity,max_quantity,avg_quantity,min_price,max_price,avg_price,n_items,n_returned_items,n_free_items,n_cancelled_items
570592,-1,1.0,75.0,7.438356164383562,1.63,24.96,5.493972602739724,72,0.0,0.0,0.0
580739,-1,2.0,3.0,2.5,2.08,2.1,2.09,2,0.0,0.0,0.0
538177,-1,1.0,29.0,2.445692883895131,0.42,847.42,6.64492509363296,530,0.0,143.0,0.0
546892,-1,1.0,16.0,1.7908496732026145,0.42,192.44,5.022745098039215,153,0.0,32.0,0.0
578539,-1,4.0,24.0,13.323529411764708,0.42,7.95,2.1073529411764707,34,0.0,200.0,0.0


<IPython.core.display.Javascript object>

In [67]:
# looking at the number of records in the cleaned data:
df_invoice_missing_clean.count()

1001

<IPython.core.display.Javascript object>

These `1001` invoices can be treated as unique invoices with unique customers such that we can add artificial ids for such "customers".

In [68]:
# adding an id column that starts from the max id in the dataset:
max_id = (
    df_raw.filter(F.col("customer_id") != -1)
    .select(F.max(F.col("customer_id").cast("int")).alias("customer_ids"))
    .collect()[0]["customer_ids"]
)

# offsetting the max id by one to avoid duplicated ids
max_id += 1


# defining a Window function to generate the uniquee index:
seq_id_window = Window.partitionBy("customer_id").orderBy("invoice_no")

df_invoice_missing_clean = df_invoice_missing_clean.withColumn(
    "temp_id", F.row_number().over(seq_id_window) + F.lit(max_id)
)

<IPython.core.display.Javascript object>

In [69]:
# selecting the relevant columnss:
df_invoice_missing = df_invoice_missing_clean.select("invoice_no", "temp_id")

<IPython.core.display.Javascript object>

In [70]:
# joining back the identified ids to the original dataset:
df_raw = df_raw.join(df_invoice_missing, how="left", on=["invoice_no"])

<IPython.core.display.Javascript object>

In [71]:
# performing the substitution:
df_raw = df_raw.withColumn(
    "customer_id",
    F.when(
        (F.col("customer_id") == -1) & (F.col("temp_id").isNotNull()), F.col("temp_id")
    ).otherwise(F.col("customer_id")),
)

<IPython.core.display.Javascript object>

In [72]:
# filtering out the data related to customers that are still missing:
df_clean = df_raw.filter(F.col("customer_id") != -1)

<IPython.core.display.Javascript object>

In [73]:
df_clean.select(
    F.countDistinct(F.col("invoice_no")).alias("n_invoices_remaining")
)  # should return the expected amount of ids

n_invoices_remaining
23191


<IPython.core.display.Javascript object>

In [74]:
# dropping the unnecessary column:
df_clean = df_clean.drop("temp_id")

<IPython.core.display.Javascript object>

# 3. Saving the Dataset
In the real world, I would often approach saving a dataset in a few different ways, such as:

1. Writing the dataset to an object store for use in further steps of the data pipeline (Amazon's `s3`, for example);
2. Writing the dataset as a table in some distributed file system (`hdfs` or even `s3`) with relevant metadata associated (such that it can be queried through `Hive` or Amazon's `Athena`);
3. Writing the results of the dataset directly to a database for use in different applications (`Postgresql`, `Redshift`, et cetera);

For this project, I won't be using such infra-structure, but I will simulate it by writing to a "object store", which, in this case, is the file system of my computer.

In [75]:
# saving the enhanced raw data as parquet in the processed step of the pipeline
PROCESSED_DATA_DIR = "../data/processed"

# using the helper function to save the file:
save_to_filesystem(df_clean, PROCESSED_DATA_DIR, "tb_ecommerce", "tb_ecommerce.parquet")

True

<IPython.core.display.Javascript object>