 # Rolling Window Features

Following notebook showcases an example workflow of creating rolling window features and buidling a model to predict which customers will buy in next 4 weeks.

This uses dummy sales data but the idea can be implemented on actual sales data and can also be expanded to include other available data sources such as clickstream data, call centre data, email contacts data, etc.

***

<b>Spark 3.1.2</b> (with Python 3.8) has been used for this notebook.<br>
Refer to [spark documentation](https://spark.apache.org/docs/3.1.2/api/sql/index.html) for help with <b>data ops functions</b>.<br>
Refer to [this article](https://medium.com/analytics-vidhya/installing-and-using-pyspark-on-windows-machine-59c2d64af76e) to <b>install and use PySpark on Windows machine</b>.

### Building a spark session
To create a SparkSession, use the following builder pattern:
 
`spark = SparkSession\
    .builder\
    .master("local")\
    .appName("Word Count")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()`

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.types import FloatType

In [2]:
#initiating spark session
spark.stop()

In [3]:
spark = SparkSession\
    .builder\
    .appName("rolling_window")\
    .config("spark.executor.memory", "1536m")\
    .config("spark.driver.memory", "2g")\
    .getOrCreate()

In [4]:
spark

## Data prep

We will be using window functions to compute relative features for all dates. We will first aggregate the data to customer x week level so it is easier to handle.

<mark>The week level date that we create will serve as the 'reference date' from which everything will be relative.</mark>

All the required dimension tables have to be joined with the sales table prior to aggregation so that we can create all  required features.

### Read input datasets

In [5]:
import pandas as pd

In [6]:
df_sales = spark.read.csv('./data/rw_sales.csv',inferSchema=True,header=True)
df_customer = spark.read.csv('./data/clustering_customer.csv',inferSchema=True,header=True)
df_product = spark.read.csv('./data/clustering_product.csv',inferSchema=True,header=True)
df_payment = spark.read.csv('./data/clustering_payment.csv',inferSchema=True,header=True)

<b>Quick exploration of the datasets:</b>
1. We have sales data that captures date, customer id, product, quantity, dollar amount & payment type at order x item level. `order_item_id` refers to each unique product in each order
2. We have corresponding dimension tables for customer info, product info, and payment tender info

In [7]:
df_sales.show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|       1|            1|2020-01-01|        572|    550|  1|        20|              2|
|       2|            2|2020-01-01|        532|    630|  3|        11|              2|
|       3|            3|2020-01-01|        608|    450|  2|        18|              4|
|       4|            4|2020-01-01|        424|    110|  2|        10|              2|
|       5|            5|2020-01-01|        584|    250|  1|         8|              4|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



In [8]:
# order_item_id is the primary key
(df_sales.count(),
 df_sales.selectExpr('count(Distinct order_item_id)').collect()[0][0],
 df_sales.selectExpr('count(Distinct order_id)').collect()[0][0])

(20000, 20000, 19622)

In [9]:
df_sales.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- tran_dt: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- dollars: integer (nullable = true)
 |-- qty: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- payment_type_id: integer (nullable = true)



In [10]:
# fix date type for tran_dt
df_sales = df_sales.withColumn('tran_dt', F.to_date('tran_dt'))

In [11]:
df_customer.show(5)

+-----------+---+---------+------------+----------------+
|customer_id|age|hh_income|omni_shopper|email_subscribed|
+-----------+---+---------+------------+----------------+
|          1| 46|   640000|           0|               0|
|          2| 32|   890000|           1|               1|
|          3| 45|   772000|           0|               0|
|          4| 46|   303000|           0|               1|
|          5| 38|   412000|           0|               0|
+-----------+---+---------+------------+----------------+
only showing top 5 rows



In [12]:
# we have 1k unique customers in sales data with all their info in customer dimension table
(df_sales.selectExpr('count(Distinct customer_id)').collect()[0][0],
 df_customer.count(),
 df_customer.selectExpr('count(Distinct customer_id)').collect()[0][0])

(1000, 1000, 1000)

In [13]:
# product dimension table provides category and price for each product
df_product.show(5)

+----------+--------+-----+
|product_id|category|price|
+----------+--------+-----+
|         1|       A|  450|
|         2|       B|   80|
|         3|       C|  250|
|         4|       D|  400|
|         5|       E|   50|
+----------+--------+-----+
only showing top 5 rows



In [14]:
(df_product.count(),
 df_product.selectExpr('count(Distinct product_id)').collect()[0][0])

(22, 22)

In [15]:
# payment type table maps the payment type id from sales table
df_payment.show(5)

+---------------+------------+
|payment_type_id|payment_type|
+---------------+------------+
|              1|        cash|
|              2| credit card|
|              3|  debit card|
|              4|   gift card|
|              5|      others|
+---------------+------------+



### Join all dim tables and add week_end column

In [16]:
df_sales = df_sales.join(df_product.select('product_id','category'), on=['product_id'], how='left')
df_sales = df_sales.join(df_payment, on=['payment_type_id'], how='left')

<b>week_end column: Saturday of every week</b>

`dayofweek()` returns 1-7 correspondng to Sun-Sat for a date.

Using this, we will convert each date to the date corresponding to the Saturday of that week (week: Sun-Sat) using below logic:<br/>
`date + 7 - dayofweek()`

In [17]:
df_sales.printSchema()

root
 |-- payment_type_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- tran_dt: date (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- dollars: integer (nullable = true)
 |-- qty: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- payment_type: string (nullable = true)



In [18]:
df_sales = df_sales.withColumn('week_end',
    F.col('tran_dt') + 7 - F.dayofweek('tran_dt'))

In [19]:
df_sales.show(5)

+---------------+----------+--------+-------------+----------+-----------+-------+---+--------+------------+----------+
|payment_type_id|product_id|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|category|payment_type|  week_end|
+---------------+----------+--------+-------------+----------+-----------+-------+---+--------+------------+----------+
|              2|        20|       1|            1|2020-01-01|        572|    550|  1|       D| credit card|2020-01-04|
|              2|        11|       2|            2|2020-01-01|        532|    630|  3|       A| credit card|2020-01-04|
|              4|        18|       3|            3|2020-01-01|        608|    450|  2|       C|   gift card|2020-01-04|
|              2|        10|       4|            4|2020-01-01|        424|    110|  2|       E| credit card|2020-01-04|
|              4|         8|       5|            5|2020-01-01|        584|    250|  1|       C|   gift card|2020-01-04|
+---------------+----------+--------+---

### customer_id x week_end aggregation
We will be creating following features at weekly level. These will then be aggregated for multiple time frames using window functions for the final dataset.
1. Sales
2. No. of orders
3. No. of units
4. Sales split by category
5. Sales split by payment type

In [20]:
df_sales_agg = df_sales.groupBy('customer_id','week_end').agg(
    F.sum('dollars').alias('sales'),
    F.countDistinct('order_id').alias('orders'),
    F.sum('qty').alias('units'))

In [21]:
# category split pivot
df_sales_cat_agg = df_sales.withColumn('category', F.concat(F.lit('cat_'), F.col('category')))

df_sales_cat_agg = df_sales_cat_agg.groupBy('customer_id','week_end').pivot('category').agg(F.sum('dollars'))

In [22]:
# payment type split pivot
# clean-up values in payment type column
df_payment_agg = df_sales.withColumn(
    'payment_type',
    F.concat(F.lit('pay_'), F.regexp_replace(F.col('payment_type'),' ','_')))

df_payment_agg = df_payment_agg.groupby('customer_id','week_end').pivot('payment_type').agg(F.max('dollars'))

In [23]:
# join all together
df_sales_agg = df_sales_agg.join(df_sales_cat_agg, on=['customer_id','week_end'], how='left')
df_sales_agg = df_sales_agg.join(df_payment_agg,   on=['customer_id','week_end'], how='left')

In [24]:
df_sales_agg = df_sales_agg.persist()
df_sales_agg.count()

17488

In [25]:
df_sales_agg.show(5)

+-----------+----------+-----+------+-----+-----+-----+-----+-----+-----+--------+---------------+--------------+-------------+----------+
|customer_id|  week_end|sales|orders|units|cat_A|cat_B|cat_C|cat_D|cat_E|pay_cash|pay_credit_card|pay_debit_card|pay_gift_card|pay_others|
+-----------+----------+-----+------+-----+-----+-----+-----+-----+-----+--------+---------------+--------------+-------------+----------+
|         67|2019-10-05| 2300|     2|    5| null| null| null| 2300| null|    null|           1200|          null|         null|      null|
|         80|2020-01-11|  900|     1|    2|  900| null| null| null| null|    null|            900|          null|         null|      null|
|         81|2020-08-01|  450|     1|    3| null|  450| null| null| null|    null|            450|          null|         null|      null|
|         86|2020-02-08|  550|     1|    1|  550| null| null| null| null|    null|            550|          null|         null|      null|
|         88|2019-06-01|  7

### Fill Missing weeks

In [26]:
# cust level min and max weeks
df_cust = df_sales_agg.groupBy('customer_id').agg(
    F.min('week_end').alias('min_week'),
    F.max('week_end').alias('max_week'))

In [27]:
# function to get a dataframe with 1 row per date in provided range
def pandas_date_range(start, end):
    dt_rng = pd.date_range(start=start, end=end, freq='W-SAT') # W-SAT required as we want all Saturdays
    df_date = pd.DataFrame(dt_rng, columns=['date'])
    return df_date

In [28]:
# use the cust level table and create a df with all Saturdays in our range
date_list = df_cust.selectExpr('min(min_week)', 'max(max_week)').collect()[0]
min_date = date_list[0]
max_date = date_list[1]

# use the function and create df
df_date_range = spark.createDataFrame(pandas_date_range(min_date, max_date))

# date format
df_date_range = df_date_range.withColumn('date',F.to_date('date'))

In [29]:
df_date_range = df_date_range.repartition(1).persist()
df_date_range.count()

101

<b>Cross join date list df with cust table to create filled base table</b>

In [30]:
df_base = df_cust.crossJoin(F.broadcast(df_date_range))

# filter to keep only week_end since first week per customer
df_base = df_base.where(F.col('date')>=F.col('min_week'))

# rename date to week_end
df_base = df_base.withColumnRenamed('date','week_end')

<b>Join with the aggregated week level table to create full base table</b>

In [31]:
df_base = df_base.join(df_sales_agg, on=['customer_id','week_end'], how='left')
df_base = df_base.fillna(0)

In [32]:
df_base = df_base.persist()
df_base.count()

95197

In [33]:
# write base table as parquet
df_base.repartition(8).write.parquet('./data/rw_base/', mode='overwrite')

In [34]:
df_base = spark.read.parquet('./data/rw_base/')

## y-variable

Determining whether a customer buys something in the next 4 weeks of current week.

In [35]:
# flag 1/0 for weeks with purchases
df_base = df_base.withColumn('purchase_flag', F.when(F.col('sales')>0,1).otherwise(0))

In [36]:
# window to aggregate the flag over next 4 weeks
df_base = df_base.withColumn(
    'purchase_flag_next_4w',
    F.max('purchase_flag').over(
        Window.partitionBy('customer_id').orderBy('week_end').rowsBetween(1,4)))

## Features
We will be aggregating the features columns over various time intervals (1/4/13/26/52 weeks) to create a rich set of lookback features. We will also create derived features post aggregation.

In [37]:
# we can create and keep Window() objects that can be referenced in multiple formulas
# we don't need a window definition for 1w features as these are already present
window_4w  = Window.partitionBy('customer_id').orderBy('week_end').rowsBetween(-3,Window.currentRow)
window_13w = Window.partitionBy('customer_id').orderBy('week_end').rowsBetween(-12,Window.currentRow)
window_26w = Window.partitionBy('customer_id').orderBy('week_end').rowsBetween(-25,Window.currentRow)
window_52w = Window.partitionBy('customer_id').orderBy('week_end').rowsBetween(-51,Window.currentRow)

In [38]:
df_base.columns

['customer_id',
 'week_end',
 'min_week',
 'max_week',
 'sales',
 'orders',
 'units',
 'cat_A',
 'cat_B',
 'cat_C',
 'cat_D',
 'cat_E',
 'pay_cash',
 'pay_credit_card',
 'pay_debit_card',
 'pay_gift_card',
 'pay_others',
 'purchase_flag',
 'purchase_flag_next_4w']

<b>Direct features</b>

In [39]:
cols_skip = ['customer_id','week_end','min_week','max_week','purchase_flag_next_4w']
for cols in df_base.drop(*cols_skip).columns:
    df_base = df_base.withColumn(cols+'_4w',  F.sum(F.col(cols)).over(window_4w))
    df_base = df_base.withColumn(cols+'_13w', F.sum(F.col(cols)).over(window_13w))
    df_base = df_base.withColumn(cols+'_26w', F.sum(F.col(cols)).over(window_26w))
    df_base = df_base.withColumn(cols+'_52w', F.sum(F.col(cols)).over(window_52w))

<b>Derived features</b>

In [40]:
# aov, aur, upt at each time cut
for cols in ['sales','orders','units']:
    for time_cuts in ['1w','_4w','_13w','_26w','_52w']:
        if time_cuts=='1w': time_cuts=''
        df_base = df_base.withColumn('aov'+time_cuts, F.col('sales'+time_cuts)/F.col('orders'+time_cuts))
        df_base = df_base.withColumn('aur'+time_cuts, F.col('sales'+time_cuts)/F.col('units'+time_cuts))
        df_base = df_base.withColumn('upt'+time_cuts, F.col('units'+time_cuts)/F.col('orders'+time_cuts))

In [41]:
# % split of category and payment type for 26w (can be extended to other timeframes as well)
for cat in ['A','B','C','D','E']:
    df_base = df_base.withColumn('cat_'+cat+'_26w_perc', F.col('cat_'+cat+'_26w')/F.col('sales_26w'))

for pay in ['cash', 'credit_card', 'debit_card', 'gift_card', 'others']:
    df_base = df_base.withColumn('pay_'+pay+'_26w_perc', F.col('pay_'+pay+'_26w')/F.col('sales_26w'))

In [42]:
# all columns
df_base.columns

['customer_id',
 'week_end',
 'min_week',
 'max_week',
 'sales',
 'orders',
 'units',
 'cat_A',
 'cat_B',
 'cat_C',
 'cat_D',
 'cat_E',
 'pay_cash',
 'pay_credit_card',
 'pay_debit_card',
 'pay_gift_card',
 'pay_others',
 'purchase_flag',
 'purchase_flag_next_4w',
 'sales_4w',
 'sales_13w',
 'sales_26w',
 'sales_52w',
 'orders_4w',
 'orders_13w',
 'orders_26w',
 'orders_52w',
 'units_4w',
 'units_13w',
 'units_26w',
 'units_52w',
 'cat_A_4w',
 'cat_A_13w',
 'cat_A_26w',
 'cat_A_52w',
 'cat_B_4w',
 'cat_B_13w',
 'cat_B_26w',
 'cat_B_52w',
 'cat_C_4w',
 'cat_C_13w',
 'cat_C_26w',
 'cat_C_52w',
 'cat_D_4w',
 'cat_D_13w',
 'cat_D_26w',
 'cat_D_52w',
 'cat_E_4w',
 'cat_E_13w',
 'cat_E_26w',
 'cat_E_52w',
 'pay_cash_4w',
 'pay_cash_13w',
 'pay_cash_26w',
 'pay_cash_52w',
 'pay_credit_card_4w',
 'pay_credit_card_13w',
 'pay_credit_card_26w',
 'pay_credit_card_52w',
 'pay_debit_card_4w',
 'pay_debit_card_13w',
 'pay_debit_card_26w',
 'pay_debit_card_52w',
 'pay_gift_card_4w',
 'pay_gift_card_1

<b>Derived features: trend vars</b>

In [43]:
# we will take ratio of sales for different timeframes to estimate trend features
# that depict whether a customer has an increasing trend or not
df_base = df_base.withColumn('sales_1w_over_4w',   F.col('sales')/    F.col('sales_4w'))
df_base = df_base.withColumn('sales_4w_over_13w',  F.col('sales_4w')/ F.col('sales_13w'))
df_base = df_base.withColumn('sales_13w_over_26w', F.col('sales_13w')/F.col('sales_26w'))
df_base = df_base.withColumn('sales_26w_over_52w', F.col('sales_26w')/F.col('sales_52w'))

<b>Time elements</b>

In [44]:
# extract year, month, and week of year from week_end to be used as features
df_base = df_base.withColumn('year', F.year('week_end'))
df_base = df_base.withColumn('month', F.month('week_end'))
df_base = df_base.withColumn('weekofyear', F.weekofyear('week_end'))

<b>More derived features</b>:<br/>
We can add many more derived features as well, as required.

e.g. lag variables of existing features, trend ratios for other features, % change (Q-o-Q, M-o-M type) using lag variales, etc.

In [45]:
# save sample rows to csv for checks
df_base.limit(50).toPandas().to_csv('./files/rw_features_qc.csv',index=False)

In [46]:
# save features dataset as parquet
df_base.repartition(8).write.parquet('./data/rw_features/', mode='overwrite')

In [47]:
df_features = spark.read.parquet('./data/rw_features/')

## Model Build

### Dataset for modeling

<b>Sample one week_end per month</b>

In [48]:
df_wk_sample = df_features.select('week_end').withColumn('month', F.substring(F.col('week_end'), 1,7))
df_wk_sample = df_wk_sample.groupBy('month').agg(F.max('week_end').alias('week_end'))

df_wk_sample = df_wk_sample.repartition(1).persist()
df_wk_sample.count()

24

In [49]:
df_wk_sample.sort('week_end').show(5)

+-------+----------+
|  month|  week_end|
+-------+----------+
|2019-01|2019-01-26|
|2019-02|2019-02-23|
|2019-03|2019-03-30|
|2019-04|2019-04-27|
|2019-05|2019-05-25|
+-------+----------+
only showing top 5 rows



In [50]:
count_features = df_features.count()

In [51]:
# join back to filer
df_model = df_features.join(F.broadcast(df_wk_sample.select('week_end')), on=['week_end'], how='inner')
count_wk_sample = df_model.count()

<b>Eligibility filter</b>: Customer should be active in last year w.r.t the reference date

In [52]:
# use sales_52w for elig. filter
df_model = df_model.where(F.col('sales_52w')>0)
count_elig = df_model.count()

In [53]:
# count of rows at each stage
print(count_features, count_wk_sample, count_elig)

95197 22938 22938


<b>Removing latest 4 week_end dates</b>: As we have a look-forward period of 4 weeks, latest 4 week_end dates in the data cannot be used for our model as these do not have 4 weeks ahead of them for the y-variable.

In [54]:
# see latest week_end dates (in the dataframe prior to monthly sampling)
df_features.select('week_end').drop_duplicates().sort(F.col('week_end').desc()).show(5)

+----------+
|  week_end|
+----------+
|2020-12-05|
|2020-11-28|
|2020-11-21|
|2020-11-14|
|2020-11-07|
+----------+
only showing top 5 rows



In [55]:
# filter
df_model = df_model.where(F.col('week_end')<'2020-11-14')
count_4w_rm = df_model.count()

In [56]:
# count of rows at each stage
print(count_features, count_wk_sample, count_elig, count_4w_rm)

95197 22938 22938 20938


### Model Dataset Summary
Let's look at event rate for our dataset and also get a quick summary of all features.

The y-variable is balanced here becuase it is a dummy dataset. <mark>In most actual scenarios, this will not be balanced and the model build exerice will involving sampling for balancing.</mark>

In [57]:
df_model.groupBy('purchase_flag_next_4w').count().sort('purchase_flag_next_4w').show()

+---------------------+-----+
|purchase_flag_next_4w|count|
+---------------------+-----+
|                    0|10464|
|                    1|10474|
+---------------------+-----+



In [58]:
df_model.groupBy().agg(F.avg('purchase_flag_next_4w').alias('event_rate'), F.avg('purchase_flag').alias('wk_evt_rt')).show()

+------------------+-------------------+
|        event_rate|          wk_evt_rt|
+------------------+-------------------+
|0.5002388002674563|0.18182252364122647|
+------------------+-------------------+



<b>Saving summary of all numerical features as a csv</b>

In [59]:
summary_metrics =\
    ('count','mean','stddev','min','0.10%','1.00%','5.00%','10.00%','20.00%','25.00%','30.00%',
     '40.00%','50.00%','60.00%','70.00%','75.00%','80.00%','90.00%','95.00%','99.00%','99.90%','max')

df_summary_numeric = df_model.summary(*summary_metrics)
df_summary_numeric.toPandas().T.to_csv('./files/rw_features_summary.csv')

In [60]:
# fillna
df_model = df_model.fillna(0)

### Train-Test Split

80-20 split

In [61]:
train, test = df_model.randomSplit([0.8, 0.2], seed=125)

In [62]:
train.columns

['week_end',
 'customer_id',
 'min_week',
 'max_week',
 'sales',
 'orders',
 'units',
 'cat_A',
 'cat_B',
 'cat_C',
 'cat_D',
 'cat_E',
 'pay_cash',
 'pay_credit_card',
 'pay_debit_card',
 'pay_gift_card',
 'pay_others',
 'purchase_flag',
 'purchase_flag_next_4w',
 'sales_4w',
 'sales_13w',
 'sales_26w',
 'sales_52w',
 'orders_4w',
 'orders_13w',
 'orders_26w',
 'orders_52w',
 'units_4w',
 'units_13w',
 'units_26w',
 'units_52w',
 'cat_A_4w',
 'cat_A_13w',
 'cat_A_26w',
 'cat_A_52w',
 'cat_B_4w',
 'cat_B_13w',
 'cat_B_26w',
 'cat_B_52w',
 'cat_C_4w',
 'cat_C_13w',
 'cat_C_26w',
 'cat_C_52w',
 'cat_D_4w',
 'cat_D_13w',
 'cat_D_26w',
 'cat_D_52w',
 'cat_E_4w',
 'cat_E_13w',
 'cat_E_26w',
 'cat_E_52w',
 'pay_cash_4w',
 'pay_cash_13w',
 'pay_cash_26w',
 'pay_cash_52w',
 'pay_credit_card_4w',
 'pay_credit_card_13w',
 'pay_credit_card_26w',
 'pay_credit_card_52w',
 'pay_debit_card_4w',
 'pay_debit_card_13w',
 'pay_debit_card_26w',
 'pay_debit_card_52w',
 'pay_gift_card_4w',
 'pay_gift_card_1

### Data Prep
Spark Models require a vector of features as input. Categorical columns also need to be String Indexed before they can be used.

As we don't have any categorical columns currently, we will directly go with VectorAssembly.

<b>We will add it to a pipeline model that can be saved to be used on test & scoring datasets.</b>

In [63]:
# model related imports (RF)
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [64]:
# list of features: remove identifer columns and the y-var
col_list = df_model.drop('week_end','customer_id','min_week','max_week','purchase_flag_next_4w').columns

stages = []
assembler = VectorAssembler(inputCols=col_list, outputCol='features')
stages.append(assembler)

pipe = Pipeline(stages=stages)
pipe_model = pipe.fit(train)

pipe_model.write().overwrite().save('./files/model_objects/rw_pipe/')

In [65]:
pipe_model = PipelineModel.load('./files/model_objects/rw_pipe/')

<b>Apply the transformation pipeline</b>

Also keep the identifier columns and y-var in the transformed dataframe.

In [66]:
train_pr = pipe_model.transform(train)
train_pr = train_pr.select('customer_id','week_end','purchase_flag_next_4w','features')
train_pr = train_pr.persist()
train_pr.count()

16776

In [67]:
test_pr = pipe_model.transform(test)
test_pr = test_pr.select('customer_id','week_end','purchase_flag_next_4w','features')
test_pr = test_pr.persist()
test_pr.count()

4162

### Model Training
We will train one iteration of Random Forest model as showcase.

In actual scenario, you will have to iterate through the training step multiple times for feature selection, and model hyper parameter tuning to get a good final model.

In [68]:
train_pr.show(5)

+-----------+----------+---------------------+--------------------+
|customer_id|  week_end|purchase_flag_next_4w|            features|
+-----------+----------+---------------------+--------------------+
|          3|2019-01-26|                    0|(102,[14,15,16,17...|
|         14|2019-01-26|                    0|[1200.0,1.0,3.0,0...|
|         17|2019-01-26|                    0|(102,[14,15,16,17...|
|         19|2019-01-26|                    1|(102,[14,15,16,17...|
|         31|2019-01-26|                    1|(102,[0,1,2,6,9,1...|
+-----------+----------+---------------------+--------------------+
only showing top 5 rows



In [69]:
model_params = {
    'labelCol': 'purchase_flag_next_4w',
    'numTrees': 128, # default: 128
    'maxDepth': 12,  # default: 12
    'featuresCol': 'features',
    'minInstancesPerNode': 25,
    'maxBins': 128,
    'minInfoGain': 0.0,
    'subsamplingRate': 0.7,
    'featureSubsetStrategy': '0.3',
    'impurity': 'gini',
    'seed': 125,
    'cacheNodeIds': False,
    'maxMemoryInMB': 256
    }

clf = RandomForestClassifier(**model_params)

In [70]:
trained_clf = clf.fit(train_pr)

### Feature Importance
We will save feature importance as a csv.

In [71]:
# Feature importance
feature_importance_list = trained_clf.featureImportances
feature_list = pd.DataFrame(train_pr.schema['features'].metadata['ml_attr']['attrs']['numeric']).sort_values('idx')

feature_importance_list = pd.DataFrame(
    data=feature_importance_list.toArray(),
    columns=['relative_importance'],
    index=feature_list['name'])
feature_importance_list = feature_importance_list.sort_values('relative_importance', ascending=False)

feature_importance_list.to_csv('./files/rw_rf_feat_imp.csv')

### Predict on train and test

In [72]:
secondelement = F.udf(lambda v: float(v[1]), FloatType())

train_pred = trained_clf.transform(train_pr).withColumn('score',secondelement(F.col('probability')))
test_pred =  trained_clf.transform(test_pr).withColumn('score', secondelement(F.col('probability')))

In [73]:
test_pred.show(5)

+-----------+----------+---------------------+--------------------+--------------------+--------------------+----------+----------+
|customer_id|  week_end|purchase_flag_next_4w|            features|       rawPrediction|         probability|prediction|     score|
+-----------+----------+---------------------+--------------------+--------------------+--------------------+----------+----------+
|         15|2019-01-26|                    1|(102,[0,1,2,6,9,1...|[53.0784863119179...|[0.41467567431185...|       1.0|0.58532435|
|         27|2019-01-26|                    1|(102,[14,15,16,17...|[51.1393971441854...|[0.39952654018894...|       1.0|0.60047346|
|         28|2019-01-26|                    1|(102,[14,15,16,17...|[49.8725279055357...|[0.38962912426199...|       1.0| 0.6103709|
|        170|2019-01-26|                    0|(102,[0,1,2,7,10,...|[54.1789500442362...|[0.42327304722059...|       1.0|  0.576727|
|        192|2019-01-26|                    0|(102,[0,1,2,7,10,...|[57.53121

### Test Set Evaluation

In [74]:
evaluator = BinaryClassificationEvaluator(
        rawPredictionCol='rawPrediction',
        labelCol='purchase_flag_next_4w',
        metricName='areaUnderROC')

In [75]:
# areaUnderROC
evaluator.evaluate(train_pred)

0.8116811886255015

In [76]:
evaluator.evaluate(test_pred)

0.7412597272276923

In [77]:
# cm
test_pred.groupBy('purchase_flag_next_4w','prediction').count().sort('purchase_flag_next_4w','prediction').show()

+---------------------+----------+-----+
|purchase_flag_next_4w|prediction|count|
+---------------------+----------+-----+
|                    0|       0.0| 1655|
|                    0|       1.0|  450|
|                    1|       0.0|  937|
|                    1|       1.0| 1120|
+---------------------+----------+-----+



In [78]:
# accuracy
test_pred.where(F.col('purchase_flag_next_4w')==F.col('prediction')).count()/test_pred.count()

0.6667467563671312

### Save Model

In [79]:
trained_clf.write().overwrite().save('./files/model_objects/rw_rf_model/')

In [80]:
trained_clf = RandomForestClassificationModel.load('./files/model_objects/rw_rf_model/')

## Scoring
We will take the records for latest week_end from df_features and score it using our trained model.
<etc. etc.>

In [81]:
df_features = spark.read.parquet('./data/rw_features/')

In [82]:
max_we = df_features.selectExpr('max(week_end)').collect()[0][0]
max_we

datetime.date(2020, 12, 5)

In [88]:
df_scoring = df_features.where(F.col('week_end')==max_we)

In [89]:
df_scoring.count()

1000

In [90]:
# fillna
df_scoring = df_scoring.fillna(0)

# transformation pipeline
pipe_model = PipelineModel.load('./files/model_objects/rw_pipe/')

# apply
df_scoring = pipe_model.transform(df_scoring)
df_scoring = df_scoring.select('customer_id','week_end','features')

# rf model
trained_clf = RandomForestClassificationModel.load('./files/model_objects/rw_rf_model/')

#apply
secondelement = F.udf(lambda v: float(v[1]), FloatType())

df_scoring = trained_clf.transform(df_scoring).withColumn('score',secondelement(F.col('probability')))

In [91]:
df_scoring.show(5)

+-----------+----------+--------------------+--------------------+--------------------+----------+----------+
|customer_id|  week_end|            features|       rawPrediction|         probability|prediction|     score|
+-----------+----------+--------------------+--------------------+--------------------+----------+----------+
|        148|2020-12-05|[1200.0,1.0,3.0,1...|[68.4381452599104...|[0.53467300984305...|       0.0|  0.465327|
|        787|2020-12-05|(102,[15,16,17,19...|[96.4135634068059...|[0.75323096411567...|       0.0|0.24676904|
|        906|2020-12-05|(102,[16,17,20,21...|[90.4284145792717...|[0.70647198890056...|       0.0|0.29352802|
|        182|2020-12-05|(102,[16,17,20,21...|[90.9264612678324...|[0.71036297865494...|       0.0|0.28963703|
|        442|2020-12-05|(102,[16,17,20,21...|[105.838704191258...|[0.82686487649421...|       0.0|0.17313512|
+-----------+----------+--------------------+--------------------+--------------------+----------+----------+
only showi

In [92]:
# save scored output
df_scoring.repartition(8).write.parquet('./data/rw_scored/', mode='overwrite')