In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from lazypredict.Supervised import LazyClassifier, LazyRegressor
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.svm import SVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import catboost as cb

from utils import *



# Predicting next purchase day

Knowing what the customer wants even before the customer buys a product is an imperative to thrive in this digital world. It gets more customer-centric, with organizations wanting to know every bit about the customer, predict everything around the customer and take suitable action to create a swell in the number of loyal customers.

And there is customer behavior data that offers nuggets of customer wisdom. As part of this ‘predicting everything around the customer’ exercise, there is also a growing need to predict customer next day purchase, with the behavioral data rising in relevance to know when customers would buy next.

One of the key aspects in predicting customer’s next day purchase would be the behavioral data. Many data wrangling and feature engineering techniques are applied to bring out the most from data towards predicting the customer’s next purchase day. Some of the most common features leveraged in predicting the next purchase day would be the RFM metrics.

*Source:* [Saksoft](https://www.saksoft.com/blog/predicting-customer-next-purchase-day/)

Just like before, we'll be using the datasets from Group 1 as the basis for our visualizations and analyses. Visualizations for the other groups can be seen in the Streamlit app.

Lez go.

In [2]:
# Group 1:
# items1 = pd.read_csv('data/Created in part 01/group1_items.csv', index_col='Invoice', parse_dates=['InvoiceDate'])
invoices = (
    pd.read_csv('../data/Created in part 01/group1_invoices.csv', index_col='Invoice', parse_dates=['InvoiceDate'])
    .pipe(adjust_time_window)
    .pipe(normalize_invoicedate)
    .pipe(clean_customer_id)
)   # importing our dataset of invoices and using all that preprocessing from part 02

In [3]:
invoices.rename({'Customer ID': 'CustomerID'}, axis=1, inplace=True)

___
- # Labels

Labels, basically, are what we are trying to predict, in this case, the number of days between the last purchase before a cutoff date and the next purchase after this date.

In other words, labels are the **y** in regression problems.

We'll be using the first 8 months of data to analyze customer behavior and then use the other 3 months of data to check if our predictions are correct.

That's respectively `invoices_jan_aug` and `invoices_sep_nov`.

In [4]:
invoices_jan_aug = invoices.query("InvoiceDate < '2010-08-01'")   # used to "train"
invoices_sep_nov = invoices.query("InvoiceDate >= '2010-08-01'")   # used to "test"

Let's start off by getting the last known purchase from the 8-month df, for every customer, and then the first purchase on the other 3-month span.

In [5]:
last_purchase = invoices_jan_aug.groupby('CustomerID')['InvoiceDate'].max().rename('LastPurchase')
first_purchase = invoices_sep_nov.groupby('CustomerID')['InvoiceDate'].min().rename('FirstPurchase')

Now let's create a df of unique CustomerID's and populate it with relevant info, starting by merging the previously created Series into this new df.

"Why is `invoices_jan_aug` being used instead of `invoices`?", you may ask. Well, some customers made their first purchase between the 3-month span we are using to test our model, i.e., they don't have any data available and can't have their next purchase predicted. Using `invoices_jan_aug` captures only customers who made at least one purchase between Jan and Aug.

In [6]:
customers = pd.DataFrame({'CustomerID': invoices_jan_aug['CustomerID'].unique()})

# Merges. How='left' asserts that `customers` only has customers already in `customers`
customers = customers.merge(last_purchase, how='left', left_on='CustomerID', right_on='CustomerID')
customers = customers.merge(first_purchase, how='left', left_on='CustomerID', right_on='CustomerID')

And now the day difference between the last purchase and the next one (the one to be predicted).

In [7]:
customers['NextDayPurchase'] = (customers['FirstPurchase'] - customers['LastPurchase']).dt.days

customers.head()   # just checking how our df is looking so far

Unnamed: 0,CustomerID,LastPurchase,FirstPurchase,NextDayPurchase
0,14739,2010-07-28,2010-08-20,23.0
1,14370,2010-06-05,2010-08-08,64.0
2,12810,2010-06-23,NaT,
3,16684,2010-07-06,2010-08-19,44.0
4,14047,2010-07-19,2010-08-17,29.0


____
- # Features

Our model needs features, right? Let's go get those **X**'s for our prediction problem.

Our feature candidates are:
- Days between the last 3 purchases;
- Standard deviation and mean of the difference of days between purchases;
- RFM data (check the notebook from part 03).

Right. So first we need a new df for the next steps. To get the difference in days of the last 3 purchases, we must first remove same-day purchases.

In [8]:
last_3_purchases = (
    invoices_jan_aug
    [['CustomerID','InvoiceDate']]
    .sort_values(['CustomerID', 'InvoiceDate'])
    .drop_duplicates()   # removes duplicated rows, i.e., same-day purchases from a single customer
)

Things get tricky here! 

To get the difference between the last three purchases, we need to push forward the date of the purchase by 1, 2 and 3 periods, and then subtract the dates from the original index order. Confusing, right? Let's go step by step, checking how that would look for one of our customers.

We *randomly* select you, customer whose ID is `12431`!

Let's see how their data looks inside our almighty df:

In [9]:
last_3_purchases.query("`CustomerID` == '12431'")

Unnamed: 0_level_0,CustomerID,InvoiceDate
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1
536389,12431,2010-01-12
494511,12431,2010-01-15
521203,12431,2010-02-09
500008,12431,2010-04-03
509212,12431,2010-05-21
509572,12431,2010-05-24
514810,12431,2010-06-07
513310,12431,2010-06-23
516307,12431,2010-07-19


In [10]:
# And then some math...
last_3_purchases['InvoiceDate-1'] = last_3_purchases.groupby('CustomerID')['InvoiceDate'].shift(1)
last_3_purchases['InvoiceDate-2'] = last_3_purchases.groupby('CustomerID')['InvoiceDate'].shift(2)
last_3_purchases['InvoiceDate-3'] = last_3_purchases.groupby('CustomerID')['InvoiceDate'].shift(3)

Let's check how customer `12431` is doing.

In [11]:
# The dates were pushed forward by 1, 2, 3 and then aligned
last_3_purchases.query("`CustomerID` == '12431'")

Unnamed: 0_level_0,CustomerID,InvoiceDate,InvoiceDate-1,InvoiceDate-2,InvoiceDate-3
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
536389,12431,2010-01-12,NaT,NaT,NaT
494511,12431,2010-01-15,2010-01-12,NaT,NaT
521203,12431,2010-02-09,2010-01-15,2010-01-12,NaT
500008,12431,2010-04-03,2010-02-09,2010-01-15,2010-01-12
509212,12431,2010-05-21,2010-04-03,2010-02-09,2010-01-15
509572,12431,2010-05-24,2010-05-21,2010-04-03,2010-02-09
514810,12431,2010-06-07,2010-05-24,2010-05-21,2010-04-03
513310,12431,2010-06-23,2010-06-07,2010-05-24,2010-05-21
516307,12431,2010-07-19,2010-06-23,2010-06-07,2010-05-24


Good.

Now let's do the math and get the differences in days, not in timestamps.

In [12]:
last_3_purchases['DayDiff1'] = (last_3_purchases['InvoiceDate'] - last_3_purchases['InvoiceDate-1']).dt.days
last_3_purchases['DayDiff2'] = (last_3_purchases['InvoiceDate'] - last_3_purchases['InvoiceDate-2']).dt.days
last_3_purchases['DayDiff3'] = (last_3_purchases['InvoiceDate'] - last_3_purchases['InvoiceDate-3']).dt.days

Step up, `12431`!

In [13]:
# The last 3 columns have the difference in days between the actual invoice and the previous 3
last_3_purchases.query("`CustomerID` == '12431'")

Unnamed: 0_level_0,CustomerID,InvoiceDate,InvoiceDate-1,InvoiceDate-2,InvoiceDate-3,DayDiff1,DayDiff2,DayDiff3
Invoice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
536389,12431,2010-01-12,NaT,NaT,NaT,,,
494511,12431,2010-01-15,2010-01-12,NaT,NaT,3.0,,
521203,12431,2010-02-09,2010-01-15,2010-01-12,NaT,25.0,28.0,
500008,12431,2010-04-03,2010-02-09,2010-01-15,2010-01-12,53.0,78.0,81.0
509212,12431,2010-05-21,2010-04-03,2010-02-09,2010-01-15,48.0,101.0,126.0
509572,12431,2010-05-24,2010-05-21,2010-04-03,2010-02-09,3.0,51.0,104.0
514810,12431,2010-06-07,2010-05-24,2010-05-21,2010-04-03,14.0,17.0,65.0
513310,12431,2010-06-23,2010-06-07,2010-05-24,2010-05-21,16.0,30.0,33.0
516307,12431,2010-07-19,2010-06-23,2010-06-07,2010-05-24,26.0,42.0,56.0


I hope that made sense...

Now we need the mean and the standard deviation of the difference between purchases.

In [14]:
mean_and_std_df = last_3_purchases.groupby('CustomerID', as_index=False).agg(
   DayDiffmean = pd.NamedAgg('DayDiff1', 'mean'),
   DayDiffstd = pd.NamedAgg('DayDiff1', 'std') 
)

Some customers only had 1 or 2 purchases, hence so many `NaN`'s. When predicting stuff, the more data you have for training, the better. Predict next purchase day of a customer who only had 1 or 2 purchases is kind of dumb.

We should focus on *frequent* customers, i.e., customers with ***at least*** 4 purchases. For customers with more than 4 purchases, we are gonna only select data from the last 4 purchases.

`pandas`-ly speaking, this means:

In [15]:
last_3_purchases = (
    last_3_purchases
    .dropna()   # if a row has a NaN, it means that row does not have all data from the last 4 purchases
    .drop_duplicates(subset=['CustomerID'],keep='last')   # ok, this customer has at least 4 purchases, but we only want the last 4
)

And finally, our RFM data from part 03!

In [16]:
# This function generates a df with RFM from any df of invoices. Check `utils.py` for more info.
rfm_df = preprocessing_part_03(invoices.rename({'CustomerID': 'Customer ID'}, axis=1))

Before merging everything into our `customers` df, let's clean up all unneeded columns from each df to be merged (and from `customers` itself).

In [17]:
customers = customers[['CustomerID', 'NextDayPurchase']]
last_3_purchases = last_3_purchases[['CustomerID', 'DayDiff1','DayDiff2','DayDiff3']]
mean_and_std_df = mean_and_std_df[['CustomerID', 'DayDiffmean','DayDiffstd']]
rfm_df = rfm_df[['CustomerID', 'Recency', 'RecencyCluster', 'Frequency', 'FrequencyCluster', 'Monetary', 'MonetaryCluster', 'Score', 'Segment']]

In [18]:
customers

Unnamed: 0,CustomerID,NextDayPurchase
0,14739,23.00
1,14370,64.00
2,12810,
3,16684,44.00
4,14047,29.00
...,...,...
3153,15021,14.00
3154,14346,72.00
3155,16428,84.00
3156,17109,91.00


In [19]:
customers = customers.merge(last_3_purchases, on='CustomerID')
customers = customers.merge(mean_and_std_df, on='CustomerID')
customers = customers.merge(rfm_df, on='CustomerID')

customers   # let's see how it look

Unnamed: 0,CustomerID,NextDayPurchase,DayDiff1,DayDiff2,DayDiff3,DayDiffmean,DayDiffstd,Recency,RecencyCluster,Frequency,FrequencyCluster,Monetary,MonetaryCluster,Score,Segment
0,14739,23.00,12.00,16.00,62.00,20.70,14.35,30,4,16,1,439.72,1,6,Mid
1,16684,44.00,6.00,71.00,76.00,30.83,25.72,5,4,24,2,1158.64,1,7,Mid
2,14047,29.00,67.00,69.00,99.00,33.00,21.17,105,2,10,1,522.52,1,4,Mid
3,12540,76.00,34.00,62.00,66.00,23.57,28.24,4,4,14,1,718.20,1,6,Mid
4,17969,42.00,21.00,22.00,42.00,18.55,10.04,86,3,13,1,614.84,1,5,Mid
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
652,16059,113.00,15.00,17.00,30.00,13.80,7.26,13,4,7,1,654.22,1,6,Mid
653,14745,,49.00,51.00,62.00,15.75,22.62,127,2,6,1,106.12,0,3,Low
654,13324,39.00,15.00,17.00,46.00,15.33,13.50,77,3,8,1,191.41,0,4,Mid
655,14626,53.00,16.00,19.00,55.00,18.33,16.62,65,3,6,1,249.31,0,4,Mid


___
- # Model training

Now we need to define the classes that will be assigned to our labels (reminder: `NextPurchaseDay` column). Here they are:
- Class 2: customers that will purchase in the next 0 - 20 days;
- Class 1: customers that will purchase in the next 21 - 49 days;
- Class 0: customers that will purchase in more than 50 days.

Before assigning labels, when must check if there are no `NaN`'s in the `NextDayPurchase` column. 

In [20]:
customers.NextDayPurchase.isna().sum()

58

Why is that so? 

Well, some customers might had 4 purchases in those first 8 months, but they didn't come back in the other 3 months. Bummer, I know.

In other words, their next day of purchase is over 3 months (90+ days). Or maybe never. Either way, they belong to class 0, so let's just fill those `NaN`'s with a high numeric value.

In [21]:
customers = customers.fillna(90)

customers['Labels'] = 2
customers.loc[customers['NextDayPurchase'] > 20, 'Labels'] = 1
customers.loc[customers['NextDayPurchase'] > 50, 'Labels'] = 0

Before that old train-and-test-split thing, let's make some copies (**hint**: these will be useful in the future).

In [22]:
X = customers.copy()

y = X.pop('Labels')
X = X.drop(columns=['CustomerID', 'NextDayPurchase'])

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)

In [24]:
# model = cb.CatBoostRegressor()
# model.fit(X_train.values, y_train.values)
# preds = model.predict(X_test.values)

In [25]:
# reg = LazyClassifier(verbose=0, ignore_warnings=False, custom_metric=None)
# models, predictions = reg.fit(X_train, X_test, y_train, y_test)
# print(models)