# Summary notebook:

This notebook aims to present the key findings from the research in the following sections: 
1. dataset

2. modelling

3. implementation of the ranking system

4. results and recommendations

At the same time, it explains the methodology taken in each step, as well as the assumptions/limitations that have to be considered.

In [8]:
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
from IPython import display
import plotly.graph_objects as go
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [9]:
# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("MAST30034 Project 2")
    .config("spark.sql.repl.eagerEval.enabled", True)
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '4g')
    .config('spark.executor.memory', '2g')
    .getOrCreate()
)

## 1. Datasets

**Given datasets**

There are four major types of business related datasets as follows that we were given.

- transaction data 

    - around 14,000,000 instances of transaction were recorded, from Feb 2021 to Oct 2022
     

- merchant details

    - around 4,000  merchants, from various industries with different take rate

- consumer datails

    - around 500,000 consumers. Postcode and gender were revealed.

- fraud data

    - around 4,000 instances for merchant fraud, 80,000 instances for consumer fraud with the fraud probabilities.

**External Dataset**

Demographic and Income datasets were obtained from a census data published by Australian Bureau of Statistics. These data were provided in SA2 levels, where SA2 (Statistical Area Level 2)is a geographical areas defined by ABS for statistical purposes. 

Postcodes in consumer dataset were used to merge the SA2 data, while those correspondencies between postcodes and SA2 might be inaccurate to certain extent.

- Demographic data

    - 2021 and 2022 population and population density at each SA2.

- Income data

    - median age of earners in each SA2

    - median and sum income of each SA2

In [10]:
# Upon running ETL script, the data below is created which combines all the given datasets into trasaction dataset
# (each row is an unique transaction, with all the details about the associaated merchant and consumer included)
data = spark.read.parquet('../data/raw/transactions_data/')

# merge with the external datasets
consumer_external = spark.read.format("csv") \
                .options(header=True, delimiter=",") \
                .load('../data/raw/addedInfo_transaction/tbl_consumer_demo_income.csv') \
                .withColumnRenamed("name", "consumer_name")
data = data.join(consumer_external, 
                 on=['consumer_id', 'consumer_name', 'address', 'state', 'postcode', 'gender'], 
                 how='inner')

In [11]:
# Example of one transaction after merging 
data.show(1)

[Stage 36:>                                                         (0 + 1) / 1]

+-----------+-------------+--------------------+-----+--------+------+------------+--------------+-------+-----------------+--------------------+-------------------+--------------------+--------------------+-------------+--------------------+------------------+-------------------+----------+--------------------+-------+-------+----------+----------+----+--------------+-----------+-----------------+-----------+-------------+-----------+
|consumer_id|consumer_name|             address|state|postcode|gender|merchant_abn|order_datetime|user_id|     dollar_value|            order_id|consumer_fraud_prob|       merchant_name|                tags|merchant_desc|merchant_revenue_lvl|merchant_take_rate|merchant_fraud_prob|SA2_CODE21|          SA2_NAME21|2021ERP|2022ERP|ERPchange#|ERPchange%|Area|popDensity2022|num_earners|medianAge_earners| sum_income|median_income|mean_income|
+-----------+-------------+--------------------+-----+--------+------+------------+--------------+-------+--------------

                                                                                

### 1. i) Preprocessing

Now, there are some merchants who don't have any information about itself registered in the dataset, such as name, industry type, etc.

Since those merchants with no information are probably not are clients, we removed those merchants from the dataset.

In [12]:
# print out number of merchant without any information
filtered = data.filter(col('merchant_name').isNotNull())
filtered.agg(*[countDistinct(col('merchant_abn')).alias('# of merchants with information')]).show()



+-------------------------------+
|# of merchants with information|
+-------------------------------+
|                           4026|
+-------------------------------+



                                                                                

Consequently, we removed around 580,000 transactions from our dataset associated with those merchants without any information

In [13]:
# print out number of transaction without any merchant information
# (transactions made by the merchants shown above)
columns_to_inspect = ['merchant_name']
data.select([count(when(col(c).isNull(), c)).alias(c) for c in columns_to_inspect]).show()



+-------------+
|merchant_name|
+-------------+
|       580830|
+-------------+



                                                                                

### 1. ii) Preliminary analysis with visualizations

<div>
<img src="../plots/numbers_of_transaction_in_date.png" width="600"/>
</div>

Observations:

- evidence of annual trend -> prediction should be made on yearly basis

- there is a spike in the number of transactions from 2021/11~2022/1, which corresponds to the holiday season

<div>
<img src="../plots/revenue_IndustryComparison.png" width="600"/>
</div>

Observation:
- computer and tent industry is generating exceptionary high total revenue compared to other industry, meaning that a lot of money is being transferred in these two industries from consumers to the merchants

- potential benefit for the BNPL company to approach to these industries in specific

<div>
<img src='../plots/merchantFraud_IndustryComparison.png' width="600"/>
</div>

Observation:

- antique and jewely are the two industries with most merchant fraudulent transactions

- the two industries are both in luxury segments with high average revenue per transaction, therefore single fraud can cause big benefit to the cheater, while making big loss to the other side.

- merchants from these industries should be treated with special attention

## 2. Models

**Approach taken:**
- Forecast model
    - predict revenue, number of transactions and number of consumers in 2023 for each merchant, based on 2021 and 2022 data

    - missing months data were imputed using average and time series model (2021/1~2 and 2022/11~12)

    - implemented with linear regression

- Fraud Detection model
    - classify whether the transaction is suspicious or not

    - transactions with raud probability over 50% were labelled as 'suspicious' (true) to ensure enough sample size to train the model on
    
    - implemented with random forest classifier

**Assumptions and Limitations:**
- Assumptions:
    - transactions follow similar patter every year

    - Distribution of fraud dataset is representiive of the whole transaction dataset

- Limitations:
    - Limited amount of transaction and fraud data
    
    - Conversion between SA2 and postcode were approximated, and there external features are also an approximate

<!-- **Steps for implementation:**
1. Prepare training and test dataset for forecast and fraud detection model
    
    - Forecast model: missing months imputation

    - Fraud Detection model: put labels on fraud dataset

2. Train the models on the training dataset

3. Test the models on the test dataset
    
    - Forecast model: use R^2, MSE, etc
    
    - Fraud Detection model: use Precision, Recall, F1, etc 

4. Make predictions

    - Forecast model: predict 2023 data
    
    - Fraud Detection model: predict suspicious transactions among all the past transactions -->

### 2. i) Forecast model: Linear model

**Missing months imputation:**

We need 2 years of historical data to train and test the linear regression model, but following 4 months are missing.

- Jan, Feb of 2021: As there is no prior data, impute with average of following 3 months

- Nov, Dec of 2022: Fitted ARIMA model with seasonal component on every merchant


Revenue prediction example|User count prediction example|Transaction count prediction example
-|-|-
![Alt Text](../plots/rev_pred.png) | ![Alt Text](../plots/user_count_pred.png) | ![Alt Text](../plots/trans_count_pred.png)

**Fitting linear model:**

After imputing missing months, linear model was fit on the following features to predict 2023 data:
- Previous year's revenue, user count, transaction count

- Category of the merchant

- Take rate

- Demographic / Income features of the consumer base

The performance of the model against the benchmark 1R model is shown below.

<center>
<img src="../plots/Linear_model_perf.png" width="400" />
</center>

It shows that the linear model in blue color, is performing signifanctly better than the 1R bench mark in red color as it got smaller MAE.

### 2. ii) Fraud detection model: Random Forest Classifier

**Model setup:**

- The Fraud detection model is trained on 12 features, including transaction details, consumer details, merchant details, demographic and income data. 

- There are two types of fraud probability: consumer fraud and merchant fraud, which are assumed to be different in property. Therefore, two separate models were trained on each.

- All transactions with fraud probability greater than 50% was labelled 'Fraudulent' (is_fraud=True) in order to ensure there is enough data to train the model on. 

**Results**

After training the models on the fraud datasets, we have predicted labels for all transactions data. 

The following dataframe contains all the predictions with the following attributes: 
- order_id as the primary key, for every transaction
- consumer_is_fraud and merchant_is_fraud, indicating the labels for is_fraud according to consumer and merchant models respectively.

In [14]:
# printing some examples of fraud predictions
fraud_prediction = pd.read_parquet('../data/raw/transactions_fraud')
fraud_prediction.head(5)

Unnamed: 0,order_id,consumer_is_fraud,merchant_is_fraud
0,50ad85ec-9f43-4dee-b429-8a2f82fdc27f,False,False
1,d90fee6a-06e8-4b1a-8259-e756a04eb21c,False,False
2,9623bf34-90e1-41c6-a679-33102ef4a18c,False,False
3,765be87d-700f-43f3-8962-37e6f9a17558,False,False
4,715a2510-5798-4c35-bcac-4fe84fcfdd45,False,False


In [15]:
# print the proportion of transactions labelled as suspicious by consumer fraud model, out of the entire transactions
sus_count_c = (fraud_prediction['consumer_is_fraud'] == True).sum()
print('Consumer Fraud model:')
print(f'{sus_count_c} instances labelled as suspicious and {fraud_prediction.shape[0]-sus_count_c} labelled otherwise.')
print(f'Proportion of fraudulent data: {np.round(100*sus_count_c/(fraud_prediction.shape[0]), 2)}%\n')

# print the proportion of transactions labelled as suspicious by merchant fraud model, out of the entire transactions
sus_count_m = (fraud_prediction['merchant_is_fraud'] == True).sum()
print('Merchant Fraud model:')
print(f'{sus_count_m} instances labelled as suspicious and {fraud_prediction.shape[0]-sus_count_m} labelled otherwise.')
print(f'Proportion of fraudulent data: {np.round(100*sus_count_m/(fraud_prediction.shape[0]), 2)}%')

Consumer Fraud model:
3209 instances labelled as suspicious and 13610452 labelled otherwise.
Proportion of fraudulent data: 0.02%

Merchant Fraud model:
510611 instances labelled as suspicious and 13103050 labelled otherwise.
Proportion of fraudulent data: 3.75%


We have conducted a cross-validation to evaluate the model performance on the fraud data. Although the data is imbalanced, the performance of the two models are very high.

In [16]:
# plot the model performances for consumer fraud detection and merchant fraud detection
categories = ['accuracy', 'precision', 'recall', 'f1-score']

fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=[0.9913, 0.9716, 0.7971, 0.8279],
      theta=categories,
      fill='toself',
      name='Merchant Fraud Detection Performance'
))
fig.add_trace(go.Scatterpolar(
      r=[0.9896, 0.9452, 0.6564, 0.7322],
      theta=categories,
      fill='toself',
      name='Consumer Fraud Detection Performance'
))
fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0, 1]
    )),
  showlegend=True
)
fig.show()

| Metrics | Merchant fraud detection | Consumer fraud detection |
| --- | --- | --- |
| Precision | 0.97 | 0.94 |
| Recall | 0.79 | 0.64 |

We can see that the merchant model outperforms the consumer fraud model. 

- Their accuracies are very high, but not relevant becuase the data is very imbalanced.

- Notice that precision is very high for both the models, above 90%. Precision is the most important metric in our case because we want to minimize the false-positives (i.e. real transactions being labelled as fraudulent), to avoid false accusation (can significantly undermine our trust from our client).

- recall is also important to avoid false negative (i.e. fradulent transaction labelled as benign) and to avoid the loss to our company. Improving this recall score is one of the future area of research

## 3. Ranking system

**Approach taken:** Weighted sum method

**Used heursitics and their relative weights:**
- Predicted gain: 0.4

- Predicted number of consmers: 0.25

- Predicted number of transactions: 0.25

- Reliability of the merchant: 0.05

- Reliability of consumer base: 0.05

**Definitions:**
- Predicted gain = predicted revenue * take rate

- Reliability of the merchant = 1 - (number of suspicious transaction by the merchant) / (total number of transaction made by the merchant)

- Reliability of consumer base = 1 - (number of suspicious transaction by the consumer base) / (total number of transaction made by the consumer base)

<!-- **Steps for implementation:**
1. Normalize each feature

2. Get weighted sum of features values for each merchant, accordingly as above

3. Sort the merchants by the weighted sum -->

In [17]:
merchants_preSorted = pd.read_csv("../data/curated/merchants_preSorted")

In [18]:
# define feature weights
default_weights = {"pred_gain": 0.4, "pred_num_consumer": 0.25, "pred_num_trans": 0.25, "merchant_reliability": 0.05, "consumer_reliability": 0.05}

In [19]:
# step 1: normalize each feature
criteria = default_weights.keys()
merchants_subset = merchants_preSorted[criteria]
normalized_df = (merchants_subset - merchants_subset.min()) / (merchants_subset.max() - merchants_subset.min())

In [20]:
# step 2: get weighted sum of feature values for each merchant
default_weighted_sum = normalized_df.apply(lambda col: col * default_weights[col.name]).sum(axis=1)

In [21]:
# step3: sort the merchants by the weighted sum
merchants_preSorted["default_weighted_sum"] = default_weighted_sum
merchants_sorted = merchants_preSorted.sort_values(by='default_weighted_sum', ascending=False)
merchants_sorted = merchants_sorted.reset_index(drop=True)

# put another column that indicates ranking of each merchant
merchants_sorted["rank"] = merchants_sorted.index+1

## 4. Result and Recommendations

In [22]:
merchants_sorted = pd.read_csv("../data/curated/merchants_sorted.csv")

### Top 10 out of 100 recommended merchants (across the industries)

- observe that most of the merchants in top 10 are from recreational sector

- top 3 merchants are: Leo In Consulting, Lacus Consulting, Est Nunc Consulting

In [23]:
# top 100 merchants
top100_merchants = merchants_sorted.head(100)
top100_merchants.head(10)

Unnamed: 0,merchant_name,category,segment,pred_num_trans,pred_gain,pred_num_consumer,merchant_reliability,consumer_reliability,default_weighted_sum,rank
0,Leo In Consulting,watch,luxury,173536.0,390640.318996,130107.0,0.985228,0.999791,0.953038,1
1,Lacus Consulting,gift,recreational,145209.0,372704.465598,113733.0,0.985503,0.999825,0.867038,2
2,Est Nunc Consulting,tent,recreational,137610.0,340722.911863,108809.0,0.983872,0.99981,0.816735,3
3,Non Vestibulum Industries,tent,recreational,157091.0,264025.57371,120701.0,0.981982,0.999778,0.791217,4
4,Erat Vitae LLP,florists,recreational,183854.0,162804.614803,135426.0,0.991085,0.999827,0.757767,5
5,Lorem Ipsum Sodales Industries,florists,recreational,127566.0,257363.423451,102536.0,0.981246,0.999806,0.711089,6
6,Pede Nonummy Corp.,tent,recreational,165578.0,137339.140382,125632.0,0.991058,0.999831,0.690313,7
7,Mauris Non Institute,cable,IT,76224.0,363062.301237,66851.0,0.983958,0.999834,0.677335,8
8,Lobortis Ultrices Company,music,recreational,71875.0,354104.129551,63557.0,0.98471,0.999789,0.656749,9
9,Orci In Consequat Corporation,gift,recreational,54456.0,394349.647795,49569.0,0.984868,0.999872,0.646013,10


### Top 10 merchants in Luxury segment

In [24]:
merchants_sorted[merchants_sorted['segment']=='luxury'].head(10)

Unnamed: 0,merchant_name,category,segment,pred_num_trans,pred_gain,pred_num_consumer,merchant_reliability,consumer_reliability,default_weighted_sum,rank
0,Leo In Consulting,watch,luxury,173536.0,390640.318996,130107.0,0.985228,0.999791,0.953038,1
21,Gravida Mauris Incorporated,watch,luxury,32761.0,315383.681366,30950.0,0.985126,0.999846,0.50611,22
35,Magna Sed Industries,art,luxury,2548.0,283722.417216,2515.0,0.984983,1.0,0.38205,36
43,Aliquam Auctor Associates,antique,luxury,22496.0,204023.70845,21608.0,0.988552,0.999775,0.367832,44
50,Commodo Ipsum Industries,jewelry,luxury,539.0,260535.873579,505.0,0.987864,0.921117,0.349478,51
51,Lacus Aliquam Corporation,antique,luxury,997.0,255175.224632,954.0,0.990832,0.97446,0.348583,52
67,Iaculis LLC,watch,luxury,24990.0,118325.336781,23941.0,0.990856,0.999874,0.29312,68
79,Hendrerit A Corporation,watch,luxury,12332.0,141882.656231,12079.0,0.984299,0.999795,0.276367,80
80,Nulla Facilisis Institute,watch,luxury,15685.0,130273.686267,15599.0,0.984161,0.999733,0.276234,81
86,Tellus Id LLC,watch,luxury,28057.0,84862.913634,26668.0,0.990769,0.999865,0.270094,87


### Top 10 merchants in Recreational segment

In [25]:
merchants_sorted[merchants_sorted['segment']=='recreational'].head(10)

Unnamed: 0,merchant_name,category,segment,pred_num_trans,pred_gain,pred_num_consumer,merchant_reliability,consumer_reliability,default_weighted_sum,rank
1,Lacus Consulting,gift,recreational,145209.0,372704.465598,113733.0,0.985503,0.999825,0.867038,2
2,Est Nunc Consulting,tent,recreational,137610.0,340722.911863,108809.0,0.983872,0.99981,0.816735,3
3,Non Vestibulum Industries,tent,recreational,157091.0,264025.57371,120701.0,0.981982,0.999778,0.791217,4
4,Erat Vitae LLP,florists,recreational,183854.0,162804.614803,135426.0,0.991085,0.999827,0.757767,5
5,Lorem Ipsum Sodales Industries,florists,recreational,127566.0,257363.423451,102536.0,0.981246,0.999806,0.711089,6
6,Pede Nonummy Corp.,tent,recreational,165578.0,137339.140382,125632.0,0.991058,0.999831,0.690313,7
8,Lobortis Ultrices Company,music,recreational,71875.0,354104.129551,63557.0,0.98471,0.999789,0.656749,9
9,Orci In Consequat Corporation,gift,recreational,54456.0,394349.647795,49569.0,0.984868,0.999872,0.646013,10
12,Vehicula Pellentesque Corporation,artist supply,recreational,115995.0,182123.924153,95089.0,0.985258,0.999747,0.609349,13
13,Dictum Phasellus In Institute,gift,recreational,62801.0,326806.57115,56325.0,0.983635,0.999788,0.604719,14


### Top 10 merchants in IT segment

In [26]:
merchants_sorted[merchants_sorted['segment']=='IT'].head(10)

Unnamed: 0,merchant_name,category,segment,pred_num_trans,pred_gain,pred_num_consumer,merchant_reliability,consumer_reliability,default_weighted_sum,rank
7,Mauris Non Institute,cable,IT,76224.0,363062.301237,66851.0,0.983958,0.999834,0.677335,8
14,Nullam Consulting,digital,IT,64694.0,285565.048651,57869.0,0.985686,0.999764,0.570528,15
16,Placerat Eget Venenatis Limited,computer,IT,114740.0,135076.715559,94331.0,0.991058,0.999862,0.561231,17
25,Arcu Sed Eu Incorporated,computer,IT,23964.0,288472.536157,22893.0,0.984204,0.999868,0.453314,26
26,Nec Incorporated,telecom,IT,3287.0,352339.084579,3238.0,0.981128,0.999615,0.450257,27
28,Suspendisse Ac Associates,digital,IT,43193.0,213524.365587,39949.0,0.983954,0.999883,0.438755,29
30,Eu Placerat LLC,computer,IT,19605.0,244907.129737,18964.0,0.982935,0.999872,0.398117,31
33,Adipiscing Elit Foundation,computer,IT,9944.0,263406.268416,9757.0,0.98032,0.999937,0.385673,34
34,Diam At Foundation,computer,IT,16108.0,242980.495986,15355.0,0.984443,0.99985,0.384919,35
36,Eleifend PC,computer,IT,23482.0,211997.496238,22529.0,0.982345,0.999787,0.378242,37


### Top 10 merchants in Daily segment

In [27]:
merchants_sorted[merchants_sorted['segment']=='daily'].head(10)

Unnamed: 0,merchant_name,category,segment,pred_num_trans,pred_gain,pred_num_consumer,merchant_reliability,consumer_reliability,default_weighted_sum,rank
10,Suspendisse Dui Corporation,opticians,daily,149493.0,131831.895979,115946.0,0.991091,0.999821,0.645259,11
11,Dignissim Maecenas Foundation,opticians,daily,42141.0,396059.134193,39180.0,0.984331,0.999774,0.611706,12
55,Pretium Et LLC,stationery,daily,6504.0,225464.189394,6403.0,0.980948,0.999805,0.338289,56
56,Dolor Quisque Inc.,shoe,daily,13816.0,191980.487468,13454.0,0.992078,0.999817,0.329559,57
57,Blandit At LLC,shoe,daily,18839.0,174541.87229,18286.0,0.982501,0.999766,0.328033,58
61,Placerat Orci Institute,stationery,daily,16154.0,166463.090924,15681.0,0.990358,0.999846,0.31219,62
63,Tempor Est Foundation,stationery,daily,8093.0,186225.140963,7962.0,0.989932,0.99969,0.305983,64
71,Sociosqu Corp.,shoe,daily,14442.0,142626.596718,14091.0,0.985182,0.99987,0.283714,72
82,At Pede Inc.,opticians,daily,7149.0,156588.289218,7066.0,0.983459,0.999825,0.274187,83
97,Nec Tellus Ltd,health,daily,10258.0,120315.072914,10067.0,0.987795,0.999938,0.249244,98


### Top 10 merchants in Housing segment

In [28]:
merchants_sorted[merchants_sorted['segment']=='housing'].head(10)

Unnamed: 0,merchant_name,category,segment,pred_num_trans,pred_gain,pred_num_consumer,merchant_reliability,consumer_reliability,default_weighted_sum,rank
18,Ornare Limited,motor,housing,19811.0,370840.725521,19120.0,0.983362,0.999936,0.519987,19
19,Amet Risus Inc.,furniture,housing,2941.0,413863.935874,2883.0,0.981242,1.0,0.508407,20
27,Phasellus At Limited,furniture,housing,27421.0,274124.187706,26133.0,0.98371,0.999771,0.450148,28
39,Interdum Feugiat Sed Inc.,furniture,housing,32830.0,177538.92557,29531.0,0.991277,0.999846,0.371143,40
44,Eu Inc.,garden,housing,21740.0,205711.666319,20913.0,0.983545,0.999884,0.366902,45
45,Magna Praesent PC,motor,housing,11107.0,239722.783067,10893.0,0.983184,0.999777,0.366678,46
54,Semper Auctor PC,motor,housing,2370.0,239189.22847,2322.0,0.981637,0.999738,0.338384,55
59,Phasellus Dapibus Incorporated,furniture,housing,22056.0,153112.340361,21203.0,0.990583,0.999829,0.317561,60
60,Suspendisse Non Leo PC,motor,housing,14783.0,174988.749573,14401.0,0.990796,0.999872,0.316197,61
62,At Pretium Corp.,motor,housing,859.0,211072.937425,843.0,0.983582,0.999254,0.306596,63


### Proportion of each segment in top 100 merchants

<div>
<img src="../plots/proportion_bySegments.png" width="600"/>
</div>

- greatest number of merchants are from recreaional factor (44%), next is housing (19%), then IT (17%)

- implies potential benefit in targetting in recreational sector

### Proportion of each category in top 100 merchants

<div>
<img src="../plots/proportion_byCategories.png" width="700"/>
</div>

- greatest number of merchants from tent (17%), second is computer (11%), then motor (8%)

- potential benefit in focusing these categories

### Comparison between top 100 and next 100 merchants across the heuristics

<div>
<img src="../plots/comparison_top100_next100.png" width="500"/>
</div>

- observe that predicted gain, predicted number of consumers, and predicted number of transactions made a big difference between the top 100 merchants and the next 100 merchants, while the reliability measures are the both very high and similar

    - indicates that our top 200 merchants are mostly trustful, while there may be some high performing merchants that were downgraded and excluded from top 200

- predicted gain is around 3 time higher in top 100 than the next 100 merchants, around 2 time higher for predicted number of consumers, and around 2.5 time higher for predicted number of transactions

### Statistics under our recommended top 100 merchants

In 2023 ...
- predicted total gain = $23,257,593

- predicted total number of transactions = 3,768,144

- predicted total number of consumers = 24,081

as shown in the output below

In [29]:
# compute the total predicted gain and number of transactions with our recommended top 100 merchants
criteria2 = ['pred_gain', 'pred_num_trans']
for criterion in criteria2:
    print(criterion, '- sum: ', top100_merchants[criterion].sum())

pred_gain - sum:  20504560.659092326
pred_num_trans - sum:  3376651.0


In [30]:
# compute the total number of unique consumers that the BNPL can reach to, under our recommended top 100 merchants
top100_merchnats_name = list(top100_merchants['merchant_name'])
transactions = pd.read_parquet('../data/raw/transactions_clean')
transactions_filtered = transactions[transactions['merchant_name'].isin(top100_merchnats_name)]
len(set(transactions_filtered['consumer_id']))

24081