#### 1. Overview

The company `Food in Baggins` (FiB) is a food delivery app expanding its operations across Middle Earth.  
One of the greatest challenges for the year of 2021 was to grow its MUB (Monthly Unique Buyers) base, spending in the most efficient way the large amount of money that was injected into the company.  
  
In order to this, FiB marketing team decided to invest in discount vouchers, with the hypothesis that this would incentivize more customers to convert.  
The dataset `mkt_test_assignment_sample` contains the data of a marketing AB test performed in june/21, in order to validate this hypothesis. The test simply consisted in sending discount vouchers to all customers in the treatment variant, so we could analyze the results and decide if the discount campaign should be rollouted as a growth lever.  

link sobre teste AB

In [0]:
from pyspark.sql import functions as f

orders_sample = spark.read.parquet(f"s3://dev-ifood-kairos/misc/bda/ord_sample")
kairos_sample = spark.read.parquet(f"s3://dev-ifood-kairos/misc/bda/mkt_test_assignment_sample")

In [0]:
orders_by_account = (
    orders_sample
    .filter(f.col('reference_date')>='2021-06-01')
    .filter(f.col('reference_date')<'2021-07-01')
    .groupBy('account_id', 'reference_date')
    .agg(
        f.countDistinct('order_id').alias('orders'),
        f.sum('discount_vouchers_used').alias('vouchers_used'),
        f.sum('subsidy').alias('subsidy'),
        f.avg('order_total').alias('aov')
    )
)

(
    kairos_sample
    .filter(f.col('reference_month')=='2021-06-01')
    .join(orders_by_account, ['account_id', 'reference_date'], 'left')
    .groupBy('test_assignment')
    .agg(
        f.count('account_id').alias('test_size'),
        f.round((f.sum( f.when(f.col('orders') > 0, 1).otherwise(0) )), 3).alias('conversions'),
        f.round((f.sum( f.when(f.col('orders') > 0, 1).otherwise(0) ) / f.count('account_id') ), 3).alias('conversion_rate'),
        f.round(f.sum('subsidy'), 0).alias('sub_total'),
        (f.sum( f.when(f.col('vouchers_used') > 0, 1).otherwise(0) ) ).alias('vouchers_used'),
        f.round(f.avg('aov'),0).alias('aov')
    )    
    .display()
)

test_assignment,test_size,conversions,conversion_rate,sub_total,vouchers_used,aov
control,147962,10899,0.074,0.0,0,2330.0
test,164460,12093,0.074,339914.0,882,2298.0


#### 2. Important Metrics

**Incremental Response Rate**: Is the difference between the conversion rate observed in the group who received the treatment compared to the group who did not receive it. It represents the conversion rate that was actually incremental due to the treatment.

$$IRR = \\text{Conversion}{_{treament}}-\\text{Conversion}{_{control}}$$ 

**Incremental Conversions (or incremental MUB)**: 

$$MUB_{i} = \\text{IRR} \\times \\text{N}_{treatment}$$ 

**Cannibalization**: estimates the proportion of conversions that would be made anyway, if we had not sent the incentive. 

$$Cannib_{treatment} = \\frac{\\text{Conversions with subsidy} - (\\text{IRR} \\times \\text{N}_{treatment}) } {\\text{\\text{Conversions with subsidy}}}{}$$

**NIR(net incremental revenue):** how much revenue is made (or lost) by sending out the promotion. (Important to normalize by the number of customers in the treatment and control in case the sizes are different)

$$NIR = \\text{Revenue}{_{treament}}-\\text{Revenue}{_{control}}$$  

**Cost per Increment**: how much we are paying for each incremental conversion

$$CPI = \\frac{\\text{Investment}_{treament}}{\\text{MUB}{_{i}}}$$

Considering the metrics above, we could evaluate the overall results of our test as follows:

- The incremental response rate (IRR) was 0%, because there was no difference in conversion between the test and the control;
- The incremental MUB and the NIR is zero;
- The Cannibalization rate is 100%, because we had no incremental conversions;
- The Cost per Increment is N/A because the we had no incremental conversions;
- **The campaign resulting P/L is NIR - total subsidy = 0 - 339,914 = MU$ - 339,914**

At first look we could say: this test flopped - we are not going to rollout this treatment, i.e., sending discount vouchers in order to foster growth;  

However, this is where things get funnier: we could develop a targetting strategy in order to rollout this treatment only to groups of users who have, in fact, an **incremental behavior**.

#### 3. A targetting example with a simple heuristic

After some analyses, the Data Analytics team recommended to target this treatment only to customers who had an average order value (AOV) of MU$ 2000 or more.  
This recommendation could trigger a post-hoc test or the rollout itself.  
Simulating the results we'd have by segmenting our test by this heuristic, we'd have this:

In [0]:
target_customers =( 
  orders_sample
  .filter(f.col('reference_date')>='2021-05-01')
  .filter(f.col('reference_date')<'2021-06-01')
  .groupBy('account_id')
  .agg(
      f.avg('order_total').alias('aov')
  )
  .filter(f.col('aov')<2000)
  .select('account_id')
  .distinct()
)

(
    kairos_sample
    .filter(f.col('reference_month')=='2021-06-01')
    .join(target_customers, ['account_id'], 'inner')
    .join(orders_by_account,  ['account_id', 'reference_date'], 'left')
    .groupBy('test_assignment')
    .agg(
        f.count('account_id').alias('test_size'),
        f.round((f.sum( f.when(f.col('orders') > 0, 1).otherwise(0) )), 3).alias('conversions'),
        f.round((f.sum( f.when(f.col('orders') > 0, 1).otherwise(0) ) / f.count('account_id') ), 3).alias('conversion_rate'),
        f.round(f.sum('subsidy'), 0).alias('sub_total'),
        (f.sum( f.when(f.col('vouchers_used') > 0, 1).otherwise(0) ) ).alias('vouchers_used'),
        f.avg('aov').alias('aov')
    )    
    .display()
)

test_assignment,test_size,conversions,conversion_rate,sub_total,vouchers_used,aov
control,19306,4779,0.248,0.0,0,1557.3070068703385
test,21602,5563,0.258,194044.0,504,1533.9450220204956


Now let's evaluate the results:  

- The incremental response rate (IRR) was +1.0%;
- The incremental MUB is 1.% * 21773 = 218 conversions
- The NIR is ( (1534 * 0.258) - (1557 * 0.248) ) * 21773 =  $MU 209,804
- The Cannibalization rate is (504 - 218)/ 504 = 56.6%. It means that around 57% of the vouchers used, were "used by customers who did not need it to make a purchase"
- The Cost per Increment is roughly MU$ 194k / 218 = MU$ 890
- The campaign incremental profit is NIR - total subsidy = 209,804 - 194,044 = MU$ 15,760

**Is that good enough?**

Sending the promotional campaign to everyone resulted in ZERO incremental conversions and a resulting loss of - MU$ 339 K. If we sent this promotional campaign only to users with an Average Order Values below MU $ in the previous month, we'd have an incremental response rate of 1%, at MU$890 for each incremental conversion, resulting in a profit of MU$15,760.

#### 4. Your Assignment

You need to develop a targetting strategy, using machine learning. You are going to create a model that will support the decision of whom should be targetted by this promotional campaign, in order to yield the best possible result.  

There are different ways to do this with unsupervised or supervised learning and you are free to use your creativity. I provide some hints at the end of the notebook, however.

Despite of what way you choose to follow, these are the expected deliverables: 

- 1. When creating the dataframe that will be used to train your model, you need to create at least 3 features for each member in your team (at least 1 from items, 1 from orders and 1 from sessions datasets);
  - Features documentation: describe the features you are creating and their semantic
- 2. Train a Tune `PipelineModel` that executes the preprocessing and modelling steps
  - Present an explanation of why you chose each step;
  - Present an analysis of the experiment results of your hyperparameter tuning;
- 3. Similar to what whas presented in the step 3 of this notebook, compare the expected results when your targetting is applied. Compare the results to the baseline results in step 3, using the metrics provided in step 2
  - This result analysis should be done using **july test results** which are not yet available. 3 days before the deadline, you'll be able to download july test results and perform the final analysis before submitting the project.

The final output should be a `.zip` file, containing (1) the `Blueprint Notebook` provided and (2) the `Pipeline Model artefacts`;

#### 5. Evaluation Criteria

>40% Data Manipulation
- your code is well structured: your code is correctly indented, pyspark chaining methods are well organized and functions and variables have good naming, and use lowercase words separated by underscores;
- your features are well documented and the code creating them matches the explanation presented;

>30% ML Pipeline
- You have trained a pipeline model and uploaded the artefacts;
- You have presented a discussion of why you chose each step of your pipeline;
- You have presented the tuning results of your model;

>30% Targetting strategy
- You have analyzed the results of the test using your targetting, similar to the step 2 of this notebook, on july test;
- You have compared the results of your targetting to the baseline heuristic target = users with AOV > $MU 2000;
- Your ranking compared to other solutions (in terms of IRR and CPI);

#### 6. Some Hints

- Here is a complete hands-on example of how to tackle this kind of problem: https://medium.com/@nesreensada/how-to-build-a-profitable-promotion-strategy-easily-with-uplift-modeling-26b2addc3e46

- No idea where to start? Here you can find some inspiration on which model to use: https://medium.com/@nesreensada/how-to-build-a-profitable-promotion-strategy-easily-with-uplift-modeling-26b2addc3e46. **Remember:** you will not be evaluated by the quality of your model, the goal of this practical exam is to test you knowledge in understanding data and applying this knowledge in a data pipeline in Spark.

- If you want a depper yet gentle introduction to uplift modelling, check this paper: https://proceedings.mlr.press/v67/gutierrez17a/gutierrez17a.pdf

- The content above should help you decide how to model your `y` label and how to tackle the problem itself. In case you are still struggling after checking it out, we can provided further tips on the `Lab 14`
.