# Experiment: Multi-Touch Customer Analysis
## Overview

Behind the growth of every consumer-facing product is the acquisition and retention of an engaged user base. When it comes to acquisition, the goal is to attract high quality users as cost effectively as possible. With marketing dollars dispersed across a wide array of campaigns, channels, and creatives, however, measuring effectiveness is a challenge. In other words, it's difficult to know how to assign credit where credit is due. Enter multi-touch attribution. With multi-touch attribution, credit can be assigned in a variety of ways, but at a high-level, it's typically done using one of two methods: `heuristic` or `data-driven`.

* Broadly speaking, heuristic methods are rule-based and consist of both `single-touch` and `multi-touch` approaches. Single-touch methods, such as `first-touch` and `last-touch`, assign credit to the first channel, or the last channel, associated with a conversion. Multi-touch methods, such as `linear` and `time-decay`, assign credit to multiple channels associated with a conversion. In the case of linear, credit is assigned uniformly across all channels, whereas for time-decay, an increasing amount of credit is assigned to the channels that appear closer in time to the conversion event.

* In contrast to heuristic methods, data-driven methods determine assignment using probabilites and statistics. Examples of data-driven methods include `Markov Chains` and `SHAP`. In this series of notebooks, we cover the use of Markov Chains and include a comparison to a few heuristic methods.

## Now

We are currently working with the marketing organisation to obtain real-data.  This notebook is currently experimenting on the process of multi-touch analysis using fabricated data.  As soon as we have the real-time, we'll switch and re-run the experiment

## About the Data
* The data is designed to reflect what is used in practice. This data includes the following columns:
<table>
   <thead>
      <tr>
         <th>Column Name</th>
         <th>Description</th>
         <th>Data Type</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td>uid</td>
         <td>A unique identifier for each individual customer.</td>
         <td>String</td>
      </tr>
      <tr>
         <td>time</td>
         <td>The date and time of when a customer interacted with an impression.</td>
         <td>Timestamp</td>
      </tr>
       <tr>
         <td>interaction</td>
         <td>Denotes whether the interaction was an impression or conversion.</td>
         <td>String</td>
      </tr>
      <tr>
         <td>channel</td>
         <td>Ad Channel</td>
         <td>String</td>
      </tr>
     <tr>
         <td>conversion</td>
         <td>Conversion value</td>
         <td>Int</td>
      </tr>
   </tbody>
</table>

* The notebook starts by generating the dataset before transforming it ready for use with Markov Chains.

## Runtime Dependencies

In [0]:
from pyspark.sql import Row
import json
import pandas as pd
import uuid
import random
from random import randrange
from datetime import datetime, timedelta 

## Generate the Data (Future: import data)

In [0]:
def _get_random_datetime(start, end):
    """
    This function will return a random datetime between two datetime 
    objects.
    """
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = randrange(int_delta)
    return start + timedelta(seconds=random_second)

def data_gen_to_df():
  """
  This function will generate and write synthetic data for multi-touch 
  attribution modeling
  """
  
  
  DataRow = Row("uid", "time", "interaction", "channel", "conversion")
  
  # Specify the number of unique user IDs that data will be generated for
  unique_id_count = 500000
  
  # Specify the desired conversion rate for the full data set
  total_conversion_rate_for_campaign = .14
  
  # Specify the desired channels and channel-level conversion rates
  base_conversion_rate_per_channel = {'Social Network':.3, 
                                      'Search Engine Marketing':.2, 
                                      'Google Display Network':.1, 
                                      'Affiliates':.39, 
                                      'Email':0.01}
  
  
  channel_list = list(base_conversion_rate_per_channel.keys())
  
  base_conversion_weight = tuple(base_conversion_rate_per_channel.values())
  intermediate_channel_weight = (20, 30, 15, 30, 5)
  channel_probability_weights = (20, 20, 20, 20, 20)
  
  # Generate list of random user IDs
  uid_list = []
  
  for _ in range(unique_id_count):
    uid_list.append(str(uuid.uuid4()).replace('-',''))
  
  rows = []
  
  # Generate data / user journey for each unique user ID
  for uid in uid_list:
      user_journey_end = random.choices(['impression', 'conversion'], 
                                        (1-total_conversion_rate_for_campaign, total_conversion_rate_for_campaign), k=1)[0]
      
      steps_in_customer_journey = random.choice(range(1,10))
      
      d1 = datetime.strptime('5/17/2020 1:30 PM', '%m/%d/%Y %I:%M %p')
      d2 = datetime.strptime('6/10/2020 4:50 AM', '%m/%d/%Y %I:%M %p')

      final_channel = random.choices(channel_list, base_conversion_weight , k=1)[0]
      
      for i in range(steps_in_customer_journey):
        next_step_in_user_journey = random.choices(channel_list, weights=intermediate_channel_weight, k=1)[0] 
        time = str(_get_random_datetime(d1, d2))
        d1 = datetime.strptime(time, '%Y-%m-%d %H:%M:%S')
        
        if user_journey_end == 'conversion' and i == (steps_in_customer_journey-1):
          rows.append(DataRow(uid,time,"conversion",final_channel,1))
          #my_file.write(uid+','+time+',conversion,'+final_channel+',1\n')
        else:
          rows.append(DataRow(uid,time,"impression",next_step_in_user_journey,0))
          #my_file.write(uid+','+time+',impression,'+next_step_in_user_journey+',0\n')
  
  # output the dataframe 
  return spark.createDataFrame(rows)

In [0]:
source_events_df = data_gen_to_df()
display(source_events_df.limit(100))

uid,time,interaction,channel,conversion
07c335c8ab2148dfab0d45d4e2d54152,2020-05-18 13:09:38,impression,Search Engine Marketing,0
a6e632c369ec4d7da0e35bc7b20b4a3c,2020-06-04 10:03:56,impression,Affiliates,0
a6e632c369ec4d7da0e35bc7b20b4a3c,2020-06-06 18:48:09,impression,Search Engine Marketing,0
a6e632c369ec4d7da0e35bc7b20b4a3c,2020-06-09 18:11:30,impression,Affiliates,0
a6e632c369ec4d7da0e35bc7b20b4a3c,2020-06-10 00:52:57,impression,Google Display Network,0
a6e632c369ec4d7da0e35bc7b20b4a3c,2020-06-10 03:15:08,impression,Search Engine Marketing,0
a6e632c369ec4d7da0e35bc7b20b4a3c,2020-06-10 04:15:02,impression,Affiliates,0
a6e632c369ec4d7da0e35bc7b20b4a3c,2020-06-10 04:33:32,impression,Affiliates,0
90f326ae3dea4cb3911df88a2d8f39eb,2020-06-01 08:13:34,impression,Affiliates,0
90f326ae3dea4cb3911df88a2d8f39eb,2020-06-05 06:27:02,impression,Search Engine Marketing,0


## Save the generated data to the lake

In [0]:
source_events_df.write.format("delta").save("/mnt/da-global-raw-box/POC/MultiTouch")