# About project

Advertisers use a variety of online marketing channels to reach consumers and they typically want to know how much each channel contributes to their marketing success. This is what is known as multi-channel attribution. In many cases, advertisers approach this problem using some simple heuristical models that often underestimate the importance of different marketing channels. In general, there are different types of attribution models:<br>

1. First Touch Conversion: A user's conversion is attributed to the first channels/touchpoints.
2. Last Touch Conversion: A user's conversion is attributed to the last channels/touchpoints.
3. Linear Touch Conversion: All channels/touchpoints are given equal credit to a user's conversion.
4. Markov chains: A probabilistic model that represents the buyer's journey as a graph, with the nodes representing different channels/touchpoints, and the connecting lines being observed transitions between them. The number of times buyers have transitioned between two states is converted into a probability, which can then be used to measure the importance of each channel and the most likely channel paths to success.<br>

In this exercise, we will use a dataset and compare different attribution models stated above. Based on this, marketers and advertisers can refine or optimize their marketing spend for different channels.

# About data
For this analysis, we are using the following dataset, showing digital marketing data that one would typically encounter in a production environment. The data set contains 586,000 marketing touch-points from July (2018), comprising 240,000 unique customers who generated ~18,000 conversions.<br>

A more detailed description of the features is as below:<br>

1. Cookie: Anonymous customer id enabling us to track the progression of a given customer
2. Timestamp: Date and time when the visit took place
3. Interaction: Categorical variable indicating the type of interaction that took place
4. Conversion: Boolean variable indicating whether a conversion took place
5. Conversion Value: Value of the potential conversion event (revenue)
6. Channel: The marketing channel that brought the customer to our site

### Imports

In [1]:
# Data imports
import pandas as pd 
import numpy as np 

# Visualization
import matplotlib.pyplot as plt 
import seaborn as sns

### Data load

In [2]:
data = pd.read_csv("./data/attribution_data.csv", sep=",")
print("Shape of data: {0}".format(data.shape))

Shape of data: (586737, 6)


### Data sanity checks

In [4]:
# Print
data.head(5)

Unnamed: 0,cookie,time,interaction,conversion,conversion_value,channel
0,00000FkCnDfDDf0iC97iC703B,2018-07-03T13:02:11Z,impression,0,0.0,Instagram
1,00000FkCnDfDDf0iC97iC703B,2018-07-17T19:15:07Z,impression,0,0.0,Online Display
2,00000FkCnDfDDf0iC97iC703B,2018-07-24T15:51:46Z,impression,0,0.0,Online Display
3,00000FkCnDfDDf0iC97iC703B,2018-07-29T07:44:51Z,impression,0,0.0,Online Display
4,0000nACkD9nFkBBDECD3ki00E,2018-07-03T09:44:57Z,impression,0,0.0,Paid Search


In [8]:
# Check data type of each column
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586737 entries, 0 to 586736
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   cookie            586737 non-null  object 
 1   time              586737 non-null  object 
 2   interaction       586737 non-null  object 
 3   conversion        586737 non-null  int64  
 4   conversion_value  586737 non-null  float64
 5   channel           586737 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 26.9+ MB


In [6]:
# Check number of unique values for each column
data.nunique()

cookie              240108
time                485110
interaction              2
conversion               2
conversion_value        11
channel                  5
dtype: int64

Findings:
1. We have 240108 unique users in our datasets
2. There are 2 types of interactions data available
3. There are 5 different channels available

In [9]:
# Check high-level stats for the numeric columns
data.describe()

Unnamed: 0,conversion,conversion_value
count,586737.0,586737.0
mean,0.030063,0.187871
std,0.17076,1.084498
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,1.0,8.5
