# Maven Reward Challenge
Iâ€™ll play the role of a Sr. Marketing Analyst at Maven Cafe.

Over a 30-day period, Maven Cafe tested different types of offers with Rewards members.

Now that the 30-day period for the test has concluded, my task is to identify key customer segments and develop a data-driven strategy for future promotional messaging & targeting.

Data that simulates the behavior of Cafe Rewards members over a 30-day period, including their transactions and responses to promotional offers.

The data is contained in three files: one with details on each offer, another with demographic information on each customer, and a third with the activity for each customer during the period.

The activities are divided into offer received, offer viewed, offer completed, and transaction.

For a transaction to be attributed to an offer, it must occur at the same time as when the offer was "completed" by the customer.

I have already cleaned the raw datasets. Cleaned offer, cleaned event, and cleaned_customer

## Objectives: 

The goal of this project was to answer three main questions:

** Which customers respond best to offers?
** What types of offers work best?
** How should we reach them to maximize impact?

Insights from this test will guide future promotional campaigns to bring in more revenue and improve customer loyalty.


## Part 1: Setup and Data Loading

### 1.1: Import Libraries
Import all packages. Group them by function to keep the setup clean and explain their purpose.

In [1]:
# Core libraries for data manipulation and numerical operations
import pandas as pd
import numpy as np

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# --- Configuration ---
# Set a consistent visual style for all plots
sns.set_style('whitegrid')
# Suppress routine warnings (e.g., from library versions) to keep output clean
import warnings
warnings.filterwarnings('ignore')

#import a script for data cleaning
from initial_report import *

### 1.2: Load Data
Load the three CSV files for immediate inspection

In [2]:
# Load DataFrames
df_customer = pd.read_csv('cleaned_customer_data.csv')
df_event = pd.read_csv('cleaned_events.csv')
df_offer = pd.read_csv('cleaned_offers.csv')

### 1.3: Data Overview Function

I already created a data overview function (initial_report) to make the process easier for data cleaning.

Now run the initial_report function for all three data set

In [3]:
# --- Inspect Customer Data ---
initial_report(df_customer)

 *** DATA CLEANING CHECKLIST ***
----------------------------------------
*** Structure:
- Total Rows: 14825
- Total Columns: 5
- Column Names: ['customer_id', 'became_member_on', 'gender', 'age', 'income']

*** Data Types:
  customer_id: object
  became_member_on: object
  gender: object
  age: int64
  income: float64

*** Mixed Data Types:

*** Distinct Values per Column:
  customer_id: 14825
  became_member_on: 1707
  gender: 3
  age: 84
  income: 91

*** Null Values and Percentages:


*** Duplicates: 0

*** Negative or Zero Values:

*** Basic Statistics:
                age         income
count  14825.000000   14825.000000
mean      54.393524   65404.991568
std       17.383705   21598.299410
min       18.000000   30000.000000
25%       42.000000   49000.000000
50%       55.000000   64000.000000
75%       66.000000   80000.000000
max      101.000000  120000.000000

*** Category Description:
                             customer_id became_member_on gender
count                       

**Comment:** Total Unique Customer is 14825

In [4]:
# --- Inspect Event Data ---
initial_report(df_event)

 *** DATA CLEANING CHECKLIST ***
----------------------------------------
*** Structure:
- Total Rows: 306137
- Total Columns: 6
- Column Names: ['customer_id', 'event', 'time', 'offer_id', 'amount', 'reward']

*** Data Types:
  customer_id: object
  event: object
  time: int64
  offer_id: object
  amount: float64
  reward: float64

*** Mixed Data Types:
  offer_id:
    - str: 167184
    - float: 138953

*** Distinct Values per Column:
  customer_id: 17000
  event: 4
  time: 120
  offer_id: 10
  amount: 5103
  reward: 4

*** Null Values and Percentages:
  offer_id: Missing Values: 138953, Pct: 45.389%
  amount: Missing Values: 167184, Pct: 54.611%
  reward: Missing Values: 272955, Pct: 89.161%


*** Duplicates: 0

*** Negative or Zero Values:
  time: 15561

*** Basic Statistics:
                time         amount        reward
count  306137.000000  138953.000000  33182.000000
mean      366.185015      12.777356      4.902628
std       200.348174      30.250529      2.887201
min       

In [5]:
# --- Inspect Offer Data ---
initial_report(df_offer)

 *** DATA CLEANING CHECKLIST ***
----------------------------------------
*** Structure:
- Total Rows: 10
- Total Columns: 6
- Column Names: ['offer_id', 'offer_type', 'difficulty', 'reward', 'duration', 'channels']

*** Data Types:
  offer_id: object
  offer_type: object
  difficulty: int64
  reward: int64
  duration: int64
  channels: object

*** Mixed Data Types:

*** Distinct Values per Column:
  offer_id: 10
  offer_type: 3
  difficulty: 5
  reward: 5
  duration: 5
  channels: 4

*** Null Values and Percentages:


*** Duplicates: 0

*** Negative or Zero Values:
  difficulty: 2
  reward: 2

*** Basic Statistics:
       difficulty     reward   duration
count   10.000000  10.000000  10.000000
mean     7.700000   4.200000   6.500000
std      5.831905   3.583915   2.321398
min      0.000000   0.000000   3.000000
25%      5.000000   2.000000   5.000000
50%      8.500000   4.000000   7.000000
75%     10.000000   5.000000   7.000000
max     20.000000  10.000000  10.000000

*** Category De

### 1.4: Initial Findings & Cleaning Plan

Based on the overview:

1.  **`df_customer`**:
    * `became_member_on`: Is an `object` (string). It **must** be converted to `datetime` to analyze membership tenure.
    * `gender`: Contains 3 values ('M', 'F', 'O').
    * `age` & `income`: Are numerical with no missing values. I will bin these later to analyze cohorts. 

2.  **`df_event`**:
    * `event`: Has 4 unique values as expected ('offer received', 'offer viewed', 'offer completed', 'transaction').
    * `offer_id`, `amount`, `reward`: Have significant missing values. This is **expected** and not an error.
        * `offer_id` is null only for 'transaction' events.
        * `amount` is null only for 'offer' events.
        * `reward` is null for all except 'offer completed' events.
    * This structure confirms a central challenge: `offer_id` and `amount` are in separate rows.

3.  **`df_offer`**:
    * `duration`: This is in **days**. The `df_event['time']` column is in **hours**. These units are incompatible. **Need** to standardize them. **Need** to convert `duration` to hours.
    * `channels`: Is an `object` (a string that looks like a list). This is unusable for analysis and must be cleaned for channel effectiveness.

**Conclusion:** The data is structurally sound, but the `duration`/`time` unit mismatch is a critical flaw that must be fixed. The `became_member_on` conversion is also mandatory.

## Part 2: Data Cleaning

### 2.1: Preprocess `df_offer`

Two tasks:
1.  **Fix Units:** Convert `duration` from days to hours to match the `time` column in `df_event`.
2.  **Create `offer_key`:** The `offer_id` hash is useless for interpretation. I'll create a readable key (e.g., `bogo-10-10-7`) that summarizes the offer's properties.

In [6]:
# Create a copy to ensure our original loaded data remains untouched
offers_cleaned = df_offer.copy()

# 1. Convert duration from days to hours
# This is a critical step for logical comparisons with the 'time' column
offers_cleaned['duration_hours'] = offers_cleaned['duration'] * 24

# 2. Create a human-readable offer_key for easier analysis and plotting
# This combines the core attributes of each offer into a simple string
offers_cleaned['offer_key'] = (
    offers_cleaned['offer_type'] + '-' +
    offers_cleaned['difficulty'].astype(str) + '-' +
    offers_cleaned['reward'].astype(str) + '-' +
    offers_cleaned['duration'].astype(str)
)

print("Cleaned Offer Data:")
# Display the original columns and the new, corrected/created ones
offers_cleaned[['offer_id', 'offer_key', 'duration', 'duration_hours']].head()

Cleaned Offer Data:


Unnamed: 0,offer_id,offer_key,duration,duration_hours
0,ae264e3637204a6fb9bb56bc8210ddfd,bogo-10-10-7,7,168
1,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo-10-10-5,5,120
2,3f207df678b143eea3cee63160fa8bed,informational-0-0-4,4,96
3,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo-5-5-7,7,168
4,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount-20-5-10,10,240


### 2.2: Preprocess `df_customer`

Here, I'll convert the membership date and create the demographic cohorts.

1.  **Convert `became_member_on`**: Change from string to a `datetime` object.
2.  **Create Bins**:
    * `membership_year`: Extract the year to analyze tenure.
    * `age_group`: Bin ages into standard cohorts.
    * `income_group`: Bin incomes into 'Low', 'Middle', and 'High'.

In [7]:
# Create a copy for processing
customers_cleaned = df_customer.copy()

# 1. Convert membership date to datetime
customers_cleaned['became_member_on'] = pd.to_datetime(customers_cleaned['became_member_on'])

# 2. Extract membership year
customers_cleaned['membership_year'] = customers_cleaned['became_member_on'].dt.year

# 3. Create age_group bins
# Using 17 as the lower bound since 18 is the min age
age_bins = [17, 34, 49, 64, 79, 110]
age_labels = ['Young Adult', 'Middle Age Adult', 'Older Adult', 'Senior', 'Elderly']
customers_cleaned['age_group'] = pd.cut(customers_cleaned['age'], bins=age_bins, labels=age_labels, right=True)

# 4. Create income_group bins
income_bins = [0, 44000, 84000, float('inf')]
income_labels = ['Low Income', 'Middle Income', 'High Income']
customers_cleaned['income_group'] = pd.cut(customers_cleaned['income'], bins=income_bins, labels=income_labels, right=True)

In [8]:
print("Cleaned Customer Data with New Features:")
customers_cleaned[['customer_id', 'age', 'age_group', 'income', 'income_group', 'membership_year']].head()

Cleaned Customer Data with New Features:


Unnamed: 0,customer_id,age,age_group,income,income_group,membership_year
0,0610b486422d4921ae7d2bf64640c50b,55,Older Adult,112000.0,High Income,2017
1,78afa995795e4d85b5d9ceeca43f5fef,75,Senior,100000.0,High Income,2017
2,e2127556f4f64592b11af22de27a7932,68,Senior,70000.0,Middle Income,2018
3,389bc3fa690240e798340f5a15918d5c,65,Senior,53000.0,Middle Income,2018
4,2eeac8d8feae4a8cad5a6af0499a211d,58,Older Adult,51000.0,Middle Income,2017


In [9]:
print("\nAge Group Distribution:")
customers_cleaned['age_group'].value_counts().sort_index()


Age Group Distribution:


age_group
Young Adult         2256
Middle Age Adult    3153
Older Adult         5150
Senior              3164
Elderly             1102
Name: count, dtype: int64

In [10]:
print("\nIncome Group Distribution:")
customers_cleaned['income_group'].value_counts().sort_index()


Income Group Distribution:


income_group
Low Income       2869
Middle Income    8941
High Income      3015
Name: count, dtype: int64

### 2.3: Preprocess `df_event` & Merge

The `df_event` data is clean, but it's the link between customers and offers. Next step is to create one master DataFrame that contains all information for every event.

Informational offers do **not** have an 'offer completed' event. A customer "completes" one by viewing it and then making any transaction within its duration.

In [11]:
# Create a copy for preprocessing
events_cleaned = df_event.copy()

### Merge all 3 cleaned dataframes into one master table
 This brings customer demographics and offer details into one timeline.

In [12]:
#check events_clened before final merge
print(events_cleaned.shape)
events_cleaned.head()

(306137, 6)


Unnamed: 0,customer_id,event,time,offer_id,amount,reward
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,,
2,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,,
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,fafdcd668e3743c1bb461111dcafc2a4,,
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,4d5c57ea9a6940dd891ad53e9dbe8da0,,


In [13]:
#check customer_cleaned before final merge
print(customers_cleaned.shape)
customers_cleaned.head()

(14825, 8)


Unnamed: 0,customer_id,became_member_on,gender,age,income,membership_year,age_group,income_group
0,0610b486422d4921ae7d2bf64640c50b,2017-07-15,F,55,112000.0,2017,Older Adult,High Income
1,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,F,75,100000.0,2017,Senior,High Income
2,e2127556f4f64592b11af22de27a7932,2018-04-26,M,68,70000.0,2018,Senior,Middle Income
3,389bc3fa690240e798340f5a15918d5c,2018-02-09,M,65,53000.0,2018,Senior,Middle Income
4,2eeac8d8feae4a8cad5a6af0499a211d,2017-11-11,M,58,51000.0,2017,Older Adult,Middle Income


In [14]:
###  Merging Master Data
# I will use the 'events_cleaned' as the LEFT table.
# This ensures we keep all 306,137 events.
# Events from customers not in 'customers_cleaned' will have NaN

df_master = pd.merge(events_cleaned, customers_cleaned, on='customer_id', how='left')

print(f"Total rows in events log: {len(events_cleaned)}")
print(f"Total rows after merge:   {len(df_master)}")
print("-" * 30)

# Now, must check for the orphan events
orphan_events = df_master['age_group'].isnull().sum()
print(f"Found {orphan_events} 'orphan' events with no matching customer.")

Total rows in events log: 306137
Total rows after merge:   306137
------------------------------
Found 33749 'orphan' events with no matching customer.


In [15]:
### Dropping Orphan Data
# I cannot analyze events if we don't know the customer's demographics.
# Therefore,it will best to drop these 33,749 rows.

rows_before_drop = df_master.shape[0]
df_master = df_master.dropna(subset=['age_group', 'income_group', 'gender'])
rows_after_drop = df_master.shape[0]

print(f"Dropped {rows_before_drop - rows_after_drop} orphan event rows.")
print(f"New master DataFrame shape: {df_master.shape}")


Dropped 33749 orphan event rows.
New master DataFrame shape: (272388, 13)


In [16]:
# Now, need to merge the offer data.
# I use 'left' because transaction events won't have an offer_id.
df_master = pd.merge(df_master, offers_cleaned, on='offer_id', how='left')
df_master.head()

Unnamed: 0,customer_id,event,time,offer_id,amount,reward_x,became_member_on,gender,age,income,membership_year,age_group,income_group,offer_type,difficulty,reward_y,duration,channels,duration_hours,offer_key
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,bogo,5.0,5.0,7.0,"['web', 'email', 'mobile']",168.0,bogo-5-5-7
1,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,discount,10.0,2.0,7.0,"['web', 'email', 'mobile']",168.0,discount-10-2-7
2,389bc3fa690240e798340f5a15918d5c,offer received,0,f19421c1d4aa40978ebb69ca19b0e20d,,,2018-02-09,M,65.0,53000.0,2018.0,Senior,Middle Income,bogo,5.0,5.0,5.0,"['web', 'email', 'mobile', 'social']",120.0,bogo-5-5-5
3,2eeac8d8feae4a8cad5a6af0499a211d,offer received,0,3f207df678b143eea3cee63160fa8bed,,,2017-11-11,M,58.0,51000.0,2017.0,Older Adult,Middle Income,informational,0.0,0.0,4.0,"['web', 'email', 'mobile']",96.0,informational-0-0-4
4,aa4862eba776480b8bb9c68455b8c2e1,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,,,2017-09-11,F,61.0,57000.0,2017.0,Older Adult,Middle Income,discount,20.0,5.0,10.0,"['web', 'email']",240.0,discount-20-5-10


In [17]:
# Sort chronologically for each customer. This is crucial for the iteration logic.
df_master = df_master.sort_values(['customer_id', 'time'])

print(f"Final, cleaned master DataFrame shape: {df_master.shape}")
df_master.head()

Final, cleaned master DataFrame shape: (272388, 20)


Unnamed: 0,customer_id,event,time,offer_id,amount,reward_x,became_member_on,gender,age,income,membership_year,age_group,income_group,offer_type,difficulty,reward_y,duration,channels,duration_hours,offer_key
49809,0009655768c64bdeb2e877511632db8f,offer received,168,5a8bc65990b245e5a138643cd4eb9837,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,3.0,"['email', 'mobile', 'social']",72.0,informational-0-0-3
68952,0009655768c64bdeb2e877511632db8f,offer viewed,192,5a8bc65990b245e5a138643cd4eb9837,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,3.0,"['email', 'mobile', 'social']",72.0,informational-0-0-3
79386,0009655768c64bdeb2e877511632db8f,transaction,228,,22.16,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,,,,,,,
101111,0009655768c64bdeb2e877511632db8f,offer received,336,3f207df678b143eea3cee63160fa8bed,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,4.0,"['web', 'email', 'mobile']",96.0,informational-0-0-4
124364,0009655768c64bdeb2e877511632db8f,offer viewed,372,3f207df678b143eea3cee63160fa8bed,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,4.0,"['web', 'email', 'mobile']",96.0,informational-0-0-4


In [18]:
##check total number of duplicated rows
df_master.duplicated().sum() 

np.int64(0)

In [19]:
#check amount of different offers
df_master.event.value_counts()

event
transaction        123957
offer received      66501
offer viewed        49860
offer completed     32070
Name: count, dtype: int64

In [20]:
# lets check data for some customers to understand the data 
df_master[df_master.customer_id=="0009655768c64bdeb2e877511632db8f"]

Unnamed: 0,customer_id,event,time,offer_id,amount,reward_x,became_member_on,gender,age,income,membership_year,age_group,income_group,offer_type,difficulty,reward_y,duration,channels,duration_hours,offer_key
49809,0009655768c64bdeb2e877511632db8f,offer received,168,5a8bc65990b245e5a138643cd4eb9837,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,3.0,"['email', 'mobile', 'social']",72.0,informational-0-0-3
68952,0009655768c64bdeb2e877511632db8f,offer viewed,192,5a8bc65990b245e5a138643cd4eb9837,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,3.0,"['email', 'mobile', 'social']",72.0,informational-0-0-3
79386,0009655768c64bdeb2e877511632db8f,transaction,228,,22.16,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,,,,,,,
101111,0009655768c64bdeb2e877511632db8f,offer received,336,3f207df678b143eea3cee63160fa8bed,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,4.0,"['web', 'email', 'mobile']",96.0,informational-0-0-4
124364,0009655768c64bdeb2e877511632db8f,offer viewed,372,3f207df678b143eea3cee63160fa8bed,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,informational,0.0,0.0,4.0,"['web', 'email', 'mobile']",96.0,informational-0-0-4
136355,0009655768c64bdeb2e877511632db8f,offer received,408,f19421c1d4aa40978ebb69ca19b0e20d,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,bogo,5.0,5.0,5.0,"['web', 'email', 'mobile', 'social']",120.0,bogo-5-5-5
149503,0009655768c64bdeb2e877511632db8f,transaction,414,,8.57,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,,,,,,,
149504,0009655768c64bdeb2e877511632db8f,offer completed,414,f19421c1d4aa40978ebb69ca19b0e20d,,5.0,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,bogo,5.0,5.0,5.0,"['web', 'email', 'mobile', 'social']",120.0,bogo-5-5-5
166644,0009655768c64bdeb2e877511632db8f,offer viewed,456,f19421c1d4aa40978ebb69ca19b0e20d,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,bogo,5.0,5.0,5.0,"['web', 'email', 'mobile', 'social']",120.0,bogo-5-5-5
181620,0009655768c64bdeb2e877511632db8f,offer received,504,fafdcd668e3743c1bb461111dcafc2a4,,,2017-04-21,M,33.0,72000.0,2017.0,Young Adult,Middle Income,discount,10.0,2.0,10.0,"['web', 'email', 'mobile', 'social']",240.0,discount-10-2-10


In [21]:
# lets check data for some customers to understand the data 
df_master[df_master.customer_id=="78afa995795e4d85b5d9ceeca43f5fef"]

Unnamed: 0,customer_id,event,time,offer_id,amount,reward_x,became_member_on,gender,age,income,membership_year,age_group,income_group,offer_type,difficulty,reward_y,duration,channels,duration_hours,offer_key
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,bogo,5.0,5.0,7.0,"['web', 'email', 'mobile']",168.0,bogo-5-5-7
13545,78afa995795e4d85b5d9ceeca43f5fef,offer viewed,6,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,bogo,5.0,5.0,7.0,"['web', 'email', 'mobile']",168.0,bogo-5-5-7
42328,78afa995795e4d85b5d9ceeca43f5fef,transaction,132,,19.89,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,,,,,,,
42329,78afa995795e4d85b5d9ceeca43f5fef,offer completed,132,9b98b8c7a33c4b65b9aebfe6a799e6d9,,5.0,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,bogo,5.0,5.0,7.0,"['web', 'email', 'mobile']",168.0,bogo-5-5-7
44065,78afa995795e4d85b5d9ceeca43f5fef,transaction,144,,17.78,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,,,,,,,
47385,78afa995795e4d85b5d9ceeca43f5fef,offer received,168,5a8bc65990b245e5a138643cd4eb9837,,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,informational,0.0,0.0,3.0,"['email', 'mobile', 'social']",72.0,informational-0-0-3
75792,78afa995795e4d85b5d9ceeca43f5fef,offer viewed,216,5a8bc65990b245e5a138643cd4eb9837,,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,informational,0.0,0.0,3.0,"['email', 'mobile', 'social']",72.0,informational-0-0-3
77440,78afa995795e4d85b5d9ceeca43f5fef,transaction,222,,19.67,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,,,,,,,
81925,78afa995795e4d85b5d9ceeca43f5fef,transaction,240,,29.72,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,,,,,,,
125790,78afa995795e4d85b5d9ceeca43f5fef,transaction,378,,23.93,,2017-05-09,F,75.0,100000.0,2017.0,Senior,High Income,,,,,,,


In [22]:
# lets check data for some customers to understand the data 
df_master[df_master.customer_id=="e2127556f4f64592b11af22de27a7932"]

Unnamed: 0,customer_id,event,time,offer_id,amount,reward_x,became_member_on,gender,age,income,membership_year,age_group,income_group,offer_type,difficulty,reward_y,duration,channels,duration_hours,offer_key
1,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,discount,10.0,2.0,7.0,"['web', 'email', 'mobile']",168.0,discount-10-2-7
17760,e2127556f4f64592b11af22de27a7932,offer viewed,18,2906b810c7d4411798c6938adc9daaa5,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,discount,10.0,2.0,7.0,"['web', 'email', 'mobile']",168.0,discount-10-2-7
91389,e2127556f4f64592b11af22de27a7932,transaction,288,,17.88,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,,,,,,,
92467,e2127556f4f64592b11af22de27a7932,transaction,294,,21.43,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,,,,,,,
98704,e2127556f4f64592b11af22de27a7932,offer received,336,3f207df678b143eea3cee63160fa8bed,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,informational,0.0,0.0,4.0,"['web', 'email', 'mobile']",96.0,informational-0-0-4
133940,e2127556f4f64592b11af22de27a7932,offer received,408,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,bogo,5.0,5.0,7.0,"['web', 'email', 'mobile']",168.0,bogo-5-5-7
151980,e2127556f4f64592b11af22de27a7932,offer viewed,420,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,bogo,5.0,5.0,7.0,"['web', 'email', 'mobile']",168.0,bogo-5-5-7
179226,e2127556f4f64592b11af22de27a7932,offer received,504,fafdcd668e3743c1bb461111dcafc2a4,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,discount,10.0,2.0,10.0,"['web', 'email', 'mobile', 'social']",240.0,discount-10-2-10
199771,e2127556f4f64592b11af22de27a7932,offer viewed,522,fafdcd668e3743c1bb461111dcafc2a4,,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,discount,10.0,2.0,10.0,"['web', 'email', 'mobile', 'social']",240.0,discount-10-2-10
199772,e2127556f4f64592b11af22de27a7932,transaction,522,,18.42,,2018-04-26,M,68.0,70000.0,2018.0,Senior,Middle Income,,,,,,,


### What I got?
1. `transaction` may happen without `offer_completed`.We can not attribute them as offer_related_transaction as the rule is **For a transaction to be attributed to an offer, it must occur at the same time as when the offer was "completed" by the customer**.
2. In some cases offers are received, viewed but not completed.
3. In some cases offers are received, completed but not viewed.I will also not include this as offer related transaction.Because it could be a coincidence, not a offer prompted transaction.
4. A customer recevied multiple offers without viewing the previous offer.
5. A customer complete two offers with one transaction. I need to check, how one transaction will assoicate with two offers.

So, a transaction will be attributed to offer if it follows the hierarchy. offer_received --> offer_viewed --> offer_completed & Transaction happened together. Otherwise, the events will not be in the final dataset. This will define **truly influenced purchase**.


## Part 3: Building the Offer Attribution Funnel

This is the most critical step. The dataset does not explicitly link transactions to offers, so we must build this link logically.

**My Logic for a "Successful Offer":**
A successful, attributed offer must follow this exact sequence:
1.  **Offer Received:** The `offer received` event starts the clock.
2.  **Offer Viewed:** The customer must view the offer *after* receiving it and *before* it expires.
3.  **Offer Completed:** The customer must complete the offer *after* viewing it and *before* it expires.
4.  **Transaction Match:** A `transaction` event must exist at the *exact same time* as the `offer completed` event.

### 3.1: Building the Attribution Function

This function will process the event history for a *single customer* and return a list of all their successful offer journeys.

**Note on Informational Offers:**
Informational offers have no 'offer completed' event. Their goal is just to drive a transaction. For now, I will focus **only on BOGO and Discount offers** (which have 'offer completed' events). I will explicitly skip informational offers for now.

In [23]:
### 3.1: Building the attribution function
def process_customer_events(events):

    """
    This function processes events from a DataFrame for a single customer to identify valid, completed offers.

    It will take dataset and return a list of dictionaries

    """

    successful_journey= []

    # --- Keep track of completion events that is already used ---
    used_completion_indices = set()

    #get all transaction for quick lookup later
    all_transactions=events[events.event=="transaction"]

    #start with offer_received event
    received_offers=events[events.event=="offer received"]

    for _,offer in received_offers.iterrows():

        #skip informational offers
        if offer["offer_type"] not in ["bogo","discount"]:
            continue

        #calculate when offer expires
        offer_start=offer["time"]
        offer_expires=offer_start+offer["duration_hours"]
        offer_id=offer["offer_id"]

        #--Did they view the offer?---
        viewed=events[
            (events.event=="offer viewed") &
            (events.offer_id==offer_id) &
            (events.time>=offer_start) &
            (events.time<=offer_expires)
        ]

        first_view=viewed.head(1)

        if first_view.empty:
            continue #offer never viewed ->skip

        view_time = first_view['time'].iloc[0] #use first view


        #---Did they complete the offer?--

        completed=events[
            (events.event=="offer completed") &
            (events.offer_id==offer_id) &
            (events.time>=view_time) &
            (events.time<=offer_expires)
        ]

        first_completion = completed.head(1)

        if first_completion.empty:
            continue

        completion_time=first_completion['time'].iloc[0]

        # --- Check if this completion has already been used ---
        completion_index = first_completion.index[0]
        if completion_index in used_completion_indices:
            continue # This completion was already attributed, skip this funnel

        # ---was thery a transaction at completion?---
        transaction=all_transactions[all_transactions.time==completion_time]

        if transaction.empty:
            continue #no transaction -> skip

        #----save this journey---

        journey={
            "customer_id":offer["customer_id"],
            "offer_id":offer_id,
            "offer_key":offer["offer_key"],
            "offer_type":offer["offer_type"],
            "channels":offer["channels"],

            #customer demographics
            "age_group":offer["age_group"],
            "gender":offer["gender"],
            "income_group":offer["income_group"],
            "membership_year":offer["membership_year"],

            #joureny timeline
            "time_rcvd": offer_start,
            "time_viewed": view_time,
            "time_completed": completion_time,

            #financial data
            "transaction_amount":transaction.iloc[0]["amount"],
            "reward": first_completion.iloc[0]["reward_x"]
        }

        successful_journey.append(journey)

        # ---Mark this completion as "used" ---
        used_completion_indices.add(completion_index)


    return successful_journey
            
    
    

### 3.2: Applying the Function and Creating the Final DataFrame

Now i will apply this function to every customer. We use `.groupby('customer_id')` and `.apply()` to run our function on each customer's event-block. This will take a moment.

In [24]:
### 3.2: Applying the Function and Creating the Final DataFrame

print("Processing customer events... This may take a minute.")

# Group the master table by customer_id and apply our function
all_journeys = df_master.groupby('customer_id').apply(process_customer_events)

# must flatten the above list.
final_data = [item for sublist in all_journeys for item in sublist]

# Create our final, clean DataFrame
df_final = pd.DataFrame(final_data)

print(f"Process complete.")
print(f"Created final DataFrame with {df_final.shape[0]} successfully attributed BOGO/Discount completions.")
df_final.head()

Processing customer events... This may take a minute.
Process complete.
Created final DataFrame with 22533 successfully attributed BOGO/Discount completions.


Unnamed: 0,customer_id,offer_id,offer_key,offer_type,channels,age_group,gender,income_group,membership_year,time_rcvd,time_viewed,time_completed,transaction_amount,reward
0,0011e0d4e6b944f998e987f904e8c1e5,2298d6c36e964ae4a3e7e9706d1fb8c2,discount-7-3-7,discount,"['web', 'email', 'mobile', 'social']",Middle Age Adult,O,Middle Income,2018.0,168,186,252,11.93,3.0
1,0011e0d4e6b944f998e987f904e8c1e5,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount-20-5-10,discount,"['web', 'email']",Middle Age Adult,O,Middle Income,2018.0,408,432,576,22.05,5.0
2,0011e0d4e6b944f998e987f904e8c1e5,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo-5-5-7,bogo,"['web', 'email', 'mobile']",Middle Age Adult,O,Middle Income,2018.0,504,516,576,22.05,5.0
3,0020c2b971eb4e9188eac86d93036a77,fafdcd668e3743c1bb461111dcafc2a4,discount-10-2-10,discount,"['web', 'email', 'mobile', 'social']",Older Adult,F,High Income,2016.0,0,12,54,17.63,2.0
4,0020c2b971eb4e9188eac86d93036a77,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo-10-10-5,bogo,"['web', 'email', 'mobile', 'social']",Older Adult,F,High Income,2016.0,408,426,510,17.24,10.0


### 3.3. Check and verify the data whether sequence are okay or not

In [25]:
#Check is there any row where rcv_time<view_time is not followed
df_final[df_final.time_rcvd>df_final.time_viewed]	

Unnamed: 0,customer_id,offer_id,offer_key,offer_type,channels,age_group,gender,income_group,membership_year,time_rcvd,time_viewed,time_completed,transaction_amount,reward


In [26]:
#Check is there any row where view_time<completed_time is not followed
df_final[df_final.time_viewed>df_final.time_completed]	

Unnamed: 0,customer_id,offer_id,offer_key,offer_type,channels,age_group,gender,income_group,membership_year,time_rcvd,time_viewed,time_completed,transaction_amount,reward


In [27]:
#check some data for previous customers
df_final[df_final.customer_id=="0009655768c64bdeb2e877511632db8f"]

Unnamed: 0,customer_id,offer_id,offer_key,offer_type,channels,age_group,gender,income_group,membership_year,time_rcvd,time_viewed,time_completed,transaction_amount,reward


In [28]:
#check some data for previous customers
df_final[df_final.customer_id=="e2127556f4f64592b11af22de27a7932"]

Unnamed: 0,customer_id,offer_id,offer_key,offer_type,channels,age_group,gender,income_group,membership_year,time_rcvd,time_viewed,time_completed,transaction_amount,reward
19879,e2127556f4f64592b11af22de27a7932,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo-5-5-7,bogo,"['web', 'email', 'mobile']",Senior,M,Middle Income,2018.0,408,420,522,18.42,5.0
19880,e2127556f4f64592b11af22de27a7932,fafdcd668e3743c1bb461111dcafc2a4,discount-10-2-10,discount,"['web', 'email', 'mobile', 'social']",Senior,M,Middle Income,2018.0,504,522,522,18.42,2.0


For this customer one `transaction` happened for two `offer_completion` that I saw previously. However, here one transaction got linked to two rows. It means that as `transaction` didn't have any offer_id, it will create problems to link transaction with an specific offer, which evantually will not help to get actual revenue. 
I will keep this data's as they show me the potential of this offer.

### 3.4: Inspect the Final Analysis-Ready Data

Finally, i got 22533 offer related rows. I will move further with this data.
Let's do a quick check on our new, clean DataFrame.

In [29]:
# Check for any duplicates. This should be 0.
print(f"Duplicate rows: {df_final.duplicated().sum()}")

# Check for nulls. This should be 0.
print(f"Null values:\\n{df_final.isnull().sum()}")

# Look at the descriptive data
df_final.describe()

Duplicate rows: 0
Null values:\ncustomer_id           0
offer_id              0
offer_key             0
offer_type            0
channels              0
age_group             0
gender                0
income_group          0
membership_year       0
time_rcvd             0
time_viewed           0
time_completed        0
transaction_amount    0
reward                0
dtype: int64


Unnamed: 0,membership_year,time_rcvd,time_viewed,time_completed,transaction_amount,reward
count,22533.0,22533.0,22533.0,22533.0,22533.0,22533.0
mean,2016.409089,326.194648,345.37079,390.040474,20.464685,4.965118
std,1.146143,194.858809,194.80457,196.567161,40.897619,3.029037
min,2013.0,0.0,0.0,0.0,0.15,2.0
25%,2016.0,168.0,180.0,222.0,11.23,2.0
50%,2017.0,336.0,384.0,426.0,16.66,5.0
75%,2017.0,504.0,510.0,552.0,23.11,5.0
max,2018.0,576.0,708.0,714.0,1015.73,10.0


**Comment:**
Average offer_related_transaction amount is $20.46. 
And, average reward amount is $4.95. 

## Part 4: Analysis

With this clean `df_final` DataFrame, I can now confidently analyze offer effectiveness and customer behavior.

This analysis will answer four key questions:
1.  **Financial Impact:** What is the real financial impact of these offers? (e.g., ROI, average spend)
2.  **Offer Performance:** Which specific offers are the most effective at driving completions?
3.  **Customer Performance:** Which demographic segments (age, income, gender) are most responsive to offers?

### 4.1: Financial Impact Analysis

### 4.1.1.: 
I will calculate the total revenue generated *directly* from our attributed offers and compare it to the total cost (rewards paid out). This gives us a clear reward-to-sales ratio.

In [30]:
# --- Calculate Revenue and Costs ---

# Total revenue from successfully completed BOGO/Discount offers
total_offer_revenue = df_final['transaction_amount'].sum()

# Total cost of rewards for these offers
total_offer_reward = df_final['reward'].sum()

# Reward-to-Sales Ratio
# This tells us what percentage of the offer_driven revenue
# is given back to the customer as a reward.
reward_to_sale_ratio = (total_offer_reward / total_offer_revenue) * 100

print(f"--- Offer Financials (BOGO/Discount) ---")
print(f"Total Attributed Revenue: ${total_offer_revenue:,.2f}")
print(f"Total Rewards Paid:       ${total_offer_reward:,.2f}")
print(f"Reward-to-Sales Ratio:    {reward_to_sale_ratio:.2f}%")
print(f"\nThis means for every $1.00 in sales generated by an offer, we spent ${reward_to_sale_ratio/100:.2f} on the reward.")

--- Offer Financials (BOGO/Discount) ---
Total Attributed Revenue: $461,130.75
Total Rewards Paid:       $111,879.00
Reward-to-Sales Ratio:    24.26%

This means for every $1.00 in sales generated by an offer, we spent $0.24 on the reward.


#### 4.1.2.:
I want to compare the total revenue with or without offer. 
At first I need to create df_transaction dataframe containing all transaction. From, there I will get total transaction amount.
After that i can have the ratio between offer_related_reveue and total revenue

In [31]:
#create df_transaction from df_master
df_transaction=df_master[df_master.event=="transaction"]
print(df_transaction.shape)
#get total revenue
total_transaction_amount=df_transaction.amount.sum()
print(f"Total Revenue is ${total_transaction_amount:.2f}")
#overall average revenue with or without offer
print(f"Average revenue with or without offer is: ${total_transaction_amount/(df_transaction.shape[0]):.2f}")
#how much percentage of total revenue is offer related?
print(f"{total_offer_revenue/total_transaction_amount*100:.2f} percent of total revenue is offer related")

(123957, 20)
Total Revenue is $1734942.40
Average revenue with or without offer is: $14.00
26.58 percent of total revenue is offer related


### 4.1.3:
Which one is successfull? with or without offer?
I can have this idea by calculating mean transaction amount.
For this case, I need to seperate offer related transaction and non-offer related transaction

In [32]:
#1. I must set a unique index (customer + time) for df_transaction
df_transaction = df_transaction.set_index(['customer_id', 'time'])
#2. I must set a unique index (customer + time) for df_final
df_final_transaction = df_final.set_index(['customer_id', 'time_completed'])
#3. Create a boolean mask
# .isin() checks if an index from 'df_transaction' also exists
# in 'df_final_transaction'.
is_offer_driven_mask = df_transaction.index.isin(df_final_transaction.index)
# 4. Separate and compare
offer_txns = df_transaction[is_offer_driven_mask]
non_offer_txns = df_transaction[~is_offer_driven_mask]
print(f"Total offer related transaction is {offer_txns.shape}")
print(f"Total non offer transaction is {non_offer_txns.shape}")
# 5. Calculate averages
avg_offer_txn = offer_txns['amount'].mean()
avg_non_offer_txn = non_offer_txns['amount'].mean()

print(f"Average Offer-Driven Transaction: ${avg_offer_txn:,.2f}")
print(f"Average Non-Offer Transaction:    ${avg_non_offer_txn:,.2f}")
print(f"\nInsight: Offer-driven purchases are, on average, {((avg_offer_txn/avg_non_offer_txn) - 1):.1%} more valuable.")

Total offer related transaction is (21642, 18)
Total non offer transaction is (102315, 18)
Average Offer-Driven Transaction: $20.19
Average Non-Offer Transaction:    $12.69

Insight: Offer-driven purchases are, on average, 59.2% more valuable.


**Comment**
Here I got 21642 offer related transaction, instead of 22533. It is because one transaction got attached to multiple offer_completion in some cases.

### 4.2: Offer Performance (Conversion Rate)

I will get the Completions count from `df_final` and Received count from `df_master`.

In [33]:
# 1. Get completion counts from df_final
completions = df_final['offer_key'].value_counts()
completions
# 2. Get received counts from the master log
# I only count BOGO/Discount offers, since that's what we are analyzing
bogo_discount_received = df_master[
    (df_master['event'] == 'offer received') &
    (df_master['offer_type'].isin(['bogo', 'discount']))
]
received = bogo_discount_received['offer_key'].value_counts()
received
# 3. Combine into a new DataFrame
df_conversion = pd.DataFrame({'Completions': completions, 'Received': received})
# 4. Calculate Conversion Rate
df_conversion['Conversion Rate'] = ((df_conversion['Completions'] / df_conversion['Received']) * 100).round(2).astype(str) + '%'
df_conversion = df_conversion.sort_values('Conversion Rate', ascending=False)
print("--- Offer Conversion Rates (BOGO/Discount) ---")
print(df_conversion)

--- Offer Conversion Rates (BOGO/Discount) ---
                  Completions  Received Conversion Rate
offer_key                                              
discount-10-2-10         4329      6652          65.08%
discount-7-3-7           4119      6655          61.89%
bogo-5-5-5               3374      6576          51.31%
bogo-10-10-5             2731      6593          41.42%
bogo-10-10-7             2585      6683          38.68%
discount-10-2-7          2047      6631          30.87%
bogo-5-5-7               2037      6685          30.47%
discount-20-5-10         1311      6726          19.49%


**Comment**
Discount offers are more successful compared to bogo offers.

### 4.3: Channel Performance (Conversion Rate)
I need to change the bogo_discount_received in order to get the channel converstion rate

In [41]:
# 1. Get channel performance from df_final
channels = df_final['channels'].value_counts()
# 2. get received amount from bogo_discount_received
channel_received=bogo_discount_received['channels'].value_counts()
df_channel_conversion = pd.DataFrame({'completions': channels, 'Received': channel_received})
df_channel_conversion
# 4. Calculate Conversion Rate
df_channel_conversion['Conversion Rate'] = ((df_channel_conversion['completions'] / df_channel_conversion['Received']) * 100).round(2).astype(str) + '%'
df_channel_conversion = df_channel_conversion.sort_values('Conversion Rate', ascending=False)
print("--- Channel Conversion Rates (BOGO/Discount) ---")
print(df_channel_conversion)


--- Channel Conversion Rates (BOGO/Discount) ---
                                      completions  Received Conversion Rate
channels                                                                   
['web', 'email', 'mobile', 'social']        14553     26476          54.97%
['email', 'mobile', 'social']                2585      6683          38.68%
['web', 'email', 'mobile']                   4084     13316          30.67%
['web', 'email']                             1311      6726          19.49%


**Comment**
['web', 'email', 'mobile', 'social'] this category was the more succesfuul channel

### 4.4: Demographic Performance

Who are our best customers?

In [42]:
#1. --- Analyze by Age Group ---
age_analysis = df_final.groupby('age_group')['transaction_amount'].agg(['count', 'sum', 'mean'])
age_analysis = age_analysis.sort_values('mean', ascending=False)
age_analysis['sum'] = age_analysis['sum'].map('${:,.2f}'.format)
age_analysis['mean'] = age_analysis['mean'].map('${:,.2f}'.format)

print("--- Analysis by Age Group (Offer Completions) ---")
print(age_analysis)

--- Analysis by Age Group (Offer Completions) ---
                  count          sum    mean
age_group                                   
Senior             4947  $111,882.33  $22.62
Older Adult        8258  $185,998.16  $22.52
Elderly            1808   $39,608.99  $21.91
Middle Age Adult   4918   $88,471.96  $17.99
Young Adult        2602   $35,169.31  $13.52


**Comment:** Senior, Older Adult, and Elderl' customers are the premium segment. They spend an average of $22 per successful offer. They are high-value and reliable.

In [43]:
#2. --- Analyze by Income Group ---
income_analysis = df_final.groupby('income_group')['transaction_amount'].agg(['count', 'sum', 'mean'])
income_analysis = income_analysis.sort_values('mean', ascending=False)
income_analysis['sum'] = income_analysis['sum'].map('${:,.2f}'.format)
income_analysis['mean'] = income_analysis['mean'].map('${:,.2f}'.format)

print("\n--- Analysis by Income Group (Offer Completions) ---")
print(income_analysis)


--- Analysis by Income Group (Offer Completions) ---
               count          sum    mean
income_group                             
High Income     5455  $162,555.71  $29.80
Middle Income  14030  $261,636.93  $18.65
Low Income      3048   $36,938.11  $12.12


**Comment**
High income group tend to spend more per transaction.

In [44]:
#3. --- Analyze by Gender ---
gender_analysis = df_final.groupby('gender')['transaction_amount'].agg(['count', 'sum', 'mean'])
gender_analysis = gender_analysis.sort_values('mean', ascending=False)
gender_analysis['sum'] = gender_analysis['sum'].map('${:,.2f}'.format)
gender_analysis['mean'] = gender_analysis['mean'].map('${:,.2f}'.format)

print("\n--- Analysis by Gender (Offer Completions) ---")
print(gender_analysis)


--- Analysis by Gender (Offer Completions) ---
        count          sum    mean
gender                            
F       10500  $235,064.10  $22.39
O         389    $7,536.91  $19.38
M       11644  $218,529.74  $18.77


**Comment**
Female spend more per transaction compared to men.


