# Starbucks Capstone Project

## 1. Business Understanding

•	The program used to create the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.

•	The basic task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present each type of offer. In detail, your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. 

## 2. Data Understanding

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

In [75]:
# import necessary packages
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline


In [136]:
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)

In [140]:
profile = pd.read_json('data/profile.json', orient='records', lines=True)

In [None]:
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

Firstly, check portfolio data.

In [3]:
portfolio.head(10)

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5


In [137]:
portfolio.shape

(10, 6)

In [138]:
portfolio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 608.0+ bytes


In [139]:
portfolio['id'].duplicated().sum()

0

We can see that this dataframe is small and contains no missing values.

Next, check profile data.

In [6]:
profile.head()

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,


In [7]:
profile.shape

(17000, 5)

In [142]:
profile['id'].duplicated().sum()

0

In [8]:
profile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB


In [9]:
profile.isna().sum()

gender              2175
age                    0
id                     0
became_member_on       0
income              2175
dtype: int64

In [10]:
(profile['gender'].isna() & profile['income'].isna()).sum()

2175

There are missing values in the gender column and income column. We can see that they all happen simultaneously. In another words, there are 14825 rows contain full information. And there are 2175 rows contain values with both gender and income missing.

In [11]:
transcript.head()

Unnamed: 0,person,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},0
4,68617ca6246f4fbc85e91a2a49552598,offer received,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},0


In [12]:
transcript.shape

(306534, 4)

In [143]:
transcript['person'].duplicated().sum()

289534

In [13]:
transcript.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB


We can see there is no missing values in transcript dataframe.

## 3. Data Preparation

Data cleaning especially important and tricky.

### 3.1 portfolio dataframe

For this dataframe, we don't have to worry about missing values. However, the channel column contains informations that are difficult to process. We want to do:

A. break this column up in to dummy variables. 

B. get dummy variables for the categorical column 'offer_type'. 

C. rename column 'id' to 'offer_id' to avoid future confusion and reconstruct the dataframe and use offer_id as index.

In [14]:
portfolio.head()

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7


Part B

In [23]:
channel_type = set()
for item in portfolio['channels']:
    channel_type.update(set(item))

channel_type = list(channel_type)

In [33]:
for type in channel_type:
    portfolio[type] = portfolio['channels'].apply(lambda chan: type in chan).apply(int)

portfolio.drop(labels='channels', axis=1, inplace=True)    
portfolio.head(10)

Unnamed: 0,reward,difficulty,duration,offer_type,id,email,social,mobile,web
0,10,10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd,1,1,1,0
1,10,10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0,1,1,1,1
2,0,0,4,informational,3f207df678b143eea3cee63160fa8bed,1,0,1,1
3,5,5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9,1,0,1,1
4,5,20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,0,0,1
5,3,7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2,1,1,1,1
6,2,10,10,discount,fafdcd668e3743c1bb461111dcafc2a4,1,1,1,1
7,0,0,3,informational,5a8bc65990b245e5a138643cd4eb9837,1,1,1,0
8,5,5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d,1,1,1,1
9,2,10,7,discount,2906b810c7d4411798c6938adc9daaa5,1,0,1,1


Part B

In [16]:
portfolio['offer_type'].unique()

array(['bogo', 'informational', 'discount'], dtype=object)

In [42]:
dummies = pd.DataFrame(pd.get_dummies(portfolio['offer_type']))

In [45]:
portfolio = pd.concat([portfolio, dummies], axis=1)

In [48]:
portfolio.drop(labels='offer_type', axis=1, inplace=True)
portfolio.head(10)

Unnamed: 0,reward,difficulty,duration,id,email,social,mobile,web,bogo,discount,informational
0,10,10,7,ae264e3637204a6fb9bb56bc8210ddfd,1,1,1,0,1,0,0
1,10,10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,1,1,1,1,1,0,0
2,0,0,4,3f207df678b143eea3cee63160fa8bed,1,0,1,1,0,0,1
3,5,5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,1,0,1,1,1,0,0
4,5,20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,0,0,1,0,1,0
5,3,7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,1,1,1,1,0,1,0
6,2,10,10,fafdcd668e3743c1bb461111dcafc2a4,1,1,1,1,0,1,0
7,0,0,3,5a8bc65990b245e5a138643cd4eb9837,1,1,1,0,0,0,1
8,5,5,5,f19421c1d4aa40978ebb69ca19b0e20d,1,1,1,1,1,0,0
9,2,10,7,2906b810c7d4411798c6938adc9daaa5,1,0,1,1,0,1,0


Part C

In [61]:
portfolio.rename(columns = {'id':'offer_id'}, inplace=True)

In [62]:
portfolio.set_index('offer_id')

Unnamed: 0_level_0,reward,difficulty,duration,email,social,mobile,web,bogo,discount,informational
offer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ae264e3637204a6fb9bb56bc8210ddfd,10,10,7,1,1,1,0,1,0,0
4d5c57ea9a6940dd891ad53e9dbe8da0,10,10,5,1,1,1,1,1,0,0
3f207df678b143eea3cee63160fa8bed,0,0,4,1,0,1,1,0,0,1
9b98b8c7a33c4b65b9aebfe6a799e6d9,5,5,7,1,0,1,1,1,0,0
0b1e1539f2cc45b7b9fa7c272da2e1d7,5,20,10,1,0,0,1,0,1,0
2298d6c36e964ae4a3e7e9706d1fb8c2,3,7,7,1,1,1,1,0,1,0
fafdcd668e3743c1bb461111dcafc2a4,2,10,10,1,1,1,1,0,1,0
5a8bc65990b245e5a138643cd4eb9837,0,0,3,1,1,1,0,0,0,1
f19421c1d4aa40978ebb69ca19b0e20d,5,5,5,1,1,1,1,1,0,0
2906b810c7d4411798c6938adc9daaa5,2,10,7,1,0,1,1,0,1,0


In [63]:
portfolio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   reward         10 non-null     int64 
 1   difficulty     10 non-null     int64 
 2   duration       10 non-null     int64 
 3   offer_id       10 non-null     object
 4   email          10 non-null     int64 
 5   social         10 non-null     int64 
 6   mobile         10 non-null     int64 
 7   web            10 non-null     int64 
 8   bogo           10 non-null     uint8 
 9   discount       10 non-null     uint8 
 10  informational  10 non-null     uint8 
dtypes: int64(7), object(1), uint8(3)
memory usage: 798.0+ bytes


### 3.2 porfile dataframe

A. As shown in Section 2, there are missing values in the gender and income columns. We will drop these rows.

B. We also can see from the head of the dataframe that there are entries with age of 118. This is very unlikely. Since all 3 cases happend in the rows being dropped, we will check the range of ages after the dropping.

C. The became_member_on column contains the dates as int. We will transform those data into appropriate format.

D. We get dummies for the gender column and drop this column. There are three types: F, M, and O.

E. We will change column name 'id' into 'customer_id' and set it as index.

In [103]:
profile.head()

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,


Part A

In [104]:
profile.dropna(how='any', inplace=True)

In [105]:
profile.isna().sum()

gender              0
age                 0
id                  0
became_member_on    0
income              0
dtype: int64

Part B

In [106]:
profile['age'].max(), profile['age'].min()

(101, 18)

After dropping rows, the range of ages seems reasonable. No change applied.

Part C

In [107]:
profile['became_member_on']= profile['became_member_on'].apply(lambda x: datetime.strptime(str(x),'%Y%m%d'))

Part D

In [108]:
profile['gender'].unique()

array(['F', 'M', 'O'], dtype=object)

In [109]:
dummies = pd.DataFrame(pd.get_dummies(profile['gender']))

In [110]:
profile = pd.concat([profile, dummies], axis=1)

In [111]:
profile.drop(labels='gender', axis=1, inplace=True)

Part E

In [112]:
profile.rename(columns = {'id':'customer_id'}, inplace=True)

In [113]:
profile.set_index('customer_id')

Unnamed: 0_level_0,age,became_member_on,income,F,M,O
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0610b486422d4921ae7d2bf64640c50b,55,2017-07-15,112000.0,1,0,0
78afa995795e4d85b5d9ceeca43f5fef,75,2017-05-09,100000.0,1,0,0
e2127556f4f64592b11af22de27a7932,68,2018-04-26,70000.0,0,1,0
389bc3fa690240e798340f5a15918d5c,65,2018-02-09,53000.0,0,1,0
2eeac8d8feae4a8cad5a6af0499a211d,58,2017-11-11,51000.0,0,1,0
...,...,...,...,...,...,...
6d5f3a774f3d4714ab0c092238f3a1d7,45,2018-06-04,54000.0,1,0,0
2cb4f97358b841b9a9773a7aa05a9d77,61,2018-07-13,72000.0,0,1,0
01d26f638c274aa0b965d24cefe3183f,49,2017-01-26,73000.0,0,1,0
9dc1421481194dcd9400aec7c9ae6366,83,2016-03-07,50000.0,1,0,0


In [114]:
profile.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14825 entries, 1 to 16999
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   age               14825 non-null  int64         
 1   customer_id       14825 non-null  object        
 2   became_member_on  14825 non-null  datetime64[ns]
 3   income            14825 non-null  float64       
 4   F                 14825 non-null  uint8         
 5   M                 14825 non-null  uint8         
 6   O                 14825 non-null  uint8         
dtypes: datetime64[ns](1), float64(1), int64(1), object(1), uint8(3)
memory usage: 622.5+ KB


### 3.3 transcript dataframe

In [203]:
transcript.head()

Unnamed: 0,customer_id,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},0
4,68617ca6246f4fbc85e91a2a49552598,offer received,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},0


In [129]:
transcript['event'].unique()

array(['offer received', 'offer viewed', 'transaction', 'offer completed'],
      dtype=object)

In [128]:
transcript[transcript['event'] == 'transaction'].iloc[0:2]

Unnamed: 0,person,event,value,time
12654,02c083884c7d45b39cc68e1314fec56c,transaction,{'amount': 0.8300000000000001},0
12657,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,transaction,{'amount': 34.56},0


In [131]:
transcript[transcript['event'] == 'offer received'].iloc[0:2]

Unnamed: 0,person,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0


In [132]:
transcript[transcript['event'] == 'offer viewed'].iloc[0:2]

Unnamed: 0,person,event,value,time
12650,389bc3fa690240e798340f5a15918d5c,offer viewed,{'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'},0
12651,d1ede868e29245ea91818a903fec04c6,offer viewed,{'offer id': '5a8bc65990b245e5a138643cd4eb9837'},0


In [130]:
transcript[transcript['event'] == 'offer completed'].iloc[0:2]

Unnamed: 0,person,event,value,time
12658,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,offer completed,{'offer_id': '2906b810c7d4411798c6938adc9daaa5...,0
12672,fe97aa22dd3e48c8b143116a8403dd52,offer completed,{'offer_id': 'fafdcd668e3743c1bb461111dcafc2a4...,0


After exploring, I found that the data can be split into two categories. The first one is about purchasing. The second one is about offers. So I will split the data into two dataframes containing each type.

A. Change column name from person to customer_id.

B. Split transcript in to transactions and event offer on different events.

C. For each dataframe, extract data from the value column. It will be a float number for transactions, named 'amount', and a string of offer id, named 'offer_id', for offer. After the spliting and data extraction, the event column and the value column will be dropped for both data frame.

D. Get dummies for offer events

I also found duplicates in the person column. So the index will remain the same.

Part A

In [148]:
transcript.rename(columns = {'person':'customer_id'}, inplace=True)

Part B

In [235]:
transactions = transcript[transcript['event'] == 'transaction'].copy()

In [250]:
offer = transcript[transcript['event'] != 'transaction'].copy()

In [182]:
transactions.shape[0] + offer.shape[0] == transcript.shape[0]

True

In [183]:
transactions.shape[1] == transcript.shape[1]

True

In [184]:
offer.shape[1] == transcript.shape[1]

True

Part C

Firstly, the transactions dataframe

In [236]:
transactions.head()

Unnamed: 0,customer_id,event,value,time
12654,02c083884c7d45b39cc68e1314fec56c,transaction,{'amount': 0.8300000000000001},0
12657,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,transaction,{'amount': 34.56},0
12659,54890f68699049c2a04d415abc25e717,transaction,{'amount': 13.23},0
12670,b2f1cd155b864803ad8334cdf13c4bd2,transaction,{'amount': 19.51},0
12671,fe97aa22dd3e48c8b143116a8403dd52,transaction,{'amount': 18.97},0


In [237]:
amount = transactions['value'].apply(lambda x: float(x.get('amount')))

In [238]:
transactions.drop(labels=['event','value'], axis=1, inplace=True)

In [239]:
transactions = pd.concat([transactions, amount],axis=1)

In [240]:
transactions.rename(columns = {'value':'amount'}, inplace=True)

In [241]:
transactions.head()

Unnamed: 0,customer_id,time,amount
12654,02c083884c7d45b39cc68e1314fec56c,0,0.83
12657,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,34.56
12659,54890f68699049c2a04d415abc25e717,0,13.23
12670,b2f1cd155b864803ad8334cdf13c4bd2,0,19.51
12671,fe97aa22dd3e48c8b143116a8403dd52,0,18.97


Then the offer dataframe

In [251]:
offer.head()

Unnamed: 0,customer_id,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},0
4,68617ca6246f4fbc85e91a2a49552598,offer received,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},0


In [252]:
offer_id = offer['value'].apply(lambda x: x.get('offer id'))

In [253]:
offer.drop(labels='value', axis=1, inplace=True)

In [254]:
offer = pd.concat([offer, offer_id], axis=1)

In [255]:
offer.rename(columns = {'value':'offer_id'}, inplace=True)

In [256]:
offer.head()

Unnamed: 0,customer_id,event,time,offer_id
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,fafdcd668e3743c1bb461111dcafc2a4
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,4d5c57ea9a6940dd891ad53e9dbe8da0


Part D

In [257]:
dummies = pd.DataFrame(pd.get_dummies(offer['event']))

In [258]:
offer = pd.concat([offer, dummies], axis=1)

In [259]:
offer.drop(labels='event', axis=1, inplace=True)

In [260]:
offer.head()

Unnamed: 0,customer_id,time,offer_id,offer completed,offer received,offer viewed
0,78afa995795e4d85b5d9ceeca43f5fef,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,1,0
1,a03223e636434f42ac4c3df47e8bac43,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,0,1,0
2,e2127556f4f64592b11af22de27a7932,0,2906b810c7d4411798c6938adc9daaa5,0,1,0
3,8ec6ce2a7e7949b1bf142def7d0e0586,0,fafdcd668e3743c1bb461111dcafc2a4,0,1,0
4,68617ca6246f4fbc85e91a2a49552598,0,4d5c57ea9a6940dd891ad53e9dbe8da0,0,1,0


## 4. Modeling

## 5. Evaluation

## 6. Deployment