In [1]:
#Import packages
import pandas as pd
import numpy as np
import math
import json
from pandas_profiling import ProfileReport

In [12]:
# read in the json files

# Portfolio dataset
# Contains offer ids and information about each offer - duration, type, etc.
# id (string) - offer id
# offer_type (string) - type of offer ie BOGO, discount, informational
# difficulty (int) - minimum required spend to complete an offer
# reward (int) - reward given for completing an offer
# duration (int) - time for offer to be open, in days
# channels (list of strings)
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)

# Customer profile dataset
# Demographic data on each customer
# age (int) - age of the customer
# became_member_on (int) - date when customer created an app account
# gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
# id (str) - customer id
# income (float) - customer's income
profile = pd.read_json('data/profile.json', orient='records', lines=True)

# Transcript dataset
# Records for transactions, offers received, offers viewed and offers completed
# event (str) - record description (ie transaction, offer received, offer viewed, etc.)
# person (str) - customer id
# time (int) - time in hours since start of test. The data begins at time t=0
# value - (dict of strings) - either an offer id or transaction amount depending on the record
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)


In [14]:
pf_portfolio = ProfileReport(portfolio, title='Portfolio Profiling Report')

In [15]:
pf_portfolio.to_notebook_iframe()

Summarize dataset: 100%|██████████| 20/20 [00:02<00:00,  7.13it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.04s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.36it/s]


In [16]:
pf_profile = ProfileReport(profile, title='Portfolio Profiling Report')

In [17]:
pf_profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 19/19 [00:03<00:00,  5.06it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.97s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.26it/s]


Notes:

1. `gender` feature has 12.8% missing values - if we use this feature, would have to deal with the missing values. Possibly not missing at random, though may want to search why values could be missing for gender.

2. `age` seems to tail off approaching 100, but there is an unusual spike of 12.8% values with an age of 118. This matches the 12.8% missing for `gender`, suggesting that age 118 has been used as a stand-in value when age is missing.

3. `became member` is the date is yyyymmdd numerical format. May want to plot a time series of this if interested in understanding the rate at which they attract new customers. Dates range from July 2013 - July 2018.

4. `income` has the same 12.8% missing.

On the 12.8% customers with missing gender, age and income attributes, would be interesting to see if they appear in the transcript - these are the customers we have no information on, so will have to handle them with care if we want to include them.

In [18]:
pf_transcript = ProfileReport(transcript, title='Portfolio Profiling Report')

In [19]:
pf_transcript.to_notebook_iframe()

Summarize dataset: 100%|██████████| 18/18 [01:52<00:00,  6.26s/it, Completed]
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.36s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  3.50it/s]


Reflections

In [41]:
transcript1 = pd.concat([transcript.drop(['value'],axis=1), pd.DataFrame(transcript['value'].tolist())], axis=1)

In [53]:
transcript1['offer_id'] = np.where(transcript1['offer_id'].isnull(), transcript1['offer id'], transcript1['offer_id'])
if 'offer id' in transcript1.columns:
    transcript1.drop(['offer id'], axis=1, inplace=True)

In [43]:
pf_transcript1 = ProfileReport(transcript1, title='Portfolio Profiling Report')

In [44]:
pf_transcript1.to_notebook_iframe()

Summarize dataset: 100%|██████████| 21/21 [00:30<00:00,  1.44s/it, Completed]
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.19s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.54it/s]


In [57]:
transcript1[transcript1['person'] == transcript1['person'].unique()[50]]

Unnamed: 0,person,event,time,amount,offer_id,reward
50,fb98ebe9e3e14afaa5b5db182b00b7ec,offer received,0,,2906b810c7d4411798c6938adc9daaa5,
45437,fb98ebe9e3e14afaa5b5db182b00b7ec,offer viewed,120,,2906b810c7d4411798c6938adc9daaa5,
53227,fb98ebe9e3e14afaa5b5db182b00b7ec,offer received,168,,fafdcd668e3743c1bb461111dcafc2a4,
88893,fb98ebe9e3e14afaa5b5db182b00b7ec,offer viewed,228,,fafdcd668e3743c1bb461111dcafc2a4,
110880,fb98ebe9e3e14afaa5b5db182b00b7ec,offer received,336,,fafdcd668e3743c1bb461111dcafc2a4,
123554,fb98ebe9e3e14afaa5b5db182b00b7ec,offer viewed,336,,fafdcd668e3743c1bb461111dcafc2a4,
150650,fb98ebe9e3e14afaa5b5db182b00b7ec,offer received,408,,2298d6c36e964ae4a3e7e9706d1fb8c2,
171224,fb98ebe9e3e14afaa5b5db182b00b7ec,offer viewed,420,,2298d6c36e964ae4a3e7e9706d1fb8c2,
201622,fb98ebe9e3e14afaa5b5db182b00b7ec,offer received,504,,3f207df678b143eea3cee63160fa8bed,
232885,fb98ebe9e3e14afaa5b5db182b00b7ec,transaction,540,33.38,,


Structure of data

Order of a successful offer appears to go:

1. Offer received

2. Offer viewed

3. Transaction

4. Offer completed

Idea: If you want to know whether a customer will take an offer, need a single record per person & offer to denote trail of events.

Notes:
- Transaction and offer completed occur at the same time

- Some transactions happen without any offer - do we care about these?
- Someone can redeem two offers at the same time
- Someone can completed a received offer without viewing it