<h1 style="color: blue;">Starbucks Project Overview</h1>

<h2 style="color: blue;">Starbucks Capstone Challenge</h2>

### Instructions for the project can be found in the Starbucks Project Workspace.

#### Dataset overview

- The program used to create the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.
- Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.
- As a simplification, there are no explicit products to track. Only the amounts of each transaction or offer are recorded.
- There are three types of offers that can be sent: buy-one-get-one (BOGO), discount, and informational. In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount. In a discount, a user gains a reward equal to a fraction of the amount spent. In an informational offer, there is no reward, but neither is there a requisite amount that the user is expected to spend. Offers can be delivered via multiple channels.
- The basic task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present each type of offer.

#### Data Dictionary
##### profile.json
Rewards program users (17000 users x 5 fields)
- gender: (categorical) M, F, O, or null
- age: (numeric) missing value encoded as 118
- id: (string/hash)
- became_member_on: (date) format YYYYMMDD
- income: (numeric)

##### portfolio.json
Offers sent during 30-day test period (10 offers x 6 fields)

- reward: (numeric) money awarded for the amount spent
- channels: (list) web, email, mobile, social
- difficulty: (numeric) money required to be spent to receive reward
- duration: (numeric) time for offer to be open, in days
- offer_type: (string) bogo, discount, informational
- id: (string/hash)

##### transcript.json
Event log (306648 events x 4 fields)

- person: (string/hash)
- event: (string) offer received, offer viewed, transaction, offer completed
- value: (dictionary) different values depending on event type
- offer id: (string/hash) not associated with any "transaction"
- amount: (numeric) money spent in "transaction"
- reward: (numeric) money gained from "offer completed"
- time: (numeric) hours after start of test

#### Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
import calendar

#### Load the datasets

In [2]:
portfolio = pd.read_json('data/portfolio.json', lines=True)
profile = pd.read_json('data/profile.json', lines=True)
transcript = pd.read_json('data/transcript.json', lines=True)

#### Get the unique channels

In [3]:
unique_channels = list(set(portfolio.channels.explode()))
unique_channels

['email', 'social', 'mobile', 'web']

#### Clean and structure the portfolio dataset

In [4]:
#  for the unique channels separate them and get an structured dataset
portfolio[unique_channels] = list(map(lambda x:  np.in1d(unique_channels, x), portfolio.channels))
portfolio.drop(columns='channels', inplace=True)
portfolio # look at the portfolio

Unnamed: 0,reward,difficulty,duration,offer_type,id,email,social,mobile,web
0,10,10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd,True,True,True,False
1,10,10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0,True,True,True,True
2,0,0,4,informational,3f207df678b143eea3cee63160fa8bed,True,False,True,True
3,5,5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9,True,False,True,True
4,5,20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7,True,False,False,True
5,3,7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2,True,True,True,True
6,2,10,10,discount,fafdcd668e3743c1bb461111dcafc2a4,True,True,True,True
7,0,0,3,informational,5a8bc65990b245e5a138643cd4eb9837,True,True,True,False
8,5,5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d,True,True,True,True
9,2,10,7,discount,2906b810c7d4411798c6938adc9daaa5,True,False,True,True


#### Clean and Structure the transcript file

In [5]:
transcript_value = pd.json_normalize(transcript.value)
transcript[transcript_value.columns] = transcript_value
transcript.drop(columns='value', inplace=True)
transcript

Unnamed: 0,person,event,time,offer id,amount,offer_id,reward
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,,,
2,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,,,
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,fafdcd668e3743c1bb461111dcafc2a4,,,
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,4d5c57ea9a6940dd891ad53e9dbe8da0,,,
...,...,...,...,...,...,...,...
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,714,,1.59,,
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,714,,9.53,,
306531,a00058cf10334a308c68e7631c529907,transaction,714,,3.61,,
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,714,,3.53,,


In [6]:
profile

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,
...,...,...,...,...,...
16995,F,45,6d5f3a774f3d4714ab0c092238f3a1d7,20180604,54000.0
16996,M,61,2cb4f97358b841b9a9773a7aa05a9d77,20180713,72000.0
16997,M,49,01d26f638c274aa0b965d24cefe3183f,20170126,73000.0
16998,F,83,9dc1421481194dcd9400aec7c9ae6366,20160307,50000.0


#### Merge the transcript file 

In [7]:
df = transcript.merge(profile, left_on='person', right_on='id', how='left')
df = df.merge(portfolio, left_on='offer id', right_on='id', how='left')
df = df.drop(columns=['reward_x', 'id_y', 'id_x']).rename(columns={'reward_y':'reward'})

In [8]:
df = df.sort_values(by=['person', 'time'])
df.drop(columns='offer_id', inplace=True)

In [9]:
%%time
df = pd.concat([df[df.person==p].ffill() for p in df.person.unique()])
df = df.sort_values(['person', 'time', 'offer id'])



CPU times: user 14min 47s, sys: 118 ms, total: 14min 47s
Wall time: 14min 47s


#### Make a decomposition of the final dataset and check some basic events

In [10]:
decomposition = df.became_member_on.apply(lambda x: [int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:])])
year = decomposition.apply(lambda x: x[0])
month = decomposition.apply(lambda x: x[1])
#df['sin_month'] = df.month.apply(lambda x: np.sin(2*np.pi*x/12))
day = decomposition.apply(lambda x: x[2])
#df['sin_day'] = decomposition.apply(lambda x: np.sin(2*np.pi*x[2]/calendar.monthrange(x[0], x[1])[1]))
min_date = datetime(year.min(), month.min(), day.min())
df['days_since_membership'] = decomposition.apply(lambda x:(datetime(x[0], x[1], x[2]) - min_date).days)
df.drop(columns='became_member_on', inplace=True)
df['offered_channels_count'] = df[['email', 'web', 'mobile', 'social']].sum(axis=1)
df.set_index('offer id', inplace=True)
print('Before Dropping duplicates', df.shape) # basic dataset shape
df = df.drop_duplicates()
print('After Dropping Duplicates', df.shape) # after dropping duplicates

Before Dropping duplicates (306534, 17)
After Dropping Duplicates (303572, 17)


In [11]:
df.head(2)

Unnamed: 0_level_0,person,event,time,amount,gender,age,income,reward,difficulty,duration,offer_type,email,social,mobile,web,days_since_membership,offered_channels_count
offer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
5a8bc65990b245e5a138643cd4eb9837,0009655768c64bdeb2e877511632db8f,offer received,168,,M,33,72000.0,0.0,0.0,3.0,informational,True,True,True,False,1571,3
5a8bc65990b245e5a138643cd4eb9837,0009655768c64bdeb2e877511632db8f,offer viewed,192,,M,33,72000.0,0.0,0.0,3.0,informational,True,True,True,False,1571,3


In [12]:
df.to_csv('dataset.csv', index=True)