# AIMS

## Introduction:

Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. And also, not all users receive the same offer, and that is the challenge to solve with this data set.


## Goal:

The data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app.
This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
The main goal is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type.


## Data Sets:
The data is contained in three files:

- portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
- profile.json - demographic data for each customer
- transcript.json - records for transactions, offers received, offers viewed, and offers completed

## Problem Statement:
We will be exploring the Starbuck’s Dataset which simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.

There are three offers_types that can be sent: buy-one-get-one (BOGO), discount, and informational.

We will segment the customer data on different parameters and check its behavious on different offer_types using both supervised and unsupervised learning.

We will analyse the data in the Exploratory Data Analysis part of this section and answer the following questions related to customer segmentation and its buying behavious.

- What is the Gender Distribution of Starbucks Customers? <br>
- What is the Age Distribution of Starbucks Customers? <br>
- What is the Income Distribution of Starbucks Customers? <br>
- How many customers enrolled yearly? <br>
- Which gender has the highest yearly membership? <br>
- Which gender has the highest Annual income? <br>
- What is the distribution of event in transcripts? <br>
- What is the percent of trasactions and offers in the event? <br>
- What is the Income Distribution for the Offer Events? <br>
- What are the Offer types amongst ages, gender and income groups? <br>

## Menu


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import json

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import warnings
warnings.filterwarnings("ignore")



In [3]:
import sys
sys.path.append("../") 

import utils.paths as path
from utils.paths2 import direcciones

In [4]:
# Crear los de drive
G_raw, G_processed, G_interim, G_external, G_models, G_reports, G_reports_figures = direcciones('starbucks')

In [5]:
# json 
portfolio = pd.read_json(path.data_raw_dir('portfolio.json'), orient='records', lines=True)
# G_portfolio = pd.read_json(G_raw/'portfolio.json', orient='records', lines=True)
profile = pd.read_json(path.data_raw_dir('profile.json'), orient='records', lines=True)
transcript = pd.read_json(path.data_raw_dir('transcript.json'), orient='records', lines=True)

### portfolio.json:

In [6]:
portfolio.head()

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7


In [7]:
print('minimum duration: ', portfolio.duration.min())
print('maximun duration: ',portfolio.duration.max())

minimum duration:  3
maximun duration:  10


In [8]:
portfolio.describe()

Unnamed: 0,reward,difficulty,duration
count,10.0,10.0,10.0
mean,4.2,7.7,6.5
std,3.583915,5.831905,2.321398
min,0.0,0.0,3.0
25%,2.0,5.0,5.0
50%,4.0,8.5,7.0
75%,5.0,10.0,7.0
max,10.0,20.0,10.0


In [9]:
portfolio.offer_type.describe()

count       10
unique       3
top       bogo
freq         4
Name: offer_type, dtype: object

In [10]:
portfolio.channels

0         [email, mobile, social]
1    [web, email, mobile, social]
2            [web, email, mobile]
3            [web, email, mobile]
4                    [web, email]
5    [web, email, mobile, social]
6    [web, email, mobile, social]
7         [email, mobile, social]
8    [web, email, mobile, social]
9            [web, email, mobile]
Name: channels, dtype: object

In [11]:
portfolio.reward.value_counts()

5     3
10    2
0     2
2     2
3     1
Name: reward, dtype: int64

In [12]:
portfolio.duration.value_counts()

7     4
5     2
10    2
4     1
3     1
Name: duration, dtype: int64

In [13]:
portfolio['id'].nunique() == portfolio.shape[0]

True

In [14]:
print(portfolio.offer_type.nunique())
print(portfolio.offer_type.unique())

3
['bogo' 'informational' 'discount']


In [20]:
portfolio.shape

(10, 6)

Portfolio dataset contains 10 records and 6 columns.

## profile.json:

In [17]:
profile.head()

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,


In [21]:
profile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB


In [22]:
print('mean income value: ', profile.income.mean())
print('median income value: ', profile.income.median())

mean income value:  65404.9915682968
median income value:  64000.0


In [23]:
profile.describe(include = 'object')

Unnamed: 0,gender,id
count,14825,17000
unique,3,17000
top,M,68be06ca386d4c31939f3a4f0e3dd783
freq,8484,1


In [24]:
profile.gender.value_counts()

M    8484
F    6129
O     212
Name: gender, dtype: int64

In [25]:
profile['id'].nunique() == profile.shape[0]

True

In [26]:
profile.became_member_on.nunique()

1716

In [29]:
profile.age.unique()

array([118,  55,  75,  68,  65,  58,  61,  26,  62,  49,  57,  40,  64,
        78,  42,  56,  33,  46,  59,  67,  53,  22,  96,  69,  20,  45,
        54,  39,  41,  79,  66,  29,  44,  63,  36,  76,  77,  30,  51,
        27,  73,  74,  70,  89,  50,  90,  60,  19,  72,  52,  18,  71,
        83,  43,  47,  32,  38,  34,  85,  48,  35,  82,  21,  24,  81,
        25,  37,  23, 100,  28,  84,  80,  87,  86,  94,  31,  88,  95,
        93,  91,  92,  98, 101,  97,  99], dtype=int64)

In [18]:
profile.shape

(17000, 5)

Profile dataset contains 17000 records and 5 columns.

## transcript.json:

In [33]:
transcript.head()

Unnamed: 0,person,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},0
4,68617ca6246f4fbc85e91a2a49552598,offer received,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},0


In [34]:
print('mean time value: ', transcript.time.mean())
print('median time value: ',transcript.time.median())

mean time value:  366.382939576034
median time value:  408.0


In [35]:
transcript.describe()

Unnamed: 0,time
count,306534.0
mean,366.38294
std,200.326314
min,0.0
25%,186.0
50%,408.0
75%,528.0
max,714.0


In [36]:
transcript.describe(include = 'object')

Unnamed: 0,person,event,value
count,306534,306534,306534
unique,17000,4,5121
top,94de646f7b6041228ca7dec82adb97d2,transaction,{'offer id': '2298d6c36e964ae4a3e7e9706d1fb8c2'}
freq,51,138953,14983


In [37]:
transcript.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB


In [38]:
transcript.event.value_counts()

transaction        138953
offer received      76277
offer viewed        57725
offer completed     33579
Name: event, dtype: int64

In [39]:
received = transcript.query(" event == 'offer received' ")['time'].mean()
completed = transcript.query(" event == 'offer completed' ")['time'].mean()
viewed = transcript.query(" event == 'offer viewed' ")['time'].mean()
transactions = transcript.query(" event == 'transaction' ")['time'].mean()

print('mean time of offer received:', received)
print('mean time of offer viewed:', viewed)
print('mean time of offer completed:', completed)
print('mean time of transactions:', transactions)

mean time of offer received: 332.57951938330035
mean time of offer viewed: 354.29051537462107
mean time of offer completed: 401.0528008576789
mean time of transactions: 381.58433427130035


In [40]:
received = transcript.query(" event == 'offer received' ")['time'].count()
completed = transcript.query(" event == 'offer completed' ")['time'].count()
viewed = transcript.query(" event == 'offer viewed' ")['time'].count()
transactions = transcript.query(" event == 'transaction' ")['time'].count()

print('offer received counts:', received)
print('offer viewed counts:', viewed)
print('offer completed counts:', completed)
print('transactions counts:', transactions)

offer received counts: 76277
offer viewed counts: 57725
offer completed counts: 33579
transactions counts: 138953


In [41]:
transcript['person'].nunique() == transcript.shape[0]

False

In [30]:
transcript.shape

(306534, 4)

Transcript dataset contains 306534 records and 4 columns.