# Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

# Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

# Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

# Final Remark

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Data Sets

The data is contained in three files:

portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
profile.json - demographic data for each customer
transcript.json - records for transactions, offers received, offers viewed, and offers completed
Here is the schema and explanation of each variable in the files:

portfolio.json

id (string) - offer id
offer_type (string) - type of offer ie BOGO, discount, informational
difficulty (int) - minimum required spend to complete an offer
reward (int) - reward given for completing an offer
duration (int) - time for offer to be open, in days
channels (list of strings)
profile.json

age (int) - age of the customer
became_member_on (int) - date when customer created an app account
gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
id (str) - customer id
income (float) - customer's income
transcript.json

event (str) - record description (ie transaction, offer received, offer viewed, etc.)
person (str) - customer id
time (int) - time in hours since start of test. The data begins at time t=0
value - (dict of strings) - either an offer id or transaction amount depending on the record
Note: If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.

You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:



Then you will want to run the above command:



Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.

# Problem Statement 

We will be exploring the Starbuck’s Dataset which simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.

There are three offers_types that can be sent: buy-one-get-one (BOGO), discount, and informational.

We will create a model that can predict behaviour of customer.

We will analyse the data in the Exploratory Data Analysis part of this section and answer the following bussiness questions related to customer buying behaviour.

What is the Gender, Age and Income Distribution of Starbucks Customers?
How many customers enrolled yearly?
What is the average age of Starbucks Customers?
What is the average Income of Starbucks Customers?
Which gender has the highest yearly membership?
Which gender has the highest annual income?
What is the distribution of event in transcripts?
What is the percent of trasactions and offers in the event?
What are the types of offers : received, views, completed?
What is the Income Distribution for the Offer Events?
What is the highest completed offer?
What is the lowest completed offer?
To compare different classification algorithms we will use lazypredict library.

In [1]:
# Install lazypredict library
!pip install lazypredict



# 1. Data Preparing

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import json
import seaborn as sns
from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

# read in the json files
portfolio = pd.read_json('https://raw.githubusercontent.com/bilgin-kocak/starbucks-customer-capstone-project/master/data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('https://raw.githubusercontent.com/bilgin-kocak/starbucks-customer-capstone-project/master/data/profile.json', orient='records', lines=True)
transcript = pd.read_json('https://raw.githubusercontent.com/bilgin-kocak/starbucks-customer-capstone-project/master/data/transcript.json', orient='records', lines=True)



URLError: <urlopen error [Errno 11001] getaddrinfo failed>

# 2.Data Understanding

In [None]:
portfolio.head()

In [None]:
profile.head()

In [None]:
transcript.head()

# 3.Data Cleaning

In this part, we will deal with missing values and extreme values.

In [None]:
profile.isna().sum()

In [None]:
transcript.isnull().sum()

In [None]:
portfolio.isnull().sum()

Only profile dataframe has missing values.

In [None]:
profile['gender'].value_counts()

In [None]:
profile['age'].describe()

In [None]:
age_counts = profile['age'].value_counts()
age_counts.sort_index()

We will drop row in profile dataframe if age is 118, gender and income value is nan.

In [None]:
profile = profile[~((profile.age == 118) & (profile.gender.isnull()) & (profile.income.isnull()))]
profile.head()

In [None]:
profile.isnull().sum()

In [None]:
profile['became_member_on'] = pd.to_datetime(profile.became_member_on, format = '%Y%m%d')
profile['start_year'] = profile.became_member_on.dt.year
profile.head()

In [None]:
# One-hot encode : channels column
channels = portfolio["channels"].str.join(sep="*").str.get_dummies(sep="*")
    
# One-hot encode : offer_type column
offer_type = pd.get_dummies(portfolio['offer_type'])
    
# Concat one-hot into a portfolio_df
portfolio_df = pd.concat([portfolio, channels, offer_type], axis=1, sort=False)

# Remove channels and offer_type
portfolio = portfolio_df.drop(['channels'], axis=1)
portfolio

In [None]:
transcript['value'].apply(lambda x: x.keys()).value_counts()

We have 3 different keys in value column. These are offer_id, amount and reward.

In [None]:
transcript['offer_id'] = transcript['value'].apply(lambda x: x['offer id'] if 'offer id' in x.keys() else x['offer_id'] if "offer_id" in x.keys() else np.nan)
transcript['amount'] = transcript['value'].apply(lambda x: x['amount'] if 'amount' in x.keys() else np.nan)
transcript['reward'] = transcript['value'].apply(lambda x: x['reward'] if 'reward' in x.keys() else np.nan)
transcript.isnull().sum()

In [None]:
transcript.drop(columns=['value'], inplace=True)
transcript.head()

# Merge the datasets

In [None]:
# merge the transcript and profile dataframes on customer_id column
transcript = transcript.merge(profile, on=['customer_id'])
# merge the transcript and portfolio  on offer_id column using left join
# To maintain all the offer_ids from the transcript column
transcript = transcript.merge(portfolio, on=['offer_id'], how='left')
transcript.head(2)

# Data Labelling

In [None]:
#Label Encoding the category columns- 
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

#label encoding - offer_id (10 different IDs) from the portfolio data set
le1 = preprocessing.LabelEncoder()
le1.fit(portfolio.offer_id)
transcript['offer_id'] = le1.fit_transform(transcript['offer_id'].astype(str))


#label encoding - offer_type from the portfolio data set (3 different types, bogo-discount-informational)
le2 = preprocessing.LabelEncoder()
le2.fit(portfolio.offer_type)
transcript['offer_type'] = le2.fit_transform(transcript['offer_type'].astype(str))


# label encoding for gender from the profile data set(4 different types, male-female-other)
le3 = preprocessing.LabelEncoder()
le3.fit(profile.gender)
transcript['gender'] = le3.fit_transform(transcript['gender'].astype(str))

transcript.head(2)

In [None]:
#To retrive its original value we can use its inverse function
le3.inverse_transform([0,1,2])

In [None]:
le4 = preprocessing.LabelEncoder()
le4.fit(profile.customer_id)
transcript['customer_id'] = le3.fit_transform(transcript['customer_id'].astype(str))
transcript.head()

In [None]:
transcript.customer_id.nunique()

In [None]:
# Seperate the three offer columns from the transaction column
transaction_df = transcript[transcript.event == "transaction"]
offers_df = transcript[transcript.event != "transaction"]
offers_df.head()

# Exploratory Data Analysis

Analysis:
What is the Gender, Age and Income Distribution of Starbucks Customers?
How many customers enrolled yearly?
What is the average age of Starbucks Customers?
What is the average Income of Starbucks Customers?
Which gender has the highest yearly membership?
Which gender has the highest annual income?
What is the distribution of event in transcripts?
What is the percent of trasactions and offers in the event?
What are the types of offers : received, views, completed?
What is the Income Distribution for the Offer Events?
What is the highest completed offer?
What is the lowest completed offer?
Q1: What is the Gender, Age and Income Distribution of Starbucks Customers?

Q2: How many customers enrolled yearly?

In [None]:
#Creating Subplots for distribution based on Gender,Age,Income and start year of membership for the cleaned Profile data
fig, ax = plt.subplots(2, 2, figsize=(13, 12))
fig.suptitle('Demographics of Customer Data of Starbucks', fontsize=15, weight='bold')

# GENDER BASED SUBPLOT
plt.subplot(2, 2, 1)
plt.hist(profile['gender']);
plt.style.use('seaborn');
plt.title('Gender Distribution of Starbucks Customers');
plt.xlabel("Gender");
plt.ylabel("Frequency");


# AGE BASED SUBPLOT
plt.subplot(2, 2, 2)
plt.hist(profile['age']);
plt.style.use('seaborn')
plt.title("Age Distribution of Starbucks Customers" );
plt.xlabel("Age");
plt.ylabel("Frequency");

# INCOME BASED  SUBPLOT
plt.subplot(2, 2, 3)
plt.hist(profile['income'] * 1E-3 );
plt.style.use('seaborn')
plt.title("Income Distribution of Starbucks Customers");
plt.xlabel("Income");
plt.ylabel("Frequency");


# BECAME A MEMBER OF STARBUCKS ON(YEAR) SUBPLOT
plt.subplot(2, 2, 4)
profile["start_year"].value_counts().plot(kind = 'bar'); 
plt.style.use('seaborn')
plt.title("Yearly Membership of Starbucks Customers");
plt.xlabel("Yearly Membership");
plt.ylabel("Frequency");

plt.show()

Q3: What is the average age of Starbucks Customers?

In [None]:
avg_age = profile['age'].describe()['mean']
print(f"The average age of starbucks customers: {avg_age:.2f}")

Q4: What is the average Income of Starbucks Customers?

In [None]:
avg_income = profile['income'].describe()['mean']
print(f"The average income of starbucks customers: {avg_income:.2f}")

Q5: Which gender has the highest yearly membership?

In [None]:
# groupby start_year and gender to plot a graph
membership_year = profile.groupby(['start_year', 'gender'])["income"].count().reset_index()
highest_gender_index = (membership_year.groupby('start_year').idxmax().income)
membership_year.rename(columns={'income':'number of membership'}, inplace=True)
membership_year.loc[highest_gender_index]

In [None]:
#plot a bar graph for membership program as a function of gender 
plt.figure(figsize=(15, 5))
sns.barplot(x='start_year', y='number of membership', hue='gender', data=membership_year);
plt.xlabel('Membership Start Year',fontsize = 12);
plt.ylabel('Count',fontsize = 12);
plt.title("Gender distribution of yearly membership", fontsize = 15)
plt.show()

Q6: Which gender has the highest annual income?

In [None]:
avg_income = profile.groupby('gender')['income'].mean()
avg_income

Q7: What is the distribution of event in transcripts?

In [None]:
sns.countplot(transcript['event'])
plt.title('Number of events In Transcripts')
plt.ylabel('Number of Transcripts')
plt.xlabel('Transcript type')
plt.xticks(rotation = 0)
plt.show()

Q8: What is the percent of trasactions and offers in the event?

In [None]:
transaction_percentage = len(transcript[transcript['event']=='transaction'])/len(transcript)
offer_percentage = len(transcript[transcript['event']!='transaction'])/len(transcript)

print(f'The percentage of transaction in all events: {transaction_percentage*100:.2f}')
print(f'The percentage of offer in all events: {offer_percentage*100:.2f}')

Q9: What are the types of offers : received, views, completed?

In [None]:
offers_df.event.value_counts()

In [None]:
offer_received = offers_df[offers_df["event"] == "offer received"]
offer_viewed = offers_df[offers_df["event"]== "offer viewed"]
offer_completed = offers_df[offers_df["event"] == "offer completed"]

# Visualize distribution of membership days grouped by success
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Offer types : received, viewed and completed', fontsize=15, weight='bold')

# Subplot for bogo offers
plt.subplot(1, 3, 1)
sns.countplot(x=offer_received['offer_type'])
plt.title('Number of types of offers received ', fontsize=13)
plt.xlabel('Offer Received')
plt.xticks(rotation = 45, fontsize=13)


# Subplot for discount offers
plt.subplot(1, 3, 2)
sns.countplot(x=offer_viewed['offer_type'])
plt.title('Number of Viewed Promotions for each Offer', fontsize=13)
plt.xlabel('Offer Viewed')
plt.xticks(rotation = 45, fontsize=13)

# Subplot for informational offers
plt.subplot(1, 3, 3)
sns.countplot(x=offer_completed['offer_type'])
plt.title('Number of Viewed Promotions for each Offer', fontsize=13)
plt.xlabel('Offer Completed')
plt.xticks(rotation = 45, fontsize=13)
plt.show()
i = 0
for offer_type in le2.inverse_transform([0, 1, 2]):
    print(f"{i} : {offer_type}")
    i += 1

Q10: What is the Income Distribution for the Offer Events?

In [None]:
#Create a Income group Column cleaning by  segregation
offers_df['income_groups'] = pd.cut(x=profile["income"],
                                    bins=[30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000,  120000],
                                   labels =['30-40K','40-50K','50-60K','60-70K','70-80K','80-90K','90-100K','100-110K','110-120K'])

plt.figure(figsize=(14, 6))
sns.countplot(x=offers_df['income_groups'], hue="event", data=offers_df)
plt.title("Income Distribution for the Offer Events")
plt.ylabel('Total')
plt.xlabel('Income ')
plt.xticks(rotation = 30)
plt.legend(title='Offer Event')
plt.show();

Q11: What is the highest completed offer?

In [None]:
max_completion = offer_completed.offer_id.value_counts().values[0]
max_comp_offer_id = offer_completed.offer_id.value_counts().index[0]
print(f"Number of Completion: {max_completion}")
print(f"offer_id with maximum offers completed:{max_comp_offer_id}")
print(f"Original offer_id with maximum offers completed: {le1.inverse_transform([max_comp_offer_id])}")

Q12: What is the lowest completed offer?

In [None]:
min_completion = offer_completed.offer_id.value_counts().values[-1]
min_comp_offer_id = offer_completed.offer_id.value_counts().index[-1]
print(f"Number of Completion: {min_completion}")
print(f"offer_id with maximum offers completed:{min_comp_offer_id}")
print(f"Original offer_id with maximum offers completed: {le1.inverse_transform([min_comp_offer_id])}")

# 5.Data Modelling

# Metrics:
It is a simple classification problem therefore, we will use accuracy to evaluate models.
Comapre the correct predictions and total number of predicitons to determine the accuracy of the model and choose the best.
In order to compare other models we will use lazypredict library

In [None]:
# dropping these columns because with null values, datetime,object,category datatypes
cols_to_drop = ['income_groups','amount','became_member_on' ,'event','reward_x']
offers_df = offers_df.drop(columns= cols_to_drop)

In [None]:
offers_df.columns

In [None]:
offers_df.head()

# Supervised Learning (Classification)
The target column is offer_type. It will help to predict the correct offer_type to send to each customer.

In [None]:
X = offers_df.drop(columns=['offer_type', 'customer_id', 'offer_id'])
y = offers_df.offer_type

In [None]:
set(offers_df.offer_type)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state =42)

clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

print(models)

In [None]:
print(predictions)

# Evalute the model accuracy
Let's evalaute AdaBoostClassifier model.

In [None]:
print(f'Train Data Size: {X_train.shape[0]}')
print(f'Test Data Size: {X_test.shape[0]}')

In [None]:
X.columns

Select randomly one offer from test set.

In [None]:
X_test.iloc[12535,:]

In [None]:
y_test.iloc[12535]

In [None]:
le2.inverse_transform([0, 1, 2])

In [None]:
random_customer_data = list(X_test.iloc[12535,:])

First we trained model.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100, random_state=0)
clf.fit(X_train, y_train)
clf.score(X, y)

In [None]:
clf.predict(np.reshape(np.array(random_customer_data),(1,-1)))

The model has correctly predicted that the customer will likely respond discount offer type with an accuracy of 100 %. Hence our model has good accuracy for prediction.

# 6.Conclusion
Different segments of customers react to offers differently.

The average age of starbucks customers is 54.39.

The count of male customers in low-income level is slightly higher than that of female.

the average salary of female is greater than male average salary, female spend less on starbucks than male.

Starbucks has more of the young people than those of the aged once.

The result of the offer_type was prediced succesfully by training a supervised classifier.

Improvement:
By using more data we can select the best classification algorithm. For the given data, below classifier model give best result.

AdaBoostClassifier
BaggingClassifier
XGBClassifier
SVC
SGDClassifier
RidgeClassifierCV
RidgeClassifier
RandomForestClassifier
QuadraticDiscriminantAnalysis
Perceptron
PassiveAggressiveClassifier
NuSVC
NearestCentroid
LogisticRegression
LinearSVC
LinearDiscriminantAnalysis
KNeighborsClassifier
GaussianNB
ExtraTreesClassifier
ExtraTreeClassifier
DecisionTreeClassifier
CalibratedClassifierCV
BernoulliNB
LGBMClassifier

# 7.Results
Customers are attracted to BOGO and Discount offers more as compared to Informational Offers. The buying behaviour of a customer are independent of its annual income.

Starbucks have more male customers than females and other gender.

Most of the classification model give best results for the given data.