# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

**Note:** If you are using the workspace, you will need to go to the terminal and run the command `conda update pandas` before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.  

You can see how to access the terminal and how the install works using the two images below.  First you need to access the terminal:

<img src="data/images/pic1.png"/>

Then you will want to run the above command:

<img src="data/images/pic2.png"/>

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.

# Importing libraries and reading files

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
from collections import defaultdict
import pickle
import seaborn as sns

pd.options.display.max_rows = 100 # Expanding to "n" the number of rows shown in the notebook using (.head(n))

In [None]:
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

# 0. Introduction to the Method

The main objective of this work is to define which offers present in the portfolio dataframe better resonate with each demographic. The different demographics targeted in this study will defined during development while analyzing the information present of both profile and transactions dataframes. 

A supervised Machine Learning model will be used in order to define which are one's most effective offers if we provide any of the following information about this potential customer: the age, income or sex.


As input information, we shall provide any or all of the customer's information (income, age or sex), if any information is missing, the mode of the missing paramenter will be used, filtered by a range of the other provided parameters (e.g. given sex = 'F' and age = 30, the income inputted in the model will be the mode of the income for female customers aged between 27 and 33).

The output will be the the offers that would spark the customer's interest.

The success/fail (1/0) criteria for the ML Model is not explicitly portrayed as a column in one of the dataframes. Therefore the "y" vector will be created based on conditions regarding the occurrence of transactions and interaction with sent offers.

Cases considered as successes:
1. No offer was sent but a transaction was still completed, that is, Starbucks saved money not sending the promotion and still made the sale, maximizing profit.
2. An offer was sent and completed, resulting in a sale, before the end of the validity period (after this period, we consider that the promotional offer had no effect over the transaction.

Cases considered as failures:
1. An offer was sent but no transaction resulted from it, that is, the offer was just "sent" or "viewed" and **not** "completed".
2. An offer was sent, a transaction occurred, but the customer never actually viewed the offer, meaning that the purchase would have been made despite the promotion.

# 1. Initial Data Examination

In this section I examine the data and understand how it is assembled.

## 1.1 Portfolio Dataframe

This dataframe holds the **offers** information.

### 1.1.1 Portfolio Info:

In [None]:
print(portfolio.shape)

In [None]:
portfolio.head(10)

In [None]:
portfolio.info()

In [None]:
portfolio.describe()

## 1.2 Profile Dataframe

This dataframe holds the **customers** information.

### 1.2.1 Profile Dataframe Info

In [None]:
profile.shape

In [None]:
profile.head()

In [None]:
profile.info()

In [None]:
profile.describe()

In [None]:
# Pie chart visualization of gender column
total = profile[profile['gender'].isna() == False].shape[0]
labels = ['Male', 'Female', 'Other']
M_perc = profile[profile['gender'] == 'M'].shape[0] / total
F_perc = profile[profile['gender'] == 'F'].shape[0] / total
O_perc = profile[profile['gender'] == 'O'].shape[0] / total
data = [M_perc, F_perc , O_perc]
explode = (0, 0, 0.3)
colors = sns.color_palette('pastel')[0:3]

fig1, ax1 = plt.subplots()

plt.pie(data, labels = labels, colors = colors, explode=explode, radius=1.4, autopct='%.2f%%',normalize=False)
# ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

In [None]:
#Histogram for age:
sns.histplot(data=profile[profile['age'] != 118]['age'],bins=20)

In [None]:
print(profile[profile['age'] != 118]['age'].mean())
print(profile[profile['age'] != 118]['age'].var())
print(profile[profile['age'] != 118]['age'].std())

In [None]:
#Histogram for income:
sns.histplot(profile['income'],bins=20)

In [None]:
print(profile['income'].mean())
print(profile['income'].var())
print(profile['income'].std())

## 1.3 Transcript Dataframe 

This dataframe holds the records of the **transactions**.

### 1.3.1 Transcript Dataframe Info

In [None]:
transcript.shape

In [None]:
transcript.head()

In [None]:
transcript.info()

In [None]:
transcript.describe()

# 2. Data Wrangling

In this section, the data will be analyzed and prepared for the next steps.

## 2.1 Data Formatting

### 2.1.1 Portfolio dataframe

Turning the codes used for offer ids into a more convenient sequence:

In [None]:
# Create dictionary of offer_ids:
offer_ids = dict()
for i, offer_id in enumerate(portfolio.id.unique()):
    offer_ids[offer_id] = i+1

In [None]:
# Insert offer id values on portfolio dataframe:
portfolio.id = portfolio.id.apply(lambda entry: offer_ids[entry])

In [None]:
# Reordering and renaming the columns for easier data visualization:
port_cols = ['id', 'offer_type', 'reward', 'difficulty', 'duration', 'channels']
portfolio = portfolio[port_cols]
portfolio.columns = ['offer_id', 'offer_type', 'reward', 'difficulty', 'duration', 'channels']

Since the time in the transcript dataframe is in hours, the validity period of each offer in the portfolio dataframe must be converted to hours to be compared in the next steps.

In [None]:
# Transforming the "duration" column from days to hours
portfolio.duration = portfolio.duration.apply(lambda entry: entry*24)

### 2.1.2 Profile dataframe

Turning the code used for client ids into a logical sequence: 

In [None]:
# Create dictionary of client_ids:
client_ids = dict()
for i, client_id in enumerate(profile.id.unique()):
    client_ids[client_id] = i

In [None]:
# Insert user id values on profile dataframe:
profile.id = profile.id.apply(lambda entry: client_ids[entry])

In [None]:
profile.head()

Defining fidelity levels based on the "became_member_on" column dates:

- Between 29/07/2013 (First date) and 28/07/2014 : Level 1
- Between 29/07/2014 and 28/07/2015 : Level 2
- Between 29/07/2015 and 28/07/2016 : Level 3
- Between 29/07/2016 and 28/07/2017 : Level 4
- Between 29/07/2017 and 26/07/2018 (Last date) : Level 5

In [None]:
profile['fidelity'] = 0
for idx, series in profile.iterrows():
    if series['became_member_on'] - 20130729 <= 9999:
        profile['fidelity'].loc[idx] = 1
    elif series['became_member_on'] - 20140729 <= 9999:
        profile['fidelity'].loc[idx] = 2
    elif series['became_member_on'] - 20150729 <= 9999:
        profile['fidelity'].loc[idx] = 3
    elif series['became_member_on'] - 20160729 <= 9999:
        profile['fidelity'].loc[idx] = 4
    else:
        profile['fidelity'].loc[idx] = 5

In [None]:
try:
    profile.drop(columns='became_member_on',inplace=True)
except:
    pass

In [None]:
# Reordering columns for better visualization:
try:
    pro_cols = ['id', 'gender', 'age', 'income', 'fidelity']
    profile = profile[pro_cols]
    profile.columns = ['customer_id', 'gender', 'age', 'income', 'fidelity']
except:
    pass

profile.head()

In [None]:
# Fidelity histogram:
sns.histplot(data=profile['fidelity'])

### 2.1.3 Transcript dataframe

In [None]:
# Insert user id values on transcript dataframes:
transcript.person = transcript.person.apply(lambda entry: client_ids[entry])

Cleaning the "value" column on transcript dataframe:

This column was comprised of dicts, which were not so easily handled and were presenting information that could be understood from other columns on the dataframe. Therefore, I decided to simply it by turning it into a column of lists, where the lists' entries are the offer sent and the amount spent by the customer targeted (in the cases where no offer was sent, but there was some kind of transaction, there will be only the amount spent; the same principle is applied for cases where an offer was sent, but there was no actual transaction).

In [None]:
transcript.value

In [None]:
# Create list of dictionary values in transcript "value" column:
values = list()
for entry in transcript.value:
    values.append(list(entry.values()))

In [None]:
# Replacing the entries in column values with recently created list:
transcript.value = values

Splitting the "value" column:

The column was modified and now it is formed by a list in each row. For the rows where "event" is "offer completed", the list is comprised of both the offer code and the amount spent on by the customer. Therefore, for better data handling, I chose to split the "value" column into two columns: "offer_id" and "spent". In the cases where no offer was sent, the value for "offer_id" will be 0. Alternatively, in cases where no transaction was carried, the "spent" column will have its value as 0.

In [None]:
# Creating columns:
transcript['offer_id'] = 0
transcript['spent'] = 0

# Going through the values in "value" column and filling the new columns:
for i, lista in enumerate(transcript.value):
    for item in lista:
        if type(item) == str:
            transcript['offer_id'].iloc[i] = item
        else:
            transcript['spent'].iloc[i] = item

Turning the offer_id code into the offer_ids in the portfolio dataframe:

In [None]:
# Adding key:value 0:0 to account for transaction rows.
offer_ids[0] = 0 

# Insert offer id values into the transcript dataframe:
transcript.offer_id = transcript.offer_id.apply(lambda entry: offer_ids[entry])


In [None]:
# Dropping 'value' column, since the values on it are already separated:
try:
    transcript.drop(columns='value',inplace=True)
except:
    pass

In [None]:
# Reoredering the columns for easier data handling:
trans_cols = ['person', 'event', 'offer_id', 'time', 'spent']
transcript = transcript[trans_cols]
transcript.columns = ['customer_id', 'event', 'offer_id', 'time', 'spent']

## 2.2 Treating NaNs

### 2.2.1 Profile Dataframe

In [None]:
transcript.head()

In [None]:
profile.head()

In [None]:
print(profile[(profile['age'] == 118)]['customer_id'].count())
print(profile.income.isna().sum())
print(profile.gender.isna().sum())
print(profile[(profile['age'] == 118) & (profile.income.isna()) & (profile.gender.isna())]['customer_id'].count())

This value of 2175 customers, when compared to the totality of customers in the database, corresponds to:

In [None]:
perc_nan = profile[(profile['age'] == 118) & (profile.income.isna()) & (profile.gender.isna())]['customer_id'].count() / profile.shape[0]
print(round(100 * perc_nan, 2),'% of all customers in the database.')

It is clear that the customers that did not provide age, did the same to gender and income, since the counts for the rows where these features were simultaneously null has the same value of the row counts where these features were individually null.

Therefore, the only information we have about these customers is the date he/she became a member. Hence, I will check for the frequency of appearence of these customers (I'll refer to them as "NaN customers") on the transcript dataframe (divided by "event"):

In [None]:
# Getting the customers ids:
profile_nan_ids = np.array(profile[(profile['age'] == 118) & (profile.income.isna()) & (profile.gender.isna())]['customer_id'])

# Checking the transactions done by the "nan_ids":
nan_ids_df = transcript[transcript.customer_id.isin(profile_nan_ids)]

In [None]:
nan_transaction = nan_ids_df[nan_ids_df['event'] == 'transaction'].shape[0]
transcript_transaction = transcript[transcript['event'] == 'transaction'].shape[0]

nan_received = nan_ids_df[nan_ids_df['event'] == 'offer received'].shape[0]
transcript_received = transcript[transcript['event'] == 'offer received'].shape[0]

nan_viewed = nan_ids_df[nan_ids_df['event'] == 'offer viewed'].shape[0]
transcript_viewed = transcript[transcript['event'] == 'offer viewed'].shape[0]

nan_completed = nan_ids_df[nan_ids_df['event'] == 'offer completed'].shape[0]
transcript_completed = transcript[transcript['event'] == 'offer completed'].shape[0]

print('The percentage of "offerless" transactions carried by NaN customers is', round(100 * nan_transaction / transcript_transaction,3),'% out of all the "offerless" transactions.\n')
print('The percentage of offers received by NaN customers is', round(100 * nan_received / transcript_received,3),'% of all offers received. \n')
print('The percentage of offers received and viewed by NaN Users is', round(100 * nan_viewed / transcript_viewed,3),'% of all offers that were both received and viewed. \n')
print('The percentage of offers that were received, viewed and completed by NaN customers is', round(100 * nan_completed / transcript_completed,3),'% of all offers that share these events. \n')

Percentages of customers divided by event:

In [None]:
total = transcript.shape[0]
perc_transaction = transcript_transaction / total
perc_received = transcript_received / total
perc_viewed = transcript_viewed / total
perc_completed = transcript_completed / total
print('The percentage of transactions completed w/o receiving any offers is',round(100 * perc_transaction, 2),'% out of all the entries for offers and transctions.\n')
print('The percentage of offers received that were not viewed is',round(100 * perc_received, 2),'% out of all transactions and offers.\n')
print('The percentage of offers that were viewed is',round(100 * perc_viewed, 2),'% out of all entries for transactions or offers.\n')
print('The percentage of completed offers is',round(100 * perc_completed, 2),'% out of all transactions or offers.\n')

Finally, I'll check the percentage of transactions made by NaN customers.

In [None]:
print(round(100*(nan_transaction + nan_received + nan_viewed + nan_completed) / transcript.shape[0],3),'%')

Since the available information about NaN customers is very scarse and the percentuals for transactions, offers completed, viewed and received are also not great, I chose to drop both the customer ids from the profile database as well as the transactions done by these customers from the transcript database.

In [None]:
# Dropping NaN customers from profile dataframe:

# Get index of NaN customers rows:
profile_nan_index = profile[profile['customer_id'].isin(profile_nan_ids)].index

In [None]:
# Droping profile rows:
profile.drop(index=profile_nan_index, inplace=True)

In [None]:
# NaN check:
profile.isna().sum()

With that, I have cleaned the profile dataframe from NaN values.

In [None]:
# Dropping NaN customers from transcript dataframe:

# Get index of NaN customers rows:
transcript_nan_index = transcript[transcript['customer_id'].isin(profile_nan_ids)].index

In [None]:
# Droping transcript rows:
transcript.drop(index=transcript_nan_index, inplace=True)

In [None]:
# NaN check:
transcript.isna().sum()

With that, I have suited the transcript dataframe to the current NaN free profile dataframe.

Checking the differences of the profile dataframe before and after NaN removal, we have:

In [None]:
# Checking the general state of the profile dataframe after NaN removal.
profile.describe()

Comparing the "income" and "age" columns from the 'profile.describe()' from the Initial Data Examination section:

<img src="data/images/profile_describe_initial.png"/>

We observe that:
1. The income values have not changed, proving that the NaN values were correctly removed and they were really not influencing the parameters in any way.
2. The age values saw differences on the percentiles values, since the value of 118 was being used as NaN placeholder. With the NaN values gone, the parameters hold way more meaning in our analysis.

## 2.3 Categorical Variables Dummyfication

In this subsection I will analyze and assess the categorical variables that will need undergo the process of dummyfication, performing it afterwards.

### 2.3.1 Portfolio Dataframe:

In [None]:
# Dummyfythe channels column:

# Getting the values in the lists of the column:
media = set()
for lista in portfolio.channels:
    for entry in lista:
        media.add(entry)

# Creating dummy columns:
portfolio[list(media)] = 0

# Creating dummy values:
for index, lista in enumerate(portfolio.channels):
    for entry in lista:
        portfolio.loc[index, entry] = 1

In [None]:
# Remove channels column:
portfolio.drop(columns='channels', inplace=True)

In [None]:
# Rename dummy columns:
portfolio.columns = ['offer_id', 'offer_type', 'reward', 'difficulty', 'duration', 'channel_web',
       'channel_social', 'channel_email', 'channel_mobile']

In [None]:
portfolio.head()

### 2.3.2 Profile Dataframe:

In [None]:
# Dummyfying the gender column:
profile[['F', 'M', 'O']] = pd.get_dummies(profile['gender'], prefix='gender')

In [None]:
# Dropping the column:
profile.drop(columns='gender', inplace=True)

In [None]:
profile.head()

### 2.3.3 Transcript Dataframe:

Since I intend to use the response of a customer to a type of offer as a measure of success, it is interesting to get the dummy variables of the "event" column.

I will not drop any dummy column because they are not linearly dependent, for there can be a transaction in the event of whether a customer received an offer or not.

In [None]:
transcript.head()

In [None]:
# Creating dummy columns and variables:
transcript[['offer_completed', 'offer_received', 'offer_viewed', 'transaction']] = pd.get_dummies(transcript['event'])

In [None]:
# Dropping the event column:
try: 
    transcript.drop(columns='event', inplace=True)
except:
    pass

In [None]:
# Reordering the columns for better visualization:
trans_cols = ['customer_id', 'offer_id','offer_received','offer_viewed', 'offer_completed', 
                      'transaction', 'time', 'spent']
transcript = transcript[trans_cols]

In [None]:
transcript.head()

## 2.4 Final Wrangling

In this subsection, any data wrangling that was not performed on previous steps due to dependency of previous processes:

### 2.4.1 Profile Dataframe

#### Age Range

Here, I will reduce the a range to avoid overspecialization in the classification process.

In [None]:
profile.head()

In [None]:
# Number of different ages in the dataframe:
profile.age.unique().size

In [None]:
# Age Range:
print('Min age:',profile.age.min(),'/ Max age:', profile.age.max())

In [None]:
plt.hist(profile.age)

The age range will be divided into 4 categories:

- Between 18 and 40 years old (Young Adults / Adults): Category 1
- Between 40 and 60 years old (Middle Age Adults): Category 2
- Between 60 and 80 years old (Old Adults): Category 3
- Between 80 and 101 years old (Elderly): Category 4

In [None]:
# Creating age range list:
range_ages = list()

for age in profile.age:
    if age > 18 and age <= 40:
        range_ages.append(1)
    elif age > 40 and age <= 60:
        range_ages.append(2)
    elif age > 60 and age <= 80:
        range_ages.append(3)
    else:
        range_ages.append(4)

# Adding column to profile dataframe
profile['age_range'] = 0
profile.age_range = range_ages

In [None]:
# Dropping age column:
try:
    profile.drop(columns='age', inplace=True)
except:
    pass

In [None]:
profile.head()

#### Income range

Repeating the same process done for the age variable.

In [None]:
# Number of different incomes in the dataframe:
profile.income.unique().size

In [None]:
# Income Range:
print('Min income:',profile.income.min(),'/ Max income:', profile.income.max())

In [None]:
plt.hist(profile.income)

The income range will be divided into 4 categories:

- Between 30,000 and 50,000: Category 1
- Between 50,000 and 70,000: Category 2
- Between 70,000 and 90,000: Category 3
- Between 90,000 and 120,000: Category 4

In [None]:
# Creating income range list:
range_incomes = list()

for income in profile.income:
    if income > 30000 and income <= 50000:
        range_incomes.append(1)
    elif income > 50000 and income <= 70000:
        range_incomes.append(2)
    elif income > 70000 and income <= 90000:
        range_incomes.append(3)
    else:
        range_incomes.append(4)

# Adding column to profile dataframe
profile['income_range'] = 0
profile.income_range = range_incomes

In [None]:
# Dropping income column:
try:
    profile.drop(columns='income', inplace=True)
except:
    pass

In [None]:
profile.head()

In [None]:
# Number of events:

received = transcript[transcript['offer_received'] == 1].shape[0]
viewed = transcript[transcript['offer_viewed'] == 1].shape[0]
completed = transcript[transcript['offer_completed'] == 1].shape[0]
transactions = transcript[transcript['transaction'] == 1].shape[0]

In [None]:
# Plot of offers:

sns.set_theme(style="whitegrid")
x = ['offers received', 'offers viewed', 'offers completed', 'transactions']
y = [received, viewed, completed, transactions]
fig, ax = plt.subplots()
plt.xlabel('Number of Events')
plt.title('Division of Events', fontsize=14)
colors = sns.color_palette('pastel')[0:4]
sns.barplot(x=x, y=y, palette=colors)

#Adding labels
for i,y in enumerate(y):
    plt.text(i, y/2, y, ha = 'center')

fig.tight_layout()


In [None]:
profile.head()

In [None]:
profile.info()

After all this process, I will save the dataframes on a csv file in order to avoid having to run all cells until this point.

In [None]:
# Saving files to avoid having to run all previous cells.
profile.to_csv('data/profile_v0.csv')
transcript.to_csv('data/transcript_v0.csv')
portfolio.to_csv('data/portfolio_v0.csv')

In [None]:
# Cell used to load files:
profile = pd.read_csv('data/profile_v0.csv', index_col=0)
transcript = pd.read_csv('data/transcript_v0.csv', index_col=0)
portfolio = pd.read_csv('data/portfolio_v0.csv', index_col=0)

# 3. Machine Learning Model

## 3.1 Success/Fail Criteria

For the assessment of the success/failure of an offer, we need to compare the its evolution from the time it is delivered to a possible costumer until its completion, when a transaction is made due to its influence.


### 3.1.1 Base dataframe for criteria definition

In this subsection, we will create the dataframe used to define if an offer was effective when presented to a customer (a criteria dataframe).

This dataframe will be based on the information contained on transcript dataframe and its creation will follow the steps:

1. The foundation of the dataframe are the offers sent to each customer, therefore we will contain the the pairs (customer_id-offer_id) in each row.
2. The information for each pair will be summarized on the respective row: The number of events for the pair will be under the columns "offer_received","offer_viewed" and "offer_completed" (e.g. if the customer_id "1" received the offer "3" two(2) times, viewed the offer one(1) time and completed it zero(0) times, the values under the respective columns will be 2, 1 and 0, respectively).
3. The time when each event happened (starting at the beginning of the experiment, when t=0, will also be recorded. 
4. Lastly, the each offer's duration will be added to its row in the criteria dataframe.

Steps 1 and 2 are relevant because they simplify the information contained on the transcript dataframe, where the records about the pairs (customer-offer) were divided among many columns. In the criteria dataframe, these records are condensed in just one row.

The third step is important for cases when a customer received an offer, completed it without actually opening it and then, later, took the time to open the sent promotion. As the records will show the number of events for that specific pair (customer-offer), it might resemble that the offer was successful. That misconception is corrected by the timestamp of the events: the "time_viewed" will show a bigger number than "time_completed". As we know that a customer may make use of an offer even if he/she doesn't know about it, that time disposition implies that the customer was going to make a purchase regardless of the offer.

When two or more offers are sent to a customer, for dataframe simplification, only the first timestamps of events will be recorded. And in the cases where that happens, we will only use the comparison between  "offer_received", "offer_viewed" and "offer_completed".

We noted that the presented criteria could not be used for informational offers. The recording system had no measure for its completion, for that reason, it is not possible to get a positive value for the "offer_completed" column using the method defined above. Therefore, in order to measure the success of this type of offer,the following method will be used: 

1. The customer received and viewed the informational offer;
2. The customer purchased any product during the duration of the offer provided on the portfolio dataframe.

In [None]:
# Operations Dataframe creation (transcript dataframe with the duration of the offer):
operations = transcript.copy()

In [None]:
operations.head()

In [None]:
# Creation of dictionary linking the offer_id to its duration.
offer_duration = dict()
for idx in portfolio.index:
    offer_duration[portfolio['offer_id'].loc[idx]] = portfolio['duration'].loc[idx]

In [None]:
# Insert offer_duration column on operations dataframe:
operations['offer_duration'] = 0
for idx in operations.index:
    try:
        operations['offer_duration'].loc[idx] = offer_duration[operations['offer_id'].loc[idx]]
    except:
        pass

Splitting the operations dataframe into two:

1. df_o: Dataframe containing information on events related to offers, not transactions (receiving, viewing and completing offers).
2. df_t: Dataframe containing information on all transactions carried during the experiment.

In [None]:
df = operations.sort_values(by=['customer_id','offer_id','time'])

df_o = df[df['transaction'] != 1]
df_t = df[df['transaction'] == 1]

df_t[['time_received', 'time_viewed', 'time_completed', 'transaction_time']] = 0
df_o[['time_received', 'time_viewed', 'time_completed', 'transaction_time']] = 0

for idx, series in df_o.iterrows():
    if series['offer_received'] == 1:
        df_o['time_received'].loc[idx] = series['time']
    elif series['offer_viewed'] == 1:
        df_o['time_viewed'].loc[idx] = series['time']
    else:
        df_o['time_completed'].loc[idx] = series['time']

In [None]:
df_t['transaction_time'] = df_t['time']

Creation of offer dataframe with tags for separating each offer (pair customer-offer):

In [None]:
customers = np.sort(df_o.customer_id.unique())
customer_offer = defaultdict(list)

for customer in customers:
    vector = df_o[df_o['customer_id'] == customer]['offer_id'].unique()
    customer_offer[customer] = list(np.sort(vector))

In [None]:
tag = 1
df_o['offer_tag'] = 0

for customer in customers:
    for offer in customer_offer[customer]:
        # Create customer-offer dataframe:
        df_co = df_o[(df_o['customer_id'] == customer) & (df_o['offer_id'] == offer)]
        
        # Information about received offers:
        n_rec = df_co['offer_received'].sum() # Number of received offers
        idx_rec = list(df_co[df_co['offer_received'] == 1].index) # Indexes of received offers
        times_rec = list(df_co[df_co['offer_received'] == 1]['time']) # Times of received offers
        
        # Chronological order of received offers' timestamps:
        sorted_times_rec = sorted(times_rec)
        
        # Chronological order of indexes (received):
        order = [times_rec.index(x) for x in sorted_times_rec]     
        sorted_idx_rec = [idx_rec[x] for x in order]
        
        # Information about viewed offers
        n_view = df_co['offer_viewed'].sum() # Number of viewed offers
        idx_view = list(df_co[df_co['offer_viewed'] == 1].index) # Indexes of viewed offers
        times_view = list(df_co[df_co['offer_viewed'] == 1]['time']) # Times of viewed offers
        
        # Chronological order of viewed offers' timestamps:
        sorted_times_view = sorted(times_view)
        
        # Chronological order of indexes (viewed):
        order = [times_view.index(x) for x in sorted_times_view]     
        sorted_idx_view = [idx_view[x] for x in order]
        
        # Information about viewed offers
        n_comp = df_co['offer_completed'].sum() # Number of completed offers
        idx_comp = list(df_co[df_co['offer_completed'] == 1].index) # Indexes of completed offers
        times_comp = list(df_co[df_co['offer_completed'] == 1]['time']) # Times of completed offers
        
        # Chronological order of completed offers' timestamps:
        sorted_times_comp = sorted(times_comp)
        
        # Chronological order of indexes (completed):
        order = [times_comp.index(x) for x in sorted_times_comp]     
        sorted_idx_comp = [idx_comp[x] for x in order]

        for offer_set in range(n_rec): # Populating offer_tag columns for each event of an offer:
            
            # Populating the tag column for received offer:
            df_o['offer_tag'].loc[sorted_idx_rec[offer_set]] = tag
            
            # Populating the tag column for viewed offer:  
            try:
                df_o['offer_tag'].loc[sorted_idx_view[offer_set]] = tag
            except:
                pass
            
            # Populating the tag column for completed offer:  
            try:
                df_o['offer_tag'].loc[sorted_idx_comp[offer_set]] = tag
            except:
                pass
            
            tag += 1
            

In [None]:
df_t['offer_tag'] = 0

Creating the criteria dataframe, where each row is a pair (customer-offer) using only the offer dataframe(df_o):

In [None]:
criteria_offers = df_o.groupby(by=['customer_id','offer_id','offer_tag']).sum()
criteria_offers.reset_index(inplace=True)

Correction of offer_completed column for informational offers:

Criteria for a sucessfull informational offer:

1. The offer was actually viewed by the customer;
2. There was a purchase time is between the receiving time (rt) and the offer validation period (ovp); -> ( purchase time = rt + ovp)
3. The purchase time is greater than the viewing time.

In [None]:
inf_offers = [3,8]

inf_df = criteria_offers[(criteria_offers['offer_id'].isin(inf_offers))] # Dataframe of informational offers

for idx, series in inf_df.iterrows(): # Checking all the rows with informational offers:
    if series['offer_viewed'] == 1: # Check only seen offers:
        customer = series['customer_id']
        t0 = series['time_received']
        t1 = series['time_viewed']
        offer_duration = series['offer_duration']
        t_trans = list(df_t[df_t['customer_id'] == customer]['time'])
        
        for time in t_trans:
            if time > t0 and time < (t0 + offer_duration) and time > t1 :
                criteria_offers['offer_completed'].loc[idx] = 1
                criteria_offers['time_completed'].loc[idx] = time
                break

Dropping unecessary columns:

In [None]:
criteria = criteria_offers.drop(columns=['offer_tag','transaction','time','spent','offer_duration','transaction_time'])
criteria.head()

Correcting the column "offer_completed" for informational offers:

In [None]:
# Saving the dataframes created in this section:

criteria_offers.to_csv('data/criteria_offers_v0.csv')
criteria.to_csv('data/criteria_v0.csv')
operations.to_csv('data/operations_v0.csv')
df_o.to_csv('data/df_o_v0.csv')

In [2]:
# Cell used to load files:
criteria_offers = pd.read_csv('data/criteria_offers_v0.csv', index_col=0)
criteria = pd.read_csv('data/criteria_v0.csv', index_col=0)
operations = pd.read_csv('data/operations_v0.csv', index_col=0)
df_o = pd.read_csv('data/df_o_v0.csv', index_col=0)

In [6]:
criteria.head()

Unnamed: 0,customer_id,offer_id,offer_received,offer_viewed,offer_completed,time_received,time_viewed,time_completed
0,1,3,1,0,0,504.0,0.0,0.0
1,1,4,1,0,1,408.0,0.0,528.0
2,3,1,1,1,1,408.0,408.0,510.0
3,3,4,1,1,1,0.0,6.0,132.0
4,3,8,1,1,1,168.0,216.0,222.0


### 3.1.2 Success/Fail Vector (ML Output Vector)

Since there is not direct column that expresses a criteria for success/failure in any of the dataframes we have been working so far. Therefore as it is an essential Machine Learning factor, this criteria must be determined by the scientist. 

The conditions for success have already been presented on section "Introduction to the Method", and they are:

Success Conditions (Output 1):
1. No offer was sent but a transaction was still completed, that is, Starbucks saved money not sending the promotion and still made the sale, maximizing profit.
2. An offer was sent and completed, resulting in a sale, before the end of the validity period (after this period, we consider that the promotional offer had no effect over the transaction.

Failure Condition (Output 0):
1. An offer was sent but no transaction resulted from it, that is, the offer was just "sent" or "viewed" and **not** "completed".
2. An offer was sent, a transaction occurred, but the customer never actually viewed the offer, meaning that the purchase would have been made despite the promotion.

<!-- outside of the offer's validity period, implying that the offer, even if seen, was ignored. Leading us to believe that the transaction would take place even if no offer was sent. -->

Since our objective is to define to which customers we should **send** the offers, Success Condition 1 will be portraited, in our ML model, as a failure occurrence (0), meaning that if a certain customer already has the desire to purchase a determined product, the offer will have no effect for Strabucks's revenue, and therefore the offer should not be sent. 

In [None]:
criteria['send_offer'] = 0

for idx, series  in criteria.iterrows():
    if (series['offer_viewed'] != 0) and (series['offer_completed'] != 0): # The custonmer SAW and COMPLETED the offer.
        if (series['time_completed'] >= series['time_viewed']): # The customer saw the offer before completing it.
            criteria['send_offer'].loc[idx] = 1

In [None]:
criteria.head()

In [None]:
# Saving the dataframes created in this section:

criteria.to_csv('data/criteria_v1.csv')

In [None]:
# Loading the dataframe:

criteria = pd.read_csv('data/criteria_v1.csv',index_col=0)

## 3.2  ML Data Preparation

This subsection prepares the data for the creation of the Machine Learning model.

### 3.2.1 Assembly of the Input Data Dataframe

After all the data preparation, it is time to define what will be used as input of the ML model and create a dataframe that enables for simple visualization and handling of this data.

1. Profile dataframe:

From this dataframe we will use:
- customer_id: for customer identification throughout dataframes;
- age: direct input for the ML model;
- income: direct input for the ML model;
- sex (dummy variables):  direct input for the ML model.

2. Portfolio dataframe:

From this dataframe, the information to be used is:
- offer_id: for offer identification throughout dataframes;

3. Transcript dataframe:

From this dataframe, we will use:
- customer_id: linking the entry to the information on the profile dataframe;
- offer_id: linking the entry to the information on the portfolio dataframe;

Suprisingly, we don't need to know how much a customer spent on Starbucks. The choice to not include the amount spent is backed by:
- Since the focus of this study is to know if a promotion was effective to someone and not find a function that maximizes profit, knowing how much was spent is not necessary.
- There is other information on the dataframes that tells us if an offer resonated with a possible customer or not.

Having the criteria dataframe and the information about the customers, we can create a dataframe that will gather all the data the ML model will need to fit a chosen classification algorithm.

Observing the criteria dataframe, it is clear that all the needed data coming from the portfolio and transcript is already in it. So now we just need to correlate the customer_ids in the criteria dataframe to their respective information on the profile dataframe.

In [None]:
# Adding customer data from the profile dataframe on the ml_input dataframe:
ml_input = criteria.copy()
ml_input[['age_range','income_range','F','M','O','fidelity']] = 0
for input_idx, input_series in ml_input.iterrows():
    customer = input_series['customer_id']
    ml_input.loc[input_idx, 'age_range'] = int(profile[profile['customer_id'] == customer]['age_range'])
    ml_input.loc[input_idx, 'income_range'] = float(profile[profile['customer_id'] == customer]['income_range'])
    ml_input.loc[input_idx, 'F'] = int(profile[profile['customer_id'] == customer]['F'])
    ml_input.loc[input_idx, 'M'] = int(profile[profile['customer_id'] == customer]['M'])
    ml_input.loc[input_idx,'O'] = int(profile[profile['customer_id'] == customer]['O'])
    ml_input.loc[input_idx,'fidelity'] = int(profile[profile['customer_id'] == customer]['fidelity'])

In [None]:
# Removing columns that will not be used by the ML model from the ml_input dataframe:
try:
    ml_input.drop(columns=['customer_id', 'offer_received', 'offer_viewed', 'offer_completed', 'time_received', 'time_viewed', 'time_completed', 'offer_duration'], inplace=True)
except:
    pass

In [None]:
# Rearranging the columns:
ml_input = ml_input[['offer_id','age_range','income_range','M','F','O','fidelity','send_offer']]

In [None]:
# Saving the recently created ml_input dataframe:

ml_input.to_csv('data/ml_input_v0.csv')

In [None]:
# Loading the ml_input dataframe:
ml_input = pd.read_csv('data/ml_input_v0.csv',index_col=0)

In [None]:
ml_input.head()

In [None]:
ml_input.shape

With that, we have the dataframe that will serve as base for the ML model.

### 3.2.2 Input Data Exploration

Now that we have the input dataframe for the ML model, we will extract some data to use as comparison approximations when the prediction model is finished.

In [None]:
# For easier typing:
df = ml_input.copy()

In [None]:
# Getting offers ids:
bogo_offers = portfolio[portfolio['offer_type'] == 'bogo']['offer_id']
inf_offers = portfolio[portfolio['offer_type'] == 'informational']['offer_id']
disc_offers = portfolio[portfolio['offer_type'] == 'discount']['offer_id']

#### GENERAL - Completion rates by offer type

In [None]:
# Total offers that were completed:
offers_1 = df[df['send_offer'] == 1].shape[0]

In [None]:
# How well BOGO offers do:
bogo_1 = df[(df['send_offer'] == 1) & (df['offer_id'].isin(bogo_offers))].shape[0]
bogo_perc = bogo_1 / offers_1

# How well Informational offers do:
inf_1 = df[(ml_input['send_offer'] == 1) & (df['offer_id'].isin(inf_offers))].shape[0]
inf_perc = inf_1 / offers_1

# How well discount offers do:
disc_1 = df[(ml_input['send_offer'] == 1) & (df['offer_id'].isin(disc_offers))].shape[0]
disc_perc = disc_1 / offers_1

print(f'Percentage of completed offers that where BOGO: {round(bogo_perc*100,2)}%')
print(f'Percentage of completed offers that where Informational: {round(inf_perc*100,2)}%')
print(f'Percentage of completed offers that where Discount: {round(disc_perc*100,2)}%')

In [None]:
# Plotting pie chart:
data = [bogo_perc, inf_perc, disc_perc]
labels = ['Buy One - Get One', 'Informational','Discount']
colors = sns.color_palette('pastel')[0:3]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Division of completed offer types', fontsize = 14)
plt.show()

As expected the most completed offer types are: **Discount** and **BOGO**

#### GENERAL - Completion rates WITHIN each offer type

In [None]:
# BOGO offers:
bogo_total = df[(df['offer_id'].isin(bogo_offers))].shape[0]
bogo_effec = bogo_1 / bogo_total

print(f'{round(bogo_effec*100,2)}% of sent BOGO offers were completed.')

# Informational offers:
inf_total = df[df['offer_id'].isin(inf_offers)].shape[0]
inf_effec = inf_1 / inf_total

print(f'{round(inf_effec*100,2)}% of sent Informational offers were completed.')

# Discount offers:
disc_total = df[(df['offer_id'].isin(disc_offers))].shape[0]
disc_effec = disc_1 / disc_total

print(f'{round(disc_effec*100,2)}% of sent Discount offers were completed.')


In [None]:
# Plotting barplot of results:
# Overall completion rate of offers:

sns.set_theme(style="whitegrid")

y = [bogo_effec*100, inf_effec*100, disc_effec*100]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.xlabel('Type of Offer')
plt.ylabel('Overall Completion Rates')
plt.title('Completion Rates for Different Offer Types', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y,2)}%', ha = 'center')


In [None]:
# Number of offers:

sns.set_theme(style="whitegrid")

y = [bogo_total, inf_total, disc_total]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.xlabel('Type of Offer')
plt.ylabel('Number of Sent Offers')
plt.title('Number of Offers divided by Type', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, y, ha = 'center')

#### AGE - Age x Overall Completion Rate

In [None]:
# Percentage of age 1 users that completed an offer:
cr_age1 = round(df[(df['age_range'] == 1)]['send_offer'].mean(), 3) * 100 # Completion rate for age 1
print(cr_age1)

# Percentage of age 2 users that completed an offer:
cr_age2 = round(df[(df['age_range'] == 2)]['send_offer'].mean(), 3) * 100 # Completion rate for age 2
print(cr_age2)

# Percentage of age 3 users that completed an offer:
cr_age3 = round(df[(df['age_range'] == 3)]['send_offer'].mean(), 3) * 100 # Completion rate for age 3
print(cr_age3)

# Percentage of age 4 users that completed an offer:
cr_age4 = round(df[(df['age_range'] == 4)]['send_offer'].mean(), 3) * 100 # Completion rate for age 4
print(cr_age4)

In [None]:
# Data Visualization:
sns.set_theme(style="whitegrid")

y = [cr_age1, cr_age2, cr_age3, cr_age4]
x = ['18-40', '40-60', '60-80', '80-101']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.xlabel('Age Ranges')
plt.ylabel('Overall Completion Rates')
plt.title('Overall Completion Rate x Age Range', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2,  f'{round(y,2)}%', ha = 'center')

In [None]:
# Checking number of customers in each age range:

n_1 = profile[profile['age_range'] == 1].shape[0]
n_2 = profile[profile['age_range'] == 2].shape[0]
n_3 = profile[profile['age_range'] == 3].shape[0]
n_4 = profile[profile['age_range'] == 4].shape[0]

sns.set_theme(style="whitegrid")

y = [n_1, n_2, n_3, n_4]
x = ['18-40', '40-60', '60-80', '80-101']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Number of Individuals')
plt.xlabel('Age Range')
# plt.title('', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, y, ha = 'center')

The offer completion rate is pretty **similar** (around 40%) amongst all age ranges.

#### AGE - In the subset of people that responded to an offer, how is the percentage division for different ages:

In [None]:
# General effectiveness of offers divided by age range:
# Out of all the people who completed the offers, what's the age range division? 

# Getting the number of "yes" by age:
n_age1 = df[(df['age_range'] == 1) & (df['send_offer'] == 1)].shape[0]
n_age2 = df[(df['age_range'] == 2) & (df['send_offer'] == 1)].shape[0]
n_age3 = df[(df['age_range'] == 3) & (df['send_offer'] == 1)].shape[0]
n_age4 = df[(df['age_range'] == 4) & (df['send_offer'] == 1)].shape[0]

perc_age1 = n_age1/offers_1
perc_age2 = n_age2/offers_1
perc_age3 = n_age3/offers_1
perc_age4 = n_age4/offers_1

print(perc_age1, perc_age2, perc_age3, perc_age4)

In [None]:
# Pie chart

data = [perc_age1, perc_age2, perc_age3, perc_age4]
labels = ['Ages: 18-40', 'Ages: 40-60','Ages: 60-80', 'Ages: 80-101']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Customer age division between offer completioners', fontsize = 14)
plt.show()

The age range responsible for the most completed offers is between 40 and 60 years old. But looking at the histogram for ages in the data set, this effect can be a consequence for the number of customers within this age range, since we have already seen that the offer completion is approximately the same for all age ranges.

#### AGE -  Most effective offer type within each age range

In [None]:
# Effectiveness of offer types divided by age range:
# Level 1 (18-40):
yes_age1 = df[(df['age_range'] == 1) & (df['send_offer'] == 1)]
n_age1 = yes_age1.shape[0]

# BOGO offers
bogo_age1_perc = yes_age1[yes_age1['offer_id'].isin(bogo_offers)].shape[0]/n_age1

# Informational offers:
inf_age1_perc = yes_age1[yes_age1['offer_id'].isin(inf_offers)].shape[0]/n_age1

# Discount offers:
disc_age1_perc = yes_age1[yes_age1['offer_id'].isin(disc_offers)].shape[0]/n_age1

print(f'Out of all completed offers: {100*round(bogo_age1_perc,3)}% are BOGO offers, \
{100*round(inf_age1_perc,3)}% are informational offers and {100*round(disc_age1_perc,3)}% are discount offers.')

# Plotting pie chart:

data = [bogo_age1_perc, inf_age1_perc, disc_age1_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers between 18 and 40 years old', fontsize = 14)
plt.show()

In [None]:
# Range 1 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['age_range'] == 1)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 1 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['age_range'] == 1)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 1 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['age_range'] == 1)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[18-40] Age Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

The most appealing type of offer for customers between 18 and 40 years old is **Discount**.

In [None]:
# Effectiveness of offer types divided by age range:
# Level 2 (40-60):
yes_age2 = df[(df['age_range'] == 2) & (df['send_offer'] == 1)]
n_age2 = yes_age2.shape[0]

# BOGO offers
bogo_age2_perc = yes_age2[yes_age2['offer_id'].isin(bogo_offers)].shape[0]/n_age2

# Informational offers:
inf_age2_perc = yes_age2[yes_age2['offer_id'].isin(inf_offers)].shape[0]/n_age2

# Discount offers:
disc_age2_perc = yes_age2[yes_age2['offer_id'].isin(disc_offers)].shape[0]/n_age2

print(f'Out of all completed offers: {100*round(bogo_age2_perc,3)}% are BOGO offers, \
{100*round(inf_age2_perc,3)}% are informational offers and {100*round(disc_age2_perc,3)}% are discount offers.')

In [None]:
# Plotting pie chart:

data = [bogo_age2_perc, inf_age2_perc, disc_age2_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers between 40 and 60 years old', fontsize = 14)
plt.show()

In [None]:
# Range 2 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['age_range'] == 2)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 2 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['age_range'] == 2)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 2 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['age_range'] == 2)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[40-60] Age Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Level 3 (60-800):
yes_age3 = df[(df['age_range'] == 3) & (df['send_offer'] == 1)]
n_age3 = yes_age3.shape[0]

# BOGO offers
bogo_age3_perc = yes_age3[yes_age3['offer_id'].isin(bogo_offers)].shape[0]/n_age3

# Informational offers:
inf_age3_perc = yes_age3[yes_age3['offer_id'].isin(inf_offers)].shape[0]/n_age3

# Discount offers:
disc_age3_perc = yes_age3[yes_age3['offer_id'].isin(disc_offers)].shape[0]/n_age3

print(f'Out of all completed offers: {100*round(bogo_age3_perc,3)}% are BOGO offers, \
{100*round(inf_age3_perc,3)}% are informational offers and {100*round(disc_age3_perc,3)}% are discount offers.')

In [None]:
# Plotting pie chart:

data = [bogo_age3_perc, inf_age3_perc, disc_age3_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers between 60 and 80 years old', fontsize = 14)
plt.show()

In [None]:
# Range 3 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['age_range'] == 3)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 3 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['age_range'] == 3)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 3 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['age_range'] == 3)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[60-80] Age Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Level 4 (80-101):
yes_age4 = df[(df['age_range'] == 4) & (df['send_offer'] == 1)]
n_age4 = yes_age4.shape[0]

# BOGO offers
bogo_age4_perc = yes_age4[yes_age4['offer_id'].isin(bogo_offers)].shape[0]/n_age4

# Informational offers:
inf_age4_perc = yes_age4[yes_age4['offer_id'].isin(inf_offers)].shape[0]/n_age4

# Discount offers:
disc_age4_perc = yes_age4[yes_age4['offer_id'].isin(disc_offers)].shape[0]/n_age4

print(f'Out of all completed offers: {100*round(bogo_age4_perc,3)}% are BOGO offers, \
{100*round(inf_age4_perc,3)}% are informational offers and {100*round(disc_age4_perc,3)}% are discount offers.')

In [None]:
# Plotting pie chart:

data = [bogo_age4_perc, inf_age4_perc, disc_age4_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers between 80 and 101 years old', fontsize = 14)
plt.show()

In [None]:
# Range 4 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['age_range'] == 4)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 4 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['age_range'] == 4)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 4 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['age_range'] == 4)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[80-101] Age Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

#### GENDER - Gender x Overall Completion Rate

In [None]:
# Percentage of male users that completed an offer:
cr_m = round(df[(df['M'] == 1)]['send_offer'].mean(), 3) * 100 # Completion rate for males
print(cr_m)

# Percentage of age 2 users that completed an offer:
cr_f = round(df[(df['F'] == 1)]['send_offer'].mean(), 3) * 100 # Completion rate for females
print(cr_f)

# Percentage of age 3 users that completed an offer:
cr_o = round(df[(df['O'] == 1)]['send_offer'].mean(), 2) * 100 # Completion rate for others
print(cr_o)

In [None]:
# Data Visualization:
sns.set_theme(style="whitegrid")

y = [cr_m, cr_f, cr_o]
x = ['Males', 'Females', 'Others']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Overall Completion Rates')
plt.title('Overall Completion Rate x Gender', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, f'{y}%', ha = 'center')

In [None]:
# Checking number of customers in each gender:

males = profile[profile['M'] == 1].shape[0]
females = profile[profile['F'] == 1].shape[0]
others = profile[profile['O'] == 1].shape[0]

sns.set_theme(style="whitegrid")

y = [males, females, others]
x = ['n_Males', 'n_Females', 'n_Others']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Number of Individuals')
# plt.title('', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, y, ha = 'center')

Since the number of customers that identify themselves as "Other" is considerabily smaller than the number of "Males" and "Females", the observations will be more focused on the results for "Male" and "Female".

That said, the numbers related to "Others" will still be showed.

#### GENDER - In the subset of people that responded to an offer, how is the percentage division for different genders:

In [None]:
# General effectiveness of offers divided by gender:
# Out of all the people who completed the offers, what's their gender? 

# Getting the number of "completed" by gender:
n_m = df[(df['M'] == 1) & (df['send_offer'] == 1)].shape[0]
n_f = df[(df['F'] == 1) & (df['send_offer'] == 1)].shape[0]
n_o = df[(df['O'] == 1) & (df['send_offer'] == 1)].shape[0]

perc_m = round(n_m/offers_1, 2) * 100
perc_f = round(n_f/offers_1, 2) * 100
perc_o = round(n_o/offers_1, 2) * 100


print(perc_m, perc_f, perc_o)

In [None]:
# Pie chart

data = [perc_m, perc_f, perc_o]
labels = ['Males', 'Females', 'Others']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Customer gender division between offer completioners', fontsize = 14)
plt.show()

#### GENDER -  Most effective offer type within each gender

In [None]:
# Effectiveness of offer types divided by age range:
# Male:
yes_m = df[(df['M'] == 1) & (df['send_offer'] == 1)]
n_m = yes_m.shape[0]

# BOGO offers
bogo_m_perc = yes_m[yes_m['offer_id'].isin(bogo_offers)].shape[0]/n_m

# Informational offers:
inf_m_perc = yes_m[yes_m['offer_id'].isin(inf_offers)].shape[0]/n_m

# Discount offers:
disc_m_perc = yes_m[yes_m['offer_id'].isin(disc_offers)].shape[0]/n_m

print(f'Out of all completed offers: {100*round(bogo_m_perc,3)}% are BOGO offers, \
{100*round(inf_m_perc,3)}% are informational offers and {100*round(disc_m_perc,3)}% are discount offers.')

# Plotting pie chart:

data = [bogo_m_perc, inf_m_perc, disc_m_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for male customers', fontsize = 14)
plt.show()

In [None]:
# Male BOGO:
all_bogo_m = df[(df['offer_id'].isin(bogo_offers)) & (df['M'] == 1)]
n_bogo_m = all_bogo_m.shape[0]
all_bogo_m1 = all_bogo_m[all_bogo_m['send_offer'] == 1]
n_bogo_m1 = all_bogo_m1.shape[0]
bogo_comp_m = n_bogo_m1 / n_bogo_m

# Male discount:
all_disc_m = df[(df['offer_id'].isin(disc_offers)) & (df['M'] == 1)]
n_disc_m = all_disc_m.shape[0]
all_disc_m1 = all_disc_m[all_disc_m['send_offer'] == 1]
n_disc_m1 = all_disc_m1.shape[0]
disc_comp_m = n_disc_m1 / n_disc_m

# Male Informational:
all_inf_m = df[(df['offer_id'].isin(inf_offers)) & (df['M'] == 1)]
n_inf_m = all_inf_m.shape[0]
all_inf_m1 = all_inf_m[all_inf_m['send_offer'] == 1]
n_inf_m1 = all_inf_m1.shape[0]
inf_comp_m = n_inf_m1 / n_inf_m

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp_m, inf_comp_m, disc_comp_m]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Male - Completion Rates')
plt.title('Male Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Female:
yes_f = df[(df['F'] == 1) & (df['send_offer'] == 1)]
n_f = yes_f.shape[0]

# BOGO offers
bogo_f_perc = yes_f[yes_f['offer_id'].isin(bogo_offers)].shape[0]/n_f

# Informational offers:
inf_f_perc = yes_f[yes_f['offer_id'].isin(inf_offers)].shape[0]/n_f

# Discount offers:
disc_f_perc = yes_f[yes_f['offer_id'].isin(disc_offers)].shape[0]/n_f

print(f'Out of all completed offers: {100*round(bogo_f_perc,3)}% are BOGO offers, \
{100*round(inf_f_perc,3)}% are informational offers and {100*round(disc_f_perc,3)}% are discount offers.')

# Plotting pie chart:

data = [bogo_f_perc, inf_f_perc, disc_f_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for female customers', fontsize = 14)
plt.show()

In [None]:
# Female BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['F'] == 1)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Female discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['F'] == 1)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Female Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['F'] == 1)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Female - Completion Rates')
plt.title('Female Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Other (60-800):
yes_o = df[(df['O'] == 1) & (df['send_offer'] == 1)]
n_o = yes_o.shape[0]

# BOGO offers
bogo_o_perc = yes_o[yes_o['offer_id'].isin(bogo_offers)].shape[0]/n_o

# Informational offers:
inf_o_perc = yes_o[yes_o['offer_id'].isin(inf_offers)].shape[0]/n_o

# Discount offers:
disc_o_perc = yes_o[yes_o['offer_id'].isin(disc_offers)].shape[0]/n_o

print(f'Out of all completed offers: {100*round(bogo_o_perc,3)}% are BOGO offers, \
{100*round(inf_o_perc,3)}% are informational offers and {100*round(disc_o_perc,3)}% are discount offers.')

# Plotting pie chart:

data = [bogo_o_perc, inf_o_perc, disc_o_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for "Other" customers', fontsize = 14)
plt.show()

In [None]:
# Other BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['O'] == 1)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Other discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['O'] == 1)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Other Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['O'] == 1)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Other Gender - Completion Rates')
plt.title('Other Gender Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

#### INCOME - Income x Overall Completion Rate

In [None]:
# Percentage of users with income_range = 1 that completed an offer:
cr_inc1 = round(df[(df['income_range'] == 1)]['send_offer'].mean(), 2) * 100 # Completion rate for males
print(cr_inc1)

# Percentage of users with income_range = 2 that completed an offer:
cr_inc2 = round(df[(df['income_range'] == 2)]['send_offer'].mean(), 2) * 100 # Completion rate for females
print(cr_inc2)

# Percentage of users with income_range = 3 that completed an offer:
cr_inc3 = round(df[(df['income_range'] == 3)]['send_offer'].mean(), 2) * 100 # Completion rate for others
print(cr_inc3)

# Percentage of users with income_range = 4 that completed an offer:
cr_inc4 = round(df[(df['income_range'] == 4)]['send_offer'].mean(), 2) * 100 # Completion rate for others
print(cr_inc4)

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [cr_inc1, cr_inc2, cr_inc3, cr_inc4]
x = ['30,000 - 50,000', '50,000 - 70,000', '70,000 - 90,000', '90,000 - 120,000']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Overall Completion Rates')
plt.title('Overall Completion Rate x Income Range', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, f'{y}%', ha = 'center')

In [None]:
# Checking number of customers in each income range:

inc1 = profile[profile['income_range'] == 1].shape[0]
inc2 = profile[profile['income_range'] == 2].shape[0]
inc3 = profile[profile['income_range'] == 3].shape[0]
inc4 = profile[profile['income_range'] == 4].shape[0]

sns.set_theme(style="whitegrid")

y = [inc1, inc2, inc3, inc4]
x = ['30k-50k', '50k-70k', '70k-90k', '90k-120k']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Number of Individuals')
plt.xlabel('Income Ranges')
# plt.title('', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, y, ha = 'center')

#### INCOME - In the subset of people that completed to an offer, how is the percentage division for different income ranges:

In [None]:
# General effectiveness of offers divided by income:
# Out of all the people who completed the offers, what's their income range? 

# Getting the number of "completed" by income:
n_inc1 = df[(df['income_range'] == 1) & (df['send_offer'] == 1)].shape[0]
n_inc2 = df[(df['income_range'] == 2) & (df['send_offer'] == 1)].shape[0]
n_inc3 = df[(df['income_range'] == 3) & (df['send_offer'] == 1)].shape[0]
n_inc4 = df[(df['income_range'] == 4) & (df['send_offer'] == 1)].shape[0]

perc_inc1 = round(n_inc1/offers_1, 2) * 100
perc_inc2 = round(n_inc2/offers_1, 2) * 100
perc_inc3 = round(n_inc3/offers_1, 2) * 100
perc_inc4 = round(n_inc4/offers_1, 2) * 100

print(perc_inc1, perc_inc2, perc_inc3, perc_inc4)

# Pie chart

data = [perc_inc1, perc_inc2, perc_inc3, perc_inc4]
labels = ['30k-50k', '50k-70k', '70k-90k', '90k-120k']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Customer income division amongst offer completioners', fontsize = 14)
plt.show()

#### INCOME -  Most effective offer type within each income range

In [None]:
# Effectiveness of offer types divided by income range:
# Range1:
yes_inc1 = df[(df['income_range'] == 1) & (df['send_offer'] == 1)]
n_inc1 = yes_inc1.shape[0]

# BOGO offers
bogo_inc1_perc = yes_inc1[yes_inc1['offer_id'].isin(bogo_offers)].shape[0]/n_inc1

# Informational offers:
inf_inc1_perc = yes_inc1[yes_inc1['offer_id'].isin(inf_offers)].shape[0]/n_inc1

# Discount offers:
disc_inc1_perc = yes_inc1[yes_inc1['offer_id'].isin(disc_offers)].shape[0]/n_inc1

print(f'Out of all completed offers: {100*round(bogo_inc1_perc,2)}% are BOGO offers, \
{100*round(inf_inc1_perc,2)}% are informational offers and {100*round(disc_inc1_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_inc1_perc, inf_inc1_perc, disc_inc1_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers on income range level 1 (30k-50k)', fontsize = 14)
plt.show()

In [None]:
# Range 1 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['income_range'] == 1)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 1 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['income_range'] == 1)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 1 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['income_range'] == 1)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[30k-50k] Income Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Range 2:
yes_inc2 = df[(df['income_range'] == 2) & (df['send_offer'] == 1)]
n_inc2 = yes_inc2.shape[0]

# BOGO offers
bogo_inc2_perc = yes_inc2[yes_inc2['offer_id'].isin(bogo_offers)].shape[0]/n_inc2

# Informational offers:
inf_inc2_perc = yes_inc2[yes_inc2['offer_id'].isin(inf_offers)].shape[0]/n_inc2

# Discount offers:
disc_inc2_perc = yes_inc2[yes_inc2['offer_id'].isin(disc_offers)].shape[0]/n_inc2

print(f'Out of all completed offers: {100*round(bogo_inc2_perc,2)}% are BOGO offers, \
{100*round(inf_inc2_perc,2)}% are informational offers and {100*round(disc_inc2_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_inc2_perc, inf_inc2_perc, disc_inc2_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers on income range level 2 (50k-70k)', fontsize = 14)
plt.show()

In [None]:
# Range 2 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['income_range'] == 2)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 2 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['income_range'] == 2)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 2 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['income_range'] == 2)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[50k-70k] Income Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Range 3:
yes_inc3 = df[(df['income_range'] == 3) & (df['send_offer'] == 1)]
n_inc3 = yes_inc3.shape[0]

# BOGO offers
bogo_inc3_perc = yes_inc3[yes_inc3['offer_id'].isin(bogo_offers)].shape[0]/n_inc3

# Informational offers:
inf_inc3_perc = yes_inc3[yes_inc3['offer_id'].isin(inf_offers)].shape[0]/n_inc3

# Discount offers:
disc_inc3_perc = yes_inc3[yes_inc3['offer_id'].isin(disc_offers)].shape[0]/n_inc3

print(f'Out of all completed offers: {100*round(bogo_inc3_perc,2)}% are BOGO offers, \
{100*round(inf_inc3_perc,2)}% are informational offers and {100*round(disc_inc3_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_inc3_perc, inf_inc3_perc, disc_inc3_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers on income range level 3 (70k-90k)', fontsize = 14)
plt.show()

In [None]:
# Range 3 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['income_range'] == 3)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 3 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['income_range'] == 3)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 3 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['income_range'] == 3)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[70k-90k] Income Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Range 4:
yes_inc4 = df[(df['income_range'] == 4) & (df['send_offer'] == 1)]
n_inc4 = yes_inc4.shape[0]

# BOGO offers
bogo_inc4_perc = yes_inc4[yes_inc4['offer_id'].isin(bogo_offers)].shape[0]/n_inc4

# Informational offers:
inf_inc4_perc = yes_inc4[yes_inc4['offer_id'].isin(inf_offers)].shape[0]/n_inc4

# Discount offers:
disc_inc4_perc = yes_inc4[yes_inc4['offer_id'].isin(disc_offers)].shape[0]/n_inc4

print(f'Out of all completed offers: {100*round(bogo_inc4_perc,2)}% are BOGO offers, \
{100*round(inf_inc4_perc,2)}% are informational offers and {100*round(disc_inc4_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_inc4_perc, inf_inc4_perc, disc_inc4_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers on income range level 4 (90k-120k)', fontsize = 14)
plt.show()

In [None]:
# Range 4 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['income_range'] == 4)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 4 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['income_range'] == 4)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 4 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['income_range'] == 4)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('[90k-120k] Income Range Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

#### FIDELITY - Fidelity Level x Overall Completion Rate

In [None]:
# Percentage of users with fidelity = 1 that completed an offer (Between 29/07/2013 and 28/07/2014):
cr_f1 = round(df[(df['fidelity'] == 1)]['send_offer'].mean(), 3) * 100 # Completion rate for fidelity=1
print(cr_f1)

# Percentage of users with fidelity = 2 that completed an offer (Between 29/07/2014 and 28/07/2015):
cr_f2 = round(df[(df['fidelity'] == 2)]['send_offer'].mean(), 3) * 100 # Completion rate for fidelity=2
print(cr_f2)

# Percentage of users with fidelity = 3 that completed an offer (Between 29/07/2015 and 28/07/2016):
cr_f3 = round(df[(df['fidelity'] == 3)]['send_offer'].mean(), 3) * 100 # Completion rate for fidelity=3
print(cr_f3)

# Percentage of users with fidelity = 4 that completed an offer (Between 29/07/2016 and 28/07/2017):
cr_f4 = round(df[(df['fidelity'] == 4)]['send_offer'].mean(), 2) * 100 # Completion rate for fidelity=4
print(cr_f4)

# Percentage of users with fidelity = 4 that completed an offer (Between 29/07/2017 and 26/07/2018):
cr_f5 = round(df[(df['fidelity'] == 5)]['send_offer'].mean(), 2) * 100 # Completion rate for fidelity=4
print(cr_f5)

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [cr_f1, cr_f2, cr_f3, cr_f4, cr_f5]
x = ['Fidelity 1', 'Fidelity 2', 'Fidelity 3', 'Fidelity 4', 'Fidelity 5']
colors = sns.color_palette('pastel')[0:6]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Overall Completion Rates')
plt.title('Overall Completion Rate x Fidelity Level', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, f'{y}%', ha = 'center')

In [None]:
# Checking number of customers in each fidelity level:

f1 = profile[profile['fidelity'] == 1].shape[0]
f2 = profile[profile['fidelity'] == 2].shape[0]
f3 = profile[profile['fidelity'] == 3].shape[0]
f4 = profile[profile['fidelity'] == 4].shape[0]
f5 = profile[profile['fidelity'] == 5].shape[0]

sns.set_theme(style="whitegrid")

y = [f1, f2, f3, f4, f5]
x = ['Fidelity 1', 'Fidelity 2', 'Fidelity 3', 'Fidelity 4', 'Fidelity 5']
colors = sns.color_palette('pastel')[0:5]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Number of Individuals')
plt.xlabel('Fidelity Levels')
# plt.title('', fontsize = 14)

for i,y in enumerate(y):
    plt.text(i, y/2, y, ha = 'center')

#### FIDELITY - In the subset of people that completed to an offer, how is the percentage division for different fidelity levels:

In [None]:
# General effectiveness of offers divided by fidelity:
# Out of all the people who completed the offers, what's their fidelity level? 

# Getting the number of "completed" by income:
n_f1 = df[(df['fidelity'] == 1) & (df['send_offer'] == 1)].shape[0]
n_f2 = df[(df['fidelity'] == 2) & (df['send_offer'] == 1)].shape[0]
n_f3 = df[(df['fidelity'] == 3) & (df['send_offer'] == 1)].shape[0]
n_f4 = df[(df['fidelity'] == 4) & (df['send_offer'] == 1)].shape[0]
n_f5 = df[(df['fidelity'] == 5) & (df['send_offer'] == 1)].shape[0]

perc_f1 = round(n_f1/offers_1, 2) * 100
perc_f2 = round(n_f2/offers_1, 2) * 100
perc_f3 = round(n_f3/offers_1, 2) * 100
perc_f4 = round(n_f4/offers_1, 2) * 100
perc_f5 = round(n_f5/offers_1, 2) * 100

print(perc_f1, perc_f2, perc_f3, perc_f4,perc_f5)

# Pie chart

data = [perc_f1, perc_f2, perc_f3, perc_f4,perc_f5]
labels = ['Fidelity 1', 'Fidelity 2', 'Fidelity 3', 'Fidelity 4', 'Fidelity 5']
colors = sns.color_palette('pastel')[0:6]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Customer fidelity division amongst offer completioners', fontsize = 14)
plt.show()

#### FIDELITY -  Most effective offer type within each fidelity range

In [None]:
# Effectiveness of offer types divided by fidelity:
# Level 1:
yes_f1 = df[(df['fidelity'] == 1) & (df['send_offer'] == 1)]
n_f1 = yes_f1.shape[0]

# BOGO offers
bogo_f1_perc = yes_f1[yes_f1['offer_id'].isin(bogo_offers)].shape[0]/n_f1

# Informational offers:
inf_f1_perc = yes_f1[yes_f1['offer_id'].isin(inf_offers)].shape[0]/n_f1

# Discount offers:
disc_f1_perc = yes_f1[yes_f1['offer_id'].isin(disc_offers)].shape[0]/n_f1

print(f'Out of all completed offers: {100*round(bogo_f1_perc,2)}% are BOGO offers, \
{100*round(inf_f1_perc,2)}% are informational offers and {100*round(disc_f1_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_f1_perc, inf_f1_perc, disc_f1_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%')
plt.title('Offer effectiveness for customers on fidelity level 1 (Between 29/07/2013 and 28/07/2014 )', fontsize = 14)
plt.show()

In [None]:
# Range 1 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['fidelity'] == 1)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 1 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['fidelity'] == 1)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 1 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['fidelity'] == 1)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('Level 1 Fidelity Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Effectiveness of offer types divided by fidelity:
# Level 2:
yes_f2 = df[(df['fidelity'] == 2) & (df['send_offer'] == 1)]
n_f2 = yes_f2.shape[0]

# BOGO offers
bogo_f2_perc = yes_f2[yes_f2['offer_id'].isin(bogo_offers)].shape[0]/n_f2

# Informational offers:
inf_f2_perc = yes_f2[yes_f2['offer_id'].isin(inf_offers)].shape[0]/n_f2

# Discount offers:
disc_f2_perc = yes_f2[yes_f2['offer_id'].isin(disc_offers)].shape[0]/n_f2

print(f'Out of all completed offers: {100*round(bogo_f2_perc,2)}% are BOGO offers, \
{100*round(inf_f2_perc,2)}% are informational offers and {100*round(disc_f2_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_f2_perc, inf_f2_perc, disc_f2_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',normalize=False)
plt.title('Offer effectiveness for customers on fidelity level 2 (Between 29/07/2014 and 28/07/2015 )', fontsize = 14)
plt.show()

In [None]:
# Range 2 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['fidelity'] == 2)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 2 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['fidelity'] == 2)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 2 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['fidelity'] == 2)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('Level 2 Fidelity Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Effectiveness of offer types divided by fidelity:
# Level 3:
yes_f3 = df[(df['fidelity'] == 3) & (df['send_offer'] == 1)]
n_f3 = yes_f3.shape[0]

# BOGO offers
bogo_f3_perc = yes_f3[yes_f3['offer_id'].isin(bogo_offers)].shape[0]/n_f3

# Informational offers:
inf_f3_perc = yes_f3[yes_f3['offer_id'].isin(inf_offers)].shape[0]/n_f3

# Discount offers:
disc_f3_perc = yes_f3[yes_f3['offer_id'].isin(disc_offers)].shape[0]/n_f3

print(f'Out of all completed offers: {100*round(bogo_f3_perc,2)}% are BOGO offers, \
{100*round(inf_f3_perc,2)}% are informational offers and {100*round(disc_f3_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_f3_perc, inf_f3_perc, disc_f3_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',normalize=False)
plt.title('Offer effectiveness for customers on fidelity level 3 (Between 29/07/2015 and 28/07/2016 )', fontsize = 14)
plt.show()

In [None]:
# Range 3 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['fidelity'] == 3)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 3 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['fidelity'] == 3)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 3 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['fidelity'] == 3)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('Level 3 Fidelity Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Effectiveness of offer types divided by fidelity:
# Level 4:
yes_f4 = df[(df['fidelity'] == 4) & (df['send_offer'] == 1)]
n_f4 = yes_f4.shape[0]

# BOGO offers
bogo_f4_perc = yes_f4[yes_f4['offer_id'].isin(bogo_offers)].shape[0]/n_f4

# Informational offers:
inf_f4_perc = yes_f4[yes_f4['offer_id'].isin(inf_offers)].shape[0]/n_f4

# Discount offers:
disc_f4_perc = yes_f4[yes_f4['offer_id'].isin(disc_offers)].shape[0]/n_f4

print(f'Out of all completed offers: {100*round(bogo_f4_perc,2)}% are BOGO offers, \
{100*round(inf_f4_perc,2)}% are informational offers and {100*round(disc_f4_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_f4_perc, inf_f4_perc, disc_f4_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',normalize=False)
plt.title('Offer effectiveness for customers on fidelity level 4 (Between 29/07/2016 and 28/07/2017)', fontsize = 14)
plt.show()

In [None]:
# Range 4 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['fidelity'] == 4)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 4 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['fidelity'] == 4)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 4 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['fidelity'] == 4)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('Level 4 Fidelity Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

In [None]:
# Effectiveness of offer types divided by fidelity:
# Level 5:
yes_f5 = df[(df['fidelity'] == 5) & (df['send_offer'] == 1)]
n_f5 = yes_f5.shape[0]

# BOGO offers
bogo_f5_perc = yes_f5[yes_f5['offer_id'].isin(bogo_offers)].shape[0]/n_f5

# Informational offers:
inf_f5_perc = yes_f5[yes_f5['offer_id'].isin(inf_offers)].shape[0]/n_f5

# Discount offers:
disc_f5_perc = yes_f5[yes_f5['offer_id'].isin(disc_offers)].shape[0]/n_f5

print(f'Out of all completed offers: {100*round(bogo_f5_perc,2)}% are BOGO offers, \
{100*round(inf_f5_perc,2)}% are informational offers and {100*round(disc_f5_perc,2)}% are discount offers.')

# Plotting pie chart:

data = [bogo_f5_perc, inf_f5_perc, disc_f5_perc]
labels = ['BOGO offers', 'Informational offers', 'Discount offers']
colors = sns.color_palette('pastel')[0:4]

plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',normalize=False)
plt.title('Offer effectiveness for customers on fidelity level 5 (Between 29/07/2017 and 26/07/2018)', fontsize = 14)
plt.show()

In [None]:
# Range 5 BOGO:
all_bogo = df[(df['offer_id'].isin(bogo_offers)) & (df['fidelity'] == 5)]
n_bogo = all_bogo.shape[0]
all_bogo1 = all_bogo[all_bogo['send_offer'] == 1]
n_bogo1 = all_bogo1.shape[0]
bogo_comp = n_bogo1 / n_bogo

# Range 5 discount:
all_disc = df[(df['offer_id'].isin(disc_offers)) & (df['fidelity'] == 5)]
n_disc = all_disc.shape[0]
all_disc1 = all_disc[all_disc['send_offer'] == 1]
n_disc1 = all_disc1.shape[0]
disc_comp = n_disc1 / n_disc

# Range 5 Informational:
all_inf = df[(df['offer_id'].isin(inf_offers)) & (df['fidelity'] == 5)]
n_inf = all_inf.shape[0]
all_inf1 = all_inf[all_inf['send_offer'] == 1]
n_inf1 = all_inf1.shape[0]
inf_comp = n_inf1 / n_inf

# Data Visualization:
sns.set_theme(style="whitegrid")

y = [bogo_comp, inf_comp, disc_comp]
x = ['BOGO', 'Informational', 'Discount']
colors = sns.color_palette('pastel')[0:4]

sns.barplot(x=x, y=y, palette=colors)
plt.ylabel('Completion Rates')
plt.title('Level 5 Fidelity Offer Completion for Each Offer Type', fontsize = 14)
plt.xlabel('Type of Offer')

for i,y in enumerate(y):
    plt.text(i, y/2, f'{round(y*100,1)}%', ha = 'center')

## 3.3. Finding the best algorithm

In this section, we will apply a number of supervised ML algorithms and check their performance for the desired results.

### 3.3.1 Method

**Method Description**

The method to optimize the prediction model will be the utilization and evaluation of different classification methods on the sklearn library. Upon arrival at the methods with best results, the optimization of hyperparameters will be performed using GridSearchCV and a new evaluation phase will be carried.

The selected initial methods for classification are:

- LinearSVC
- SVC
- NuSVC
- K-Neighbors
- Random Forest
- AdaBoost
- Gradient Boosting
- Decision Tree Classifier

The variety of employed methods is explained by the exploratory nature of this project, where the main objective is to be familiarized with how each algorithm responds to classification problems like this.

**Performance Metrics**

The most common metrics for classification models are:

- Accuracy
- Precision and Recall
- F1-score
- AU-ROC

For this project, accuracy and AU-ROC will be the main measures of performance, while precision will serve as a secondary/tie-breaker metric.

_Explain the metrics on the blog._

In [None]:
# Importing necessary libraries:

import time

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_score, classification_report, roc_auc_score, precision_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression, Perceptron, PassiveAggressiveClassifier, LogisticRegressionCV
from sklearn.svm import LinearSVC, SVC, NuSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier


In [None]:
def display_results(y_test, y_pred):
    '''
    Description:
        Function to display results of a classification model.
    INPUT:
        - y_test: True values for predictions for the data.
        - y_pred: Predicted values for the data.
    '''
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=labels)
    accuracy = accuracy_score(y_test, y_pred)
    au_roc = roc_auc_score(y_test, y_pred, average='macro')

    print("Labels:", labels)  
    disp.plot()
    print("Accuracy:", accuracy)
    print("AU-ROC:", au_roc)
    print(classification_report(y_test, y_pred))
#     print("Confusion Matrix:")

In [None]:
# Separation of input dataframe and output vector from ml_input:
X = ml_input[['offer_id','age_range','income_range','M','F','O','fidelity']]
y = ml_input.send_offer

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

**Method Testing**

In [None]:
# Creation of records dictionaries:
cols = defaultdict(list)

In [None]:
# Testing for Logistic Regression:
print('Logistic Regression')

start = time.time()

model_lrg = LogisticRegression()
model_lrg.fit(X_train, y_train)
y_pred_lrg = model_lrg.predict(X_test)
display_results(y_test, y_pred_lrg)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Logistic Regression')
cols['accuracy'].append(accuracy_score(y_test,y_pred_lrg))
cols['auroc'].append(roc_auc_score(y_test, y_pred_lrg, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_lrg, average='macro'))

In [None]:
# Testing for Perceptron:
print('Perceptron')

start = time.time()

model_per = Perceptron()
model_per.fit(X_train, y_train)
y_pred_per = model_per.predict(X_test)
display_results(y_test, y_pred_per)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Perceptron')
cols['accuracy'].append(accuracy_score(y_test,y_pred_per))
cols['auroc'].append(roc_auc_score(y_test, y_pred_per, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_per, average='macro'))

In [None]:
# Testing for Passive Aggressive Classifier:
print('Passive Aggressive Classifier')

start = time.time()

model_pac = PassiveAggressiveClassifier()
model_pac.fit(X_train, y_train)
y_pred_pac = model_pac.predict(X_test)
display_results(y_test, y_pred_pac)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Passive Agressive')
cols['accuracy'].append(accuracy_score(y_test,y_pred_pac))
cols['auroc'].append(roc_auc_score(y_test, y_pred_pac, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_pac, average='macro'))

In [None]:
# Testing for Logistic RegressionCV:
print('Logistic RegressionCV')

start = time.time()

model_lrcv = LogisticRegressionCV()
model_lrcv.fit(X_train, y_train)
y_pred_lrcv = model_lrcv.predict(X_test)
display_results(y_test, y_pred_lrcv)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Logistic Regression CV')
cols['accuracy'].append(accuracy_score(y_test,y_pred_lrcv))
cols['auroc'].append(roc_auc_score(y_test, y_pred_lrcv, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_lrcv, average='macro'))

In [None]:
# Testing for Linear SVC:
print('Linear SVC')

start = time.time()

model_lsvc = LinearSVC()
model_lsvc.fit(X_train, y_train)
y_pred_lsvc = model_lsvc.predict(X_test)
display_results(y_test, y_pred_lsvc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Linear SVC')
cols['accuracy'].append(accuracy_score(y_test,y_pred_lsvc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_lsvc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_lsvc, average='macro'))

In [None]:
# Testing for SVC:
print('SVC')

start = time.time()

model_svc = SVC()
model_svc.fit(X_train, y_train)
y_pred_svc = model_svc.predict(X_test)
display_results(y_test, y_pred_svc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('SVC')
cols['accuracy'].append(accuracy_score(y_test,y_pred_svc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_svc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_svc, average='macro'))

In [None]:
# Testing for NuSVC:
print('NuSVC')

start = time.time()

model_nusvc = NuSVC()
model_nusvc.fit(X_train, y_train)
y_pred_nusvc = model_nusvc.predict(X_test)
display_results(y_test, y_pred_nusvc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('NuSVC')
cols['accuracy'].append(accuracy_score(y_test,y_pred_nusvc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_nusvc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_nusvc, average='macro'))

In [None]:
# Testing for K-Neighbors Classifier:
print('K-Neighbors')

start = time.time()

model_knc = KNeighborsClassifier()
model_knc.fit(X_train, y_train)
y_pred_knc = model_knc.predict(X_test)
display_results(y_test, y_pred_knc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('K-Neighbors')
cols['accuracy'].append(accuracy_score(y_test,y_pred_knc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_knc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_knc, average='macro'))

In [None]:
# Testing for Random Forest Classifier:
print('Random Forest')

start = time.time()

model_rfc = RandomForestClassifier()
model_rfc.fit(X_train, y_train)
y_pred_rfc = model_rfc.predict(X_test)
display_results(y_test, y_pred_rfc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Random Forest')
cols['accuracy'].append(accuracy_score(y_test,y_pred_rfc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_rfc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_rfc, average='macro'))

In [None]:
# Testing for AdaBoost:
print('AdaBoost')

start = time.time()

model_ada = AdaBoostClassifier()
model_ada.fit(X_train, y_train)
y_pred_ada = model_ada.predict(X_test)
display_results(y_test, y_pred_ada)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('AdaBoost')
cols['accuracy'].append(accuracy_score(y_test,y_pred_ada))
cols['auroc'].append(roc_auc_score(y_test, y_pred_ada, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_ada, average='macro'))

In [None]:
# Testing for Gradient Boosting:
print('Gradient Boosting')

start = time.time()

model_gbc = GradientBoostingClassifier()
model_gbc.fit(X_train, y_train)
y_pred_gbc = model_gbc.predict(X_test)
display_results(y_test, y_pred_gbc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Gradient Boosting')
cols['accuracy'].append(accuracy_score(y_test,y_pred_gbc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_gbc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_gbc, average='macro'))

In [None]:
# Testing for Bagging Classifier:
print('Bagging Classifier')

start = time.time()

model_bag = BaggingClassifier()
model_bag.fit(X_train, y_train)
y_pred_bag = model_bag.predict(X_test)
display_results(y_test, y_pred_bag)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Bagging')
cols['accuracy'].append(accuracy_score(y_test,y_pred_bag))
cols['auroc'].append(roc_auc_score(y_test, y_pred_bag, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_bag, average='macro'))

In [None]:
# Testing for Decision Tree:
print('Decision Tree')

start = time.time()

model_dtc = DecisionTreeClassifier()
model_dtc.fit(X_train, y_train)
y_pred_dtc = model_dtc.predict(X_test)
display_results(y_test, y_pred_dtc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Decision Trees')
cols['accuracy'].append(accuracy_score(y_test,y_pred_dtc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_dtc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_dtc, average='macro'))

In [None]:
# Testing for Extra Tree:
print('Extra Tree')

start = time.time()

model_etc = ExtraTreeClassifier()
model_etc.fit(X_train, y_train)
y_pred_etc = model_etc.predict(X_test)
display_results(y_test, y_pred_etc)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Extra Trees')
cols['accuracy'].append(accuracy_score(y_test,y_pred_etc))
cols['auroc'].append(roc_auc_score(y_test, y_pred_etc, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_etc, average='macro'))

In [None]:
cols

In [None]:
# Creation of metrics dataframe:
metrics = pd.DataFrame(data=cols)

In [None]:
metrics.sort_values(by=['accuracy','auroc','precision'], ascending=False)

Since many methods returned a value of AU-ROC of exactly 0.5, we can deduce that there is no difference between these methods and a constant value prediction. Therefore, as a filtering measure, we have decided to only compare methods where this metric was greater than 0.5.

In [None]:
metrics.sort_values(by=['accuracy','auroc','precision'], ascending=False)[metrics['precision'] > 0.5]

In [None]:
%store cols

From the dataframe visualization, it is clear that the best classifiers were Gradient Boosting, Bagging and Random Forest. Now, for the optimization process, we will use GridSearchCV to increase the parameter metrics through parameter tunning.

In [None]:
# Gradient Boosting method tuning:
model_gbc.get_params()

In [None]:
# Testing for Gradient Boosting with Grid Search:
print('Gradient Boosting GridSearch')

start = time.time()

gbc = GradientBoostingClassifier()
parameters = {'learning_rate': [0.5, 0.2, 0.1],
              'n_estimators': [100, 250, 500],
              'max_depth': [3,5,7],
              'min_samples_split': [2,4,6],              
            }

gbcgs = GridSearchCV(estimator=gbc, param_grid=parameters)
gbcgs.fit(X_train, y_train)
y_pred_gbcgs = gbcgs.predict(X_test)

display_results(y_test, y_pred_gbcgs)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Gradient Boosting GridSearch')
cols['accuracy'].append(accuracy_score(y_test,y_pred_gbcgs))
cols['auroc'].append(roc_auc_score(y_test, y_pred_gbcgs, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_gbcgs, average='macro'))

In [None]:
metrics = pd.DataFrame(data=cols)
metrics.sort_values(by='accuracy',ascending=False)

In [None]:
# Decison Tree Classifier method tuning:
model_dtc.get_params()

In [None]:
# Testing for Extra Decision Classifier with Grid Search:
print('Decision Tree Classifier GridSearch')

start = time.time()

dtc = DecisionTreeClassifier()
parameters = {'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [1,2],
              'min_samples_split': [2, 3],
              'splitter': ['random', 'best'],
            }

dtcgs = GridSearchCV(estimator=dtc, param_grid=parameters)
dtcgs.fit(X_train, y_train)
y_pred_dtcgs = dtcgs.predict(X_test)

display_results(y_test, y_pred_etcgs)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Decision Tree Classifier GridSearch')
cols['accuracy'].append(accuracy_score(y_test,y_pred_dtcgs))
cols['auroc'].append(roc_auc_score(y_test, y_pred_dtcgs, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_dtcgs, average='macro'))

In [None]:
# Random Forest method tuning:
model_rfc.get_params()

In [None]:
# Testing for Random Forest Classifier with Grid Search:
print('Random Forest GridSearch')

start = time.time()

rfc = RandomForestClassifier()
parameters = {'n_estimators': [100, 200, 400],
              'criterion': ['entropy', 'gini'],
            }

rfcgs = GridSearchCV(estimator=rfc, param_grid=parameters)
rfcgs.fit(X_train, y_train)
y_pred_rfcgs = rfcgs.predict(X_test)

display_results(y_test, y_pred_rfcgs)

end = time.time()

print('Time elapsed:',end - start,'s')

cols['method'].append('Random Forest GridSearch')
cols['accuracy'].append(accuracy_score(y_test,y_pred_rfcgs))
cols['auroc'].append(roc_auc_score(y_test, y_pred_rfcgs, average='macro'))
cols['precision'].append(precision_score(y_test, y_pred_rfcgs, average='macro'))

In [None]:
%store cols

In [None]:
metrics = pd.DataFrame(data=cols)
metrics.sort_values(by=['accuracy','auroc','precision'], ascending=False)[metrics['auroc'] > 0.5].head(8)

In [None]:
# Updating dataframe and removing possible duplicates:
metrics = pd.DataFrame(data=cols)
try:
    metrics.drop_duplicates(inplace=True,subset='method')
except:
    pass

In [None]:
# Displaying results:
metrics.sort_values(by=['accuracy','auroc','precision'], ascending=False)[metrics['auroc'] > 0.5].head(8)

Selected method:
**Gradient Boosting**

In [None]:
# Presenting parameters:
gbcgs.best_estimator_.get_params()

In [None]:
# Saving model to pickle file:
filename = 'data/final_model.sav'
pickle.dump(gbcgs, open(filename,'wb'))

In [None]:
# Loading model in pickle file:
filename = 'data/final_model.sav'
gbcgs = pickle.load(open(filename, 'rb'))

## 3.4. Application of ML Model

In this subsection, we will finally apply the optimized model for preparing a function that receives the information of a customer and returns which offers we should send to this customer, if any.

If no customer information is provided, we will use the modes for the customers features (age, income and sex).

In [None]:
# Getting mode values:
mode_age = ml_input.age_range.mode()[0]
mode_income = ml_input.income_range.mode()[0]
mode_sex = ml_input[['M','F','O']].sum().sort_values(ascending=False).index[0]
mode_fidelity =  ml_input.fidelity.mode()[0]


In [None]:
def recommend_offer(customer_age=mode_age, customer_income=mode_income, customer_sex=mode_sex,\
                    customer_fidelity=mode_fidelity, df=ml_input, model=gbcgs):
    '''
    Description:
        - This function receives customer information and return which offers are most effective for this specific 
        customer.
    Input:
        - customer_age: (int) Customer age range, must be an integer.
        - customer_income: (float) Customer income range, must be a float.
        - customer_sex: (str) Customer sex, must be 'M'(male), 'F'(female) or 'O'(other).
        - customer_fidelity: (int) Customer fidelity level.
        - df: (pandas dataframe)dataframe with the information for the ML model training and testing.
        - model: (Sklearn ML model) Machine Learning model to be used for prediction. 
    Output:
        - offers: (array) A list of 10 itens of 0s (do not send offer) and 1s (send offer),
            where each index corresponds to an offer_id as listen on the portfolio dataframe.
    '''
    
    # Creating input dataframe:
    df_input = pd.DataFrame(columns=['offer_id','age_range','income_range','M','F','O','fidelity'])
    df_input.offer_id = np.sort(ml_input.offer_id.unique())
    for i, offer_id in enumerate(np.sort(ml_input.offer_id.unique())):
        df_input['age_range'].iloc[i] = customer_age
        df_input['income_range'].iloc[i] = customer_income
        df_input['fidelity'].iloc[i] = customer_fidelity
        if customer_sex == 'M':
            df_input['M'].iloc[i] = 1
            df_input['F'].iloc[i] = 0
            df_input['O'].iloc[i] = 0
        if customer_sex == 'F':
            df_input['M'].iloc[i] = 0
            df_input['F'].iloc[i] = 1
            df_input['O'].iloc[i] = 0
        if customer_sex == 'O':
            df_input['M'].iloc[i] = 0
            df_input['F'].iloc[i] = 0
            df_input['O'].iloc[i] = 1
    
    offers = model.predict(df_input)
    
    return offers

test = recommend_offer(customer_age=2, customer_income=3, customer_sex='M',customer_fidelity=4)

In [None]:
# Creating 'O' prediction dataframe:
gender = ['O']
n_age = profile.age_range.unique().shape[0]
n_income = profile.income_range.unique().shape[0]
n_fidelity = profile.fidelity.unique().shape[0]
n_offer = portfolio.offer_id.unique().shape[0]
n_rows = n_age * n_income * n_fidelity * n_offer
i = 0

offer_list = list(np.sort(portfolio.offer_id.unique()))
offer_col = offer_list * int(n_rows / n_offer)
gender_col = gender * int(n_rows)

df_other = pd.DataFrame(columns=['offer_id','age_range','income_range','fidelity','gender','prediction'])
df_other['offer_id'] = offer_col
df_other['gender'] = gender_col

for age in np.sort(profile.age_range.unique()):
    for income in np.sort(profile.income_range.unique()):
        for fidelity in np.sort(profile.fidelity.unique()):

            ages = list()
            ages.append(age)
            ages = ages * n_offer
            
            incomes = list()
            incomes.append(income)
            incomes = incomes * n_offer
            
            fidelities = list()
            fidelities.append(fidelity)
            fidelities = fidelities * n_offer

            recommendations = recommend_offer(customer_age=age, customer_income=income, customer_sex=gender[0],customer_fidelity=fidelity)

            df_other['age_range'].iloc[i:i+10] = ages
            df_other['income_range'].iloc[i:i+10] = incomes
            df_other['fidelity'].iloc[i:i+10] = fidelities
            df_other['prediction'].iloc[i:i+10] = recommendations

            i += 10
            

In [None]:
# Creating 'F' prediction dataframe:
gender = ['F']
n_age = profile.age_range.unique().shape[0]
n_income = profile.income_range.unique().shape[0]
n_fidelity = profile.fidelity.unique().shape[0]
n_offer = portfolio.offer_id.unique().shape[0]
n_rows = n_age * n_income * n_fidelity * n_offer
i = 0

offer_list = list(np.sort(portfolio.offer_id.unique()))
offer_col = offer_list * int(n_rows / n_offer)
gender_col = gender * int(n_rows)

df_fem = pd.DataFrame(columns=['offer_id','age_range','income_range','fidelity','gender','prediction'])
df_fem['offer_id'] = offer_col
df_fem['gender'] = gender_col

for age in np.sort(profile.age_range.unique()):
    for income in np.sort(profile.income_range.unique()):
        for fidelity in np.sort(profile.fidelity.unique()):

            ages = list()
            ages.append(age)
            ages = ages * n_offer
            
            incomes = list()
            incomes.append(income)
            incomes = incomes * n_offer
            
            fidelities = list()
            fidelities.append(fidelity)
            fidelities = fidelities * n_offer

            recommendations = recommend_offer(customer_age=age, customer_income=income, customer_sex=gender[0],customer_fidelity=fidelity)

            df_fem['age_range'].iloc[i:i+10] = ages
            df_fem['income_range'].iloc[i:i+10] = incomes
            df_fem['fidelity'].iloc[i:i+10] = fidelities
            df_fem['prediction'].iloc[i:i+10] = recommendations

            i += 10
            

In [None]:
# Creating 'M' prediction dataframe:
gender = ['M']
n_age = profile.age_range.unique().shape[0]
n_income = profile.income_range.unique().shape[0]
n_fidelity = profile.fidelity.unique().shape[0]
n_offer = portfolio.offer_id.unique().shape[0]
n_rows = n_age * n_income * n_fidelity * n_offer
i = 0

offer_list = list(np.sort(portfolio.offer_id.unique()))
offer_col = offer_list * int(n_rows / n_offer)
gender_col = gender * int(n_rows)

df_male = pd.DataFrame(columns=['offer_id','age_range','income_range','fidelity','gender','prediction'])
df_male['offer_id'] = offer_col
df_male['gender'] = gender_col

for age in np.sort(profile.age_range.unique()):
    for income in np.sort(profile.income_range.unique()):
        for fidelity in np.sort(profile.fidelity.unique()):

            ages = list()
            ages.append(age)
            ages = ages * n_offer
            
            incomes = list()
            incomes.append(income)
            incomes = incomes * n_offer
            
            fidelities = list()
            fidelities.append(fidelity)
            fidelities = fidelities * n_offer

            recommendations = recommend_offer(customer_age=age, customer_income=income, customer_sex=gender[0],customer_fidelity=fidelity)

            df_male['age_range'].iloc[i:i+10] = ages
            df_male['income_range'].iloc[i:i+10] = incomes
            df_male['fidelity'].iloc[i:i+10] = fidelities
            df_male['prediction'].iloc[i:i+10] = recommendations

            i += 10
            

In [None]:
#Checking dataframe
df_male

In [None]:
#Checking dataframe
df_fem

In [None]:
#Checking dataframe
df_other

In [None]:
# Saving dataframes:

df_other.to_csv('data/df_other_v0.csv')
df_male.to_csv('data/df_male_v0.csv')
df_fem.to_csv('data/df_fem_v0.csv')

In [None]:
# Loading dataframes:
df_other = pd.read_csv('data/df_other_v0.csv', index_col=0)
df_fem = pd.read_csv('data/df_fem_v0.csv', index_col=0)
df_male = pd.read_csv('data/df_male_V0.csv', index_col=0)