The AirCnC M&M product manager for the Mansios and Manors category has just run
an ad to encourage customers to upgrade and consider M&M property for thier next booking. after the result came out, the booking rate is lower for customers who had seen the ad, even when filtering down to the customers considering an M&M property only. 

In [24]:
import pandas as pd
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.formula.api as smf

In [2]:
df = pd.read_csv (r'D:\2024\data science portfolio\Custoemr behavior analysis Oreilly\data\AirCnC_MnM_exercises_data.csv')

In [15]:
# The booking rate is lower for customers who have seen the ad
df.groupby('ad').agg(bkg_rate = ('bkd', lambda x: np.mean(x)))

Unnamed: 0_level_0,bkg_rate
ad,Unnamed: 1_level_1
0,0.464139
1,0.448417


Description: 
- period: numeric variable, taking the values 0 (preliminary period before the ad was run) and 1 (period when the ad was run)
- income: numeric variable, indicating the income of the customer
- ad: binary variable, 0 for customers who haven't been served the ad, 1 for customers who have been served the ad
- mm: "considered a Mansion & Manor property", binary variable, 0 for customers who didn't consider an M&M property, 1 for customers who considered an M&M property
- bkd: "booked property", binary variable, 0 for customers who didn't book the property they were considering, 1 for customers who booked the property they were considering

What is the booking rate for customers who have seen the ad, restricting to customers considering an M&M property? Customers who haven’t seen the ad, with the same restriction?

In [17]:
filtered_df1 =df[(df['ad']==1)& (df['mm']==1)]

In [19]:
filtered_df1.groupby('ad').agg(bkg_rate = ('bkd', lambda x: np.mean(x)))

Unnamed: 0_level_0,bkg_rate
ad,Unnamed: 1_level_1
1,0.911111


In [21]:
filtered_df2 =df[(df['ad']==0)& (df['mm']==1)]

In [22]:
filtered_df2.groupby('ad').agg(bkg_rate = ('bkd', lambda x: np.mean(x)))

Unnamed: 0_level_0,bkg_rate
ad,Unnamed: 1_level_1
0,0.932051


In [23]:
# This remains true even when restricting to customers considering an M&M property
df[(df['mm']==1)].groupby('ad').agg(bkg_rate = ('bkd', lambda x: np.mean(x)))

Unnamed: 0_level_0,bkg_rate
ad,Unnamed: 1_level_1
0,0.932051
1,0.911111


## Understanding the behavior 

1.a. What are the behavioral categories for the variables in the data (Income, Ad, MM, Bkd)?

Income is a personal characteristic. Ad is a business behavior. MM is a customer behavior. Bkd is a customer behavior.

1.b. What is (are) the goal(s) of the ad?

The goals of the ad are

to increase the percentage of customers who consider an M&M property
to increase the percentage of customers who book an M&M property

In [25]:
# The ad indeed increases the probability that a customer will consider an M&M property
mod_mm = smf.logit('mm ~ ad', data = df)
res_mm = mod_mm.fit()
res_mm.summary()

Optimization terminated successfully.
         Current function value: 0.295525
         Iterations 6


0,1,2,3
Dep. Variable:,mm,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9998.0
Method:,MLE,Df Model:,1.0
Date:,"Wed, 18 Sep 2024",Pseudo R-squ.:,5.536e-05
Time:,18:36:22,Log-Likelihood:,-2955.3
converged:,True,LL-Null:,-2955.4
Covariance Type:,nonrobust,LLR p-value:,0.5673

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.3576,0.037,-62.933,0.000,-2.431,-2.284
ad,0.0673,0.117,0.576,0.564,-0.162,0.296


The coefficient for ad is 0.0673, but with a high p-value (0.564), indicating it is not statistically significant. This means that showing an ad does not significantly increase the probability that a customer will consider an M&M property in this model.

In [26]:
#The ad increases the probability that a customer will book an M&M property
df['bkd_mm'] = df['bkd'] * df['mm'] # Equal to 1 if and only if a customer books an M&M property

mod_bkd_mm = smf.logit('bkd_mm ~ ad', data = df)
res_bkd_mm = mod_bkd_mm.fit()
res_bkd_mm.summary()

Optimization terminated successfully.
         Current function value: 0.280956
         Iterations 6


0,1,2,3
Dep. Variable:,bkd_mm,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9998.0
Method:,MLE,Df Model:,1.0
Date:,"Wed, 18 Sep 2024",Pseudo R-squ.:,2.103e-05
Time:,18:37:17,Log-Likelihood:,-2809.6
converged:,True,LL-Null:,-2809.6
Covariance Type:,nonrobust,LLR p-value:,0.731

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4344,0.039,-62.937,0.000,-2.510,-2.359
ad,0.0420,0.122,0.345,0.730,-0.196,0.281


•  Method: MLE (Maximum Likelihood Estimation).

•  Pseudo R-squ.: 2.103e-05 (a measure of model fit, very low in this case).

•  Log-Likelihood: -2809.6 (a measure of model fit, lower is better).

•  Converged: True (the model fitting process was successful).

The coefficient for ad is 0.0420, but with a high p-value (0.730), indicating it is not statistically significant. This means that showing an ad does not significantly increase the probability that a customer will book an M&M property in this model.

## Summary
The logistic regression model was used to see if showing an ad (ad) increases the likelihood of a customer booking an M&M property (bkd_mm). The results show that the ad's effect is not statistically significant, meaning it doesn't have a strong impact on the outcome in this dataset.

Income affect booking 

In [27]:
# Income increases the probability that a customer will consider an M&M property
mod_mm = smf.logit('mm ~ income + ad', data = df)  #This line defines a logistic regression model where mm (considering an M&M property) is 
#the dependent variable, and income and ad (whether an ad was shown) are the independent variables.
res_mm = mod_mm.fit()
res_mm.summary()

Optimization terminated successfully.
         Current function value: 0.092007
         Iterations 9


0,1,2,3
Dep. Variable:,mm,No. Observations:,8489.0
Model:,Logit,Df Residuals:,8486.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 18 Sep 2024",Pseudo R-squ.:,0.6023
Time:,18:38:44,Log-Likelihood:,-781.04
converged:,True,LL-Null:,-1964.1
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-5.0019,0.119,-41.925,0.000,-5.236,-4.768
income,9.768e-06,3.47e-07,28.160,0.000,9.09e-06,1.04e-05
ad,0.4934,0.464,1.063,0.288,-0.417,1.403


The coefficient for the intercept being -5.0019 means that, when both income and ad are zero, the log-odds of a customer considering an M&M property (mm) is -5.0019.the intercept value of -5.0019 tells us that if we don't consider the effects of income and ads, the chance of a customer considering an M&M property is very low.

1. Coefficients Table:
•  Intercept: The coefficient for the intercept is -5.0019, with a very small p-value (0.000), indicating it is statistically significant.

•  income: The coefficient for income is 9.768e-06, with a very small p-value (0.000), indicating it is statistically significant. This means that income has a significant positive effect on the likelihood of considering an M&M property.

•  ad: The coefficient for ad is 0.4934, but with a high p-value (0.288), indicating it is not statistically significant. This means that showing an ad does not significantly increase the probability that a customer will consider an M&M property in this model.

Summary
The logistic regression model was used to see if income and showing an ad (ad) increase the likelihood of a customer considering an M&M property (mm). The results show that income has a significant positive effect, meaning higher income increases the likelihood of considering an M&M property. However, the ad's effect is not statistically significant, meaning it doesn't have a strong impact on the outcome in this dataset.

In [28]:
# Income increases the probability that a customer will book an M&M property
mod_bkd_mm = smf.logit('bkd_mm ~ income + ad', data = df)
res_bkd_mm = mod_bkd_mm.fit()
res_bkd_mm.summary()

Optimization terminated successfully.
         Current function value: 0.064053
         Iterations 9


0,1,2,3
Dep. Variable:,bkd_mm,No. Observations:,8489.0
Model:,Logit,Df Residuals:,8486.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 18 Sep 2024",Pseudo R-squ.:,0.7007
Time:,18:39:08,Log-Likelihood:,-543.74
converged:,True,LL-Null:,-1816.8
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-5.7731,0.161,-35.783,0.000,-6.089,-5.457
income,1.127e-05,4.23e-07,26.658,0.000,1.04e-05,1.21e-05
ad,0.2978,0.725,0.411,0.681,-1.124,1.719


The logistic regression model was used to see if income and showing an ad (ad) increase the likelihood of a customer booking an M&M property (bkd_mm). The results show that:

•  Income: Higher income significantly increases the likelihood of booking an M&M property. This means that as a customer's income goes up, they are more likely to book an M&M property.

•  Ad: Showing an ad does not have a significant impact on the likelihood of booking an M&M property. This means that whether or not a customer sees an ad doesn't strongly affect their decision to book.

In [29]:
# Customers considering an M&M property after seeing the ad have a lower income than customers 
# considering an M&M property without having seen the ad
df.groupby(['ad', 'mm']).agg(avg_income = ('income', lambda x: np.mean(x)))

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_income
ad,mm,Unnamed: 2_level_1
0,0,65419.63
0,1,1098118.0
1,0,22362.17
1,1,19363.8


The ad was effective at driving more customers to consider an M&M property across the board (i.e. irrespective of income). However, because there are more customers with a lower income than with a higher income, this added proportionately more lower-income customers to the pool of customers considering an M&M property. These lower-income customers have a lower likelihood to book a property, so the average booking rate across customers considering an M&M property decreased. In other words, the mix of customers considering an M&M property changed, but the individual probability that a customer would consider and book an M&M property increased. This ad is a resounding success!