<a href="https://colab.research.google.com/github/mhmmdmin/Portfolio/blob/main/Marketing%20Response%20Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Import and Load Libraries

In [None]:
!pip install dalex
!pip install scikit-plot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np
import pandas as pd
pd.set_option("max_columns",None)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from xgboost import XGBClassifier

import dalex as dx

import scikitplot as skplt

%matplotlib inline

#Import and Load Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv("/content/drive/MyDrive/data/ifood_df.csv")

1. Age = Customer's age
2. Education = Customer's education level
3. Customer_Days = How many times customer came to company
4. Marital_Divorced = Customer's marital status is divorced
5. Marital_Married = Customer's marital status is married
6. Marital_Single = Customer's marital status is signle
7. Marital_Together = Customer's marital status is together
8. Marital_Widow = Customer's marital status is widow
9. Education_2n Cycle = Customer's education level is 2nd cycle
10. Education_Basic = Customer's education level is basic
11. Education_Graduation = Customer's education level is graduation
12. Education_Master = Customer's education level is master
13. Education_PhD = Customer's education level is PhD
14. Marital_Status = Customer's marital status
15. Income = Customer's yearly household income
16. Kidhome = Number of children in customer's household
17. Teenhome = Number of teenagers in customer's household
18. Dt_Customer = Date of customer's enrollment with the company
19. Recency = Number of days since customer's last purchase
20. MntWines = Amount spent on wine in the last 2 years
21. MntFruits = Amount spent on fruits in the last 2 years
22. MntMeatProducts = Amount spent on meat in the last 2 years
23. MntFishProducts = Amount spent on fish in the last 2 years
24. MntSweetProducts = Amount spent on sweets in the last 2 years
25. MntGoldProds = Amount spent on gold in the last 2 years
26. NumDealsPurchases = Number of purchases made with a discount
27. NumWebPurchases = Number of purchases made through the company's web site
28. NumCatalogPurchases = Number of purchases made using a catalogue
29. NumStorePurchases = Number of purchases made directly in stores
30. NumWebVisitsMonth = Number of visits to company's web site in the last month
31. AcceptedCmp3 = 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
32. AcceptedCmp4 = 1 if customer accepted the offer in the 4th campaign, 0 otherwise
33. AcceptedCmp5 = 1 if customer accepted the offer in the 5th campaign, 0 otherwise
34. AcceptedCmp1 = 1 if customer accepted the offer in the 1st campaign, 0 otherwise
35. AcceptedCmp2 = 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
36. Response = 1 if customer accepted the offer in the last campaign, 0 otherwise
37. Complain = 1 if customer complained in the last 2 years, 0 otherwise

#Data Inspection

In [None]:
#Check Dataset
df.head()

In [None]:
#Check data structure
df.info()

**Check Unique in Each Columns**

In [None]:
for x in df.columns:
  print(f"unique value of {x}")
  print(f"{df[x].unique()}")
  print()

In [None]:
#Check missing value
df.isnull().sum()

In [None]:
#Check duplicate value
df.duplicated().sum()

In [None]:
df[df.duplicated(keep = False)]

In [None]:
#Drop duplicate value
df.drop_duplicates(keep=False, inplace=True)

#Data Transformation

In [None]:
#Change Column data type to string and then replace 1 with different number & 0 with blank
df['marital_Married']=df['marital_Married'].astype(str).replace({'1':'3','0':''})
df['marital_Single']=df['marital_Single'].astype(str).replace({'0':''})
df['marital_Together']=df['marital_Together'].astype(str).replace({'1':'2','0':''})
df['marital_Widow']=df['marital_Widow'].astype(str).replace({'1':'5','0':''})
df['marital_Divorced']=df['marital_Divorced'].astype(str).replace({'1':'4', '0':''})
#Join them in one column
df['marital_status']=df["marital_Widow"]+df['marital_Together']+df['marital_Single']+df['marital_Married']+df['marital_Divorced']
#Map numbers into different categorical values
df['marital_status']=df['marital_status'].map({'1':'Single', '2':'Together','3':'Married','4':'Divorced','5':'Widow'})

In [None]:
#Change Column data type to string and then replace 1 with different number & 0 with blank.
df['education_2n Cycle']= df['education_2n Cycle'].astype(str).replace({'0':''})
df['education_Basic']= df['education_Basic'].astype(str).replace({'1':'2','0':''})
df['education_Graduation']= df['education_Graduation'].astype(str).replace({'1':'3','0':''})
df['education_Master']= df['education_Master'].astype(str).replace({'1':'4','0':''})
df['education_PhD']= df['education_PhD'].astype(str).replace({'1':'5','0':''})
#Join them in one column
df['education_level']= df['education_2n Cycle']+df['education_Basic']+df['education_Graduation']+df['education_Master']+df['education_PhD']
#Map numbers into different categorical values
df['education_level']= df['education_level'].map({'1':'2n Cycle','2':'Basic','3':'Graduation','4':'Master','5':'PhD'})

**Check the relationship between kidhome and teenhome before join them**

In [None]:
df.groupby(['Kidhome', 'Teenhome'])['MntTotal'].median().plot(kind='bar')

There's no relationship between kidhome and teenhome in total amount of spent, so I will merge them into 1 columns.

In [None]:
df['dependent_counts']=df['Kidhome']+df['Teenhome']

In [None]:
df.groupby('dependent_counts')['MntTotal'].mean().plot(kind='bar')

In [None]:
df = df.drop(['education_2n Cycle','education_Basic','education_Graduation','education_Master','education_PhD','marital_Widow','marital_Together','marital_Single','marital_Married','marital_Divorced','Kidhome','Teenhome'], axis=1)

In [None]:
df.describe().T

#Train Test Split

In [None]:
#most successful campaign
campaign = df.loc[:,['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Response']]

campaign = campaign.melt()
campaign = pd.crosstab(campaign["variable"], campaign["value"])

cols = list(campaign.columns)
a, b = cols.index(0), cols.index(1)
cols[b], cols[a] = cols[a], cols[b]
campaign = campaign[cols]

campaign.columns = "Yes","No"
campaign.plot.bar(stacked=True)
plt.title('Acceptance of Marketing Campaigns')
plt.xlabel('Campaign')
plt.ylabel('Acceptance')
plt.legend(title='Response',loc='upper right')
plt.show()

In [None]:
df['AcceptedCmpOverall'].value_counts()

We only used the last campaign as the target because it has the most significant effect over all campaign.

In [None]:
X = df.drop(["Response"], axis = 1)
y = df["Response"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y, 
    test_size=0.2,
    stratify = y, 
    random_state=1000
)

# EDA

In [None]:
#Define function for proportion
def prop_agg(df, y, x):
  temp_df = df.groupby([y,x], as_index = False).size()
  temp_df['prop'] = temp_df['size'] / temp_df.groupby(y)['size'].transform('sum')
  return temp_df

In [None]:
X_train["Response"] = y_train

##Age

In [None]:
X_train["age_bin"] = pd.cut(X_train['Age'],bins=[20,30,40,50,60,70,80],labels=['21-30','31-40','41-50','51-60',"61-70", "71+"])

In [None]:
prop_agg(X_train, "age_bin", "Response")

In [None]:
# Visualization last campaign
g = sns.FacetGrid(data = prop_agg(X_train, "age_bin", "Response"),col = "age_bin");
g.map(sns.barplot, "Response", "prop");

The last campaign, age 21-30 and 71+ has the most response percentage. Followed by 31-40 years old, 41-50 years old, 61-70 years old and 51-60 years old.

In [None]:
X_train.groupby("age_bin")["Response"].mean()

**Check the income of age**

In [None]:
sns.boxplot(x = "age_bin", 
            y = "Income", 
            data = X_train,
            palette='pastel');

In [None]:
X_train.groupby("age_bin")["Income"].mean().plot(kind="bar").set_title("Average Income by Age Group")
plt.ylabel("Average Income")
plt.xlabel("Age Group")
plt.xticks(rotation=0);

We can see that there's an uptrend of average income by age group after 21-30 which 31-40 has the lowest income of them all.

In [None]:
sns.boxplot(x = "age_bin", 
            y = "MntTotal", 
            data = X_train,
            palette='pastel');

In [None]:
X_train.groupby("age_bin")["MntTotal"].mean().plot(kind="bar").set_title("Average Expense by Age Group")
plt.ylabel("Average Expense")
plt.xlabel("Age Group")
plt.xticks(rotation=0);

Same as the income, the average expense by age group has the same trend after 21-30 years old. Maybe it is influenced by their job but we don't have that in our dataset now.

##Dependent

In [None]:
prop_agg(X_train, "dependent_counts", "Response")

In [None]:
# Visualization
g = sns.FacetGrid(data = prop_agg(X_train, "dependent_counts", "Response"),col = "dependent_counts");
g.map(sns.barplot, "Response", "prop");

As we can see that dependent count 0 has the most significant response to the last campaign.

In [None]:
sns.boxplot(x = "dependent_counts", 
            y = "Age", 
            data = X_train,
            palette='pastel');

age range of each dependent counts is vary and the 0 dependent count has the widest variation.

In [None]:
X_train.groupby("dependent_counts")["Income"].mean().plot(kind="bar").set_title("Average Income by Dependent Counts")
plt.ylabel("Average Income")
plt.xlabel("Dependent Counts")
plt.xticks(rotation=0);

We can see the average income that has no dependent counts is the largest of them all.

In [None]:
sns.boxplot(x = "dependent_counts", 
            y = "Income", 
            data = X_train,
            palette='pastel');

In [None]:
X_train.groupby("dependent_counts")["MntTotal"].mean().plot(kind="bar").set_title("Average Expenses by Dependent Counts")
plt.ylabel("Average Expenses")
plt.xlabel("Dependent Counts")
plt.xticks(rotation=0);

The average expenses of 0 dependent count is also the highest of them all.

In [None]:
sns.boxplot(x = "dependent_counts", 
            y = "MntTotal", 
            data = X_train,
            palette='pastel');

##Education

In [None]:
prop_agg(X_train, "education_level", "Response")

In [None]:
# Visualization last campaign
g = sns.FacetGrid(data = prop_agg(X_train, "education_level", "Response"),col = "education_level");
g.map(sns.barplot, "Response", "prop");

In [None]:
order_el = ['Basic', '2n Cycle', 'Graduation', 'Master', 'PhD']
sns.boxplot(x = "education_level", 
            y = "Income", 
            data = X_train,
            order = order_el,
            palette='pastel');

We can see from the plot that education level of basic has the lowest income than the others.

In [None]:
X_train.groupby("education_level")["Income"].mean().loc[order_el].plot(kind="bar").set_title("Average Income by Education Level")
plt.ylabel("Average Income")
plt.xlabel("Education")
plt.xticks(rotation=0);

In [None]:
sns.boxplot(x = "education_level", 
            y = "MntTotal", 
            data = X_train,
            order = order_el,
            palette='pastel');

In [None]:
X_train.groupby("education_level")["MntTotal"].mean().loc[order_el].plot(kind="bar").set_title("Average Expenses by Education Level")
plt.ylabel("Average Expenses")
plt.xlabel("Education Level")
plt.xticks(rotation=0);

Because the basic average income is low it effects its expense which is low too.

In [None]:
pd.pivot_table(X_train, index='age_bin', columns='education_level', values='Income', aggfunc='mean').plot(kind='bar', colormap='Set3')
plt.ylabel('Average Income')
plt.xlabel('Age')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

As we can see from plot above that, the reason why average income of 21-30 is high because there are few people which has master degree and PhD with high income exceeding the age of 31 even until 70 years old.

##Marital

In [None]:
prop_agg(X_train, "marital_status", "Response")

In [None]:
# Visualization last campaign
g = sns.FacetGrid(data = prop_agg(X_train, "marital_status", "Response"),col = "marital_status");
g.map(sns.barplot, "Response", "prop");

In [None]:
sns.boxplot(x = "marital_status", 
            y = "Income", 
            data = X_train,
            palette='pastel');

In [None]:
X_train.groupby("marital_status")["Income"].mean().plot(kind="bar").set_title("Average Income by Marital Status")
plt.ylabel("Average Income")
plt.xlabel("Marital Status")
plt.xticks(rotation=0);

Average income by marital status is not so different.

In [None]:
sns.boxplot(x = "marital_status", 
            y = "MntTotal", 
            data = X_train,
            palette='pastel');

In [None]:
X_train.groupby("marital_status")["MntTotal"].mean().plot(kind="bar").set_title("Average Expense by Marital Status")
plt.ylabel("Average Expense")
plt.xlabel("Marital Status")
plt.xticks(rotation=0);

But for the average expense, widow is the highest of them all.

In [None]:
pd.pivot_table(X_train, index='age_bin', columns='marital_status', values='Income', aggfunc='mean').plot(kind='bar', colormap='Set3')
plt.ylabel('Average Income')
plt.xlabel('Age')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

As we can see from the plot above that 21-30 income is contributed by people who (Single, Married, and Together). Then highest of average income at 31-40 are from divorced one. Then there's no significant different in age by marital status.

In [None]:
pd.pivot_table(X_train, index='marital_status', columns='dependent_counts', values='Response', aggfunc='mean').plot(kind='bar', colormap='Set3')
plt.ylabel('Response')
plt.xlabel('Marital Status')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

It seems there's relationship between dependent count 0 and they who live alone (single, divorced, and widow) which has the highest response to the last campaign. But the married and together but has no dependents also get high response to the last campaign.

In [None]:
pd.pivot_table(X_train, index='marital_status', columns='dependent_counts', values='MntTotal', aggfunc='mean').plot(kind='bar', colormap='Set3')
plt.ylabel('MntTotal')
plt.xlabel('Marital Status')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

Regardless their marital status, I think that they who do not have dependent count tends to spend more and response positively in the last campaign.

##Products

In [None]:
sns.boxplot(x = "Response", y = "MntWines", data = X_train);

In [None]:
sns.boxplot(x = "Response", y = "MntMeatProducts", data = X_train);

In [None]:
sns.boxplot(x = "Response", y = "MntFruits", data = X_train);

In [None]:
sns.boxplot(x = "Response", y = "MntFishProducts", data = X_train);

In [None]:
sns.boxplot(x = "Response", y = "MntSweetProducts", data = X_train);

In [None]:
sns.boxplot(x = "Response", y = "MntGoldProds", data = X_train);

As we can see from the boxplots above that, the one who spend more on wines, fruits, meat products, fish products, sweet products, and gold products will likely to response the last campagin than the others.

In [None]:
X_train.groupby('age_bin')[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].mean().plot(kind='bar', colormap='Set3')
plt.ylabel('Average Spend')
plt.xlabel('Age')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

From the plot above, we can see that average spend for each age range are wines and meat products.

In [None]:
X_train.groupby('dependent_counts')[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].mean().plot(kind='bar', colormap='Set3')
plt.ylabel('Average Spend')
plt.xlabel('Dependent Counts')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

The same with the age, the average spend for each dependent counts are wines and meat products.

#Pre-Processing

**Typecasting Categorical Value Into Numeric**

In [None]:
X_train['education_level'].replace(['Basic','2n Cycle','Graduation','Master','PhD'],[0,1,2,3,4], inplace=True)
X_train['marital_status'].replace(['Single','Together','Married','Divorced', 'Widow'],[0,1,2,3,4], inplace=True)

In [None]:
X_test['education_level'].replace(['Basic','2n Cycle','Graduation','Master','PhD'],[0,1,2,3,4], inplace=True)
X_test['marital_status'].replace(['Single','Together','Married','Divorced', 'Widow'],[0,1,2,3,4], inplace=True)

In [None]:
X_train = X_train.drop(["Response"], axis=1)

In [None]:
X_train.columns

In [None]:
drop_col1 = ["age_bin","Z_CostContact","Z_Revenue","Complain",'AcceptedCmp3',
       'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']
drop_col2 = ["Z_CostContact","Z_Revenue","Complain",'AcceptedCmp3',
       'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']


X_train = X_train.drop(drop_col1,axis=1)
X_test = X_test.drop(drop_col2, axis=1)

#Modeling

We will use 5 models:

- KNN as a baseline model
- Decision tree
- Random Forest
- SVM RBF
- XGBoost

Compare them all and choose the best model

In [None]:
y_train.value_counts(normalize=True)

**KNN**

In [None]:
knn_clf = KNeighborsClassifier(
    n_neighbors = 5,
)

**Decision Tree**

In [None]:
dc_clf = DecisionTreeClassifier(
    max_depth = 5,
    ccp_alpha = 0.001,
    class_weight = {0: 0.156716, 1:0.843284}
)

**Random Forest**

In [None]:
rf_clf = RandomForestClassifier(
    random_state=1000,
    n_estimators=1000,
    class_weight = {0: 0.156716, 1:0.843284}
)

**SVM RBF**

In [None]:
svm_clf = SVC(
    random_state = 1000,
    probability=True,
    class_weight = {0: 0.156716, 1:0.843284}
)

**XGBoost**

In [None]:
xgb_clf = XGBClassifier(
    random_state=1000,
    n_estimators=1000,
    class_weight = {0: 0.156716, 1:0.843284}
)

##Fitting the Model

**KNN**

In [None]:
knn_clf.fit(X_train, y_train)

**Decision Tree**

In [None]:
dc_clf.fit(X_train, y_train)

**Random Forest**

In [None]:
rf_clf.fit(X_train, y_train)

**SVM RBF**

In [None]:
svm_clf.fit(X_train, y_train)

**XGBoost**

In [None]:
xgb_clf.fit(X_train, y_train)

##Model Evaluation

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# knn prediction
knn_pred = knn_clf.predict(X_test)
knn_pred_proba = knn_clf.predict_proba(X_test)

# decision tree prediction
dc_pred = dc_clf.predict(X_test)
dc_pred_proba = dc_clf.predict_proba(X_test)

# random forest prediction
rf_pred = rf_clf.predict(X_test)
rf_pred_proba = rf_clf.predict_proba(X_test)

# SVM RBF prediction
svm_pred = svm_clf.predict(X_test)
svm_pred_proba = svm_clf.predict_proba(X_test)

# XGBoost prediction
xgb_pred = xgb_clf.predict(X_test)
xgb_pred_proba = xgb_clf.predict_proba(X_test)

**KNN**

In [None]:
pd.DataFrame(metrics.classification_report(y_test, knn_pred, target_names=['No Response','Response'], output_dict=True))

In [None]:
# knn result
skplt.metrics.plot_confusion_matrix(y_test, knn_pred);

In [None]:
skplt.metrics.plot_roc_curve(y_test, knn_pred_proba);

**Decision Tree**

In [None]:
pd.DataFrame(metrics.classification_report(y_test, dc_pred, target_names=['No Response','Response'], output_dict=True))

In [None]:
# decision tree result
skplt.metrics.plot_confusion_matrix(y_test, dc_pred);

In [None]:
skplt.metrics.plot_roc_curve(y_test, dc_pred_proba);

**Random Forest**

In [None]:
pd.DataFrame(metrics.classification_report(y_test, rf_pred, target_names=['No Response','Response'], output_dict=True))

In [None]:
# randomforest result
skplt.metrics.plot_confusion_matrix(y_test, rf_pred);

In [None]:
skplt.metrics.plot_roc_curve(y_test, rf_pred_proba);

**SVM RBF**

In [None]:
pd.DataFrame(metrics.classification_report(y_test, svm_pred, target_names=['No Response','Response'], output_dict=True))

In [None]:
# svm result
skplt.metrics.plot_confusion_matrix(y_test, svm_pred);

In [None]:
skplt.metrics.plot_roc_curve(y_test, svm_pred_proba);

**XGBoost**

In [None]:
pd.DataFrame(metrics.classification_report(y_test, xgb_pred, target_names=['No Response','Response'], output_dict=True))

In [None]:
# xgboost result
skplt.metrics.plot_confusion_matrix(y_test, xgb_pred);

In [None]:
skplt.metrics.plot_roc_curve(y_test, xgb_pred_proba);

#Explanatory Model Analysis

##Feature Importance

In [None]:
## initiate explainer for XGBoost model
response_xgb_exp = dx.Explainer(xgb_clf, X_train, y_train, label = "XGBoost Interpretation")

In [None]:
# visualize permutation feature importance for Random Forest Model
response_xgb_exp.model_parts().plot()

From the permutation feature importance above we can conclude that there are some features that affect to campaign response. The most influence features are overall campaign accepted and recency. I think it makes sense, the recent customer do transaction and the frequent they accept the campaign, the more they will have a good response at the last marketing campaign.

##Partial Dependence Plot

In [None]:
# create partial dependence plot of XGBoost model
response_xgb_exp.model_profile().plot()

#Interpretation

1. The more recent customers do transaction, the more likely they accept the marketing campaign. (2.5x from they who has recency more than 80 days)
2. The more customers accepted previous campaign, the more likely they accept the marketing campaign. (For more than 2 accepted campaign it will increase the probability as 44%)
3. If customers spend more than 500 in meat product, it will increase the probability of campaign response to 61%.
4. The longer customers have joined the commerce, the more likely they respond to the marketing campaign (increase 82% if the customers have joined more than 2700 days)
5. The less customers spend in wines, the more likely they respond to the marketing campaign.
6. The less number of purchases made directly in stores, the more likely customers will respond to the last marketing campaign.
7. They who are single, divorced, and widow tend to react or respond to the marketing campaign then they who are married or life together.