# Table Of Contents

- [Importing Libraries](#Importing-Libraries)
- [Loading Data](#Loading-Data)
- [Auto EDA](#Auto-EDA)
- [Feature Engineering](#Feature-Engineering)
- [Visualize](#Visualize)
- [Removing Outliers](#Removing-Outliers)
- [Preprocessing for Modelling](#Preprocessing-For-Modelling)
- [Dimensionality Reduction](#Dimensionality-reduction-PCA)
- [Clustering](#Clustering)
- [Insights](#Get-insights)
- [Conclusion Segmentation](#Conclusion-Segmentation)
- [Promotion EDA](#Promotion-EDA)
- [SHAP Analysis](#SHAP-Analysis)
- [Classification](#Classification)
- [Conclusion Promotion](#Conclusion-Promotion)

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from yellowbrick.cluster import KElbowVisualizer
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import shap
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Loading Data

In [2]:
df = pd.read_csv('../input/customer-personality-analysis/marketing_campaign.csv',sep = '\t')
df.head()

In [3]:
df.info()

There are some missing values in income.Let's remove them.

In [4]:
df.dropna(inplace = True)
print("No of data points in data are : " , len(df))

In [5]:
df.describe()

# Auto EDA

this will help to get some crucial insights easily.. 

In [6]:
!pip install dataprep

In [7]:
from dataprep.eda import plot, plot_correlation, create_report, plot_missing
plot(df)

Z_Revenue and Z_Cost should be removed as they don't contribute anything to training

# Feature Engineering

In [8]:
## Dt_Customer: Date of customer's enrollment with the company
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"])
dates= []
for i in df["Dt_Customer"]:
    i = i.date()
    dates.append(i)  

print("The newest customer's enrolment date in the records:",max(dates))
print("The oldest customer's enrolment date in the records:",min(dates))



days = []
d1 = max(dates) 
for i in dates:
    delta = d1 - i
    days.append(delta)
df["Customer_days"] = days
df["Customer_days"] = pd.to_numeric(df["Customer_days"], errors="coerce")

In [9]:
### Age will provide more clearity
df["Age"] = 2021-df["Year_Birth"]  

## Let's see whole spending
df["Spent"] = df["MntWines"]+ df["MntFruits"]+ df["MntMeatProducts"]+ df["MntFishProducts"]+ df["MntSweetProducts"]+ df["MntGoldProds"]


## Let's define Marital Status in a more  better way to get more clarity how many members are in household
df["Living_With"]=df["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

## To get a more clarity about family's background
df["Children"]=df["Kidhome"]+df["Teenhome"]


df["Family_Size"] = df["Living_With"].replace({"Alone": 1, "Partner":2})+ df["Children"]


df["Is_Parent"] = np.where(df.Children> 0, 1, 0)




### Dropping the engineered features
features_to_drop = ["ID","Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth"]
df.drop(features_to_drop, axis=1,inplace = True)

In [10]:
df.head()

In [11]:
df.describe()

There seems to be outliers in Age as 128 is highest.<br>
This can be clearly seen but some may not let's go for plotting

# Visualize

In [12]:
cont_vars = ['Income','Spent',"MntWines","MntFruits","MntMeatProducts","MntFishProducts","MntSweetProducts","MntGoldProds",'NumDealsPurchases',
       'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
       'NumWebVisitsMonth','Recency','Children']



In [13]:

plt.figure(figsize = (20,7))
for i in range(len(cont_vars)):
    plt.subplot(5,3,i+1)
    sns.scatterplot(data = df , y =cont_vars[i],x = 'Age',palette = 'viridis')
    plt.tight_layout()
    


### It looks like income too has outliers!

# Removing Outliers

In [14]:
df = df[(df["Age"]<100)]
df = df[(df["Income"]<600000)]
print("After Removing Outliers Number  of Datapoints are : ",len(df))

In [15]:
df.columns

In [16]:
plt.figure(figsize=(20,7))
sns.heatmap(df.corr(),annot =True,cmap = "YlGnBu");

# Preprocessing For Modelling

In [17]:
le = LabelEncoder()
df['Education'] = le.fit_transform(df['Education'])
df['Living_With'] = le.fit_transform(df['Living_With'])

### Removing all the Promotion Variables

In [18]:
data = df.copy()

In [19]:
promotion_vars = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
data.drop(promotion_vars,axis=1,inplace = True)

In [20]:
# let's scale every column
sc = StandardScaler()

scaled_df = pd.DataFrame(sc.fit_transform(data),columns = data.columns)
scaled_df.head()

# Dimensionality reduction PCA

In [21]:
pca = PCA(n_components = 3)

pca_df = pd.DataFrame(pca.fit_transform(scaled_df),columns = ['PC1','PC2','PC3'])
pca_df.head()

# Clustering

# Elbow Method

In [22]:
elbow= KElbowVisualizer(KMeans(), k=10)
elbow.fit(pca_df)
elbow.show();

In [23]:
# iner = [] 
# for i in range(1, 11): 
#     kmeans = KMeans(n_clusters = i, init = 'k-means++')
#     kmeans.fit(pca_df) 
#     iner.append(kmeans.inertia_)
# plt.plot(range(1,11),iner,marker ='o')
# plt.xlabel('K')
# plt.ylabel('Distortion Scroe')
# plt.show()

In [24]:
cluster = KMeans(n_clusters = 4).fit_predict(pca_df)
pca_df['cluster'] =cluster
pca_df.head()


In [25]:
## Let's put this clusters as a feature in data before it was processed
data['cluster'] = cluster

In [26]:
data.head()

In [27]:
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d')
ax.scatter(pca_df['PC1'], pca_df['PC2'],pca_df['PC3'], c=pca_df["cluster"], marker='o',s=50,cmap = 'brg' )
plt.show()

In [28]:
sns.countplot(pca_df['cluster']);

# Get insights

In [29]:
sns.scatterplot(data = data,x = 'Spent',y = 'Income',hue = 'cluster',palette = 'viridis');

In [30]:
sns.boxenplot(x = 'cluster' , y ='Spent' ,data = data);

- Group 0: high spending and average income
- Group 1: high spending and high income
- Group 2: low spending and low income
- Group 3: high spending and low income


### Group 1 is the biggest spender

In [31]:
# Spent vs Products
Product_vars = ['MntWines',
       'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds']

for i in Product_vars:
    plt.figure(figsize = (7,7))
    sns.barplot(x  = 'cluster' , y = i,data = data)
    plt.show()
    

### Group 3 spend more on wines and gold.


In [32]:
Personal_vars = ['Customer_days','Age','Education','Kidhome','Teenhome','Children','Family_Size','Is_Parent','Living_With']
plt.figure(figsize = (10,7))
for i in (Personal_vars):
    
    sns.catplot(data = data,x='cluster',y=i)
    
    
    

In [33]:
Place_vars = ['NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth']

for i in Place_vars:
    plt.figure(figsize = (7,7))
    sns.boxplot(x='cluster',y=i,data= data)
    plt.show()

# Conclusion Segmentation

## Write a short text of what is the key business takeaway of the recommendation.

## Group 0 <br>
- high spending and average income
- Are a parent
- Are older
- has teen at home
- Family size is atleast 2



## Group 1 
- high spending and high income
- More number of store purchases and catalog purchases
- Family size is atmost 3
- Atmost 1 child
- Spend on all products

## Group 2 
- low spending and low income
- more web visits
- at most 2 children
- have only 1 teen

## Group 3 
- high spending and low income
- spends more on wines and gold.
- more store purchases
- atleast size of family is 2
- definitely a parent
 

# Promotion EDA

In [34]:
df['cluster'] = cluster
df["Total"] = df["AcceptedCmp1"]+ df["AcceptedCmp2"]+ df["AcceptedCmp3"]+ df["AcceptedCmp4"]+ df["AcceptedCmp5"]

sns.countplot(x=df["Total"],hue=df["cluster"]);



### Promotions overall has'nt done very well

In [35]:
sns.boxenplot(y = 'NumDealsPurchases',x = 'cluster',data=data);

### Group 1 is not much into Deals even it has high income

In [36]:
sns.countplot(df["Response"] , hue= df['cluster']);

In [37]:
promotion_vars.append('NumDealsPurchases')
promotion_vars.remove('Complain')
promotion_vars

In [38]:
df_promotion = df[promotion_vars]
df_promotion.head() 

In [39]:
sns.countplot(df['Total'],hue = df['Response']);

 # Classification 

In [40]:
train,test = train_test_split(df_promotion,test_size= 0.2)


In [41]:
X_train = train.drop('Response',axis = 1)
y_train = train['Response']
X_test = test.drop('Response',axis = 1)
y_test = test['Response']

In [42]:
xgb = XGBClassifier()
fit =xgb.fit(X_train,y_train)
y_pred = xgb.predict(X_test)
accuracy_score(y_pred,y_test)

# SHAP Analysis

In [43]:
pred = xgb.predict(X_test, output_margin=True)
explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
np.abs(shap_values.sum(1) + explainer.expected_value - pred).max()

In [44]:
shap.summary_plot(shap_values, X_test)

# Conclusion Promotion

### Run SHAP analysis on the model results, and write a short text of what would be your recommendation to business for the next round of campaigns.?

<i>It looks like the campaigns had a very least effect on people and it has'nt pulled the audience to buy the product. Deals made with Disocunt may had been able to make more effect then Campaigns.Perhaps there is a need of better targeted and well planned campaigns.<i><br>
##### Recommendation to business for the next round of campaigns.
- Discounts can be mentioned in the campaigns
- Campaigns should be  more family oriented as we saw our data mostly contains families
- A strategy can be followed as we have already clustered data so we can provide valid recommendations to customers according to their interests.
- Some discounts  can be made on products displayed via campaign so that to sell more products at a cheap rate. This will help to retain customers.
-  Meat and Fish can be sold together at some discount rate.
- Campaigns should represent cultural aspects of the country that will drive more people towards them.
- Campaigns can use humour. As the memes will do the rest.
- Social Media can be used more effectively.







