Marketing

In [1]:
# this is to view the notebook in full with on my screen

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

There are a few different features in this dataset that could be interesting to predict for marketing analysis. Some possibilities include:

    Recency: This feature measures how recently a customer made a purchase, which could be useful for identifying who to target with promotions or other marketing efforts.

    Income: Income could be a useful feature to predict because it could allow you to segment customers into different income groups and target them with different marketing campaigns.

    AcceptedCmp1-5: These features indicate whether a customer accepted a marketing campaign in the past, which could be useful for predicting who is more likely to accept future campaigns.

    NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth: These features measure different types of customer behavior, such as how often they make purchases or visit the website,
    which could be useful for identifying patterns in customer behavior and targeting marketing efforts accordingly.

    Complain: This feature indicates whether a customer has filed a complaint in the past, which could be useful for identifying potential issues with products or services and addressing them in future marketing efforts.

    Z_Revenue and ZCostcontat:Marital  Complain AcceptedCmp1 AcceptedCmp2 AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 Education Income MntWines DROP ONLY ONE VALUE 11.

It is also important to note that depending on the business, the target feature may vary. It is always best to consult with stakeholders and industry experts before making a decision.

groupby NumWebPurchases and NumWebVisitsMonth
groupby 

Content<br>
<br>
competitions (encoded)<br>
* AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise <br>
* AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise <br>
* AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise <br>
* AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise <br>
* AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise <br>
<br>
Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise <br>
<br>Complain - 1 if customer complained in the last 2 years<br>
<br>DtCustomer - date of customer’s enrolment with the company<br>
<br>Education - customer’s level of education 'Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'<br>
<br>Marital - customer’s marital status<br>
<br>Kidhome - number of small children in customer’s household<br>
<br>Teenhome - number of teenagers in customer’s household<br>
<br>Income - customer’s yearly household income<br>
<br>MntFishProducts - amount spent on fish products in the last 2 years<br>
<br>MntMeatProducts - amount spent on meat products in the last 2 years<br>
<br>MntFruits - amount spent on fruits products in the last 2 years<br>
<br>MntSweetProducts - amount spent on sweet products in the last 2 years<br>
<br>MntWines - amount spent on wine products in the last 2 years<br>
<br>MntGoldProds - amount spent on gold products in the last 2 years<br>
<br>NumDealsPurchases - number of purchases made with discount<br>
<br>NumCatalogPurchases - number of purchases made using catalogue<br>
<br>NumStorePurchases - number of purchases made directly in stores<br>
<br>NumWebPurchases - number of purchases made through company’s web site<br>
<br>NumWebVisitsMonth - number of visits to company’s web site in the last month<br>
<br>Recency - number of days since the last purchase <br>

In [2]:
# inport libraries:
  
#import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import copy

from sklearn.preprocessing import LabelEncoder

from sklearn.svm import SVC
from sklearn .model_selection import train_test_split
from sklearn .linear_model import LogisticRegression
from sklearn .naive_bayes import GaussianNB
#from sklearn import metrics

from sklearn.metrics import roc_curve 
from sklearn.metrics import accuracy_score, confusion_matrix 
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

%matplotlib inline
%config InlineBackend.figure_format  = 'retina'

In [3]:
data=pd.read_csv("marketing_campaign.csv",sep=";")


In [4]:
# rename the columns
data.rename(columns = {'Year_Birth': 'YearBirth',
                       'Marital_Status': 'MaritalStatus', 
                       'Dt_Customer': 'DtCustomer', 
                       'Z_CostContact': 'ZCostContact',
                       'Z_Revenue': 'ZRevenue', 
                       'Kidhome': 'KidHome',
                       'Teenhome': 'TeenHome'}, inplace = True)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   YearBirth            2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   MaritalStatus        2240 non-null   object 
 4   Income               2216 non-null   float64
 5   KidHome              2240 non-null   int64  
 6   TeenHome             2240 non-null   int64  
 7   DtCustomer           2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [6]:
data.describe()

Unnamed: 0,ID,YearBirth,Income,KidHome,TeenHome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,ZCostContact,ZRevenue,Response
count,2240.0,2240.0,2216.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,...,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0,2240.0
mean,5592.159821,1968.805804,52247.251354,0.444196,0.50625,49.109375,303.935714,26.302232,166.95,37.525446,...,5.316518,0.072768,0.074554,0.072768,0.064286,0.013393,0.009375,3.0,11.0,0.149107
std,3246.662198,11.984069,25173.076661,0.538398,0.544538,28.962453,336.597393,39.773434,225.715373,54.628979,...,2.426645,0.259813,0.262728,0.259813,0.245316,0.114976,0.096391,0.0,0.0,0.356274
min,0.0,1893.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
25%,2828.25,1959.0,35303.0,0.0,0.0,24.0,23.75,1.0,16.0,3.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
50%,5458.5,1970.0,51381.5,0.0,0.0,49.0,173.5,8.0,67.0,12.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
75%,8427.75,1977.0,68522.0,1.0,1.0,74.0,504.25,33.0,232.0,50.0,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
max,11191.0,1996.0,666666.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,...,20.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,11.0,1.0


In [7]:
data['Response'].value_counts()

0    1906
1     334
Name: Response, dtype: int64

In [8]:
# fill the null values in numerical columns with average specific to certain column
# fill in the missing data in the columns according to the Education average.
unique_Education = pd.unique(data.Education)

# find Income averages in Education specific
# Equalize the average Income values to the missing values in Income specific to Education
temp_data = data.copy()  # set temp_data variable to avoid losing real data
columns = ['Income'] # it can be add more column
for c in unique_Education:
    
    # create Education filter
    Education_filtre = temp_data.Education == c
    # filter data by Education
    filtered_data = temp_data[Education_filtre]
    
    # find average for Income in specific to Education
    for s in columns:
        mean = np.round(np.mean(filtered_data[s]), 2)
        if ~np.isnan(mean): # if there if average specific to Education
            filtered_data[s] = filtered_data[s].fillna(mean)
            print(f"Missing Value in {s} column fill with {mean} when Education:{c}")
        else: # find average for all data if no average in specific to Education
            all_data_mean = np.round(np.mean(data[s]), 2)
            filtered_data[s] = filtered_data[s].fillna(all_data_mean)
            print(f"Missing Value in {s} column fill with {all_data_mean}")
    # Synchronize data filled with missing values in Income to data temporary            
    temp_data[Education_filtre] = filtered_data

# equate the deprecated temporary data to the real data variable
data = temp_data.copy() 

Missing Value in Income column fill with 52720.37 when Education:Graduation
Missing Value in Income column fill with 56145.31 when Education:PhD
Missing Value in Income column fill with 52917.53 when Education:Master
Missing Value in Income column fill with 20306.26 when Education:Basic
Missing Value in Income column fill with 47633.19 when Education:2n Cycle


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data[s] = filtered_data[s].fillna(mean)


In [9]:
data.columns

Index(['ID', 'YearBirth', 'Education', 'MaritalStatus', 'Income', 'KidHome',
       'TeenHome', 'DtCustomer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'ZCostContact', 'ZRevenue', 'Response'],
      dtype='object')

In [10]:
data.head()

Unnamed: 0,ID,YearBirth,Education,MaritalStatus,Income,KidHome,TeenHome,DtCustomer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,ZCostContact,ZRevenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,...,5,0,0,0,0,0,0,3,11,0


In [11]:
df_encoded = copy.deepcopy(data)

df_encoded.loc[:,['KidHome', 'TeenHome', ]] = df_encoded.loc[:,['KidHome', 'TeenHome', ]].apply(LabelEncoder().fit_transform) 


strong correlatioms:meat catalog purchases and meat sales
strong correlation :wine and high sales plus meat and

In [12]:
data.head()

Unnamed: 0,ID,YearBirth,Education,MaritalStatus,Income,KidHome,TeenHome,DtCustomer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,ZCostContact,ZRevenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,...,5,0,0,0,0,0,0,3,11,0


In [13]:
bar_chart(data,'Education')

NameError: name 'bar_chart' is not defined

In [None]:
bar_chart(data,'MaritalStatus')

In [None]:
bar_chart(data,'KidHome')

In [None]:
bar_chart(data,'TeenHome')

In [None]:
bar_chart(data,'NumWebVisitsMonth')

In [None]:
bar_chart(data,'AcceptedCmp2')

In [None]:
bar_chart(data,'AcceptedCmp3')

In [None]:
bar_chart(data,'AcceptedCmp4')

In [None]:
retention = data.drop(columns=['ZRevenue', 'ID','ZCostContact','MaritalStatus',  'Complain', 'AcceptedCmp1','AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5','Education', 'Income', 'MntWines'])


In [None]:
retention.columns

In [None]:
df_attr = retention#.iloc[:, [0,2,6,7,8,9,10,11,12,13,14,15,16]] # Select ‘age’, ‘bmi’ and ‘charges’ columns.

sns.pairplot(df_attr, diag_kind ="kde", corner = True); # pairplot

In [None]:
def facetgridplot(train, var):
    facet = sns.FacetGrid(train, hue="KidHome", aspect=4)
    facet.map(sns.kdeplot, var, fill = True)
    facet.set(xlim= (0 , train[var].max()))
    facet.add_legend()
    plt.show();

In [None]:
facetgridplot(retention, 'TeenHome')

In [None]:
retention.columns
