## Marketing Customer Behavior Analysis

### Key Objectives:
---
1. Conduct thorough data exploration to uncover insights, establish relationships, and enhance comprehension of customer characteristics.
2. Propose and articulate a customer segmentation strategy derived from observed customer behaviors.
3. Engineer and validate a predictive model to empower the company in maximizing profits from upcoming marketing campaigns.

### Data Dictionary
---
| **Column Name**         | **Description**                                            |
|--------------------------|------------------------------------------------------------|
| id                       | Customer's unique ID                                       |
| year_birth               | Customer's birth year                                      |
| education                | Customer's education level                                 |
| marital_status           | Customer's marital status                                  |
| income                   | Customer's yearly household income                         |
| kidhome                  | Number of children in customer's household                 |
| teenhome                 | Number of teenagers in customer's household                |
| dt_customer              | Date of customer's enrollment with the company             |
| recency                  | Number of days since customer's last purchase              |
| mntwines                 | Amount spent on wine in the last 2 years                   |
| mntfruits                | Amount spent on fruits in the last 2 years                 |
| mntmeatproducts          | Amount spent on meat in the last 2 years                   |
| mntfishproducts          | Amount spent on fish in the last 2 years                   |
| mntsweetproducts         | Amount spent on sweets in the last 2 years                 |
| mntgoldprods             | Amount spent on gold in the last 2 years                   |
| numdealspurchases        | Number of purchases made with a discount                   |
| numwebpurchases          | Number of purchases made through the company's web site    |
| numcatalogpurchases      | Number of purchases made using a catalogue                |
| numstorepurchases        | Number of purchases made directly in stores                |
| numwebvisitsmonth        | Number of visits to company's web site in the last month   |
| acceptedcmp1             | 1 if customer accepted the offer in the 1st campaign, 0 otherwise  |
| acceptedcmp2             | 1 if customer accepted the offer in the 2nd campaign, 0 otherwise  |
| acceptedcmp3             | 1 if customer accepted the offer in the 3rd campaign, 0 otherwise  |
| acceptedcmp4             | 1 if customer accepted the offer in the 4th campaign, 0 otherwise  |
| acceptedcmp5             | 1 if customer accepted the offer in the 5th campaign, 0 otherwise  |
| response                 | 1 if customer accepted the offer in the last campaign, 0 otherwise (Target variable) |
| complain                 | 1 if customer complained in the last 2 years, 0 otherwise |
| country                  | Customer's location                                       |



In [2]:
import datetime

from IPython.display import display

import pandas as pd
import numpy as np

import seaborn as sns

import scipy.stats as stats 


### Exploratory Data Analysis and Data Preprocessing
---

In [3]:
marketing_data = pd.read_csv("../data/raw-data.csv")

marketing_data.columns = [column.lower() for column in marketing_data.columns] # Converting columns to lowercase to standardize column names.

display(marketing_data.head())
print(marketing_data.info())

Unnamed: 0,id,year_birth,education,marital_status,income,kidhome,teenhome,dt_customer,recency,mntwines,...,numstorepurchases,numwebvisitsmonth,acceptedcmp3,acceptedcmp4,acceptedcmp5,acceptedcmp1,acceptedcmp2,response,complain,country
0,1826,1970,Graduation,Divorced,"$84,835.00",0,0,6/16/14,0,189,...,6,1,0,0,0,0,0,1,0,SP
1,1,1961,Graduation,Single,"$57,091.00",0,0,6/15/14,0,464,...,7,5,0,0,0,0,1,1,0,CA
2,10476,1958,Graduation,Married,"$67,267.00",0,1,5/13/14,0,134,...,5,2,0,0,0,0,0,0,0,US
3,1386,1967,Graduation,Together,"$32,474.00",1,1,5/11/14,0,10,...,2,7,0,0,0,0,0,0,0,AUS
4,5371,1989,Graduation,Single,"$21,474.00",1,0,4/8/14,0,6,...,2,7,1,0,0,0,0,1,0,SP


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   id                   2240 non-null   int64 
 1   year_birth           2240 non-null   int64 
 2   education            2240 non-null   object
 3   marital_status       2240 non-null   object
 4   income               2216 non-null   object
 5   kidhome              2240 non-null   int64 
 6   teenhome             2240 non-null   int64 
 7   dt_customer          2240 non-null   object
 8   recency              2240 non-null   int64 
 9   mntwines             2240 non-null   int64 
 10  mntfruits            2240 non-null   int64 
 11  mntmeatproducts      2240 non-null   int64 
 12  mntfishproducts      2240 non-null   int64 
 13  mntsweetproducts     2240 non-null   int64 
 14  mntgoldprods         2240 non-null   int64 
 15  numdealspurchases    2240 non-null   int64 
 16  numweb

In [4]:
marketing_data.isna().sum()/marketing_data.shape[0]*100

id                     0.000000
year_birth             0.000000
education              0.000000
marital_status         0.000000
income                 1.071429
kidhome                0.000000
teenhome               0.000000
dt_customer            0.000000
recency                0.000000
mntwines               0.000000
mntfruits              0.000000
mntmeatproducts        0.000000
mntfishproducts        0.000000
mntsweetproducts       0.000000
mntgoldprods           0.000000
numdealspurchases      0.000000
numwebpurchases        0.000000
numcatalogpurchases    0.000000
numstorepurchases      0.000000
numwebvisitsmonth      0.000000
acceptedcmp3           0.000000
acceptedcmp4           0.000000
acceptedcmp5           0.000000
acceptedcmp1           0.000000
acceptedcmp2           0.000000
response               0.000000
complain               0.000000
country                0.000000
dtype: float64

In [5]:
marketing_data.drop(labels="id", axis=1, inplace=True) # Dropping ID since it is not necessary for this exercise.
marketing_data.dropna(inplace=True) # Dropping null values since there are not many missing values.

In [6]:
marketing_data["income"] = marketing_data["income"].apply(lambda x: x.replace("$", "").replace(",", "").replace(" ", "")) # Removing unecessary text from income amounts.
marketing_data["income"] = marketing_data["income"].astype(float)

In [7]:
marketing_data["dt_customer"] = pd.to_datetime(marketing_data["dt_customer"], format="%m/%d/%y") # Changing the data type of the data the customer enrolled in the company to be datetime.

In [8]:
marketing_data.describe()

Unnamed: 0,year_birth,income,kidhome,teenhome,dt_customer,recency,mntwines,mntfruits,mntmeatproducts,mntfishproducts,...,numcatalogpurchases,numstorepurchases,numwebvisitsmonth,acceptedcmp3,acceptedcmp4,acceptedcmp5,acceptedcmp1,acceptedcmp2,response,complain
count,2216.0,2216.0,2216.0,2216.0,2216,2216.0,2216.0,2216.0,2216.0,2216.0,...,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0
mean,1968.820397,52247.251354,0.441787,0.505415,2013-07-10 11:29:27.509025280,49.012635,305.091606,26.356047,166.995939,37.637635,...,2.671029,5.800993,5.319043,0.073556,0.074007,0.073105,0.064079,0.013538,0.150271,0.009477
min,1893.0,1730.0,0.0,0.0,2012-07-30 00:00:00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1959.0,35303.0,0.0,0.0,2013-01-16 00:00:00,24.0,24.0,2.0,16.0,3.0,...,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1970.0,51381.5,0.0,0.0,2013-07-08 12:00:00,49.0,174.5,8.0,68.0,12.0,...,2.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1977.0,68522.0,1.0,1.0,2013-12-31 00:00:00,74.0,505.0,33.0,232.25,50.0,...,4.0,8.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1996.0,666666.0,2.0,2.0,2014-06-29 00:00:00,99.0,1493.0,199.0,1725.0,259.0,...,28.0,13.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
std,11.985554,25173.076661,0.536896,0.544181,,28.948352,337.32792,39.793917,224.283273,54.752082,...,2.926734,3.250785,2.425359,0.261106,0.261842,0.260367,0.24495,0.115588,0.357417,0.096907


### Feature Engineering
---

We will create four new features.

1. Getting the customer's current age from the year they were born to assess the age groups of our customers.
2. Calculating the total amount they spent across products.
3. Calculating the total number of purchases from web, in store, and catalog. 
4. Calculating the web engagement rate, which is the ratio of the number of web visits per month by the total number of purchases.
5. Segmenting customers based on income and classsifying them into "low", "medium", and "high" income brackets.

In [26]:
marketing_data["customer_age"] = datetime.datetime.now().year - marketing_data["year_birth"]

marketing_data['total_amount_spent'] = marketing_data[['mntwines', 'mntfruits', 'mntmeatproducts', 
                                'mntfishproducts', 'mntsweetproducts', 
                                'mntgoldprods']].sum(axis=1)

marketing_data['total_num_purchases'] = marketing_data[['numwebpurchases', 'numcatalogpurchases', 
                                                        'numstorepurchases']].sum(axis=1)

marketing_data['web_engagement_rate'] = marketing_data['numwebvisitsmonth'] / marketing_data['total_num_purchases']


income_bins = [0, 30000, 60000, 140000]
income_labels = ['low', 'medium', 'high']
marketing_data['income_bracket'] = pd.cut(marketing_data['income'], bins=income_bins, labels=income_labels)


In [24]:
marketing_data["complain"] = pd.Categorical(marketing_data["complain"])
marketing_data["income_bracket"] = pd.Categorical(marketing_data["income_bracket"])
marketing_data["education"] = pd.Categorical(marketing_data["education"])

In [11]:
stats.pointbiserialr(marketing_data["complain"], marketing_data["income"])

SignificanceResult(statistic=-0.027224512314477325, pvalue=0.2001612932395199)

In [23]:
contingency_table = pd.crosstab(marketing_data["income_bracket"], marketing_data["education"])
print("Chi2 P-value",stats.chi2_contingency(contingency_table).pvalue)
print("Cramer's V:",stats.contingency.association(contingency_table, method="cramer"))

Chi2 P-value 2.9052559944433553e-61
Cramer's V: 0.26298233434203155
