# Project Proposal: Customer Personality Analysis Dataset

## Team Members
- Ian Hash
- Pleng Witayaweerasak

### Project Overview

#### **Introduction and Motivation**

Have you ever received an erroneous advertisement? Do product suggestions in your online shopping carts not align with your needs? Do you ever wish your favorite brands would make a product just for you, to meet your specific requirements? We have experienced less than ideal product recommendations, ads, and a lack of unique products consistently across all the retailers we buy from. Occasionally, they suggest, market, or create just the product we are looking for, and when they do, we buy it! The lack of high performing customer specific brands out there are all the motivation we need to dive deep into the process of customer segmentation, machine learning for predicting a customer’s segment, and how to use this paradigm to better serve our customers (including ourselves!). These strategies are widely used amongst larger retailers, yet we still experience a yearning for a better shopping experience, especially with smaller retailers.


Both team members’ previous data science project addressed product retail questions including customer segmentation analysis, product recommendation service (fashion) and improving customer product selection (wine quality-price relationships). We are interested in building upon these two previous projects and implementing both customer segmentation analysis and a machine learning model to predict customer segment and provide an overall improved customer experience in the retail space.

-------------


#### **Business Questions**

Retail businesses are often heavily focused on their number of sales, and volume of revenue they bring in each quarter. Retail business Key Performance Indicators (KPIs) often include are directly related to the increase in sales and revenue. Number of sales and revenue are tied together, and can be greatly impacted by how a business markets, develops products, retains customers, and strategically distributes its resources to improve on these common KPIs. The business questions we aim to answer are directly related to these common retail business KPIs, and intend to be applicable to most retail businesses.

- How can we use segmentation analysis/machine learning to conduct personalized marketing campaigns?
- How can we use segmentation analysis/machine learning to better allocate business resources to high-value customer segments?
- How can we use segmentation analysis/machine learning to guide our product development and innovation process?
- How can we use segmentation analysis/machine learning to improve customer retention?

Each of these key business questions are tied to both the technical process we are implementing (segmentation analysis/machine learning) and how they can help a retail business meet their KPIs (sales and revenue related). Through answering these questions we will demonstrate that customer segmentation and machine learning can be used to increase sales and revenue through targeted marketing, maximizing customer lifetime value, customer retention and informed product development and innovation.

-------------



#### **Project Goals**

Understanding your customer is key to business success. Recognizing that your customer base is made of different kinds of customers, who have different preferences, needs, and behaviors is an important layer to understanding your customers. With this information, a business can take a number of value generating actions including but not limited to targeted marketing, informed product development/evolution, and improved purchase recommendations. This project aims to conduct a customer segmentation analysis, and implement a machine learning infrastructure to market and recommend different products to different customer segments depending on their behavior, as well as inform new product development or previous product evolution to better match customer needs. Our project has three central goals:

- Conduct customer segmentation analysis
- Implement Machine learning model to predict customer segment/group based on demographic information and customer behavior
- Suggest three value adding propositions to a company looking to increase their revenue with machine learning including:
 - Showcase ML Informed Targeted Marketing: how different segments can be marketed to differently to both save on overall advertising budget and increase ROI on advertising
 - Demonstrate ML Informed Product Recommendations: how different customer segments can be recommended different products while shopping in real time
 -Provide ML Informed Product Development/Evolution Suggestions: how products can be designed or evolved differently for the unique customer segments.
 -Identify High-Value Customer Segments: which customers stick around longer and/or buy more. Identifying these customers can improve ROI of marketing efforts and resource allocation to products that high value customers want.

Adding complicated software and infrastructure to a business can add significant overhead. We intend to show how valuable the addition of customer segmentation analysis informed machine learning models can be to a retailer, and how it can improve their bottom line.

-------------



### Data Description

The **Customer Personality Analysis** dataset offers a comprehensive view of customer demographics, purchasing behavior, marketing campaign responses, and interaction channels.

#### Key Metadata:
- **Source**: [Kaggle – Customer Personality Analysis](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis)
- **Size**: 2,240 rows × 29 columns  
- **Format**: CSV

#### Columns Overview:
Data definition sourced from [Kaggle](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis)

**Demographics**
- `ID`: Customer's unique identifier
- `Year_Birth`: Customer's birth year
- `Education`: Customer's education level
- `Marital_Status`: Customer's marital status
- `Income`: Customer's yearly household income
- `Kidhome`: Number of children in customer's household
- `Teenhome`: Number of teenagers in customer's household
- `Dt_Customer`: Date of customer's enrollment with the company
- `Recency`: Number of days since customer's last purchase
- `Complain`: 1 if the customer complained in the last 2 years, 0 otherwise

**Products**

- `MntWines`: Amount spent on wine in last 2 years
- `MntFruits`: Amount spent on fruits in last 2 years
- `MntMeatProducts`: Amount spent on meat in last 2 years
- `MntFishProducts`: Amount spent on fish in last 2 years
- `MntSweetProducts`: Amount spent on sweets in last 2 years
- `MntGoldProds`: Amount spent on gold in last 2 years

**Promotion**

- `NumDealsPurchases`: Number of purchases made with a discount
- `AcceptedCmp1`: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- `AcceptedCmp2`: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- `AcceptedCmp3`: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- `AcceptedCmp4`: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- `AcceptedCmp5`: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- `Response`: 1 if customer accepted the offer in the last campaign, 0 otherwise

**Place**

- `NumWebPurchases`: Number of purchases made through the company’s website
- `NumCatalogPurchases`: Number of purchases made using a catalogue
- `NumStorePurchases`: Number of purchases made directly in stores
- `NumWebVisitsMonth`: Number of visits to company’s website in the last month

### Task: Machine Learning

Improve business understanding of customer through segmentation analysis and use machine learning to predict customer segments that can be used for targeted marketing, product recommendation, and product development/evolution. Clustering is **unsupervised learning** technique to predict customer demographic groups based on their purchasing behavior, promotional response, and channel usage. We will explore machine learning models which can help discover meaningful patterns such as:

1. K-Means Clustering

  - Works well with numerical customer attributes (e.g., Income, Purchases, Recency).

  - Clear, distinct segmentation of customers.

  - Requires categorical variables to be encoded properly (e.g., One-Hot Encoding for Education, Marital Status).

  - Sensitive to outliers (need to scale numerical values such as income).

  - Needs the number of clusters (k) predefined.

2. Dimensionality Reduction (PCA/t-SNE/UMAP)
  - For feature reduction or visualization prior to clustering.

### Experience:

Sourced from the customer behavior dataset which includes information on customer demographics and consumer behavior. Our machine learning model will learn from this dataset, gaining experience in the domain and more specifically, the customer base specific to the business in the dataset. Additional experience can be gained from more data collection in the future, or addition of reinforcement learning.


### Performance: Evaluation Methdos

Since our task is unsupervised learning, our evaluation will focuse on internal metrics or cluster interpretability. These are the evaluation methods that we want to explore:

1. Elbow Method: Plots Within-Cluster-Sum-of-Squares (WCSS) vs. number of clusters.

2. Silhouette Score: Measures how similar an object is to its own cluster vs. others.

3. Domain validation: We will interpret clusters using our business logic—e.g., "High-income should be wine lovers" or "Budget-conscious customers should prefer promotions."

### Workflow Plan for our Project

**Intermediate Deliverables (April 17)**

1. Exploratory Data Analysis (EDA) - April 10

  - Analyze distributions, correlations.

  - Visualize trends in spending behavior by demographic attributes.

2. Data Understanding & Cleaning - April 14

  - Convert categorical columns (Education, Marital_Status) using encoding.

  - Normalize numeric features using scaler.

**Final Deliverables (May 2)**

3. Clustering Modeling - April 21

  - Try K-Means

  - Use PCA or t-SNE for visualization.

  - Choose optimal clusters using silhouette or elbow method.
  
  - Evaluate model

4. Cluster Profiling - April 24

  - Analyze cluster composition: income, age, product preference.

  - Interpret clusters.

5. Business Insights - May 1

  - Recommend campaign strategies for each segment.

  - Finalize report with segment summaries and visualizations.





In [None]:
import kagglehub
imakash3011_customer_personality_analysis_path = kagglehub.dataset_download('imakash3011/customer-personality-analysis')

print('Data source import complete.')

Data source import complete.


In [None]:
import pandas as pd
import os

# Find the CSV file path
file_path = None
for root, dirs, files in os.walk(imakash3011_customer_personality_analysis_path):
    for file in files:
        if file.endswith('.csv') or file.endswith('.txt'):
            file_path = os.path.join(root, file)
            break
    if file_path:
        break

df = pd.read_csv(file_path, sep='\t')
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i