## Dataset : https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation

#### Objective : This case study aims to analyze customer data to uncover key demographic and behavioral patterns for effective segmentation. It explores the total customer base, demographic characteristics, and spending behavior while examining the distribution of professions, gender, average age, and family size within different segments. The study investigates spending score patterns, key influencing factors, and trends related to work experience and marital status. Additionally, data preprocessing techniques, such as handling missing values and performing group-by operations, are applied to compare age and work experience across customer categories. Customers are classified into age groups—Young, Middle-aged, and Senior—to derive targeted insights. The analysis identifies influential factors affecting high and low spending scores and highlights customer segments with unique purchasing behaviors. These insights help businesses personalize marketing strategies, enhance customer retention, and optimize sales performance.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('customer_segmentation.csv')

In [3]:
df.head(5)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [4]:
df.shape

(8068, 11)

In [5]:
df.columns

Index(['ID', 'Gender', 'Ever_Married', 'Age', 'Graduated', 'Profession',
       'Work_Experience', 'Spending_Score', 'Family_Size', 'Var_1',
       'Segmentation'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


In [7]:
df.describe()

Unnamed: 0,ID,Age,Work_Experience,Family_Size
count,8068.0,8068.0,7239.0,7733.0
mean,463479.214551,43.466906,2.641663,2.850123
std,2595.381232,16.711696,3.406763,1.531413
min,458982.0,18.0,0.0,1.0
25%,461240.75,30.0,0.0,2.0
50%,463472.5,40.0,1.0,3.0
75%,465744.25,53.0,4.0,4.0
max,467974.0,89.0,14.0,9.0


In [8]:
df.isnull().sum()

ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

### What is the total number of customers in the dataset?

In [9]:
total_count = len(df)
total_count

8068

### How many unique professions are listed in the dataset?

In [10]:
df['Profession'].fillna('Unknown',inplace=True)

In [11]:
unique_profession = df['Profession'].unique()
unique_profession

array(['Healthcare', 'Engineer', 'Lawyer', 'Entertainment', 'Artist',
       'Executive', 'Doctor', 'Homemaker', 'Marketing', 'Unknown'],
      dtype=object)

### Count the number of male and female customers.

In [12]:
gender_distribution = df['Gender'].value_counts()
gender_distribution

Male      4417
Female    3651
Name: Gender, dtype: int64

### Find the average age of customers.

In [13]:
avg_age = df['Age'].mean()
avg_age

43.46690629647992

### List the distinct values in the Segmentation column.

In [14]:
distinct_segment = df['Segmentation'].value_counts()
distinct_segment

D    2268
A    1972
C    1970
B    1858
Name: Segmentation, dtype: int64

### What is the distribution of customers across different spending scores?

In [15]:
dist_spending_score = df['Spending_Score'].value_counts()
dist_spending_score

Low        4878
Average    1974
High       1216
Name: Spending_Score, dtype: int64

### Identify the top 3 most common professions in the dataset.

In [16]:
top_3_professions = df['Profession'].value_counts()
top_3_professions.head(3)

Artist           2516
Healthcare       1332
Entertainment     949
Name: Profession, dtype: int64

### Calculate the average family size for each segmentation group.

In [17]:
avg_family_size = df.groupby('Segmentation')['Family_Size'].mean()
round(avg_family_size,0)

Segmentation
A    2.0
B    3.0
C    3.0
D    3.0
Name: Family_Size, dtype: float64

### How many customers have a work experience of more than 5 years?

In [18]:
cnt_of_customers = len(df[df['Work_Experience'] > 5])
cnt_of_customers

1579

### Handle missing values for the Ever_Married column by filling them with the mode.

In [19]:
df['Ever_Married'].fillna(df['Ever_Married'].mode()[0], inplace=True)

### Perform a group-by operation to find the average age and work experience for each segmentation group.

In [20]:
avg_age_and_exp = df.groupby('Segmentation')[['Age','Work_Experience']].mean()
avg_age_and_exp

Unnamed: 0_level_0,Age,Work_Experience
Segmentation,Unnamed: 1_level_1,Unnamed: 2_level_1
A,44.924949,2.874578
B,48.200215,2.378151
C,49.144162,2.240771
D,33.390212,3.021717


### Create a new column categorizing customers as “Young” (<30), “Middle-aged” (30–50), or “Senior” (>50).

In [21]:
def cust_category(age):
    
    if age < 30:
        return 'Young'
    elif age >= 30 or age < 50:
        return 'Middle-aged'
    else:
        return 'Senior'

In [22]:
df['Customer_Category'] = df['Age'].apply(cust_category)

In [23]:
df

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,Customer_Category
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D,Young
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A,Middle-aged
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B,Middle-aged
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B,Middle-aged
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A,Middle-aged
...,...,...,...,...,...,...,...,...,...,...,...,...
8063,464018,Male,No,22,No,Unknown,0.0,Low,7.0,Cat_1,D,Young
8064,464685,Male,No,35,No,Executive,3.0,Low,4.0,Cat_4,D,Middle-aged
8065,465406,Female,No,33,Yes,Healthcare,1.0,Low,1.0,Cat_6,D,Middle-aged
8066,467299,Female,No,27,Yes,Healthcare,1.0,Low,4.0,Cat_6,B,Young


### What is the age distribution of customers in each segmentation group?

In [24]:
age_dist = df.groupby('Segmentation')['Age'].describe()
age_dist

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Segmentation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,1972.0,44.924949,16.406909,18.0,33.0,41.0,52.0,89.0
B,1858.0,48.200215,14.806443,18.0,37.0,46.0,58.0,89.0
C,1970.0,49.144162,14.57509,18.0,38.0,49.0,59.0,89.0
D,2268.0,33.390212,15.680304,18.0,22.0,29.0,38.0,89.0


### Which professions contribute the most to the "High" spending score group?

In [25]:
high_score = df[df['Spending_Score'] == 'High']
prof_with_high_score = high_score['Profession'].value_counts()
prof_with_high_score.head(1)

Executive    398
Name: Profession, dtype: int64

### Which customer segment has the highest average family size?

In [26]:
avg_family_size = df.groupby('Segmentation')['Family_Size'].mean()
highest_avg_segment = avg_family_size.idxmax()
highest_avg_segment

'D'

### What is the proportion of customers in each segment (Segmentation column)?

In [27]:
cust_dist = df.groupby('Segmentation')['ID'].count()
total_cust = len(df['ID'])
cust_proportion = cust_dist / total_cust
cust_proportion

Segmentation
A    0.244422
B    0.230293
C    0.244175
D    0.281111
Name: ID, dtype: float64

### Identify the age group (e.g., <30, 30–50, >50) that contributes the most to the "D" segment.

In [28]:
age_grp = df[df['Segmentation']=='D'] 
result = age_grp.groupby('Customer_Category')['ID'].count()
result.idxmax()

'Young'

### For customers with "Low" spending scores, what is the most common profession?

In [29]:
low_spending_score = df[df['Spending_Score'] == 'Low']
result = low_spending_score['Profession'].value_counts()
result.idxmax()

'Artist'

### Which segment has the highest percentage of customers with a family size greater than 4?

In [30]:
family_size_ = df[df['Family_Size'] > 4]
result = family_size_.groupby('Segmentation')['ID'].count()
total_cust = df.groupby('Segmentation')['ID'].count()
highest_percentage = (result/total_cust)*100
highest_percentage

Segmentation
A     8.113590
B    10.710441
C    11.675127
D    18.738977
Name: ID, dtype: float64

### Among customers with more than 10 years of work experience, which spending score is the most common?

In [31]:
cust_data = df[df['Work_Experience'] > 10]
common_spending_score = cust_data['Spending_Score'].value_counts()
most_common_score = common_spending_score.idxmax()
most_common_count = common_spending_score.max()
print(f'Most common spending score among customers more than 10 years of experience is {most_common_score} : {most_common_count}.')

Most common spending score among customers more than 10 years of experience is Low : 132.
