## Customer Segmentation for Automobile Company

<u>Context</u>

An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market.

<u>Content</u>

In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers.

You are required to help the manager to predict the right group of the new customers.

Acknowledgements

https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/#ProblemStatement

Inspiration

https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/#ProblemStatement

Dataset Source

https://www.kaggle.com/vetrirah/customer

### __1. Data Preprocessing__ 

The first step in building the model is to importing Libraries and understanding features within the unprocessed dataset.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

__a. Sample Submission Data__

In [3]:
df1 = pd.read_csv('./Files/sample_submission.csv')
df1.head()

Unnamed: 0,ID,Segmentation
0,458989,A
1,458994,A
2,458996,A
3,459000,A
4,459001,A


In [28]:
listItem = []
for col in df1.columns:
    listItem.append([col, df1[col].dtype,
                   df1[col].isna().sum(),
                   round((df1[col].isna().sum()/len(df1[col])) *100, 2),
                   df1[col].nunique(), list(df1[col].unique()[:5])]);

df1Desc = pd.DataFrame(columns=['dataFeatures', 'dataType', 'null', 'nullPct', 'unique', 'uniqueSample'],
                     data=listItem)

df1Desc

Unnamed: 0,dataFeatures,dataType,null,nullPct,unique,uniqueSample
0,ID,int64,0,0.0,2627,"[458989, 458994, 458996, 459000, 459001]"
1,Segmentation,object,0,0.0,1,[A]


In [30]:
df1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,2627.0,463433.918919,2618.245698,458989.0,461162.5,463379.0,465696.0,467968.0


__b. Train and Test Data__

Columns Explanation:

- Work_Experience = customer's work experience in Years.
- Graduated = indication whether the customer has graduated.
- Family_Size = Number of family members including the customer.
- Spending_Score = Spending score of the customer.
- Var_1 = Anonymised Category for the customer.
- Segmentation = (target) Customer Segment of the Customer.

Dataframe preparation

In [46]:
dfTr = pd.read_csv('./Files/Train.csv')
dfTr.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [47]:
dfTs = pd.read_csv('./Files/Test.csv')
dfTs.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,458996,Female,Yes,69,No,,0.0,Low,1.0,Cat_6
3,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6
4,459001,Female,No,19,No,Marketing,,Low,4.0,Cat_6


Dataframe Information

In [48]:
def describeDF(x):
    listItem = []
    for col in x.columns:
        listItem.append([col, x[col].dtype,
                       x[col].isna().sum(),
                       round((x[col].isna().sum()/len(x[col])) *100, 2),
                       x[col].nunique(), list(x[col].unique()[:5])]);

    desc = pd.DataFrame(columns=['dataFeatures', 'dataType', 'null', 'nullPct', 'unique', 'uniqueSample'],
                         data=listItem)
    return desc

In [49]:
describeDF(dfTr)

Unnamed: 0,dataFeatures,dataType,null,nullPct,unique,uniqueSample
0,ID,int64,0,0.0,8068,"[462809, 462643, 466315, 461735, 462669]"
1,Gender,object,0,0.0,2,"[Male, Female]"
2,Ever_Married,object,140,1.74,2,"[No, Yes, nan]"
3,Age,int64,0,0.0,67,"[22, 38, 67, 40, 56]"
4,Graduated,object,78,0.97,2,"[No, Yes, nan]"
5,Profession,object,124,1.54,9,"[Healthcare, Engineer, Lawyer, Entertainment, ..."
6,Work_Experience,float64,829,10.28,15,"[1.0, nan, 0.0, 4.0, 9.0]"
7,Spending_Score,object,0,0.0,3,"[Low, Average, High]"
8,Family_Size,float64,335,4.15,9,"[4.0, 3.0, 1.0, 2.0, 6.0]"
9,Var_1,object,76,0.94,7,"[Cat_4, Cat_6, Cat_7, Cat_3, Cat_1]"


Comments:

- There are a very high percentage of null value (10.28%) in Work_Experience which might affect the model later on. This is perhaps due to the unwillingness of customer to state their job, or they are still looking for a steady job.
- Family_Size has 4.15 % of null value, meaning that some user prefer to keep their privacy.
- Ever_Married, Profession, Graduated and Var_1 both have less than 2% of null value.
- Must consider what to do with the null value, since it might affect the model performance.

In [50]:
dfTr.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,8068.0,463479.214551,2595.381232,458982.0,461240.75,463472.5,465744.25,467974.0
Age,8068.0,43.466906,16.711696,18.0,30.0,40.0,53.0,89.0
Work_Experience,7239.0,2.641663,3.406763,0.0,0.0,1.0,4.0,14.0
Family_Size,7733.0,2.850123,1.531413,1.0,2.0,3.0,4.0,9.0


In [51]:
dfTr.Profession.unique()

array(['Healthcare', 'Engineer', 'Lawyer', 'Entertainment', 'Artist',
       'Executive', 'Doctor', 'Homemaker', 'Marketing', nan], dtype=object)

In [52]:
dfTr.groupby('Profession').count()

Unnamed: 0_level_0,ID,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
Profession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Artist,2516,2516,2487,2516,2502,2305,2516,2447,2491,2516
Doctor,688,688,677,688,683,630,688,662,683,688
Engineer,699,699,682,699,695,628,699,673,694,699
Entertainment,949,949,937,949,937,862,949,909,943,949
Executive,599,599,587,599,594,528,599,585,594,599
Healthcare,1332,1332,1298,1332,1320,1184,1332,1264,1317,1332
Homemaker,246,246,240,246,244,211,246,216,242,246
Lawyer,623,623,615,623,611,540,623,590,617,623
Marketing,292,292,285,292,287,253,292,275,290,292
