# Customer Segmentation Analysis

An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4, and P5). After intensive market research, they’ve deduced that the behavior of the new market is similar to their existing market.

In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for a different segment of customers. This strategy has work e exceptionally well for them. They plan to use the same strategy for the new markets and have identified 2627 new potential customers.

You are required to help the manager to predict the right group of the new customers.

In [2]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
df = pd.read_csv('Train customer segmentation .csv')

In [4]:
df.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


In [6]:
df.isnull().sum()

ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

In [7]:
df = df.dropna(axis=0)

In [8]:
df.shape

(6665, 11)

In [9]:
df.Gender.unique()

array(['Male', 'Female'], dtype=object)

In [10]:
df.Gender = df.Gender.map({'Male':1, 'Female':0})

In [11]:
df.Ever_Married = df.Ever_Married.map({'Yes':1, 'No':0})

In [12]:
df.Graduated = df.Graduated.map({'Yes':1, 'No':0})


In [13]:
# Perform one-hot encoding on the Profession column
profession_dummies = pd.get_dummies(df.Profession, prefix='Profession').astype('int')

# Concatenate the one-hot encoded results with the original DataFrame
df = pd.concat([df, profession_dummies], axis=1)

# Optionally, drop the original Profession column
df.drop('Profession', axis=1, inplace=True)

In [14]:
df.Spending_Score = df.Spending_Score.map({'Low':1, 'High':3, 'Average':2})

In [15]:
df.Family_Size = df.Family_Size.astype('int')

In [16]:
df.Work_Experience = df.Work_Experience.astype('int')

In [17]:
df.Var_1 = df.Var_1.map({'Cat_1':1, 'Cat_2':2, 'Cat_3':3, 'Cat_4':4, 'Cat_5':5, 'Cat_6':6, 'Cat_7':7}) 

In [18]:
df.Segmentation = df.Segmentation.map({'A':1, 'B':2, 'C':3, 'D':4})

In [105]:
cor = df.corr()
cor

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,Profession_Artist,Profession_Doctor,Profession_Engineer,Profession_Entertainment,Profession_Executive,Profession_Healthcare,Profession_Homemaker,Profession_Lawyer,Profession_Marketing
ID,1.0,0.007943,0.033197,0.011664,-0.011744,-0.028948,0.014333,0.006343,-0.007728,-0.013782,-0.01909,-0.005157,0.014949,0.00439,0.021269,0.001264,0.017542,-0.006205,-0.013512
Gender,0.007943,1.0,0.114869,0.021269,-0.045263,-0.052294,0.06814,0.058025,0.023131,0.03359,-0.046428,0.008909,-0.219165,0.137358,0.233025,0.030189,-0.12367,-0.03647,-0.03867
Ever_Married,0.033197,0.114869,1.0,0.567729,0.202987,-0.092892,0.617433,-0.083559,0.090855,-0.206909,0.17731,-0.084033,0.01579,0.016149,0.203215,-0.421563,-0.014425,0.200622,-0.099513
Age,0.011664,0.021269,0.567729,1.0,0.24725,-0.188769,0.432261,-0.281772,0.171659,-0.231696,0.119892,-0.119262,-0.034804,-0.017287,0.133837,-0.441194,-0.054111,0.541586,-0.076032
Graduated,-0.011744,-0.045263,0.202987,0.24725,1.0,0.032257,0.114328,-0.234985,0.127577,-0.172233,0.367099,-0.035552,-0.111697,-0.003579,-0.06598,-0.249805,-0.026476,0.0074,-0.097774
Work_Experience,-0.028948,-0.052294,-0.092892,-0.188769,0.032257,1.0,-0.077021,-0.069123,0.026172,0.006982,0.01782,-0.003784,0.00029,0.015254,-0.024093,-0.007009,0.181573,-0.117349,-0.008296
Spending_Score,0.014333,0.06814,0.617433,0.432261,0.114328,-0.077021,1.0,0.095669,0.085278,-0.097284,0.040527,-0.085421,-0.030472,-0.066033,0.358155,-0.267461,-0.019459,0.214281,-0.077522
Family_Size,0.006343,0.058025,-0.083559,-0.281772,-0.234985,-0.069123,0.095669,1.0,-0.142051,0.199412,-0.159135,0.004173,0.025264,-0.018821,0.103099,0.25269,-0.059818,-0.163836,0.027336
Var_1,-0.007728,0.023131,0.090855,0.171659,0.127577,0.026172,0.085278,-0.142051,1.0,-0.019768,0.09019,-0.022684,-0.067445,-0.03173,0.041406,-0.090001,-0.028779,0.101777,-0.035689
Segmentation,-0.013782,0.03359,-0.206909,-0.231696,-0.172233,0.006982,-0.097284,0.199412,-0.019768,1.0,-0.137655,-0.008625,-0.093229,-0.117926,-0.010167,0.367225,0.003346,-0.067182,0.090604


In [107]:
sns.pairplot(df, hue='Segmentation')

<seaborn.axisgrid.PairGrid at 0x78371c43eab0>

In [21]:
plt.figure(figsize=(20, 16))  
sns.heatmap(cor, annot=True)

<Axes: >

In [22]:
df.describe()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,Profession_Artist,Profession_Doctor,Profession_Engineer,Profession_Entertainment,Profession_Executive,Profession_Healthcare,Profession_Homemaker,Profession_Lawyer,Profession_Marketing
count,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0,6665.0
mean,463519.84096,0.551688,0.591748,43.536084,0.637509,2.629107,1.550638,2.84111,5.178395,2.542836,0.328882,0.088822,0.087322,0.12138,0.075769,0.16159,0.026257,0.075019,0.034959
std,2566.43174,0.497358,0.491547,16.524054,0.480755,3.405365,0.740806,1.524743,1.409265,1.122723,0.469842,0.284508,0.282327,0.326593,0.264648,0.368102,0.159909,0.263441,0.183689
min,458982.0,0.0,0.0,18.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,461349.0,0.0,0.0,31.0,0.0,0.0,1.0,2.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,463575.0,1.0,1.0,41.0,1.0,1.0,1.0,2.0,6.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,465741.0,1.0,1.0,53.0,1.0,4.0,2.0,4.0,6.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,467974.0,1.0,1.0,89.0,1.0,14.0,3.0,9.0,7.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [23]:
y = df.Segmentation
#since spending_score, age, ever married are highly correlated, therefore dropping them.
X = df.drop(columns=['ID', 'Segmentation'], axis = 1)

In [24]:
df = df.reset_index(drop=True)
features = X.columns

In [25]:
from sklearn.model_selection import StratifiedShuffleSplit

# Get the split indexes
strat_shuf_split = StratifiedShuffleSplit(n_splits=1, 
                                          test_size=0.3, 
                                          random_state=42)

train_idx, test_idx = next(strat_shuf_split.split(X, y))

# Create the dataframes
X_train = df.loc[train_idx, features]
y_train = df.loc[train_idx, 'Segmentation']

X_test  = df.loc[test_idx, features]
y_test  = df.loc[test_idx, 'Segmentation']

In [26]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Logistic Regression

In [28]:
from sklearn.linear_model import LogisticRegression

# Standard logistic regression
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [29]:
from sklearn.linear_model import LogisticRegressionCV

# L1 regularized logistic regression
lr_l1 = LogisticRegressionCV(Cs=10, cv=4, penalty='l1', solver='liblinear').fit(X_train, y_train)

In [30]:
# L2 regularized logistic regression
lr_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2', solver='liblinear').fit(X_train, y_train)

In [31]:
# Predict the class and the probability for each
y_pred = list()
y_prob = list()

coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab,mod in zip(coeff_labels, coeff_models):
    y_pred.append(pd.Series(mod.predict(X_test), name=lab))
    y_prob.append(pd.Series(mod.predict_proba(X_test).max(axis=1), name=lab))
    
y_pred = pd.concat(y_pred, axis=1)
y_prob = pd.concat(y_prob, axis=1)

y_pred.head()

Unnamed: 0,lr,l1,l2
0,3,3,3
1,1,2,1
2,1,2,1
3,4,2,1
4,1,2,1


In [32]:
y_prob.head()

Unnamed: 0,lr,l1,l2
0,0.52709,0.423914,0.370917
1,0.396117,0.412147,0.414076
2,0.40037,0.442572,0.415775
3,0.293829,0.439803,0.385172
4,0.377388,0.447147,0.409857


In [33]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

metrics = list()
cm = dict()

for lab in coeff_labels:

    # Preciision, recall, f-score from the multi-class support function
    precision, recall, fscore, _ = score(y_test, y_pred[lab], average='weighted')
    
    # The usual way to calculate accuracy
    accuracy = accuracy_score(y_test, y_pred[lab])
    
    # ROC-AUC scores can be calculated by binarizing the data
    auc = roc_auc_score(label_binarize(y_test, classes=[1,2,3,4]),
              label_binarize(y_pred[lab], classes=[1,2,3,4]), 
              average='weighted')
    
    # Last, the confusion matrix
    cm[lab] = confusion_matrix(y_test, y_pred[lab])
    
    metrics.append(pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy,
                              'auc':auc}, 
                             name=lab))

metrics = pd.concat(metrics, axis=1)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [34]:
metrics

Unnamed: 0,lr,l1,l2
precision,0.480711,0.579827,0.4255
recall,0.507,0.4205,0.483
fscore,0.480691,0.385738,0.426143
accuracy,0.507,0.4205,0.483
auc,0.670611,0.618225,0.656337


Display or plot the confusion matrix for each model.

In [36]:
fig, axList = plt.subplots(nrows=2, ncols=2)
axList = axList.flatten()
fig.set_size_inches(12, 10)

axList[-1].axis('off')

for ax,lab in zip(axList[:-1], coeff_labels):
    sns.heatmap(cm[lab], ax=ax, annot=True, fmt='d');
    ax.set(title=lab);
    
plt.tight_layout()

## Decision Tree

In [38]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt = dt.fit(X_train, y_train)

In [39]:
dt.tree_.node_count, dt.tree_.max_depth

(4565, 28)

In [40]:
# A function to return error matrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def measure_error(y_true, y_pred, label):
    return pd.Series({
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, average='weighted'),  # Use 'weighted', 'macro', or 'micro'
        'recall': recall_score(y_true, y_pred, average='weighted'),        # Use 'weighted', 'macro', or 'micro'
        'f1': f1_score(y_true, y_pred, average='weighted')               # Use 'weighted', 'macro', or 'micro'
    }, name=label)

In [41]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

train_test_full_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                              measure_error(y_test, y_test_pred, 'test')],
                              axis=1)

train_test_full_error

Unnamed: 0,train,test
accuracy,0.965702,0.453
precision,0.966344,0.456251
recall,0.965702,0.453
f1,0.965761,0.454363


In [42]:
# Grid search CV
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth':range(1, dt.tree_.max_depth+1, 2),
              'max_features': range(1, len(dt.feature_importances_)+1)}

GR = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid=param_grid,
                  scoring='accuracy',
                  n_jobs=-1)

GR = GR.fit(X_train, y_train)

  _data = np.array(data, dtype=dtype, copy=copy,


In [43]:
GR.best_estimator_.tree_.node_count, GR.best_estimator_.tree_.max_depth

(231, 7)

In [44]:
y_train_pred_gr = GR.predict(X_train)
y_test_pred_gr = GR.predict(X_test)

train_test_gr_error = pd.concat([measure_error(y_train, y_train_pred_gr, 'train'),
                                 measure_error(y_test, y_test_pred_gr, 'test')],
                                axis=1)

In [45]:
train_test_gr_error

Unnamed: 0,train,test
accuracy,0.565273,0.517
precision,0.5709,0.51246
recall,0.565273,0.517
f1,0.562311,0.510237


## Random forest classifier

In [47]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

In [48]:
param_grid = {'n_estimators': [2*n+1 for n in range(20)],
             'max_depth' : [2*n+1 for n in range(10) ],
             'max_features':["auto", "sqrt", "log2"]}

In [49]:
import warnings
warnings.filterwarnings('ignore')

In [50]:
search = GridSearchCV(estimator=model, param_grid=param_grid,scoring='accuracy')
search.fit(X_train, y_train)

In [51]:
search.best_score_

0.5404072883172562

In [52]:
search.best_params_

{'max_depth': 7, 'max_features': 'sqrt', 'n_estimators': 39}

In [53]:
from sklearn import metrics
def get_accuracy(X_train, X_test, y_train, y_test, model):
    return  {"test Accuracy":metrics.accuracy_score(y_test, model.predict(X_test)),"trian Accuracy": metrics.accuracy_score(y_train, model.predict(X_train))}

In [54]:
print(get_accuracy(X_train, X_test, y_train, y_test, search.best_estimator_))

{'test Accuracy': 0.538, 'trian Accuracy': 0.6030010718113612}
