# Correct segment prediction

### Project overview:
- This problem marks the importance of behaviorial segmentation in python. 
- Through this behavioral segementation retailer strategize it's marketing, promotion and loyality Strategy.

#### Dataset:
- The Dataset contains 4999 entries and the 26 features.
- This dataset is a real dataset for one of the biggest retailers based out of west cost. 

#### Objective:
Our aim is the build a multi classfication model that will predict the correct segment

In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel      
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [2]:
df = pd.read_csv('Segment-GP.csv',encoding='latin1')

In [3]:
df.head()

Unnamed: 0,customer id,channel,Nearest_distance_cosco,Net Sales,Age,children count,Transaction_date,Martial status,Region,Loyality_flag,...,Orders groceries,Orders_cerials,Orders_chokolates,Orders dentals,Orders cosmetics,Orders_ready_eat,Orders_braverage,Orders_frozen,Loyality_amt,segments
0,5102,Email,4,367941,58,4,2017-09-19,0,San Diego,0,...,6,10,1,6,3,7,4,0,335,6
1,3549,Email,33,171305,98,1,2017-02-17,0,Atlanta,0,...,7,8,0,3,5,4,10,8,446,5
2,5885,direct,28,173439,18,2,2017-06-06,0,TampaSt. Petersburg,0,...,5,5,8,9,3,7,5,0,49,6
3,7381,store,40,997988,87,3,2016-09-23,0,Seattle,1,...,5,0,8,1,2,7,7,7,280,3
4,2713,direct,5,898554,31,2,2017-01-20,0,Seattle,0,...,4,5,10,5,10,8,9,0,89,3


In [4]:
df.info()

#Dataset is clean

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 26 columns):
customer id               4999 non-null int64
channel                   4999 non-null object
Nearest_distance_cosco    4999 non-null int64
Net Sales                 4999 non-null int64
Age                       4999 non-null int64
children count            4999 non-null int64
Transaction_date          4999 non-null object
Martial status            4999 non-null int64
Region                    4999 non-null object
Loyality_flag             4999 non-null int64
coupon amount             4999 non-null int64
Customer_type             4999 non-null object
Race_cd                   4999 non-null object
payment                   4999 non-null object
Brands                    4999 non-null object
Income                    4999 non-null int64
Orders groceries          4999 non-null int64
Orders_cerials            4999 non-null int64
Orders_chokolates         4999 non-null int64
Orders denta

In [7]:
categorical = [var for var in df.columns if df[var].dtype=='O']
numerical = [var for var in df.columns if df[var].dtype!='O']

In [9]:
categorical

['channel',
 'Transaction_date',
 'Region',
 'Customer_type',
 'Race_cd',
 'payment',
 'Brands']

In [10]:
numerical

['customer id',
 'Nearest_distance_cosco',
 'Net Sales',
 'Age',
 'children count',
 'Martial status',
 'Loyality_flag',
 'coupon amount',
 'Income',
 'Orders groceries',
 'Orders_cerials',
 'Orders_chokolates',
 'Orders dentals',
 'Orders cosmetics',
 'Orders_ready_eat',
 'Orders_braverage',
 'Orders_frozen',
 'Loyality_amt',
 'segments']

In [11]:
#Applying one hot encoding to categorical columns

df = pd.concat([df[numerical], pd.get_dummies(df[categorical])], axis=1)

In [12]:
from xgboost import XGBClassifier

X = df.drop(['customer id','segments'],axis=1)
y = df['segments']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42, shuffle=True)

In [13]:
#Fit model

classifier = XGBClassifier(random_state=2)
classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

f1 = f1_score(y_test,y_pred, average='macro')
print("F1-score:",f1)

xgb_cr = classification_report(y_test,y_pred)
print("Classification report \n",xgb_cr)

F1-score: 0.1748557401852023
Classification report 
               precision    recall  f1-score   support

           1       0.15      0.17      0.16       247
           2       0.23      0.16      0.19       288
           3       0.19      0.24      0.21       241
           4       0.13      0.14      0.14       231
           5       0.18      0.14      0.16       251
           6       0.18      0.21      0.20       242

    accuracy                           0.18      1500
   macro avg       0.18      0.18      0.17      1500
weighted avg       0.18      0.18      0.18      1500



In [14]:
#Our F1 score is less, we can optimize our model using GridSearchCV

from sklearn.model_selection import GridSearchCV

parameters={'learning_rate':[0.1,0.15,0.2,0.25,0.3],
            'max_depth':range(1,3)}

grid_search = GridSearchCV(estimator=classifier, param_grid = parameters, n_jobs=-1, verbose=4)
grid_search.fit(X_train,y_train)

grid_predictions = grid_search.predict(X_test)

grid_f1 = f1_score(y_test, grid_predictions, average='macro')
print("Grid F1score",grid_f1)

report = classification_report(y_test,y_pred)
print("Classification report grid \n",report)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 18.5min finished


Grid F1score 0.1773890153382315
Classification report grid 
               precision    recall  f1-score   support

           1       0.15      0.17      0.16       247
           2       0.23      0.16      0.19       288
           3       0.19      0.24      0.21       241
           4       0.13      0.14      0.14       231
           5       0.18      0.14      0.16       251
           6       0.18      0.21      0.20       242

    accuracy                           0.18      1500
   macro avg       0.18      0.18      0.17      1500
weighted avg       0.18      0.18      0.18      1500



In [15]:
#As we can see that the f1score descrease after applying the gridSeachCV, we can hyperparameter tune to improve f1 score

#Predictor check using ensembling

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

f1 = f1_score(y_test,y_pred, average='macro')
print("RF F1score",grid_f1)

report = classification_report(y_test,y_pred)
print("Classification report RF \n",report)

RF F1score 0.1773890153382315
Classification report RF 
               precision    recall  f1-score   support

           1       0.18      0.21      0.19       247
           2       0.16      0.10      0.13       288
           3       0.15      0.22      0.18       241
           4       0.16      0.17      0.17       231
           5       0.18      0.14      0.16       251
           6       0.15      0.14      0.14       242

    accuracy                           0.16      1500
   macro avg       0.16      0.16      0.16      1500
weighted avg       0.16      0.16      0.16      1500

