# Introduction

Source: https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/#ProblemStatement

### Problem Statement

An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market. 

In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers. 

You are required to help the manager to predict the right group of the new customers.

### Data Description 

|Variable|	Definition|
|---|-:|
|ID|	Unique ID|
|Gender|	Gender of the customer|
|Ever_Married|	Marital status of the customer|
|Age|	Age of the customer|
|Graduated|	Is the customer a graduate?|
|Profession|	Profession of the customer|
|Work_Experience|	Work Experience in years|
|Spending_Score|	Spending score of the customer|
|Family_Size|	Number of family members for the customer (including the customer)|
|Var_1|	Anonymised Category for the customer|
|Segmentation|	(target) Customer Segment of the customer|

sample_submission.csv

ID: Unique ID

Segmentation: Predicted segment for customers in the test set

### Evaluation Metric
The evaluation metric for this hackathon is Accuracy Score.


### Public and Private split
The public leaderboard is based on 40% of test data, while final rank would be decided on remaining 60% of test data (which is private leaderboard)

In [2]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [48]:
from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, MissingIndicator

In [26]:
LabelEncoder()

In [6]:
train = pd.read_csv('Train_aBjfeNk.csv')

In [7]:
train.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [59]:
train.columns

Index(['ID', 'Gender', 'Ever_Married', 'Age', 'Graduated', 'Profession',
       'Work_Experience', 'Spending_Score', 'Family_Size', 'Var_1',
       'Segmentation'],
      dtype='object')

In [8]:
train.Segmentation.value_counts()


Segmentation
D    2268
A    1972
C    1970
B    1858
Name: count, dtype: int64

In [11]:
_, factors = pd.factorize(train.Segmentation)

In [12]:
factors

Index(['D', 'A', 'B', 'C'], dtype='object')

In [13]:
dict(factors)

ValueError: dictionary update sequence element #0 has length 1; 2 is required

In [15]:
factors[1]

'A'

In [16]:
factors['D']

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [19]:
factors.tolist()

['D', 'A', 'B', 'C']

In [21]:
label_map = dict(zip(factors, range(len(factors))))

In [22]:
label_map

{'D': 0, 'A': 1, 'B': 2, 'C': 3}

In [23]:
train.Segmentation.map(label_map)

0       0
1       1
2       2
3       2
4       1
       ..
8063    0
8064    0
8065    0
8066    2
8067    2
Name: Segmentation, Length: 8068, dtype: int64

In [27]:
ColumnTransformer?

[0;31mInit signature:[0m
[0mColumnTransformer[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtransformers[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mremainder[0m[0;34m=[0m[0;34m'drop'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msparse_threshold[0m[0;34m=[0m[0;36m0.3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_jobs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtransformer_weights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mverbose[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mverbose_feature_names_out[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input
to be transformed separately and the features generated by each transfor

In [49]:
# Add Missing Indicators for experience + family Size

In [53]:

preprocessor = ColumnTransformer(
                    [
                        ("Gender_Ordinal", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), ["Gender"]),
                        ("Married_Ordinal", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), ["Ever_Married"]),
                        ("Graduated_Ordinal", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), ["Graduated"]),
                        ("Profession_Ordinal", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1), ["Profession"]),
                        ("Experience_Imputer", SimpleImputer(missing_values=np.nan, strategy="median"), ["Work_Experience"]),
                        ("Family_Size_Imputer", SimpleImputer(missing_values=np.nan, strategy="median"), ["Family_Size"]),

                    ], 
                  remainder='passthrough')
                  

In [54]:
preprocessor.set_output(transform='pandas')

In [55]:
preprocessor.fit(train)

In [56]:
preprocessor.transform(train)

Unnamed: 0,Gender_Ordinal__Gender,Married_Ordinal__Ever_Married,Graduated_Ordinal__Graduated,Profession_Ordinal__Profession,Experience_Imputer__Work_Experience,Family_Size_Imputer__Family_Size,remainder__ID,remainder__Age,remainder__Spending_Score,remainder__Var_1,remainder__Segmentation
0,1.0,0.0,0.0,5.0,1.0,4.0,462809,22,Low,Cat_4,D
1,0.0,1.0,1.0,2.0,1.0,3.0,462643,38,Average,Cat_4,A
2,0.0,1.0,1.0,2.0,1.0,1.0,466315,67,Low,Cat_6,B
3,1.0,1.0,1.0,7.0,0.0,2.0,461735,67,High,Cat_6,B
4,0.0,1.0,1.0,3.0,1.0,6.0,462669,40,High,Cat_6,A
...,...,...,...,...,...,...,...,...,...,...,...
8063,1.0,0.0,0.0,-1.0,0.0,7.0,464018,22,Low,Cat_1,D
8064,1.0,0.0,0.0,4.0,3.0,4.0,464685,35,Low,Cat_4,D
8065,0.0,0.0,1.0,5.0,1.0,1.0,465406,33,Low,Cat_6,D
8066,0.0,0.0,1.0,5.0,1.0,4.0,467299,27,Low,Cat_6,B


NameError: name 'preprocessor' is not defined