## 4 - Experimentation

Over here, I carried out experimentations with different ways of data cleaning.

This is to improve from the initial mode lwith 26% accuracy.

The highest accuracy I found was at 41.6%  by dropping row ['id', 'anon_cat'], use min-max scale for all variables and leaving the outliers. But everytime I rerun, the accuracy fluctuates.

In [1]:
## import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn import preprocessing

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import scale


In [2]:
## loading data
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [3]:
##save id to id datafram
df_id = df.loc[:,'ID']

## convert all column names to lower case for ease of typing
df.columns = df.columns.str.lower()

## removing the 'Segmentation' column which is what we are trying to predict and ID
df = df.drop(['segmentation'], axis= 'columns')

## Cleaning the dataset

- 1) dealing with the missing values<br>


- 2) replace outliers<br>

- 3) cleaning columns<br>

- 4) encoding variables<br>

- 5) standardisation<br>


### 1) dealing with the missing values


In [4]:
## 1) dealing with the missing values

#assign median for numerical variables
df['work_experience'].fillna(int(df['work_experience'].median()), inplace=True)
df['family_size'].fillna(int(df['family_size'].median()), inplace=True)

#assign mode for categorical variables
df['ever_married'].fillna(df['ever_married'].mode()[0], inplace=True)
df['graduated'].fillna(df['graduated'].mode()[0], inplace=True)
df['profession'].fillna(df['profession'].mode()[0], inplace=True)
df['var_1'].fillna(df['var_1'].mode()[0], inplace=True)

In [5]:
## check all is filled
df.isna().sum()

id                 0
gender             0
ever_married       0
age                0
graduated          0
profession         0
work_experience    0
spending_score     0
family_size        0
var_1              0
dtype: int64

### 2) replace outliers

from the experiments we shouldn't cap outliers as accuracy decreases


In [6]:
"""# below Q5 then above Q95

# age 
q1 = df['age'].quantile(0.05)
df['age'][df['age']<=q1] = q1


q4 = df['age'].quantile(0.95)
df['age'][df['age']>=q4] = q4

# work experience 
q1 = df['work_experience'].quantile(0.05)
df['work_experience'][df['work_experience']<=q1] = q1

q4 = df['work_experience'].quantile(0.95)
df['work_experience'][df['work_experience']>=q4] = q4

# family size
q1 = df['family_size'].quantile(0.05)
df['family_size'][df['family_size']<=q1] = q1

q4 = df['family_size'].quantile(0.95)
df['family_size'][df['family_size']>=q4] = q4"""

"# below Q5 then above Q95\n\n# age \nq1 = df['age'].quantile(0.05)\ndf['age'][df['age']<=q1] = q1\n\n\nq4 = df['age'].quantile(0.95)\ndf['age'][df['age']>=q4] = q4\n\n# work experience \nq1 = df['work_experience'].quantile(0.05)\ndf['work_experience'][df['work_experience']<=q1] = q1\n\nq4 = df['work_experience'].quantile(0.95)\ndf['work_experience'][df['work_experience']>=q4] = q4\n\n# family size\nq1 = df['family_size'].quantile(0.05)\ndf['family_size'][df['family_size']<=q1] = q1\n\nq4 = df['family_size'].quantile(0.95)\ndf['family_size'][df['family_size']>=q4] = q4"

In [7]:
## checking outliers have been removed
df.describe()

#or with boxplots ( to uncomment)
#for column in num_col:
    #plt.figure(figsize=(10,2))
    #sns.boxplot(data=df, x=column, showfliers= True) # set showfliers to False to remove outliers

Unnamed: 0,id,age,work_experience,family_size
count,8068.0,8068.0,8068.0,8068.0
mean,463479.214551,43.466906,2.47298,2.856346
std,2595.381232,16.711696,3.265248,1.499577
min,458982.0,18.0,0.0,1.0
25%,461240.75,30.0,0.0,2.0
50%,463472.5,40.0,1.0,3.0
75%,465744.25,53.0,4.0,4.0
max,467974.0,89.0,14.0,9.0


### 3) cleaning columns

In [8]:
## cleaning of var_1
# renaming it to anon_cat for anonymised category
df.rename(columns={"var_1": "anon_cat"}, inplace=True)

# extracting the numbers from the string: 1 instead of cat_1
df['anon_cat'] = df['anon_cat'].str.extract('(\d+)')

# converting 'Family_Size' from float to int as the number of human is a whole number
df['family_size'] = df['family_size'].astype(int)


In [9]:
## checking that work_expeirence values are all whole numbers
we = df['work_experience']%1 == 0
we.value_counts() ## all non decimal numbers

## converting float to integer type
df['work_experience'] = df['work_experience'].astype(int)

### 4) encoding variables

In [10]:
## one hot encoding: columns that have binary values 

## gender 
gender_ohe = preprocessing.LabelEncoder()
df['gender'] = gender_ohe.fit_transform(df['gender'])


## ever_married
ever_married_ohe = preprocessing.LabelEncoder()
df['ever_married'] = ever_married_ohe.fit_transform(df['ever_married'])

## graduated
graduated_ohe = preprocessing.LabelEncoder()
df['graduated'] = graduated_ohe.fit_transform(df['graduated'])



In [11]:
## multi-categories encoding

## ordered category: spending_score
spending_score_oe = OrdinalEncoder()
df['spending_score'] = spending_score_oe.fit_transform(df['spending_score'].values.reshape(-1,1))

## unordered category: profession
## get dummy variables for 'profession' variable 
dummies = pd.get_dummies(df.profession, dtype=int)

## concatenate dummy variables to main df
df = pd.concat([df,dummies], axis='columns')

## drop the 'profession' and 1 dummy variable 'Artist' to avoid multicollinearity
df = df.drop(['profession','Artist'], axis = 'columns')



In [12]:
df.head()

Unnamed: 0,id,gender,ever_married,age,graduated,work_experience,spending_score,family_size,anon_cat,Doctor,Engineer,Entertainment,Executive,Healthcare,Homemaker,Lawyer,Marketing
0,462809,1,0,22,0,1,2.0,4,4,0,0,0,0,1,0,0,0
1,462643,0,1,38,1,1,0.0,3,4,0,1,0,0,0,0,0,0
2,466315,0,1,67,1,1,2.0,1,6,0,1,0,0,0,0,0,0
3,461735,1,1,67,1,0,1.0,2,6,0,0,0,0,0,0,1,0
4,462669,0,1,40,1,1,1.0,6,6,0,0,1,0,0,0,0,0


### 5) standardising the numerical categories



In [13]:
## separating the variables into their types
num_df = df[['age','family_size','anon_cat']]
cat_df = df.drop(['age','family_size','anon_cat'], axis='columns')

In [14]:
## min max scale instead
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df.to_numpy())
df_scaled = pd.DataFrame(df_scaled, columns= df.columns)

In [15]:
## check final df
df_scaled = df_scaled.drop(['id','anon_cat'], axis= 'columns')
df_scaled.head()

Unnamed: 0,gender,ever_married,age,graduated,work_experience,spending_score,family_size,Doctor,Engineer,Entertainment,Executive,Healthcare,Homemaker,Lawyer,Marketing
0,1.0,0.0,0.056338,0.0,0.071429,1.0,0.375,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.28169,1.0,0.071429,0.0,0.25,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.690141,1.0,0.071429,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.690141,1.0,0.0,0.5,0.125,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.309859,1.0,0.071429,0.5,0.625,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [16]:
## saved final_df to csv 
# to be reused next in segmentation using Kmeans
#df_scaled.to_csv('final_scaled_df.csv', index=False)


## Kmeans clustering

In [17]:
## get origianl dataset 
orig_df = pd.read_csv('train.csv')
orig_df.columns = orig_df.columns.str.lower()

In [18]:
## fitting the KMeans model with k = 4
kmeans = KMeans(n_clusters=4, init= 'random' , n_init=15)
kmeans.fit(df_scaled)
print("WCSS: ", kmeans.inertia_)
print("Iternations until converged: ", kmeans.n_iter_)
#print("Final centroids: ")
#print(kmeans.cluster_centers_)
print("Cluster assignments ")
print(kmeans.labels_)

## adding to original data the cluster's assignment at the last column
label = pd.DataFrame(kmeans.labels_, columns = ['labels']) # index to 
orig_df = pd.concat([orig_df,label],axis=1)
orig_df 

## converting labels 0-3 to A-D
orig_df['labels'] = orig_df['labels'].replace({0:'A',1:'B', 2:'C', 3:'D'})
orig_df

## where the labels are equal to the segmentation 
true = orig_df[orig_df['segmentation'] == orig_df['labels']]
true

WCSS:  8147.749118316345
Iternations until converged:  9
Cluster assignments 
[3 1 1 ... 3 3 0]


Unnamed: 0,id,gender,ever_married,age,graduated,profession,work_experience,spending_score,family_size,var_1,segmentation,labels
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D,D
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B,B
7,464347,Female,No,33,Yes,Healthcare,1.0,Low,3.0,Cat_6,D,D
11,464942,Male,No,19,No,Healthcare,4.0,Low,4.0,Cat_4,D,D
13,459573,Male,Yes,70,No,Lawyer,,Low,1.0,Cat_6,A,A
...,...,...,...,...,...,...,...,...,...,...,...,...
8052,467455,Female,No,37,Yes,Artist,8.0,Low,2.0,Cat_6,C,C
8053,465667,Male,No,23,No,Healthcare,1.0,Low,3.0,Cat_2,D,D
8055,461291,Male,No,18,No,Healthcare,0.0,Low,2.0,Cat_6,D,D
8059,460132,Male,No,39,Yes,Healthcare,3.0,Low,2.0,Cat_6,D,D


In [19]:
## calculate how many we've correctely predicted the cluster of the clients

len(true)/len(df)


0.30800694100148734

## Playing with different combinations

The base accuracy is at 26%.

- 1: 0.3166831928606842 just standardise-scaling all variables
- 2: 0.39439762022806146 1+ deleting id, anon_cat
- 3: 0.2854486861675756 1+ deleted id, anon_cat, gender
- 4: 0.41460089241447695  deleting id, anon_cat, min max scale all variables, didn't replace outliers <--- best one 
- 5: 0.17178978681209717  1+ 4 + mean instead of median
- 6: 0.2811105602379772 1+ 4 + replacing outliers

Here I have found the combination of cleaning data number 4 to be the best. However, by rerunning the accuracy fluctuates and is at 30.8%.

### Next steps:

If I had more time: \
-automate the cleaning and kmeans model pipeline to test different combinations more effectively \
-try one-hot-encoding on binary features:'gender', 'married','graduated' \
-try other ways of dealing with outliers \
-try to log 'age' so the distribution is closer to the gaussian distribution