# Data preparation for clustering

For clustering using the k-means method we will need to follow a few steps:
* **Standardise the data:** The process of converting an actual range of values into a standard range of values
* **Find a Similarity Measure:**
* **Interpret Results:**

In [11]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

Importing cleaned data

In [20]:
all_data = pd.read_csv('data/all_data.csv')
all_data=all_data.drop(["Unnamed: 0"], axis=1) # Drop Unnamed: 0 column
all_data.head()

df = pd.read_csv('data/all_data.csv')
df=df.drop(["Unnamed: 0"], axis=1) # Drop Unnamed: 0 column

Changing categorical data entries can be done by One-hot encoding.

**Label encoding** is a method during data preparation for converting categorical data variables so they can be provided to machine learning algorithims to improve predictions

LabelEncoder() is a data manipulation function used to convert categorical data into indicator variables

:::{note}
Machine learning models require all input and output variables to be numeric
:::

It is impossible to do k-means clustering on a categorical variable

In [21]:
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
df["Study_Type"] = labelencoder.fit_transform(df["Study_Type"])
df.head()

Unnamed: 0,Total,Study_Type,No_participants,Amount_won,Amount_lost,1,2,3,4
0,1150,0,95,5800,-4650,12.0,9.0,3.0,71.0
1,-675,0,95,7250,-7925,24.0,26.0,12.0,33.0
2,-750,0,95,7100,-7850,12.0,35.0,10.0,38.0
3,-525,0,95,7000,-7525,11.0,34.0,12.0,38.0
4,100,0,95,6450,-6350,10.0,24.0,15.0,46.0


In [14]:
df["Study_Type"].value_counts()

1    162
8    153
6     70
5     57
7     41
3     40
9     35
4     25
2     19
0     15
Name: Study_Type, dtype: int64

I opted for LabelEncoder as opposed to One-Hot Encoder to reduce the number of demensions being used in the data set.

This will be important for clustering due to the fact that k-means clustering can suffer from the curse of dimensionality

In [15]:
all_data.head()

Unnamed: 0,Total,Study_Type,No_participants,Amount_won,Amount_lost,1,2,3,4
0,1150,Fridberg,95,5800,-4650,12.0,9.0,3.0,71.0
1,-675,Fridberg,95,7250,-7925,24.0,26.0,12.0,33.0
2,-750,Fridberg,95,7100,-7850,12.0,35.0,10.0,38.0
3,-525,Fridberg,95,7000,-7525,11.0,34.0,12.0,38.0
4,100,Fridberg,95,6450,-6350,10.0,24.0,15.0,46.0


### Standardising data

As for K-means, often it is not sufficient to normalize only mean. One normalizes data equalizing variance along different features as K-means is sensitive to variance in data, and features with larger variance have more emphasis on result. So for K-means, I would recommend using StandardScaler for data preprocessing.

Don't forget also that k-means results are sensitive to the order of observations, and it is worth to run algorithm several times, shuffling data in between, averaging resulting clusters and running final evaluations with those averaged clusters centers as starting points.

### Standardising data Test

In [22]:
scaler = preprocessing.StandardScaler()
segmentation_std = scaler.fit_transform(df)

In [23]:
segmentation_std

array([[ 1.04498799, -1.60707299, -0.69883592, ..., -1.38690362,
        -1.0891972 ,  2.13175282],
       [-0.41434565, -1.60707299, -0.69883592, ..., -0.42018166,
        -0.66885443,  0.03874291],
       [-0.47431826, -1.60707299, -0.69883592, ...,  0.09161232,
        -0.76226394,  0.31413895],
       ...,
       [ 1.28487845,  0.74499351,  2.29926737, ..., -0.81824364,
         0.49876436,  2.40714887],
       [ 1.08496973,  0.74499351,  2.29926737, ..., -0.19271767,
         0.82569762,  1.03016866],
       [-1.31393487,  0.74499351,  2.29926737, ...,  4.01536616,
        -0.94908294, -0.18157392]])

The standardised data is now stored in an array. I will convert it back to a pandas dataframe,

In [25]:
df_standard = pd.DataFrame(segmentation_std, columns=['Total', 'Study_Type', 'No_participants', 'Amount_won', 'Amount_lost', '1', '2', '3', '4'])
df_standard

Unnamed: 0,Total,Study_Type,No_participants,Amount_won,Amount_lost,1,2,3,4
0,1.044988,-1.607073,-0.698836,-1.471413,1.525683,-0.477115,-1.386904,-1.089197,2.131753
1,-0.414346,-1.607073,-0.698836,-0.538613,0.135451,1.024186,-0.420182,-0.668854,0.038743
2,-0.474318,-1.607073,-0.698836,-0.635110,0.167288,-0.477115,0.091612,-0.762264,0.314139
3,-0.294400,-1.607073,-0.698836,-0.699441,0.305250,-0.602224,0.034746,-0.668854,0.314139
4,0.205371,-1.607073,-0.698836,-1.053262,0.804036,-0.727332,-0.533914,-0.528740,0.754773
...,...,...,...,...,...,...,...,...,...
612,0.365298,0.744994,2.299267,2.613607,-1.530705,1.024186,2.025056,-0.622150,0.644614
613,1.844623,0.744994,2.299267,0.780173,0.464437,-1.352875,-0.135852,0.919107,1.966515
614,1.284878,0.744994,2.299267,0.812338,0.146063,0.273535,-0.818244,0.498764,2.407149
615,1.084970,0.744994,2.299267,1.391318,-0.342110,1.149295,-0.192718,0.825698,1.030169


**Exporting Data to CSV file**

In [26]:
df_standard.to_csv('data/normalise.csv')

### Conclusion