In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#https://www.datacamp.com/community/tutorials/k-means-clustering-python

In [2]:
#We vertrekken van de titanic dataset titanic-train.csv (zie BB)
titanic_df = pd.read_csv('titanic-train.csv')

In [4]:
print(titanic_df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [5]:
titanic_df.isna().head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False


In [6]:
titanic_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
#Pandas provides the fillna() function for replacing missing values with a specific value. 
#Let's apply that with Mean Imputation.

In [10]:
titanic_df.fillna(titanic_df.mean(), inplace=True)

In [11]:
titanic_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
Yes, you can see there are still some missing values in the Cabin and Embarked columns. 
This is because these values are non-numeric. 
In order to perform the imputation the values need to be in numeric form. 
There are ways to convert a non-numeric value to a numeric one. More on this later.

Let's do some more analytics in order to understand the data better. 
Understanding is really required in order to perform any Machine Learning task. 
Let's start with finding out which features are categorical and which are numerical.

Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.
Continuous: Age, Fare. Discrete: SibSp, Parch.
Two features are left out which are not listed above in any of the categories. 
Yes, you guessed it right, Ticket and Cabin. Ticket is a mix of numeric and alphanumeric data types. 
Cabin is alphanumeric. Let see some sample values.

In [12]:
titanic_df['Ticket'].head()

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

In [13]:
titanic_df['Cabin'].head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

In [None]:
Let's actually build a K-Means model with the training set. 
But before that you will need some data preprocessing as well. 
You can see that not all the feature values are of same type. 
Some of them are numerical and some of them are not. 
In order to ease the computation, you will feed all numerical data to the model. 
Let's see the data types of different features that you have:

In [14]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [None]:
So, you can see that the following features are non-numeric:

Name
Sex
Ticket
Cabin
Embarked
Before converting them into numeric ones, you might want to do some feature engineering, 
i.e. features like Name, Ticket, Cabin and Embarked do not have any impact on the survival status of the passengers. 
Often, it is better to train your model with only significant features than to train it with all the features, 
including unnecessary ones. 
It not only helps in efficient modelling, but also the training of the model can happen in much lesser time. 
Although, feature engineering is a whole field of study itself, I will encourage you to dig it further. 
But for this tutorial, know that the features Name, Ticket, Cabin and Embarked 
can be dropped and they will not have significant impact on the training of the K-Means model.

In [15]:
titanic_df = titanic_df.drop(['Name','Ticket', 'Cabin','Embarked'], axis=1)

In [None]:
Now that the dropping part is done let's convert the 'Sex' feature to a numerical one 
(only 'Sex' is remaining now which is a non-numeric feature). 
You will do this using a technique called Label Encoding.

In [18]:
labelEncoder = LabelEncoder()
labelEncoder.fit(titanic_df['Sex'])
titanic_df['Sex'] = labelEncoder.transform(titanic_df['Sex'])

In [19]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int32
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
dtypes: float64(2), int32(1), int64(5)
memory usage: 52.3 KB


In [None]:
Looks like you are good to go to train your K-Means model now.

You can first drop the Survival column from the data with the drop() function.

In [20]:
X = np.array(titanic_df.drop(['Survived'], 1).astype(float))
y = np.array(titanic_df['Survived'])

In [21]:
kmeans = KMeans(n_clusters=2) # You want cluster the passenger records into 2: Survived or Not survived
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [None]:
You can see all the other parameters of the model other than n_clusters. 
Let's see how well the model is doing by looking at the percentage of passenger records that were clustered correctly.

In [22]:
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

0.49158249158249157


In [None]:
That is nice for the first go. Your model was able to cluster correctly with a 50% (accuracy of your model). 
But in order to enhance the performance of the model you could tweak some parameters of the model itself. 
I will list some of these parameters which the scikit-learn implementation of K-Means provides:

algorithm
max_iter
n_jobs

Let's tweak the values of these parameters and see if there is a change in the result.

In the scikit-learn documentation, you will find a solid information about these parameters which you should dig further.

In [23]:
kmeans = kmeans = KMeans(n_clusters=2, max_iter=600, algorithm = 'auto')
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=600,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [24]:
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

0.49158249158249157


In [None]:
You can see a decrease in the score. 
One of the reasons being you have not scaled the values of the different features that you are feeding to the model. 
The features in the dataset contain different ranges of values. 
So, what happens is a small change in a feature does not affect the other feature. 
So, it is also important to scale the values of the features to a same range.

Let's do that now and for this experiment you are going to take 0 - 1 as the uniform value range across all the features.

In [25]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [26]:
kmeans.fit(X_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=600,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [27]:
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

0.6262626262626263


In [None]:
Great! You can see an instant 12% increase in the score.

So far you were able to load your data, preprocess it accordingly, do a little bit of feature engineering 
and finally you were able to make a K-Means model and see it in action.

Now, let's discuss K-Means's limitations.

In [None]:
Disadvantages of K-Means
Now that you have a fairly good idea on how K-Means algorithm works let's discuss some its disadvantages.

The biggest disadvantage is that K-Means requires you to pre-specify the number of clusters (k). 
However, for the Titanic dataset, you had some domain knowledge available that told you the number of people 
who survived in the shipwreck. This might not always be the case with real world datasets. 
Hierarchical clustering is an alternative approach that does not require a particular choice of clusters. 
An additional disadvantage of k-means is that it is sensitive to outliers and different results can occur 
if you change the ordering of the data.

K-Means is a lazy learner where generalization of the training data is delayed until a query is made to the system. 
This means K-Means starts working only when you trigger it to, thus lazy learning methods can construct 
a different approximation or result to the target function for each encountered query. 
It is a good method for online learning, but it requires a possibly large amount of memory to store the data, 
and each request involves starting the identification of a local model from scratch.

In [None]:
Conclusion
So, in this tutorial you scratched the surface of one of the most popular clustering techniques - K-Means. 
You learned about its inner mechanics, implemented it using the Titanic Dataset in Python, 
and you also got a fair idea of its disadvantages. 
If you would like to learn more about these clustering techniques, 
I highly recommend you check out our Unsupervised Learning in Python course.