## k-means clustering in Python
In the following, we will go thoguh the steps of performing k-means clustering in python. Before the model is built, we need to take care of two things in particular, if not handled before: missing values and data standardization. As we know that we calculate distances in the process, k-means is a typical example where it is really important that proper scaling is performed in advance.

In [None]:
# Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Here, we'll use a data set which contains statistics in arrests
# per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.
# It includes also the percent of the population living in urban areas.

crime_rates=pd.read_csv("USArrests.csv", index_col=0)

crime_rates.head()

In [None]:
# As we can see from info, there are no missing values, so this is not an issue in this case 

crime_rates.info()

In [None]:
# Since the "UrbanPop" variable is a percantage, while the other variables are frequencies per 100,000 people,
# we need to scale/standardize the data

crime_rates.describe()

In [None]:
# We have seen standardization before, here we do it with a different function
# After importing it, we use the apply function to change all the columns

from sklearn import preprocessing

crime_rates_s = crime_rates.apply(lambda x: preprocessing.scale(x))

In [None]:
crime_rates_s.head()

In [None]:
# Now we can perform clustering
# First, we try to look at the function to create the model, and try to determine the optimal k value
# Let's look at one use of the KMeans function

from sklearn.cluster import KMeans

# First we create an  object with several parameters
# n_clusters: number of clusters
# init: cluster initialization technique, "random" or more advanced "k-means++"
# n_init sets the number of initializations to perform. This is important because two runs can converge on 
# different cluster assignments
# random_state: by specifying this, we make sure that we will get the same model when re-running the code

kmeans = KMeans(n_clusters = 3, init = 'k-means++', n_init = 5, random_state = 42)

kmeans.fit(crime_rates_s)

In [None]:
# After creating the model, we obtain the sum of squared errors

print(kmeans.inertia_)

# Final locations of the centroids of clusters

print(kmeans.cluster_centers_)

# The number of iterations required to converge

print(kmeans.n_iter_)

In [None]:
# Now that we know how to build one model we can select the optimal k value
# for this, we can iterate over k values, record the quality of the model for each k
# and then visualize the performance change to use the elbow method
# The metric to evaluate a model is sum of squared errors

sse_clust = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(crime_rates_s)
    sse_clust.append(kmeans.inertia_)

In [None]:
sse_clust

In [None]:
# Let's visualize the results
# We can confirm, that the elbow method suggets to use 4 clusters

plt.plot(range(1, 11), sse_clust)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

In [None]:
# After we decided on using 4 clusters, we can create the final model

kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)

y_clust = kmeans.fit_predict(crime_rates_s)

y_clust

In [None]:
# Now to interpret the clusters. 
# As we have the cluster labels, we can simply group by that, and look at how the 
# attributes vary over clusters

crime_rates.groupby(y_clust).mean()

In [None]:
# We can check what specific states are in a given cluster

crime_rates[y_clust == 2]