<a href="https://colab.research.google.com/github/kundyyy/100-Days-Of-ML-Code/blob/master/AfterWork_Data_Science_Clustering_Analysis_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="blue">To use this notebook on Google Colaboratory, you will need to make a copy of it. Go to **File** > **Save a Copy in Drive**. You can then use the new copy that will appear in the new tab.</font>

# AfterWork Data Science: Clustering Analysis with Python

### Prerequisites

In [0]:
# Let's first import the libraries that we will need
# ----
#
import pandas as pd               # pandas for performing data manipulation
import numpy as np                # numpy for performing scientific computations
import matplotlib.pyplot as plt   # matplotlib for performing visualisation 

### Examples 

In [0]:
# Example
# ---
# In this example we will use the K-means algorithm to create three
# clusters from our given dataset. 
# ---
# This algorithm will partion our data into k clusters such that data points 
# in the same cluster are similar, and data points in the different 
# clusters are away from each other. The value of k would be the 
# no. of clusters we would intend to have. 
# ---
# Dataset url = https://bit.ly/RioTemperature
# ---
# 

# Let's first import our algorithm 
# ---
# 
from sklearn.cluster import KMeans 

#### Data Importation

In [0]:
# Then import our dataset
# ---
# Dataset url = https://bit.ly/RioTemperature

# This dataset contains rio temperature accross the years 1997 - 2019.
# We also note that we don't have a label.
# ---
# 
rio_temp_df = pd.read_csv('https://bit.ly/RioTemperature')
rio_temp_df.head()

In [0]:
# Checking our last records
# ---
# 
rio_temp_df.tail()

#### Data Exploration / Cleaning / Preparation / Statistical Analysis

We won't perform extensive exploration / cleaning / preparation / statistical analsysis steps here because our main focus for this part of the session is to apply clustering analysis to our dataset.

#### Data Modeling / Implementing the Solution

In [0]:
# During this step we select the data that we would like to work with.
# The following code will select all the values and store them in an array containing 
# a matrix that will contain our features. 
# This matrix will then be passed to our K-means algorithm for clustering.
# ---
# 
X = rio_temp_df.iloc[:,].values

# Let's preview our resulting data. 
# We can make comparisons with the previewed data in the previous cell just to 
# confirm we put the right values in our matrix.
# ---
# 
X

In [0]:
# Lets now create the K-means clusterer that we will use to perform cluster analysis. 
# Because we want two clusters, we pass 3 to the clusterer. 
# NB: We use clusterer for clustering, regressor for regression, classifier for classifying 
# ---
# In addition, we set random_state = 1, if we would like to reproduce results 
# at some later point in time. 
# ----
# For further info about K-means, we can refer its documentation
# by following this link: https://bit.ly/2To6GKN. This will be useful 
# to explore other model parameters that you'll get to see as an output of this cell.
# ---
# 
clusterer = KMeans(3, random_state=1)

# Then passing our data the clusterer
# ---
# 
clusterer.fit(X)

In [0]:
# We then use the predict method to return the cluster that each data point 
# belongs to and then store this in a new column of our dataframe.
# ---
# 
rio_temp_df['cluster_group'] = clusterer.predict(X)

In [0]:
# We then sample our dataset to check for the assigned clustering groups 0, 1 and 2. 
# We check the last column with the name "cluster_group".
# If we don't get to see these clusters, we can run the code again to get another set of records.
# ---
#
rio_temp_df.sample(10)

In [0]:
# To preview the records in our first cluster which is cluster 0 we perform
# the following pandas operation.
# ---
# 
first_cluster = rio_temp_df[rio_temp_df.cluster_group.isin([0])]
first_cluster.head()

In [0]:
# To preview the records in our second cluster which is cluster 1, we can do the following;
# ---
# 
second_cluster = rio_temp_df[rio_temp_df.cluster_group.isin([1])]
second_cluster.head()

In [0]:
# We can preview the records in our third cluster which is cluster 2 as shown below:
# ---
# NB: If you investigate this cluster, and compare it with the other clusters.
# One might resolve this is a cluster of outliers owing to the large values found in our features.
# ---
#
third_cluster = rio_temp_df[rio_temp_df.cluster_group.isin([2])]
third_cluster.head()

If we had a dataset with 2 features i.e. x and y, then it would be easy to visualise our clusters. However, there are techniques such as Principal Component Analysis (PCA), which we can use for visualising a dataset with many features. This topic beyond the scope of this session.

#### Challenging our Solution

In [0]:
# We can also check the optimal values of K through the use of the 
# elbow method as shown below.
# We will run the KNN algorithm for different values of K (say K = 10 to 1) 
# and plot the K values against SSE(Sum of Squared Errors). 
# And select the value of K for the elbow point as shown in the figure.
# ---
# 

# We will first define an empy list where we will be required to store our errors
# ---
#
Error = []

# Then use a for loop to run KNN several times and append values the sum of squared errors 
# the error list created above. The values in this list willl then be plotted agains the no. of clusters.
# to create our elbow method visualisation.
# ---
#
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i).fit(X)
    kmeans.fit(X)
    Error.append(kmeans.inertia_)

# We plot our elbow method visualisation: No. of clusters vs Error
# ---
# 
plt.plot(range(1, 11), Error)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error');

The output graph of the Elbow method is shown above. 
As we can see, the optimal value of k is 2, as the elbow-like shape is formed at k=2 in the above graph. We can implement k-means again using k = 2.



### <font color="green">Challenges</font> 

In [0]:
# Challenge 1
# ---
# Perform cluster analysis on the following loan applicants dataset by creating 
# 4 clusters of customers. In addition, challenge your solution by determining 
# the optimal no. of customers by performing the elbow method.
# Hint: Check and deal with missing values.
# ---
# Dataset url = https://bit.ly/LoanApplicantsDs
# ---
# OUR CODE GOES BELOW
#  

In [0]:
# Challenge 2
# ---  
# An computer distributor has the given data on the computers in stock.
# Perform clustering analysis on the following computers dataset to identify
# the optimal no. clusters.
# ---
# Dataset url = https://bit.ly/ComputersDs
# ---
# OUR CODE GOES BELOW
#  