<a name='0'></a>
#### LICENSE
MIT License

Copyright (c) 2021 Jean de Dieu Nyandwi

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

# Intro to Unsupervised Learning - K-Means Clustering

K-Means clustering is a type of unsupervised learning algorithms. In unsupervised learning, the machine learning model do not get the labels during training. It instead has to figure out the labels itself. It's like learning without instructions. It's like a teacher telling you, "hey, here are 1000 exercises to use while preparing for a test, the test will be only 5 questions from all of those exercises." That can feel like a struggle, you will do all you can to narrow down these 100 exercises to 5. Some questions may be similar, or may be solved by one method, etc..the goal will merely be to narrow down the exercises, while maximizing the chance of passing the test. 

That type of example can be compared to clustering. The model is given bunch of data (lacking labels) and the job of the model is to find the labels that can be present according to the supplied data. 


K-Means Clustering require the number of clusters to be specified before training. The way this type of algorithm works is beyond the scope of this notebook but here are 3 main steps of how such algorithm work: 

* K-Means will randomly assigns samples of data to initial centroids of all clusters. This step is called initialization. A centroid is also referred to as a cluster center and it is the mean of all the sample of data in a cluster. 

* It then reassigns the samples to the nearest centroids.
* It also find the new centroids of all clusters by taking the mean value of all of the samples assigned to each previous centroids. 

The last two steps are repeated until the stopping criterion is fulfilled or when difference between the old and new centroids is constant. 

Unspervised learning has got its applications in areas such as grouping web search results, customer segmentation, news aggregation and more. 

## KMeans Clustering

### Contents

* [1 - Imports](#1)
* [2 - Loading the data](#2)
* [3 - Exploratory Analysis](#3)
* [4 - Preprocessing the data](#4)
* [5 - Trainin K-Means Clustering to Find Clusters](#5)
* [6 - Evaluating K-Means Clustering](#6)
* [7 - Final Notes](#7)

<a name='1'></a>
## 1 - Imports

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

<a name='2'></a>

## 2 - Loading the data

In this notebook, we will use a different dataset. Up to this point creating these notebooks, my goal has been to look on the other side, to try something new, to try new a dataset. If you have went through some notebooks about other algorithms, no doubt that you have learned something new or perhaps nothing new but you experienced a new dataset. 

In this notebook, we will use a mushroom dataset. The dataset describes mushrooms in terms of their physical characteristics and they are classified into: poisonous or edible.

The dataset also includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like `leaflets three, let it be for Poisonous Oak and Ivy.`

The dataset contains the labels (edibility) but for the purpose of doing clustering, we will remove the labels. 

In [None]:
# Let's firs hide warnings just in case

import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.datasets import fetch_openml

mushroom_data = fetch_openml(name='mushroom', version=1)

In [None]:
mushroom_data.data.shape

As you can see above, there are 8124 examples and 22 features. 

In [None]:
# Description of the data 
print(mushroom_data.DESCR)

In [None]:
# Displaying feature names

mushroom_data.feature_names

In [None]:
# Displaying target name

mushroom_data.target_names

In [None]:
# Getting the whole dataframe

mushroom_data = mushroom_data.frame

<a name='3'></a>
## 3 - Exploratory Data Analysis


### Taking a quick look into the dataset

In [None]:
mushroom_data.head()

In [None]:
# Displaying the last rows 

mushroom_data.tail()

In [None]:
mushroom_data.info()

All features are categorical. So we will make sure to handle them. 

### Checking Summary Statistics

In [None]:
# Summary stats

mushroom_data.describe()

### Checking Missing Values

In [None]:
# Checking missing values

mushroom_data.isnull().sum()

It seems that we have missing values in the feature `stalk-root`. 

Usually there are three things to do with if them if they are present:
* We can remove all missing values completely
* We can leave them as they are or
* We can fill them with a given strategy such as mean, media or most frequent value. Either `Sklearn` or Pandas provides a quick ways to fill these kind of values. 

We will handle that during the data preprocessing.

### More Data Exploration

Before preprocessing the data, let's take a look into specific features. 

I want to also make note that I do not know alot about mushrooms. I thought that it would be interesting to use this real world datasets, and perhaps some people who will come across this may some of mushroom samples and their characteristics. 

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(data=mushroom_data, x='cap-shape', hue='class')

In cap_shape, the letters stands for: `ell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s`. It seems that the convex type is dominant and most of it are edible. 

In [None]:
plt.figure(figsize=(12,7))

sns.countplot(data=mushroom_data, x='cap-color', hue='class')

The above is the cap color. The alphabets stands for `brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y `. 

Also it seems that most caps are brown(n), either edible or brown.

In [None]:
plt.figure(figsize=(12,7))

sns.countplot(data=mushroom_data, x='population')

The most populations are most several. Here are what the letters stand for: `abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y`.  

In [None]:
plt.figure(figsize=(12,7))

sns.countplot(data=mushroom_data, x='habitat')

In [None]:
plt.figure(figsize=(12,7))

sns.countplot(data=mushroom_data, x='stalk-root')

Above is the feature that is missing values. We will remove all missing values. Since the missing values are of one category, we will drop it to avoid adding noise in the dataset. 

And finally, we can look in the class feature. There are two categories, `e(edible)` and `p(poisonous)`. 

In [None]:
plt.figure(figsize=(12,7))

sns.countplot(data=mushroom_data, x='class')

<a name='4'></a>

## 4 - Data Preprocessing 


Let's remove the missing values first. 

In [None]:
mushroom_df = mushroom_data.dropna()

For the purpose of performing clustering, we will remove the labels. 

In [None]:
mushroom = mushroom_df.drop('class', axis=1)
mushroom_labels = mushroom_df['class']

Let's now convert all categorical features into the numerics.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()

mushroom_prepared = encoder.fit_transform(mushroom)

In [None]:
mushroom_prepared

As you can see above, `mushroom_prepared` is a NumPy array. We can convert it back to the Pandas Dataframe although KMeans algorithm can accept both as input. 

In [None]:
mushroom_prep_df = pd.DataFrame(mushroom_prepared, columns=mushroom.columns)
mushroom_prep_df.head()

No alphabets anymore. They were perfectly encoded or converted to numerics representation. 

We are now ready to find the labels with KMeans Clustering. Again, this is for the assumption that we do not have labels, or to make it simple, we have a data about the characteristics of different plants, but we do not know if they are edible or not. We want to use unsupervised learning to figure that out. 

<a name='5'></a>

## 5 - Training K-Means Clustering to Find Clusters

We are going to create a KMeans model from `sklearn.cluster`. We will remember to provide the number of the clusters, which is 2 in our case. 

In [None]:
from sklearn.cluster import KMeans

k_clust = KMeans(n_clusters=2, random_state=42)

k_clust.fit(mushroom_prep_df)

We can access the cluster centers by `model.cluster_centers_`. 

In [None]:
k_clust.cluster_centers_

Also, we can get the labels that the KMeans provided for each data point. 

In [None]:
k_labels = k_clust.labels_
k_labels

<a name='6'></a>

### 6 -Evaluating K-Means Clustering

In real world, evaluating the performance of KMeans is not an easy thing, because there are not true labels to compare with the clustered labels. In our case since we have them, we can find things like accuracy score, or even find the confusion matrix to display the actual and predicted classes. Not to mention classification report to find things like Recall, Precision, or F1 Score. 

But again since we are merely comparing the labels(true and clustered), we do not need that extra metrics. 

Before finding the accuracy score, I will first convert the true labels into the numbers or encode them. For simplicity, I will use a map function. 

In [None]:
map_dict = {
    
    'p':0,
    'e':1
}

mushroom_labels_prep = mushroom_labels.map(map_dict)

In [None]:
mushroom_labels_prep

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(mushroom_labels_prep, k_labels)

This is not excellent, but it's so impressive. Why? Well, KMeans never saw the labels, it was only feed the data of different characteristics of poisonous and edible mushrooms and its job was to try to find patterns in the data so as to learn if a given mushroom specy is a poisonous or edible. 


KMeans algorithm is very useful in areas where you have a bunch of unlabeled data. Take an example in customer segmentation. You may want to provide different promotions to some groups of your customers but you have no clue of who would benefit from that particular promotion. So, you can try to find the group of customers using this algorithm. It will try to group similar customers according to their interests, and will likely appreciate the promotion.

The same concept can be applied to grouping the equipments that has similar defects in an industry. That was just mentioning few, there are more applications of KMeans clustering. 

<a name='7'></a>

### 7 - Final Notes

In this notebook, we learned the idea behind unsupervised learning and KMeans clustering. We also practiced that on mushroom dataset where we were interested in grouping the species that can be poisonous or edible. 

If you like mushrooms and you know some of their characteristics, no doubt that you enjoyed this notebook. Maybe pick one edible sample and make it your next meal :)

## Acknowledgments

Thanks to Jean de Dieu Nyandwi for creating the open-source project [unsupervised-learning-ipython-notebooks](https://github.com/Nyandwi/machine_learning_complete/blob/main/6_classical_machine_learning_with_scikit-learn/10_intro_to_unsupervised_learning_with_kmeans_clustering.ipynb). It inspires the majority of the content in this chapter.

## [BACK TO TOP](#0)