<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

<img style="display: float:left" src="https://storage.googleapis.com/kaggle-datasets-images/903978/1533070/57da797ac0a3334dfa9e0eda0f5559cc/dataset-cover.jpg?t=2020-10-14-15-50-13" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Lab Files](#2.1)
* [3. Clustering](#3)
* [4. TearDown](#4)
  * [4.1 Stop Hadoop](#4.1)

<a id='0'></a>
## Description
<p>
In this notebook, we are going to use K-Means to cluster our data. 

We will be using the Iris dataset, which has labels.
    
</p>
In thi lab we will use Apache Spark to do unsupervised learning.     
<div>The goal for this lab are:</div>
<ul>    
    <li>Practice the Spark ML API</li>
    <li>Build a K-Means model</li>
</ul>    
</p>



<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession

By setting this environment variable we can include extra libraries in our Spark cluster.<br/>

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' pyspark-shell'

The first thing always is to create the SparkSession

In [None]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
    .appName("Iris - Clustering - MLlib")
    .config("spark.sql.warehouse.dir","hdfs://localhost:9000/warehouse")
    .enableHiveSupport()
    .getOrCreate())

<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Check Lab Files

In order to complete this lab you need to previosly upload the datasets into HDFS.<br/>

Check you have the data ready in HDFS

http://localhost:50070/explorer.html#/datalake/raw/kaggle/iris/

<a id='3'></a>
## 3. Clustering

Let's create the DataFrame

In [None]:
irisDF = (spark.read.option("header","true")
                 .option("inferSchema","true")
                 .csv("hdfs://localhost:9000/datalake/raw/kaggle/iris/")
                 .cache())

print(f"There are {irisDF.count()} rows in the datasets")

In [None]:
irisDF.printSchema()

In [None]:
irisDF.limit(5).toPandas()

Notice that we have four variables we will consider as "features".  


In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["SepalLengthCm", "SepalWidthCm","PetalLengthCm","PetalWidthCm"], outputCol="features")
irisFeaturesDF = assembler.transform(irisDF)
irisFeaturesDF.limit(5).toPandas()

I'm going to create another vector of just two feature just for the sake of plotting later the clusters by suing PCA (Principal Component Analysis)

In [None]:
from pyspark.ml.feature import PCA

pca = PCA(k=2, inputCol="features", outputCol="pca_features")
irisTwoFeaturesDF = pca.fit(irisFeaturesDF).transform(irisFeaturesDF)
irisTwoFeaturesDF.limit(5).toPandas()

In [None]:
irisTwoFeaturesDF.printSchema()

### How to Determine the Optimal K for K-Means? 

In K-means clustering algorithm the number of clusters (k) is the hyper-parameter to be tuned.

There are two methods for finding the optimal value for K

**1.Silhouette Method:**

Based on the metric Silhouett Score: The higher the silhouette score the better is the clustering.

The range of the Silhouette value is between +1 and -1. 

A high value is desirable and indicates that the point is placed in the correct cluster. 

If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.

**2.Elbow Method:** 

Based on the metric WSSSE (Within Set Sum of Squared Errors). The lower the WSSSE the better is the clustering.

As the number of clusters increases, the WSSSE value will start to decrease.

When we analyze the graph we can see that the graph will rapidly change at a point and thus creating an elbow shape. 

From this point, the graph starts to move almost parallel to the X-axis. 

The K value corresponding to this point is the optimal K value or an optimal number of clusters.

In [None]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
import matplotlib.pyplot as plt

evaluator = ClusteringEvaluator()

seed=1234

# k's we are going to try: from 2 up to 10 clusters
ks = range(2,10)

#metrics
models=[]
wssse=[]
sil=[]

for k in ks:
    kmeans = KMeans(k=k, seed=seed, maxIter=300, featuresCol="features")
    model = kmeans.fit(irisTwoFeaturesDF)    
    predictions = model.transform(irisTwoFeaturesDF)
    models.append(model)
    wssse.append(model.summary.trainingCost)
    sil.append(evaluator.evaluate(predictions))

#silhouette plot
plt.plot(range(2,10),sil)
plt.title("Silhouette Method")
plt.xlabel("Number of Clusters")
plt.xlabel("Silhouette Score")
plt.show()

#elbow plot
plt.plot(range(2,10),wssse)
plt.title("Elbow Method")
plt.xlabel("Number of Clusters")
plt.xlabel("WSSSE")
plt.show()

According to the charts, the optimal value for k would be K=5

Lets' visualize the clusters using this time the **pca_features** (to get two dimension centroids)

In [None]:
from pyspark.sql.functions import col
from pyspark.ml.functions import vector_to_array

kmeans = KMeans(k=5, seed=seed, maxIter=300, featuresCol="pca_features")
bestModel = kmeans.fit(irisTwoFeaturesDF)    
predictions = bestModel.transform(irisTwoFeaturesDF)

# Evaluate clustering by computing Silhouette score
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.summary.trainingCost
print(f"Within Set Sum of Squared Errors = {wssse}")

In [None]:
df = predictions.select((vector_to_array(col('pca_features'))[0]).alias('x'),
                        (vector_to_array(col('pca_features'))[1]).alias('y'),
                         col('prediction').alias('label')).toPandas()

clusters = df['label'].unique()

centroids = bestModel.clusterCenters()
  
fig = plt.figure()
ax = fig.add_subplot(111)

for i in list(clusters):
    t = df.loc[df['label']==i]
    ax.scatter(x=t['x'],y=t['y'],label=i)

for c in centroids:    
    ax.scatter(x=c[0],y=c[1],c='black')
       
plt.show()

<a id='4'></a>
## 4. Tear Down

Once we complete the the lab we can stop all the services

<a id='4.1'></a>
### 4.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```