<img src="images/cads-logo.png" style="height: 100px;" align=left> <img src="images/apache_spark.png" style="height: 20%;width:20%" align=right>

mc : https://colab.research.google.com/drive/1lN85HNOWdoRNml_BK0yRenuTEvI7di83?usp=sharing

# Clustering
In clustering, we are going to see if there are natural grouping among the data. So, for example, let's take a look at the utilization data, and see if we can divide this data set into three groups that logically come together. So to do that, we need the Apache Spark Machine Learning package. 

In [3]:
!pip install pyspark
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 64kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 48.5MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612242 sha256=a73c9f3b5aac1dd1484a2c296674c0d7d987f633a280167f10a790926a3c3cf5
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [4]:
spark = SparkSession.builder.getOrCreate()

In [None]:
# MC - i'm using gdrive instead
import os
MAIN_DIRECTORY = os.getcwd()
file_path =MAIN_DIRECTORY+"/Data/utilization.json"
df_util = spark.read.format("json").load(file_path)

In [23]:
# Diff data analysis & data analytics
# 2 diff concept

# data analysis
# - reviewing performance of company
# - works with previews/transactional data
# - answer these question: Whats going on in the business

# data analytics
# - want to make decisions about the future
# - predictions
# - use historical data to make prediction

# need to read books
# read about business books, management books


In [24]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
cd "/content/drive/MyDrive/UM Lecture/CADS/13 BDA with Apache Spark 2day"

/content/drive/MyDrive/UM Lecture/CADS/13 BDA with Apache Spark 2day


In [17]:
pwd

'/content/drive/My Drive/UM Lecture/CADS/13 BDA with Apache Spark 2day'

In [19]:
df_util = spark.read.format("json").load("/content/drive/MyDrive/UM Lecture/CADS/13 BDA with Apache Spark 2day/data/utilization.json")

In [20]:
df_util.show(5)

+---------------+-------------------+-----------+---------+-------------+
|cpu_utilization|     event_datetime|free_memory|server_id|session_count|
+---------------+-------------------+-----------+---------+-------------+
|           0.77|03/16/2019 17:21:40|       0.22|      115|           58|
|           0.53|03/16/2019 17:26:40|       0.23|      115|           64|
|            0.6|03/16/2019 17:31:40|       0.19|      115|           82|
|           0.46|03/16/2019 17:36:40|       0.32|      115|           60|
|           0.77|03/16/2019 17:41:40|       0.49|      115|           84|
+---------------+-------------------+-----------+---------+-------------+
only showing top 5 rows



Now, we would like to group data based on the CPU utilization, free memory, and session count. Spark MLLib works with something called a vector. A vector is basically like an array or single data structure that holds all the values from a particular row that the ML algorithm will be looking at. So in our case, we are going to look at only three columns, `cpu_utilization`, `free_memory`, and `session_count`.

Now, we are going to create a vector to store these three values, and we do that by calling `VectorAssembler`. 

In [26]:
vecAssembler = VectorAssembler(inputCols=['cpu_utilization','free_memory','session_count'], outputCol='features')

Now, VectorAssembler returns a data structure, and then we will use this data structure to create a DataFrame by combining the mentioned columns into a single vector and put that vector in a new column called `features`.

In [27]:
vecCluster_df  = vecAssembler.transform(df_util)

In [28]:
vecCluster_df.show(5)

+---------------+-------------------+-----------+---------+-------------+----------------+
|cpu_utilization|     event_datetime|free_memory|server_id|session_count|        features|
+---------------+-------------------+-----------+---------+-------------+----------------+
|           0.77|03/16/2019 17:21:40|       0.22|      115|           58|[0.77,0.22,58.0]|
|           0.53|03/16/2019 17:26:40|       0.23|      115|           64|[0.53,0.23,64.0]|
|            0.6|03/16/2019 17:31:40|       0.19|      115|           82| [0.6,0.19,82.0]|
|           0.46|03/16/2019 17:36:40|       0.32|      115|           60|[0.46,0.32,60.0]|
|           0.77|03/16/2019 17:41:40|       0.49|      115|           84|[0.77,0.49,84.0]|
+---------------+-------------------+-----------+---------+-------------+----------------+
only showing top 5 rows



Now, we want to use this DataFrame in our clustering algorithm, all combined into a single column called `features`. The reason we did this is because the Machine Learning algorithms in Spark MLLib expect the input data to be in a single vector. And now the ML algorithm, we are going to use is called **KMeans**.

In [29]:
# setK(3) number of clusters
# setSeed(1) it takes a seed for random value generation
kmeans = KMeans().setK(3).setSeed(1)

Now, `kmeans` is a data structure that is ready to run the KMeans algorithm. To do that, we will use `fit()`, and `fit()` is the command that is used to actually take input data and then apply the algorithm. 

In [30]:
kmodel = kmeans.fit(vecCluster_df)

The critical thing in a KMeans model is the cluster centers or centroids. So let's look up what the centroids are.

In [31]:
kmodel.clusterCenters()

[array([ 0.61918113,  0.38080285, 68.75004716]),
 array([ 0.71174897,  0.28808911, 86.87510507]),
 array([ 0.51439668,  0.48445202, 50.49452021])]

#### Well Done!