<a href="https://colab.research.google.com/github/muhammetsnts/SPARK/blob/main/2.ML_with_PySpark_MLlib/K_Means_Clustering/1.K_Means_Clustering_Documentation_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Environment

In [1]:
# install Java8
!apt-get -q install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.1.1
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip it
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

# install findspark 
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Download and Read Dataset

In [5]:
!wget -q https://raw.githubusercontent.com/muhammetsnts/SPARK/main/data/sample_kmeans_data.txt

In [7]:
data = spark.read.format('libsvm').load("sample_kmeans_data.txt")

In [8]:
data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



# Modelling
We will create KMeans model and set the K value 2. Also will set a seed to have the same results for multiple running.

This project is an unsupervised learning project so we will continue with features column. (K means expects only features)

In [9]:
final_data = data.select('features')

In [6]:
from pyspark.ml.clustering import KMeans

In [10]:
kmeans = KMeans().setK(2).setSeed(1)

In [11]:
model = kmeans.fit(final_data)

In [15]:
centers = model.clusterCenters()

In [16]:
centers

[array([9.1, 9.1, 9.1]), array([0.1, 0.1, 0.1])]

K was 2 and we have 3 column in features vector. So the coordinates of the cluster centers are above.

In [17]:
final_data.show()

+--------------------+
|            features|
+--------------------+
|           (3,[],[])|
|(3,[0,1,2],[0.1,0...|
|(3,[0,1,2],[0.2,0...|
|(3,[0,1,2],[9.0,9...|
|(3,[0,1,2],[9.1,9...|
|(3,[0,1,2],[9.2,9...|
+--------------------+



We will find which row is belong to which cluster.

In [19]:
results = model.transform(final_data) # this is unsupervised so no need to make train-test split

In [21]:
results.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         1|
|(3,[0,1,2],[0.1,0...|         1|
|(3,[0,1,2],[0.2,0...|         1|
|(3,[0,1,2],[9.0,9...|         0|
|(3,[0,1,2],[9.1,9...|         0|
|(3,[0,1,2],[9.2,9...|         0|
+--------------------+----------+



Thats it! Lets use a higher k value.

In [22]:
kmeans2 = KMeans().setK(3).setSeed(1)

In [24]:
model2 = kmeans2.fit(final_data)

In [25]:
centers2 = model2.clusterCenters()

In [26]:
centers2

[array([9.1, 9.1, 9.1]), array([0.05, 0.05, 0.05]), array([0.2, 0.2, 0.2])]

In [27]:
results2 = model2.transform(final_data)

In [28]:
results2.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         1|
|(3,[0,1,2],[0.1,0...|         1|
|(3,[0,1,2],[0.2,0...|         2|
|(3,[0,1,2],[9.0,9...|         0|
|(3,[0,1,2],[9.1,9...|         0|
|(3,[0,1,2],[9.2,9...|         0|
+--------------------+----------+

