<a href="https://colab.research.google.com/github/muhammetsnts/SPARK/blob/main/projects/7.Wheat_Clustering_with_K_Means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Info

We'll be working with a real data set about seeds, from UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds.

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for 
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. 

The data set can be used for the tasks of classification and cluster analysis.


Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured: 
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 
All of these parameters were real-valued continuous.

Let's see if we can cluster them in to 3 groups with K-means!


# Setup Environment

In [1]:
# install Java8
!apt-get -q install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.1.1
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip it
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

# install findspark 
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
#spark = SparkSession.builder.appName('ops').getOrCreate()

# Download and Read the Data

In [2]:
!wget -q https://raw.githubusercontent.com/muhammetsnts/SPARK/main/data/seeds_dataset.csv

In [3]:
data = spark.read.csv("seeds_dataset.csv", header=True, inferSchema=True)

In [4]:
data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [5]:
data.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|             4.805|
|16.14|    14.99|     0.9034|5.6579999999999995|             3.562|                1.355|             5.175|
|14.38|    14.21|     0.8951|             5.386|             3.312|   2.4619999999999997|             4.956|
|14.69|    14.49|  

In [6]:
# Import VectorAssembler and Vectors

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [7]:
data.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

We have only feature columns. There is no label in dataset.

In [8]:
assembler = VectorAssembler(inputCols=data.columns, 
                            outputCol='features')

In [9]:
output = assembler.transform(data)

In [10]:
output.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)



In [11]:
final_data = output.select(['features'])

# Modelling

In [12]:
from pyspark.ml.clustering import KMeans

In [13]:
kmeans = KMeans().setK(3).setSeed(42)

In [14]:
model = kmeans.fit(final_data)

In [19]:
centers = model.clusterCenters()

In [20]:
centers

[array([18.72180328, 16.29737705,  0.88508689,  6.20893443,  3.72267213,
         3.60359016,  6.06609836]),
 array([11.96441558, 13.27480519,  0.8522    ,  5.22928571,  2.87292208,
         4.75974026,  5.08851948]),
 array([14.64847222, 14.46041667,  0.87916667,  5.56377778,  3.27790278,
         2.64893056,  5.19231944])]

In [21]:
result = model.transform(final_data)

In [23]:
result.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|[15.26,14.84,0.87...|         2|
|[14.88,14.57,0.88...|         2|
|[14.29,14.09,0.90...|         2|
|[13.84,13.94,0.89...|         2|
|[16.14,14.99,0.90...|         2|
|[14.38,14.21,0.89...|         2|
|[14.69,14.49,0.87...|         2|
|[14.11,14.1,0.891...|         2|
|[16.63,15.46,0.87...|         2|
|[16.44,15.25,0.88...|         2|
|[15.26,14.85,0.86...|         2|
|[14.03,14.16,0.87...|         2|
|[13.89,14.02,0.88...|         2|
|[13.78,14.06,0.87...|         2|
|[13.74,14.05,0.87...|         2|
|[14.59,14.28,0.89...|         2|
|[13.99,13.83,0.91...|         1|
|[15.69,14.75,0.90...|         2|
|[14.7,14.21,0.915...|         2|
|[12.72,13.57,0.86...|         1|
+--------------------+----------+
only showing top 20 rows



# Feature Scaling

Lets scale our data and get predictions after that.

In [24]:
output.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|            features|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|[15.26,14.84,0.87...|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|[14.88,14.57,0.88...|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|[14.29,14.09,0.90...|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|             4.805|[13.84,13.94,0.89...|
|16.14|    14.99|     0.9034|5.6579999999999995|             3.562|                1.355| 

In [25]:
from pyspark.ml.feature import StandardScaler

In [26]:
scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')

In [27]:
scaler_model = scaler.fit(output)

In [28]:
scaled_output = scaler_model.transform(output)

In [29]:
scaled_output.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+--------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|            features|      scaledFeatures|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+--------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|[15.26,14.84,0.87...|[5.24452795332028...|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|[14.88,14.57,0.88...|[5.11393027165175...|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|[14.29,14.09,0.90...|[4.91116018695588...|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|     

In [30]:
kmeans2 = KMeans(featuresCol='features', k=3)

In [31]:
model_scaled = kmeans2.fit(scaled_output)

In [32]:
centers_scaled = model_scaled.clusterCenters()
centers_scaled

[array([14.81910448, 14.53716418,  0.88052239,  5.59101493,  3.29935821,
         2.70658209,  5.21753731]),
 array([18.72180328, 16.29737705,  0.88508689,  6.20893443,  3.72267213,
         3.60359016,  6.06609836]),
 array([11.98865854, 13.28439024,  0.85273659,  5.22742683,  2.88008537,
         4.58392683,  5.0742439 ])]

In [34]:
result_scaled = model_scaled.transform(scaled_output)
result_scaled.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+--------------------+----------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|            features|      scaledFeatures|prediction|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+--------------------+----------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|[15.26,14.84,0.87...|[5.24452795332028...|         0|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|[14.88,14.57,0.88...|[5.11393027165175...|         0|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|[14.29,14.09,0.90...|[4.91116018695588...|         0|
|13.84|    13.94|     0.8955

In [35]:
result.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|[15.26,14.84,0.87...|         2|
|[14.88,14.57,0.88...|         2|
|[14.29,14.09,0.90...|         2|
|[13.84,13.94,0.89...|         2|
|[16.14,14.99,0.90...|         2|
|[14.38,14.21,0.89...|         2|
|[14.69,14.49,0.87...|         2|
|[14.11,14.1,0.891...|         2|
|[16.63,15.46,0.87...|         2|
|[16.44,15.25,0.88...|         2|
|[15.26,14.85,0.86...|         2|
|[14.03,14.16,0.87...|         2|
|[13.89,14.02,0.88...|         2|
|[13.78,14.06,0.87...|         2|
|[13.74,14.05,0.87...|         2|
|[14.59,14.28,0.89...|         2|
|[13.99,13.83,0.91...|         1|
|[15.69,14.75,0.90...|         2|
|[14.7,14.21,0.915...|         2|
|[12.72,13.57,0.86...|         1|
+--------------------+----------+
only showing top 20 rows



As you can see, the results are changed after scaling the data.