# Classification with Clustering 

We'll be working with a real data set about seeds, from UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds.

The dataset presents the information about seeds, specifically, the kernels of wheat.

It deals with three different kinds of wheat seeds (or kernels): Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. 

High quality visualization of the internal kernel structure was detected using a soft X-ray technique. The images were recorded on 13x18 cm X-ray KODAK plates. 

The data set can be used for the tasks of classification and cluster analysis.

Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured: 

1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 

All of these parameters were real-valued continuous.

Let's see if we can cluster them in to 3 groups with K-means!

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [None]:
from pyspark.ml.clustering import KMeans

# Loads data.
dataset = spark.read.csv("seeds_dataset.csv",header=True,inferSchema=True)

- it is noted that the data does not have "lables". 
- So, it's a unsupervied machine learning.

In [None]:
dataset.head()

In [None]:
dataset.describe().show()

## Format the Data

It's quite common to format the original data and then, they can be well used during training a model.

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [None]:
dataset.columns

In [None]:
vec_assembler = VectorAssembler(inputCols = dataset.columns, outputCol='features')

help(VectorAssembler)

In [None]:
final_data = vec_assembler.transform(dataset)

In [None]:
# see its DataFrame structure, columns

final_data

## Scale the Data
It is a good idea to scale our data to deal with the curse of dimensionality: https://en.wikipedia.org/wiki/Curse_of_dimensionality

In [None]:
from pyspark.ml.feature import StandardScaler

help(StandardScaler)

In [None]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [None]:
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(final_data)

In [None]:
# Normalize each feature to have unit standard deviation.
final_data = scalerModel.transform(final_data)

In [None]:
final_data.head()


## Train the Model and Evaluate

In [None]:
# Trains a k-means model.
kmeans = KMeans(featuresCol='scaledFeatures',k=3) #specify Kmeans model
model = kmeans.fit(final_data)

In [None]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(final_data)
print("Within Set Sum of Squared Errors = " + str(wssse))

In [None]:
# Shows the result: three clusters produced 
# and print the centroid (or center) for each point (feature)

centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

In [None]:
# we want to know the prediction of each seed, i.e, the type of each seed.

model.transform(final_data)

In [None]:
model.transform(final_data).select('prediction').show()

Now you are ready for your consulting Project!
# Great Job!