Welcome to exercise twp of week three of “Apache Spark for Scalable Machine Learning on BigData”. In this exercise we’ll work on clustering.

Let’s create our DataFrame again:


In [1]:
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()

sc = SparkContext(conf=conf)
from pyspark.sql import Row
from pyspark.sql import SparkSession

spark = SparkSession(sc)

In [3]:
# delete files from previous runs
#!rm -f hmp.parquet*

# download the file containing the data in PARQUET format
#!wget https://github.com/IBM/coursera/raw/master/hmp.parquet
    
# create a dataframe out of it
df = spark.read.parquet('hmp.parquet')

# register a corresponding query table
df.createOrReplaceTempView('df')

Let’s reuse our feature engineering pipeline.

In [4]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, Normalizer
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="class", outputCol="classIndex")
encoder = OneHotEncoder(inputCol="classIndex", outputCol="categoryVec")
vectorAssembler = VectorAssembler(inputCols=["x","y","z"],
                                  outputCol="features")
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer])
model = pipeline.fit(df)
prediction = model.transform(df)
prediction.show()

+---+---+---+--------------------+-----------+----------+--------------+----------------+--------------------+
|  x|  y|  z|              source|      class|classIndex|   categoryVec|        features|       features_norm|
+---+---+---+--------------------+-----------+----------+--------------+----------------+--------------------+
| 22| 49| 35|Accelerometer-201...|Brush_teeth|       6.0|(13,[6],[1.0])|[22.0,49.0,35.0]|[0.20754716981132...|
| 22| 49| 35|Accelerometer-201...|Brush_teeth|       6.0|(13,[6],[1.0])|[22.0,49.0,35.0]|[0.20754716981132...|
| 22| 52| 35|Accelerometer-201...|Brush_teeth|       6.0|(13,[6],[1.0])|[22.0,52.0,35.0]|[0.20183486238532...|
| 22| 52| 35|Accelerometer-201...|Brush_teeth|       6.0|(13,[6],[1.0])|[22.0,52.0,35.0]|[0.20183486238532...|
| 21| 52| 34|Accelerometer-201...|Brush_teeth|       6.0|(13,[6],[1.0])|[21.0,52.0,34.0]|[0.19626168224299...|
| 22| 51| 34|Accelerometer-201...|Brush_teeth|       6.0|(13,[6],[1.0])|[22.0,51.0,34.0]|[0.20560747663551...|
|

Now let’s create a new pipeline for kmeans.

In [13]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans(featuresCol="features").setK(14).setSeed(1)
pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer ,kmeans])
model = pipeline.fit(df)
predictions = model.transform(df)

evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

Silhouette with squared euclidean distance = 0.2668998965895519


We have 14 different movement patterns in the dataset, so setting K of KMeans to 14 is a good idea. But please experiment with different values for K, do you find a sweet spot? The closer Silhouette gets to 1, the better.

https://en.wikipedia.org/wiki/Silhouette_(clustering)


In [12]:
for k in range(2,15):
    kmeans = KMeans(featuresCol="features").setK(k).setSeed(1)
    pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer ,kmeans])
    model = pipeline.fit(df)
    predictions = model.transform(df)

    evaluator = ClusteringEvaluator()

    silhouette = evaluator.evaluate(predictions)
    print("K = {} and Silhouette with squared euclidean distance = {}".format(k,silhouette))

K = 2 and Silhouette with squared euclidean distance = 0.6875664014387497
K = 3 and Silhouette with squared euclidean distance = 0.6147915951361759
K = 4 and Silhouette with squared euclidean distance = 0.6333227654128869
K = 5 and Silhouette with squared euclidean distance = 0.5937447997439024
K = 6 and Silhouette with squared euclidean distance = 0.592463658820136
K = 7 and Silhouette with squared euclidean distance = 0.5484627422401509
K = 8 and Silhouette with squared euclidean distance = 0.46686489256383346
K = 9 and Silhouette with squared euclidean distance = 0.48034893889849645
K = 10 and Silhouette with squared euclidean distance = 0.47370428136987536
K = 11 and Silhouette with squared euclidean distance = 0.4819049717562352
K = 12 and Silhouette with squared euclidean distance = 0.40964155503229643
K = 13 and Silhouette with squared euclidean distance = 0.4153293521373778
K = 14 and Silhouette with squared euclidean distance = 0.41244594513295846


Now please extend the pipeline to work on the normalized features. You need to tell KMeans to use the normalized feature column and change the pipeline in order to contain the normalizer stage as well.

In [22]:
kmeans = KMeans(featuresCol='features_norm').setK(14).setSeed(1)
pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer ,kmeans])
model = pipeline.fit(df)

predictions = model.transform(df)

evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
df_pan = predictions.toPandas()

Silhouette with squared euclidean distance = 0.2668998965895519


Sometimes, inflating the dataset helps, here we multiply x by 10, let’s see if the performance inceases.

In [23]:
df_pan

Unnamed: 0,x,y,z,source,class,classIndex,categoryVec,features,features_norm,prediction
0,22,49,35,Accelerometer-2011-04-11-13-28-18-brush_teeth-...,Brush_teeth,6.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[22.0, 49.0, 35.0]","[0.20754716981132076, 0.46226415094339623, 0.3...",12
1,22,49,35,Accelerometer-2011-04-11-13-28-18-brush_teeth-...,Brush_teeth,6.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[22.0, 49.0, 35.0]","[0.20754716981132076, 0.46226415094339623, 0.3...",12
2,22,52,35,Accelerometer-2011-04-11-13-28-18-brush_teeth-...,Brush_teeth,6.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[22.0, 52.0, 35.0]","[0.2018348623853211, 0.47706422018348627, 0.32...",12
3,22,52,35,Accelerometer-2011-04-11-13-28-18-brush_teeth-...,Brush_teeth,6.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[22.0, 52.0, 35.0]","[0.2018348623853211, 0.47706422018348627, 0.32...",12
4,21,52,34,Accelerometer-2011-04-11-13-28-18-brush_teeth-...,Brush_teeth,6.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[21.0, 52.0, 34.0]","[0.19626168224299065, 0.48598130841121495, 0.3...",10
...,...,...,...,...,...,...,...,...,...,...
446524,41,35,51,Accelerometer-2012-06-11-11-39-29-walk-m1.txt,Walk,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[41.0, 35.0, 51.0]","[0.3228346456692913, 0.2755905511811024, 0.401...",2
446525,40,35,52,Accelerometer-2012-06-11-11-39-29-walk-m1.txt,Walk,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[40.0, 35.0, 52.0]","[0.31496062992125984, 0.2755905511811024, 0.40...",2
446526,39,37,51,Accelerometer-2012-06-11-11-39-29-walk-m1.txt,Walk,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[39.0, 37.0, 51.0]","[0.30708661417322836, 0.29133858267716534, 0.4...",8
446527,39,37,53,Accelerometer-2012-06-11-11-39-29-walk-m1.txt,Walk,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[39.0, 37.0, 53.0]","[0.3023255813953488, 0.2868217054263566, 0.410...",8


In [None]:
from pyspark.sql.functions import col
df_denormalized = df.select([col('*'),(col('x')*10)]).drop('x').withColumnRenamed('(x * 10)','x')

In [15]:
kmeans = KMeans(featuresCol="features_norm").setK(14).setSeed(1)
pipeline = Pipeline(stages=[vectorAssembler, kmeans])
model = pipeline.fit(df_denormalized)
predictions = model.transform(df_denormalized)

evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

NameError: name 'df_denormalized' is not defined

Apache SparkML can be used to try many different algorithms and parametrizations using the same pipeline. Please change the code below to use GaussianMixture over KMeans. Please use the following link for your reference.

https://spark.apache.org/docs/latest/ml-clustering.html#gaussian-mixture-model-gmm


In [None]:
from pyspark.ml.clustering import GaussianMixture

gmm = $$
pipeline =  Pipeline(stages=[vectorAssembler, gmm])


model = pipeline.fit(df)

predictions = model.transform(df)

evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

In [24]:
GaussianMixture?

Object `GaussianMixture` not found.
