# K-Means Quiz
Use this Jupyter notebook to find the answers to the quiz in the previous section. There is an answer key in the next part of the lesson.

We might want to take a look at the distribution of the Title+Body length feature we used before and instead of using the raw number of words create categories based on this length: short, longer,..., super long.

In the questions below I'll refer to length of the combined Title and Body fields as Description length (and by length we mean the number of words when the text is tokenized with pattern="\W").

In [73]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, CountVectorizer, IDF, StringIndexer
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler, MinMaxScaler
from pyspark.sql.functions import concat, lit, udf, sum as Fsum, pow as Fpow, col, sqrt as Fsqrt, max as Fmax, min as Fmin
from pyspark.sql.functions import stddev as Fstddev, mean, avg, count
from pyspark.sql.types import IntegerType, FloatType
import numpy as np
import re
from pyspark.sql import functions as F
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator



In [19]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

### Read Dataset

In [20]:
stack_overflow_data = 'Train_onetag_small.json'

In [21]:
df = spark.read.json(stack_overflow_data)
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

### Build Description Length Features

In [22]:
df = df.withColumn("Desc", concat(col("Title"), lit(' '), col("Body")))

In [23]:
regexTokenizer = RegexTokenizer(inputCol="Desc", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)
body_length = udf(lambda x: len(x), IntegerType())
split_text = udf(lambda x: x.split(" "), IntegerType())
df = df.withColumn("DescLength", body_length(df.words))
df = df.withColumn("TagLength", body_length(split_text(df.Tags)))

In [24]:
assembler = VectorAssembler(inputCols=["DescLength"], outputCol="DescVec")
df = assembler.transform(df)

# Question 1
How many times is the Description length of the longest question longer than the shortest one's?

Tip: Don't forget to import Spark SQL's aggregate functions that can operate on DataFrame columns.

In [26]:
df.select("DescLength").show()

+----------+
|DescLength|
+----------+
|        96|
|        83|
|      3168|
|       124|
|       154|
|        75|
|       121|
|       170|
|       107|
|        74|
|       145|
|       148|
|        24|
|        49|
|        48|
|       389|
|       380|
|       216|
|       123|
|       404|
+----------+
only showing top 20 rows



In [29]:
df.groupBy().agg(Fmax("DescLength")).collect()

[Row(max(DescLength)=7532)]

In [32]:
#also
df.groupBy().max("DescLength").collect()

[Row(max(DescLength)=7532)]

In [30]:
df.groupBy().agg(Fmin("DescLength")).collect() # 7532/10 = 753

[Row(min(DescLength)=10)]

# Question 2
What is the mean and standard deviation of the Description length?

In [37]:
print('mean', df.groupBy().mean("DescLength").collect())
print('stddev', df.groupBy().agg(Fstddev("DescLength")).collect())

mean [Row(avg(DescLength)=180.28187)]
stddev [Row(stddev_samp(DescLength)=192.10819533505023)]


In [36]:
df.describe("DescLength").collect()

[Row(summary='count', DescLength='100000'),
 Row(summary='mean', DescLength='180.28187'),
 Row(summary='stddev', DescLength='192.10819533505023'),
 Row(summary='min', DescLength='10'),
 Row(summary='max', DescLength='7532')]

# Question 3
Let's use K-means to create 5 clusters of Description lengths. Set the random seed to 42 and fit a 5-class K-means model on the Description length column (you can use KMeans().setParams(...) ). What length is the center of the cluster representing the longest questions?

In [40]:
dataset = df.select(col("DescVec").alias("features"))

In [57]:
kmeans = KMeans().setParams(featuresCol="DescVec", predictionCol="DescGroup", k=5, seed=42)
#kmeans = KMeans().setK(5).setSeed(42)
model = kmeans.fit(df)

In [58]:
#predictions = model.transform(dataset) # dataset was df["DescVec"]
df = model.transform(df)

In [59]:
evaluator = ClusteringEvaluator()

In [44]:
silhouette = evaluator.evaluate(predictions) # now df["DescVec", "predictions"]
print("Silhouette with squared euclidean distance = " + str(silhouette))

Silhouette with squared euclidean distance = 0.7631218189100486


In [80]:
silhouette = evaluator.evaluate(df.select(col("DescVec").alias("features"), col("DescGroup").alias("prediction"))) 
print("Silhouette with squared euclidean distance = " + str(silhouette))

Silhouette with squared euclidean distance = 0.7631218189100486


In [63]:
df.select(["DescLength", "DescVec", "DescGroup"]).show()

+----------+--------+---------+
|DescLength| DescVec|DescGroup|
+----------+--------+---------+
|        96|  [96.0]|        0|
|        83|  [83.0]|        0|
|      3168|[3168.0]|        3|
|       124| [124.0]|        0|
|       154| [154.0]|        0|
|        75|  [75.0]|        0|
|       121| [121.0]|        0|
|       170| [170.0]|        4|
|       107| [107.0]|        0|
|        74|  [74.0]|        0|
|       145| [145.0]|        0|
|       148| [148.0]|        0|
|        24|  [24.0]|        0|
|        49|  [49.0]|        0|
|        48|  [48.0]|        0|
|       389| [389.0]|        2|
|       380| [380.0]|        2|
|       216| [216.0]|        4|
|       123| [123.0]|        0|
|       404| [404.0]|        2|
+----------+--------+---------+
only showing top 20 rows



In [65]:
df.filter(df.DescLength == 7532.0).show()

+--------------------+----+--------------+--------------------+------+--------------------+--------------------+----------+---------+--------+---------+
|                Body|  Id|          Tags|               Title|oneTag|                Desc|               words|DescLength|TagLength| DescVec|DescGroup|
+--------------------+----+--------------+--------------------+------+--------------------+--------------------+----------+---------+--------+---------+
|<p>Ok After lots ...|2032|php svg arabic|Generating SVG Dy...|   php|Generating SVG Dy...|[generating, svg,...|      7532|        3|[7532.0]|        3|
+--------------------+----+--------------+--------------------+------+--------------------+--------------------+----------+---------+--------+---------+



In [54]:
model.clusterCenters()

[array([ 97.09383642]),
 array([ 1077.93227792]),
 array([ 502.39304611]),
 array([ 2731.08284024]),
 array([ 242.33112488])]

In [77]:
df.groupby("DescGroup").agg(avg("DescLength")).show()

+---------+------------------+
|DescGroup|   avg(DescLength)|
+---------+------------------+
|        1|      1074.2109375|
|        3|2731.0828402366865|
|        4| 241.0267434466191|
|        2|499.83863263173606|
|        0| 96.71484436347646|
+---------+------------------+



In [78]:
df.groupby("DescGroup").agg(avg("DescLength")).show()

+---------+------------------+
|DescGroup|   avg(DescLength)|
+---------+------------------+
|        1|      1074.2109375|
|        3|2731.0828402366865|
|        4| 241.0267434466191|
|        2|499.83863263173606|
|        0| 96.71484436347646|
+---------+------------------+

