# K-Means Quiz
Use this Jupyter notebook to find the answers to the quiz in the previous section. There is an answer key in the next part of the lesson.

We might want to take a look at the distribution of the Title+Body length feature we used before and instead of using the raw number of words create categories based on this length: short, longer,..., super long.

In the questions below I'll refer to length of the combined Title and Body fields as Description length (and by length we mean the number of words when the text is tokenized with pattern="\W").

In [35]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat, count, lit, udf, max, avg
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import RegexTokenizer, VectorAssembler
from pyspark.ml.clustering import KMeans

# TODOS: 
# 1) import any other libraries you might need
# 2) run the cells below to read the dataset and extract description length features
# 3) write code to answer the quiz questions

In [18]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

### Read Dataset

In [19]:
stack_overflow_data = 'Train_onetag_small.json'

In [20]:
df = spark.read.json(stack_overflow_data)
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

### Build Description Length Features

In [21]:
df = df.withColumn("Desc", concat(col("Title"), lit(' '), col("Body")))

In [22]:
body_length = udf(lambda x: len(x), IntegerType())

In [23]:
regexTokenizer = RegexTokenizer(inputCol="Desc", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)
df = df.withColumn("DescLength", body_length(df.words))

In [24]:
assembler = VectorAssembler(inputCols=["DescLength"], outputCol="DescVec")
df = assembler.transform(df)

In [25]:
number_of_tags = udf(lambda x: len(x.split(" ")), IntegerType())
df = df.withColumn("NumTags", number_of_tags(df.Tags))

In [26]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', Desc="How to check if an uploaded file is an image without mime type? <p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an imag

# Question 1
How many times greater is the Description Length of the longest question than the Description Length of the shortest question (rounded to the nearest whole number)?

Tip: Don't forget to import Spark SQL's aggregate functions that can operate on DataFrame columns.

In [27]:
# TODO: write your code to answer this question
df.agg({"DescLength": "max"}).collect()[0][0]/df.agg({"DescLength": "min"}).collect()[0][0]

753.2

# Question 2
What is the mean and standard deviation of the Description length?

In [28]:
# TODO: write your code to answer this question
df.agg({"DescLength": "avg"}).collect()[0][0]

180.28187

In [29]:
df.agg({"DescLength": "std"}).collect()[0][0]

192.10819533505023

# Question 3
Let's use K-means to create 5 clusters of Description lengths. Set the random seed to 42 and fit a 5-class K-means model on the Description length column (you can use KMeans().setParams(...) ). What length is the center of the cluster representing the longest questions?

In [43]:
# create df to perform ml
ml_df = df.select(col("NumTags"), col("DescLength"), 
                  col("DescVec"))

In [44]:
# train k means model
kmeans = KMeans().setParams(featuresCol="DescVec", 
                            predictionCol="DescGroup", 
                            k=5, 
                            seed=42)
model = kmeans.fit(ml_df)
ml_df = model.transform(ml_df)

In [45]:
model.clusterCenters()

[array([ 97.09383642]),
 array([ 1077.93227792]),
 array([ 502.39304611]),
 array([ 2731.08284024]),
 array([ 242.33112488])]

In [46]:
ml_df.head()

Row(NumTags=5, DescLength=96, DescVec=DenseVector([96.0]), DescGroup=0)

In [47]:
ml_df.groupby("DescGroup").agg(avg(col("DescLength")), avg(col("NumTags")), count(col("DescLength"))).orderBy("avg(DescLength)").show()

+---------+------------------+------------------+-----------------+
|DescGroup|   avg(DescLength)|      avg(NumTags)|count(DescLength)|
+---------+------------------+------------------+-----------------+
|        0| 96.71484436347646|   2.7442441184785|            63674|
|        4| 241.0267434466191| 3.093549070868367|            28306|
|        2|499.83863263173606|3.2294372294372296|             6699|
|        1|      1074.2109375|3.2864583333333335|             1152|
|        3|2731.0828402366865|  3.42603550295858|              169|
+---------+------------------+------------------+-----------------+

