﻿# K-Means测验的答案 
我们可能更想看标题加内容的长度的特征的分布，而不是通过一些语言描述来看文章内容的长度：短 、 更长、 超级长

在下面的问题中，我把标题加内容`Title`的长度的字段`Body`叫做描述长度（长度是指使用pattern =“\ W”进行分词的单词数）。

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, col, concat, count, desc, explode, lit, min, max, split, stddev, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import RegexTokenizer, VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.clustering import KMeans

In [None]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

### 读取数据集

In [None]:
stack_overflow_data = 'Train_onetag_small.json'

In [None]:
df = spark.read.json(stack_overflow_data)
df.persist()

In [None]:
DataFrame [Body：string，Id：bigint，Tags：string，Title：string，oneTag：string]



### 创建描述长度特征

In [None]:
df = df.withColumn("Desc", concat(col("Title"), lit(' '), col("Body")))

In [None]:
regexTokenizer = RegexTokenizer(inputCol="Desc", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)

In [None]:
body_length = udf(lambda x: len(x), IntegerType())
df = df.withColumn("DescLength", body_length(df.words))

In [None]:
assembler = VectorAssembler(inputCols=["DescLength"], outputCol="DescVec")
df = assembler.transform(df)

In [None]:
number_of_tags = udf(lambda x: len(x.split(" ")), IntegerType())
df = df.withColumn("NumTags", number_of_tags(df.Tags))

# 问题1
最长问题的描述长度比最短问题的描述长度（四舍五入到最接近的整数）大多少倍？

提示：不要忘记导入可以对 DataFrame 列进行操作的 Spark SQL聚合函数。

In [None]:
df.agg(min("DescLength")).show()

In [None]:
+---------------+
|min(DescLength)|
+---------------+
| 10|
+---------------+




In [None]:
df.agg(max("DescLength")).show()

In [None]:
+---------------+
|max(DescLength)|
+---------------+
| 7532|
+---------------+



# Question 2
描述长度的平均值和标准差是多少？

In [None]:
df.agg(avg("DescLength"), stddev("DescLength")).show()

In [None]:
+---------------+-----------------------+
|avg(DescLength)|stddev_samp(DescLength)|
+---------------+-----------------------+
| 180.28187| 192.10819533505023|
+---------------+-----------------------+



# 问题3
让我们用 K-means把描述长度归到5类中。将 random seed 设置为42并用5簇的K-means模型拟合描述长度（你可以使用KMeans（）。setParams（...））。代表最长问题的簇的中心长度是多少？

In [None]:
kmeans = KMeans().setParams(featuresCol="DescVec", predictionCol="DescGroup", k=5, seed=42)
model = kmeans.fit(df)
df = model.transform(df)

In [None]:
df.head()

In [None]:
Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file.The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', Desc="How to check if an uploaded file is an image without mime type?<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file.The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", words=['how', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'without', 'mime', 'type', 'p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which', 'file', 'type', 'you', 'upload', 'p', 'p', 'is', 'there', 'a', 'way', 'to', 'check', 'if', 'the', 'uploaded', 'file', 'is', 'an', 'image', 'apart', 'from', 'checking', 'the', 'file', 'extension', 'using', 'php', 'p'], DescLength=96, DescVec=DenseVector([96.0]), DescGroup=0)




In [None]:
df.groupby("DescGroup").agg(avg(col("DescLength")), avg(col("NumTags")), count(col("DescLength"))).orderBy("avg(DescLength)").show()

In [None]:
+---------+------------------+------------------+-----------------+
|DescGroup| avg(DescLength)| avg(NumTags)|count(DescLength)|
+---------+------------------+------------------+-----------------+
| 0| 96.71484436347646| 2.7442441184785| 63674|
| 4| 241.0267434466191| 3.093549070868367| 28306|
| 2|499.83863263173606|3.2294372294372296| 6699|
| 1| 1074.2109375|3.2864583333333335| 1152|
| 3|2731.0828402366865| 3.42603550295858| 169|
+---------+------------------+------------------+-----------------+




```python

```