# Feature Extractor

# TF-IDF

In [1]:
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

In [3]:
# create dataframe
sentenceData = spark.createDataFrame([(0, "I heard about Spark and I love Spark"),
                                      (0, "I wish Java could use case classes"),
                                      (1, "Logistic regression models are neat")]).toDF("label", "sentence")

In [4]:
# word partition
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)

In [5]:
# feature extraction 使用HashingTF的transform()方法把句子哈希成特征向量，这里设置哈希表的桶数为2000。
hashingtf = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurized_data = hashingtf.transform(wordsData)

In [8]:
print (featurized_data.collect())

[Row(label=0, sentence='I heard about Spark and I love Spark', words=['i', 'heard', 'about', 'spark', 'and', 'i', 'love', 'spark'], rawFeatures=SparseVector(20, {0: 1.0, 5: 2.0, 9: 2.0, 13: 1.0, 17: 2.0})), Row(label=0, sentence='I wish Java could use case classes', words=['i', 'wish', 'java', 'could', 'use', 'case', 'classes'], rawFeatures=SparseVector(20, {2: 1.0, 7: 1.0, 9: 3.0, 13: 1.0, 15: 1.0})), Row(label=1, sentence='Logistic regression models are neat', words=['logistic', 'regression', 'models', 'are', 'neat'], rawFeatures=SparseVector(20, {4: 1.0, 6: 1.0, 13: 1.0, 15: 1.0, 18: 1.0}))]


In [13]:
featurized_data.select("label", "rawFeatures").show()

+-----+--------------------+
|label|         rawFeatures|
+-----+--------------------+
|    0|(20,[0,5,9,13,17]...|
|    0|(20,[2,7,9,13,15]...|
|    1|(20,[4,6,13,15,18...|
+-----+--------------------+



最后，使用IDF来对单纯的词频特征向量进行修正，使其更能体现不同词汇对文本的区别能力，IDF是一个Estimator，调用fit()方法并将词频向量传入，即产生一个IDFModel。

In [10]:
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurized_data)

很显然，IDFModel是一个Transformer，调用它的transform()方法，即可得到每一个单词对应的TF-IDF度量值。

In [11]:
rescaledData = idfModel.transform(featurized_data)
rescaledData.select("label", "features").show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|(20,[0,5,9,13,17]...|
|    0|(20,[2,7,9,13,15]...|
|    1|(20,[4,6,13,15,18...|
+-----+--------------------+



# Word2Vec

首先，导入Word2Vec所需要的包，并创建三个词语序列，每个代表一个文档：

In [1]:
from pyspark.ml.feature import Word2Vec

documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )], ["text"])

新建一个Word2Vec，显然，它是一个Estimator，设置相应的超参数，这里设置特征向量的维度为3

In [2]:
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")

读入训练数据，用fit()方法生成一个Word2VecModel。

In [3]:
model = word2Vec.fit(documentDF)

利用Word2VecModel把文档转变成特征向量。

In [6]:
result = model.transform(documentDF)

for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))
 


Text: [Hi, I, heard, about, Spark] => 
Vector: [0.0824229435064,-0.00583029687405,-0.0562269836664]

Text: [I, wish, Java, could, use, case, classes] => 
Vector: [0.0372266347653,-0.00211243118559,-0.0131242724934]

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.0836555838585,0.00822269544005,-0.0409857220948]



Text: [Hi, I, heard, about, Spark] => 
Vector: [0.0127797678113,-0.0934097565711,-0.108308439702]
 
Text: [I, wish, Java, could, use, case, classes] => 
Vector: [0.0761276933564,0.0345174372196,-0.0429060061329]
 
Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.0675941422582,0.0452983468771,0.0530217912048]

可以看到，文档被转变为了一个3维的特征向量，这些特征向量就可以被应用到相关的机器学习方法中。

# CountVectorizer

In [7]:
from pyspark.ml.feature import CountVectorizer

假设我们有如下的DataFrame，其包含id和words两列，可以看成是一个包含两个文档的迷你语料库。

In [9]:
df = spark.createDataFrame([
        (0, "a b c".split(" ")),
        (1, "a b b c a".split(" "))], ["id", "words"])

随后，通过CountVectorizer设定超参数，训练一个CountVectorizer，这里设定词汇表的最大量为3，设定词汇表中的词至少要在2个文档中出现过，以过滤那些偶然出现的词汇。

In [10]:
# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

在训练结束后，可以通过cv对DataFrame进行fit,获得到模型的词汇表：

In [11]:
model = cv.fit(df)

使用这一模型对DataFrame进行变换，可以得到文档的向量化表示

In [12]:
result = model.transform(df)
result.show(truncate=False)

+---+---------------+-------------------------+
|id |words          |features                 |
+---+---------------+-------------------------+
|0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+



# Feature Transformer