### Feature Extractors
Extraction: Extracting features from “raw” data
+ TF-IDF
+ Word2Vec
+ CountVectorizer
+ FeatureHasher

##### Tf-IDF

In [1]:
import pyspark
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Feature Extractor").getOrCreate()

22/03/30 20:28:56 WARN Utils: Your hostname, iamhimanshu0 resolves to a loopback address: 127.0.1.1; using 192.168.43.239 instead (on interface wlo1)
22/03/30 20:28:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/03/30 20:28:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

In [4]:
sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark"),
    (0.0, "I wish Java could use case classes"),
    (1.0, "Logistic regression models are neat")
], ["label", "sentence"])

sentenceData.show()

                                                                                

+-----+--------------------+
|label|            sentence|
+-----+--------------------+
|  0.0|Hi I heard about ...|
|  0.0|I wish Java could...|
|  1.0|Logistic regressi...|
+-----+--------------------+



In [5]:
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')
wordsData = tokenizer.transform(sentenceData)

In [6]:
hashingTF = HashingTF(inputCol='words', outputCol='rawFeatures',numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

In [7]:
idf = IDF(inputCol='rawFeatures', outputCol='features')
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

                                                                                

In [11]:
rescaledData.select("label",'features').show(truncate=False)

+-----+-------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                                                   |
+-----+-------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |(20,[6,8,13,16],[0.28768207245178085,0.6931471805599453,0.28768207245178085,0.5753641449035617])                                           |
|0.0  |(20,[0,2,7,13,15,16],[0.6931471805599453,0.6931471805599453,1.3862943611198906,0.28768207245178085,0.6931471805599453,0.28768207245178085])|
|1.0  |(20,[3,4,6,11,19],[0.6931471805599453,0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453])                       |
+-----+---------------------------------------------------------------------------------------------------------

#### Word2Vec

Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. 

The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc.

In [12]:
from pyspark.ml.feature import Word2Vec

In [14]:
# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

documentDF.show(truncate=False)

+------------------------------------------+
|text                                      |
+------------------------------------------+
|[Hi, I, heard, about, Spark]              |
|[I, wish, Java, could, use, case, classes]|
|[Logistic, regression, models, are, neat] |
+------------------------------------------+



In [15]:
word2vec = Word2Vec(vectorSize=3, minCount=0, inputCol='text', outputCol='result')

model = word2vec.fit(documentDF).transform(documentDF)

22/03/29 21:52:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/03/29 21:52:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


In [18]:
model.show(truncate=False)

+------------------------------------------+-------------------------------------------------------------------+
|text                                      |result                                                             |
+------------------------------------------+-------------------------------------------------------------------+
|[Hi, I, heard, about, Spark]              |[0.10757351368665696,0.005313180014491082,0.02163493409752846]     |
|[I, wish, Java, could, use, case, classes]|[0.02210963943174907,-0.03750888577529362,0.046501401013561657]    |
|[Logistic, regression, models, are, neat] |[-0.023545664548873902,-0.036877965182065965,0.0036725979298353195]|
+------------------------------------------+-------------------------------------------------------------------+



In [24]:
for row in model.collect():
    text, vector = row
    print(f"Text:- {' '.join(text)} => vector:- {vector}")

Text:- Hi I heard about Spark => vector:- [0.10757351368665696,0.005313180014491082,0.02163493409752846]
Text:- I wish Java could use case classes => vector:- [0.02210963943174907,-0.03750888577529362,0.046501401013561657]
Text:- Logistic regression models are neat => vector:- [-0.023545664548873902,-0.036877965182065965,0.0036725979298353195]


#### CountVectorizer

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.


In [3]:
from pyspark.ml.feature import CountVectorizer

df = spark.createDataFrame([
    (0,"a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ['id','words'])

df.show()

                                                                                

+---+---------------+
| id|          words|
+---+---------------+
|  0|      [a, b, c]|
|  1|[a, b, b, c, a]|
+---+---------------+



In [4]:
# fit a countvectorizer from the corpus
cv = CountVectorizer(inputCol='words', outputCol='features',
                        vocabSize=3, minDF=2.0)
model = cv.fit(df)

result = model.transform(df)
result.show(truncate=False)

+---+---------------+-------------------------+
|id |words          |features                 |
+---+---------------+-------------------------+
|0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+



#### FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:

- Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns using the categoricalCols parameter.
- String columns: For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).
- Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of 1.0.

In [5]:
from pyspark.ml.feature import FeatureHasher

dataset = spark.createDataFrame([
    (2.2, True, "1", "foo"),
    (3.3, False, "2", "bar"),
    (4.4, False, "3", "baz"),
    (5.5, False, "4", "foo")
], ["real", "bool", "stringNum", "string"])

dataset.show()

+----+-----+---------+------+
|real| bool|stringNum|string|
+----+-----+---------+------+
| 2.2| true|        1|   foo|
| 3.3|false|        2|   bar|
| 4.4|false|        3|   baz|
| 5.5|false|        4|   foo|
+----+-----+---------+------+



In [6]:
hasher = FeatureHasher(inputCols=['real','bool','stringNum','string'],outputCol='features')

featurized = hasher.transform(dataset)
featurized.show(truncate=False)

+----+-----+---------+------+--------------------------------------------------------+
|real|bool |stringNum|string|features                                                |
+----+-----+---------+------+--------------------------------------------------------+
|2.2 |true |1        |foo   |(262144,[174475,247670,257907,262126],[2.2,1.0,1.0,1.0])|
|3.3 |false|2        |bar   |(262144,[70644,89673,173866,174475],[1.0,1.0,1.0,3.3])  |
|4.4 |false|3        |baz   |(262144,[22406,70644,174475,187923],[1.0,1.0,4.4,1.0])  |
|5.5 |false|4        |foo   |(262144,[70644,101499,174475,257907],[1.0,1.0,5.5,1.0]) |
+----+-----+---------+------+--------------------------------------------------------+

