# Essentials of Feature Engineering in pyspark - Part I

Data preprocessing in Spark

The most commanly used data preprocessing techniques in Spark approaches are as follows 
  
  1) VectorAssembler

  2)Bucketing

3)Scaling and normalization

 a) StandardScaler

 b) MinMAxScaler

 c) MaxAbsScaler

 d) Elementwise Product

 e) Normalizer

4) Working with categorical features

a) StringIndexer

b) Converting Indexed values back to text

c) Indexing in vectors

d) One-hot encoding

5) Text data transformers

a) tokenizing text

b) Removing common words

c) Creating word combinations

d) Converting words into numerical representations

e) Tf-Idf

f) Word2Vec

6) Feature Manipulation

7) PCA

In [None]:
# Initializing a Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count").config("spark.some.config.option","some-value").getOrCreate()

Downloading the dataset
For the purpose of this demonstration we will be using three different datasets

1) retail-data/by-day

2) simple-ml-integers

3) simple-ml

4) simple-ml-scaling

The datasets can be downloaded from this link https://github.com/databricks/Spark-The-Definitive-Guide

In [None]:
# # Lets us begin by reading the "retail-data/by-day" which is in .csv format
# sales = spark.read.format("csv") \ # here we space the format of the file we intend to read
#         .option("header","true") \ # setting "header" as true will consider the first row as the header of the Dataframe
#         .option("inferSchema", "true") \ # Spark has its own mechanism to infer the schema which I will leverage at this poit of time
#         .load("/data/retail-data/by-day/*.csv") \ # here we specify the path to our csv file(s)
#         .coalesce(5)\
#         .where("Description IS NOT NULL") # We intend to take only those rows in the which the value in the description column is not null

In [None]:
# Lets us begin by reading the "retail-data/by-day" which is in .csv format and save it into a Spark dataframe named 'sales'
sales = spark.read.format("csv").option("header","true").option("inferSchema", "true").load(r"data/retail-data/by-day/*.csv").coalesce(5).where("Description IS NOT NULL")

In [None]:
# Lets us read the parquet files in "simple-ml-integers" and make a Spark dataframe named 'fakeIntDF'
fakeIntDF=spark.read.parquet("/home/spark/DhirajR/Spark/feature_engineering/data/simple-ml-integers")
# Lets us read the parquet files in "simple-ml" and make a Spark dataframe named 'simpleDF'
simpleDF=spark.read.json(r"/home/spark/DhirajR/Spark/feature_engineering/data/simple-ml")
# Lets us read the parquet files in "simple-ml-scaling" and make a Spark dataframe named 'scaleDF'
scaleDF=spark.read.parquet(r"/home/spark/DhirajR/Spark/feature_engineering/data/simple-ml-scaling")

In [None]:
sales.cache()
sales.show()

In [None]:
type(sales)

# Vector assembler

The vector assembler is basically use to concatenate all the features into a single vector which can be further passed to the estimator or ML algorithm. In order to demo the 'Vector Assembler' we will use the 'fakeIntDF' which we had created in the previous steps.

In [None]:
# Let us see what kind of data do we have in 'fakeIntDF'

In [None]:
fakeIntDF.cache()
fakeIntDF.show()

In [None]:
# Let us import the vector assembler
from pyspark.ml.feature import VectorAssembler
# Once the Vector assembler is imported we are required to create the object of the same. Here I will create an object anmed va
# The above result shows that we have three features in 'FakeIntDF' i.e. int1, int2, int3. Let us create the object va so as to combine the three features into a single column named features
assembler = VectorAssembler(inputCols=["int1", "int2", "int3"],outputCol="features")
# Now let us use the transform method to transform our dataset
assembler.transform(fakeIntDF).show()

# Bucketing
Bucketing is a most straight forward approach for fro converting the contonuous variables into categorical variable let us understand this with an example straight away

In pyspark the task of bucketing can be easily accomplished using the Bucketizer class.

Firstly, We shall accomplish the noop task of creating bucket borders. Let us define a list
bucketBorders =[-1.0, 5.0,10.0,250.0,600.0]

Next, let us create a object of the Bucketizer class. Then we will apply the transform method to our target Dataframe "dataframe"

In [None]:
# Let us create a sample dataframe for demo purpose

data = [(-999.9,), (-0.5,), (-0.3,), (0.0,), (0.2,), (999.9,)]
dataFrame = spark.createDataFrame(data, ["features"])

In [None]:
from pyspark.ml.feature import Bucketizer
bucketBorders=[-float("inf"), -0.5, 0.0, 0.5, float("inf")]

bucketer=Bucketizer().setSplits(bucketBorders).setInputCol("features").setOutputCol("Buckets")
bucketer.transform(dataFrame).show()

# Scaling and normalization

Scaling and normalization is another common task that we come across while handling continuous varaibles. It is not always imperative to scale and normalize the features. However, it is highly recommended to scale and normalize the features before applying an ML algorithm in order to avert the risk of an algorithm being insensitive to a certain features.

Spark ML provides us with a class "StandardScaler" for easy scaling and normaization of features

In [None]:
scaleDF.show()

In [None]:
from pyspark.ml.feature import StandardScaler
# Let us create an object of StandardScaler class
Scalerizer=StandardScaler().setInputCol("features").setOutputCol("Scaled_features")
Scalerizer.fit(scaleDF).transform(scaleDF).show(truncate=False)

# MinMaxScaler

The StandardScaler standardizes the features with a zero mean and standard deviation of 1. Sometimes, we encounter situations where we need to scale values within a given range (i.e. max and min). For such task Spark ML provdies a MinMaxScaler.

The StandardScaler and MinMaxScaler share the common soul, the only difference is that we can provide the minimum value and maximum values within which we wish to scale the features.

For the sake of illustration, let us scale the features in the range 5 to 10.

In [None]:
from pyspark.ml.feature import MinMaxScaler
# Let us create an object of MinMaxScaler class
MinMaxScalerizer=MinMaxScaler().setMin(5).setMax(10).setInputCol("features").setOutputCol("MinMax_Scaled_features")
MinMaxScalerizer.fit(scaleDF).transform(scaleDF).show()

# MinAbsScaler

Sometimes we need to scalerize features between -1 to 1. The MinAbsScaler does exactly this by dividing the features by the maximum absolute values

In [None]:
from pyspark.ml.feature import MaxAbsScaler
# Let us create an object of MinAbsScaler class
MinAbsScalerizer=MaxAbsScaler().setInputCol("features").setOutputCol("MinAbs_Scaled_features")
MinAbsScalerizer.fit(scaleDF).transform(scaleDF).show(truncate =False)

# ElementwiseProduct

What differentiates ElementwiseProduct from the previously mentioned scalizers is the fact that, in ElementwiseProduct the features are scaled based on a multiplying factor. 

The below mentioned code snippet will transform the feature#1 --> 10 times, feature#2 --> 0.1 times and feature#3 --> -1 times 

For example --> the features [10, 20, 30] if scaled by [10, 0.1, -1] will become [100, 2.0, -30]

In [None]:
from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors

# Let us define a scaling vector 

ScalebyVector=Vectors.dense([10,0.1,-1])

# Let us create an object of the class Elementwise product
ScalingUp=ElementwiseProduct().setScalingVec(ScalebyVector).setInputCol("features").setOutputCol("ElementWiseProduct")
# Let us transform
ScalingUp.transform(scaleDF).show(truncate=False)

# Normalizer

The normalizer allows the user to calculate distance between features. The most commonly used distance metircs are "Manhattan distance" and the "Euclidean distance". The Normalizer takes a parameter "p" from the user which represents the power norm.

For example, Manhatan norm (Mahnatan distance) p = 1; Euclidean norm (Euclidean distance) p = 2;

In [None]:
from pyspark.ml.feature import Normalizer
# Let us create an object of the class Normalizer product
l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm")
l2_norm=Normalizer().setP(2).setInputCol("features").setOutputCol("l2_norm")
linf_norm=Normalizer().setP(float("inf")).setInputCol("features").setOutputCol("linf_norm")
# Let us transform
l1_norm.transform(scaleDF).show(truncate=False)

In [None]:
l2_norm.transform(scaleDF).show(truncate=False)

In [None]:
linf_norm.transform(scaleDF).show(truncate=False)

# StringIndexer (Converting strings to numerical values)

Most of the ML algorithms require converting categorical features into numerical ones. 

Sparks StringIndexer maps strings nto different numerical values. We will use the simpleDF dataframe for demo purpose which consist of a feature "lab" which is a categorical variable.


In [None]:
simpleDF.show(5)

Let us apply string indexer to a categorical variable named "lab" in "simpleDF" DataFrame.

In [None]:
from pyspark.ml.feature import StringIndexer
# Let us create an object of the class StringIndexer
lblindexer=StringIndexer().setInputCol("lab").setOutputCol("LabelIndexed")
# Let us transform
idxRes=lblindexer.fit(simpleDF).transform(simpleDF)
idxRes=idxRes.drop("value1","value2")
idxRes.show(5)

# IndexToString

Sometimes we come accross situations where it is necessary to convert the indexed values back to text. To do this the Spark ML provides a class IndextoString. To demonstrate the "IndextoString" let us use the "LabelIndexed" column of  "idxRes" dataframe which was created in the previous code snippet.

The LabelIndexed column consists of 1.0 --> good and 0.0 --> bad. Nw let us try and reverse this

In [None]:
from pyspark.ml.feature import IndexToString
LabelReverse=IndexToString().setInputCol("LabelIndexed").setOutputCol("ReverseIndex")
LabelReverse.transform(idxRes).show()

# Indexing within Vectors

Spark offer yet another class named "VectorIndexer". The "VectorIndexer" identifies the categorical variables with a set of features which is already been vectorized and converts it into a categorical feature with zero based category indices.

For the purpose of illustration let us first create a new DataFrame with features in the form of Vectors.

In [None]:
from pyspark.ml.linalg import Vectors
dataln=spark.createDataFrame([(Vectors.dense(1,2,3),1),(Vectors.dense(2,5,6),2),(Vectors.dense(1,8,9),3)]).toDF("features","labels")
dataln.show()

In [None]:
from pyspark.ml.feature import VectorIndexer
VecInd=VectorIndexer().setInputCol("features").setMaxCategories(2).setOutputCol("indexed")
VecInd.fit(dataln).transform(dataln).show()

# One hot endcoding

One hot encoder is the most common type of transformation performed during pre-processing. Let us look at an example straight away.

In [None]:
simpleDF.show()

In [None]:
# Let us encode the "color" feature in the "simpleDF"
from pyspark.ml.feature import StringIndexer,OneHotEncoder
SI=StringIndexer().setInputCol('color').setOutputCol('StrIndexed')
ColorIdx=SI.fit(simpleDF).transform(simpleDF)
ohe=OneHotEncoder().setInputCol('StrIndexed').setOutputCol("oheIndexed")
ohe.transform(ColorIdx).show()

# Tokenizing text

Tokenizing is the process of converting free form text into a sequence of tokens. Spark ML offers a Tokenizer class to do this task. The "Description" column in sales dataframe consists text with words seperated with white spaces. Let us use this for the sake of or demo. 

In [None]:
# Let us import the tokenizer
from pyspark.ml.feature import Tokenizer
# Create an object of the Tokenizer class
Tok=Tokenizer().setInputCol("Description").setOutputCol("Tokenized")
sales_tok=Tok.transform(sales).select("Description",'Tokenized')
sales_tok.show()

# RegexTokenizer

Tokenizer class by default considers the white space between the words as seperator. However, at times, we may come across various seperators such as '|', '\' or '@'. To handle such situations, Spark ML provides the RegexTokenizer class.

To demonstrate this class let us create our own dataframe named "data_txt" which consists of text seperated by '|'

In [None]:
from pyspark.sql.types import StringType
mydata=['Too Fast For You','For|Your|Eyes|Only','As|a|Matter|of|Fact','As|far|as|I|know','Away|from|Keyboard']
data_txt=spark.createDataFrame(mydata,StringType()).toDF("Text")

In [None]:
data_txt.show()

In [None]:
# Let us import RegexTokenizer class
from pyspark.ml.feature import RegexTokenizer
# Create an object of this class
RegTok=RegexTokenizer().setInputCol('Text').setOutputCol("Tokenized").setPattern("|").setGaps(False)
RegTok.transform(data_txt).show()

In [None]:
from pyspark.sql.types import StringType
mydata=['Too Fast For You You','For Your Eyes Only','As a Matter of Fact','As far as I know','Away from Keyboard']
dataln=spark.createDataFrame(mydata,StringType())

In [None]:
dataln.show()

# Removing Stopwords

The common task in NLP after tokenizing is to remove the stopwords such as 'the', 'and', 'but',etc. Spark ML offers StopWordsRemover class to handle this task.

In [None]:
# Let us import the StopWordsRemover
from pyspark.ml.feature import StopWordsRemover
# Let us import a predefined corpus of stopwords
englishStopWords = StopWordsRemover.loadDefaultStopWords("english")
# Create an object of StopWordsRemover
stops=StopWordsRemover().setStopWords(englishStopWords).setInputCol('Tokenized').setOutputCol("Stops_removed")
stops.transform(sales_tok).show()

# uni grams, bi grams, tri grams ..... N grams

Big data Processing made simple

The bigram of the above sentence would look like
"Big data" "data Processing" "Processing made" "made simple"

The trigram of the above sentence would look like
"Big data Processing" "data Processing made" "Processing made simple" 

And so on


In [None]:
# Let us import the NGram class

from pyspark.ml.feature import NGram
uni_gram=NGram().setInputCol("Tokenized").setOutputCol("uni_gram").setN(1)
bi_gram=NGram().setInputCol("Tokenized").setOutputCol("bi_gram").setN(2)
tri_gram=NGram().setInputCol("Tokenized").setOutputCol("tri_gram").setN(3)

# uni_gram.transform(sales_tok.select("Tokenized"))
bi_gram.transform(sales_tok.select("Tokenized")).show(truncate=True)

# Converting words to numerical representation (CountVectorizer)

The 'CountVectorizer' does the following:

1) Counts the total number of words in the complete document

2) For each word in each row (or sentence), it counts the the number of ouccurences of that particular word in the entire documnent.

3) For each word in each row (or sentence), it counts the the number of ouccurences of that particular word in the given row (or sentence).


In addition, a word is included in the vocab only if it satisfies the following criteria

1) minTF --> minimum term frequency (the term freq of a word shd be > minTF)

2) minDF --> it is the minimum number of documents (or sentences) a word must appear

3) vocabsize --> total maximum size of vocab.

In [None]:
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer().setInputCol("Tokenized").setOutputCol("CountVec").setVocabSize(500).setMinTF(1).setMinDF(2)

In [None]:
cv.fit(sales_tok).transform(sales_tok).select("Tokenized","CountVec").show(truncate=False)

In [None]:
from pyspark.ml.feature import Tokenizer
Tok=Tokenizer().setInputCol("value").setOutputCol("Tokenized")
dataln=Tok.transform(dataln)
dataln.show()

In [None]:
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer().setInputCol("Tokenized").setOutputCol("CountVec").setVocabSize(500).setMinTF(1).setMinDF(1)
cv.fit(sales_tok).transform(dataln).select("Tokenized","CountVec").show(truncate=False)

# TF-IDF (Term frequency to inverse document frequency)

TF-IDF measures how often a word occurs in each document, weighted according to how many documents that word occurs in. In order to demonstrate the TF-IDF in pyspark let us create a new dataframe named tfIdfln. We will derive this dataframe from sales_tok that we had previously created.

From the "Tokenized" column of the "sales_tok" let us keep only those records which contain the word "red". This can be done as follows:

In [None]:
tfIDFln=sales_tok.where("array_contains(Tokenized,'red')").select("Tokenized").limit(10)
tfIDFln.show(truncate=False)

Now let us import the two important classes offered by Spark HashingTF and IDF, creates objects of them and apply transformation

In [None]:
from pyspark.ml.feature import HashingTF,IDF
tf=HashingTF().setInputCol("Tokenized").setOutputCol("TFOut").setNumFeatures(10000)
idf=IDF().setInputCol("TFOut").setOutputCol("IDFOut").setMinDocFreq(2)
idf.fit((tf.transform(tfIDFln))).transform((tf.transform(tfIDFln))).select("IDFOut").show(1,False)

# Word2Vec

Word2Vec is a deep-learning based frame work for computing vector representations of a set of words. The goal is to have similar words close to one another in a vector space.

In order to demonstrate Word2Vec let us first initialize a new DataFrame named "documentDF", it only contains a few sentences.

In [None]:
documentDF=spark.createDataFrame([("Hi I heard about Spark".split(" "), ),("I wish Java could use case classes".split(" "), ),("Logistic regression models are neat".split(" "), )],["text"])
documentDF.show(truncate=False)

In [None]:
from pyspark.ml.feature import Word2Vec
w2v=Word2Vec(vectorSize=5,minCount=0,inputCol="text",outputCol="result")
model=w2v.fit(documentDF)
results=model.transform(documentDF)
# minCount is the minimum number of times a word should appear in the complete document to be included in the vocabulary
# fro more information about various parameters in Word2Vec please refer https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/mllib/feature/Word2Vec.html

In [None]:
for row in results.collect():
    text, vector = row
    print("Text:[%s] => \n Vector: %s\n" % (",".join(text),str(vector)))


# Principal Component Analysis (PCA)

PCA helps us find the most important aspects of our data. It is a dimensionality reduction technique, which is employed when the dataset consists of a large number of features. PCA changes the feature representation of our data by deriving new features from the original features.

PCA takes a parameter 'k' which specifies the number of output features.

In order to demonstrate feature selection let us go back to our scaleDF dataframe which we had created earlier. Remenber, it consists of 3 features

In [None]:
scaleDF.show()

In [None]:
from pyspark.ml.feature import PCA
pca=PCA().setInputCol("features").setOutputCol("PCA_features").setK(2)
scaleDF=pca.fit(scaleDF).transform(scaleDF)


In [None]:
scaleDF.show(truncate=False)

# Feature selection

One of the most commonly used feature selection approach is the Chisquare selector. The Chisquare uses statistical methods to identify features which are significant by determining a p-value corresponding to every feature. Once the p-value is determined the "best" features can be selected in one of the three ways: 

1) NumTopFeatures: In this case, the user is required to specify the desired number of features (N) to be selected. All the features are then sorted in the ascending order of p-value and to N features are selected.

2) percentile: In this case the user user may specify the percentage of total num of features, and then the features are selected on based on the p-value on lower-the-better-basis

3) threshold (fpr): In this case, we sipmly set threshold over the p-value (mostly 0.05). All features with p-value below the given threshold are considered to be significant and are selected.

Let us start from scratch and develop a code example. We will work with the sales dataframe that we created earlier.

In [None]:
sales=sales.where("Description is NOT NULL").where("CustomerID is NOT NULL").select("Description","CustomerID")
sales.show(10)

In [None]:
# Let us tokenize the "Description"
from pyspark.ml.feature import Tokenizer
Tok=Tokenizer().setInputCol("Description").setOutputCol("Tokenized")
sales=Tok.transform(sales)
sales.show(10)

In [None]:
# Let us countvectorize the Tokenized column
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer().setInputCol("Tokenized").setOutputCol("CountVec").setVocabSize(500).setMinTF(1).setMinDF(2)
sales=cv.fit(sales).transform(sales)
sales.show(10)

Now we will consider the "CountVec" as our features and "CustomerID" as our labels

In [None]:
from pyspark.ml.feature import ChiSqSelector
chisq=ChiSqSelector().setFeaturesCol("CountVec").setLabelCol("CustomerID").setNumTopFeatures(5).setOutputCol("Aspects")
sales=chisq.fit(sales).transform(sales)
sales.drop("Description","CustomerID","Tokenized").show(truncate=False)

# Polynomial Expansion

Polynomial expansion is used to generate interaction variables of all the input columns. With
polynomial expansion, we specify to what degree we would like to see various interactions. For
example, for a degree-2 polynomial, Spark takes every value in our feature vector, multiplies it
by every other value in the feature vector, and then stores the results as features. For instance, if
we have two input features, we’ll get four output features if we use a second degree polynomial
(2x2). If we have three input features, we’ll get nine output features (3x3). If we use a thirddegree
polynomial, we’ll get 27 output features (3x3x3) and so on. This transformation is useful
when you want to see interactions between particular features but aren’t necessarily sure about
which interactions to consider.

In [None]:
scaleDF.show()

In [None]:
from pyspark.ml.feature import PolynomialExpansion
PE=PolynomialExpansion().setInputCol("PCA_features").setOutputCol("Poly_features").setDegree(2)
scaleDF=PE.transform(scaleDF)


In [None]:
scaleDF.select('PCA_features','Poly_features').show(10,truncate=False)

# Conclusion
In this humble blog I have tried to cover some basic and widely used data preprocessing transformations offered by Spark ML. I have demonstrated each of them with an illustration. However, there is a plethora of Spark tools for to aid the feature engineering task some of these I will try to cover in my next blog.

# Bibliography
Chambers, B., & Zaharia, M., 2018. Spark: The definitive guide. " O'Reilly Media, Inc.".