<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Spark Preprocessing</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

# Formatting Models According to Your Use Case
To preprocess data for Spark’s different advanced analytics tools, you must consider your end
objective. The following list walks through the requirements for input data structure for each
advanced analytics task in MLlib:
* In the case of most classification and regression algorithms, you want to get your data into **a column of type Double to represent the label and a column of type Vector (either dense or sparse) to represent the features**.
* In the case of recommendation, you want to get your data into a column of users, **a column of items (say movies or books), and a column of ratings**.
* In the case of unsupervised learning, **a column of type Vector (either dense or sparse)** is needed to represent the features.
* In the case of graph analytics, you will want **a DataFrame of vertices and a DataFrame of edges**.

The best way to get your data in these formats is through transformers. Transformers are functions that
accept a DataFrame as an argument and return a new DataFrame as a response.

## Read in Several Sample Datasets

In [0]:
sales = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("dbfs:/FileStore/tables/retail-data/by-day/*.csv")\
  .coalesce(5)\
  .where("Description IS NOT NULL")
    

In [0]:
sales.show(2)

In [0]:
sales.count()

In [0]:
fakeIntDF = spark.read.parquet("dbfs:/FileStore/tables/simple-ml-integers")
simpleDF = spark.read.json("dbfs:/FileStore/tables/simple-ml")
scaleDF = spark.read.parquet("dbfs:/FileStore/tables/simple-ml-scaling")

In [0]:
fakeIntDF.show(2)

In [0]:
simpleDF.show(2)

In [0]:
scaleDF.show(2)

## Transformers

Transformers are functions that convert raw data in some way. This might be to create a new
interaction variable (from two other variables), to normalize a column, or to simply turn it into a
Double to be input into a model. Transformers are primarily used in preprocessing or feature
generation.

All transformers require you to specify, at a minimum, the inputCol and the outputCol, which
represent the column name of the input and output, respectively.

## High-Level Transformers
High-level transformers allow you to
concisely specify a number of transformations in one. 
These operate at a “high level”, and allow you
to avoid doing data manipulations or transformations one by one. In general, you should try to use the
highest level transformers you can, in order to minimize the risk of error and help you focus on the
business problem instead of the smaller details of implementation. While this is not always possible,
it’s a good objective.

### RFormula
The RFormula is the easiest transfomer to use when you have “conventionally” formatted data.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import RFormula

supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")
supervised.fit(simpleDF).transform(simpleDF).show()

###  SQL Transformers
A SQLTransformer allows you to leverage Spark’s vast library of SQL-related manipulations just as
you would a MLlib transformation. Any SELECT statement you can use in SQL is a valid
transformation. The only thing you need to change is that instead of using the table name, you should
just use the keyword THIS. The following is a basic example of using SQLTransformer:

In [0]:
# COMMAND ----------

from pyspark.ml.feature import SQLTransformer

basicTransformation = SQLTransformer()\
  .setStatement("""
    SELECT sum(Quantity), count(*), CustomerID
    FROM __THIS__
    GROUP BY CustomerID
  """)

basicTransformation.transform(sales).show()

### VectorAssembler
The VectorAssembler is a tool you’ll use in nearly every single pipeline you generate. It helps
concatenate all your features into one big vector you can then pass into an estimator. It’s used
typically in the last step of a machine learning pipeline and takes as input a number of columns of
Boolean, Double, or Vector. This is particularly helpful if you’re going to perform a number of
manipulations using a variety of transformers and need to gather all of those results together.
The output from the following code snippet will make it clear how this works:

In [0]:
# COMMAND ----------

from pyspark.ml.feature import VectorAssembler
va = VectorAssembler().setInputCols(["int1", "int2", "int3"]).setOutputCol("features")
va.transform(fakeIntDF).show()

## Working with Continuous Features

There are two common transformers for continuous features. First, you can convert continuous
features into categorical features via a process called bucketing, or you can scale and normalize your
features according to several different requirements. These transformers will only work on Double
types, so make sure you’ve turned any other numerical values to Double:

In [0]:
# COMMAND ----------

contDF = spark.range(20).selectExpr("cast(id as double)")

In [0]:
contDF.show(2)

### Bucketing
The most straightforward approach to bucketing or binning is using the Bucketizer. This will split a
given continuous feature into the buckets of your designation. You specify how buckets should be
created via an array or list of Double values.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import Bucketizer
bucketBorders = [-1.0, 5.0, 10.0, 250.0, 600.0]
bucketer = Bucketizer().setSplits(bucketBorders).setInputCol("id")
bucketer.transform(contDF).show()

In addition to splitting based on hardcoded values, another option is to split based on percentiles in
our data. This is done with QuantileDiscretizer, which will bucket the values into user-specified
buckets with the splits being determined by approximate quantiles values.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import QuantileDiscretizer
bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id")

In [0]:
bucketer = QuantileDiscretizer(numBuckets=5, inputCol="id", outputCol="buckets")

In [0]:
fittedBucketer = bucketer.fit(contDF)
fittedBucketer.transform(contDF).show()

## StandardScaler
The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1.
The flag withStd will scale the data to unit standard deviation while the flag withMean (false by
default) will center the data prior to scaling it.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import StandardScaler
sScaler = StandardScaler().setInputCol("features").setOutputCol("scaledFeatures")
sScaler.fit(scaleDF).transform(scaleDF).show(10, False)

### MinMaxScaler
The MinMaxScaler will scale the values in a vector (component wise) to the proportional values on
a scale from a given min value to a max value. If you specify the minimum value to be 0 and the
maximum value to be 1, then all the values will fall in between 0 and 1:

In [0]:
# COMMAND ----------

from pyspark.ml.feature import MinMaxScaler
minMax = MinMaxScaler().setMin(5).setMax(10).setInputCol("features")

In [0]:
minMax = MinMaxScaler(inputCol="features")

In [0]:
fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).show()

In [0]:
scaleDF.show()

### MaxAbsScaler
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import MaxAbsScaler
maScaler = MaxAbsScaler().setInputCol("features")
fittedmaScaler = maScaler.fit(scaleDF)
fittedmaScaler.transform(scaleDF).show(10, False)

### ElementwiseProduct
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector. In other words, it scales each column of the dataset by a scalar multiplier.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors
scaleUpVec = Vectors.dense(10.0, 15.0, 20.0)
scalingUp = ElementwiseProduct()\
  .setScalingVec(scaleUpVec)\
  .setInputCol("features")
scalingUp.transform(scaleDF).show()

### Normalizer
Normalize a vector to have unit norm using the given p-norm.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import Normalizer
manhattanDistance = Normalizer().setP(1).setInputCol("features")
manhattanDistance.transform(scaleDF).show(10, False)

## Working with Categorical Features
The most common task for categorical features is indexing. Indexing converts a categorical variable
in a column to a numerical one that you can plug into machine learning algorithms.

### StringIndexer
The simplest way to index is via the StringIndexer, which maps strings to different numerical IDs.
Spark’s StringIndexer also creates metadata attached to the DataFrame that specify what inputs
correspond to what outputs. This allows us later to get inputs back from their respective index values:

In [0]:
# COMMAND ----------

from pyspark.ml.feature import StringIndexer
lblIndxr = StringIndexer().setInputCol("lab").setOutputCol("labelInd")
idxRes = lblIndxr.fit(simpleDF).transform(simpleDF)
idxRes.show()

We can also apply StringIndexer to columns that are not strings, in which case, they will be
converted to strings before being indexed:

In [0]:
# COMMAND ----------

valIndexer = StringIndexer().setInputCol("value1").setOutputCol("valueInd")
valIndexer.fit(simpleDF).transform(simpleDF).show()

### Converting Indexed Values Back to Text

In [0]:
# COMMAND ----------

from pyspark.ml.feature import IndexToString
labelReverse = IndexToString().setInputCol("labelInd")
labelReverse.transform(idxRes).show()

### Indexing in Vectors
VectorIndexer is a helpful tool for working with categorical variables that are already found inside
of vectors in your dataset. This tool will automatically find categorical features inside of your input
vectors and convert them to categorical features with zero-based category indices.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import VectorIndexer
from pyspark.ml.linalg import Vectors
idxIn = spark.createDataFrame([
  (Vectors.dense(1, 2, 3),1),
  (Vectors.dense(2, 5, 6),2),
  (Vectors.dense(1, 8, 9),3)
]).toDF("features", "label")
indxr = VectorIndexer()\
  .setInputCol("features")\
  .setOutputCol("idxed")\
  .setMaxCategories(2)
indxr.fit(idxIn).transform(idxIn).show()

### One-Hot Encoding
Indexing categorical variables is only half of the story. One-hot encoding is an extremely common
data transformation performed after indexing categorical variables.

In [0]:
simpleDF.show(2)

In [0]:
# COMMAND ----------

from pyspark.ml.feature import OneHotEncoder, StringIndexer
lblIndxr = StringIndexer().setInputCol("color").setOutputCol("colorInd")
colorLab = lblIndxr.fit(simpleDF).transform(simpleDF.select("color"))
ohe = OneHotEncoder().setInputCol("colorInd")
ohe.fit(colorLab).transform(colorLab).show()

## Text Data Transformers

### Tokenizing Text
Tokenization is the process of converting free-form text into a list of “tokens” or individual words.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import Tokenizer
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn.transform(sales.select("Description"))
tokenized.show(20, False)

We can also create a Tokenizer that is not just based white space but a regular expression with the
RegexTokenizer. The format of the regular expression should conform to the Java Regular
Expression (RegEx) syntax:

In [0]:
# COMMAND ----------

from pyspark.ml.feature import RegexTokenizer
rt = RegexTokenizer()\
  .setInputCol("Description")\
  .setOutputCol("DescOut")\
  .setPattern(" ")\
  .setToLowercase(True)
rt.transform(sales.select("Description")).show(20, False)

Another way of using the RegexTokenizer is to use it to output values matching the provided pattern
instead of using it as a gap.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import RegexTokenizer
rt = RegexTokenizer()\
  .setInputCol("Description")\
  .setOutputCol("DescOut")\
  .setPattern(" ")\
  .setGaps(False)\
  .setToLowercase(True)
rt.transform(sales.select("Description")).show(20, False)

### Removing Common Words

In [0]:
# COMMAND ----------

from pyspark.ml.feature import StopWordsRemover
englishStopWords = StopWordsRemover.loadDefaultStopWords("english")
stops = StopWordsRemover()\
  .setStopWords(englishStopWords)\
  .setInputCol("DescOut")
stops.transform(tokenized).show()

### Creating Word Combinations
With n-grams, we can look at sequences of words that commonly co-occur and use them as inputs to a
machine learning algorithm.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import NGram
unigram = NGram().setInputCol("DescOut").setN(1)
bigram = NGram().setInputCol("DescOut").setN(2)

In [0]:
unigram.transform(tokenized.select("DescOut")).show()

In [0]:
bigram.transform(tokenized.select("DescOut")).show()

+--------------------+--------------------------+
|             DescOut|NGram_08853e7c5d02__output|
+--------------------+--------------------------+
|[rabbit, night, l...|      [rabbit night, ni...|
|[doughnut, lip, g...|      [doughnut lip, li...|
|[12, message, car...|      [12 message, mess...|
|[blue, harmonica,...|      [blue harmonica, ...|
|[gumball, coat, r...|      [gumball coat, co...|
|[skulls, , water,...|      [skulls ,  water,...|
|[feltcraft, girl,...|      [feltcraft girl, ...|
|[camouflage, led,...|      [camouflage led, ...|
|[white, skull, ho...|      [white skull, sku...|
|[english, rose, h...|      [english rose, ro...|
|[hot, water, bott...|      [hot water, water...|
|[scottie, dog, ho...|      [scottie dog, dog...|
|[rose, caravan, d...|      [rose caravan, ca...|
|[gingham, heart, ...|      [gingham heart, h...|
|[storage, tin, vi...|      [storage tin, tin...|
|[set, of, 4, knic...|      [set of, of 4, 4 ...|
|   [popcorn, holder]|          [popcorn holder]|


### Converting Words into Numerical Representations

In [0]:
# COMMAND ----------

from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer()\
  .setInputCol("DescOut")\
  .setOutputCol("countVec")\
  .setVocabSize(500)\
  .setMinTF(1)\
  .setMinDF(2)
    

In [0]:
fittedCV = cv.fit(tokenized)

In [0]:
fittedCV.transform(tokenized).show()

### Term frequency–inverse document frequency

In [0]:
# COMMAND ----------

tfIdfIn = tokenized\
  .where("array_contains(DescOut, 'red')")\
  .select("DescOut")\
  .limit(10)
tfIdfIn.show(10, False)

In [0]:
# COMMAND ----------

from pyspark.ml.feature import HashingTF, IDF
tf = HashingTF()\
  .setInputCol("DescOut")\
  .setOutputCol("TFOut")\
  .setNumFeatures(10000)
idf = IDF()\
  .setInputCol("TFOut")\
  .setOutputCol("IDFOut")\
  .setMinDocFreq(2)

In [0]:
# COMMAND ----------

idf.fit(tf.transform(tfIdfIn)).transform(tf.transform(tfIdfIn)).show(10, False)

### Word2Vec

Word2Vec is notable for capturing relationships between words based on their semantics. 
Here’s a simple example from the documentation:

In [0]:
# COMMAND ----------

from pyspark.ml.feature import Word2Vec
# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

In [0]:
# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text",
  outputCol="result")

In [0]:
model = word2Vec.fit(documentDF)

In [0]:
result = model.transform(documentDF)
for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))

## Feature Manipulation

### PCA
Principal Components Analysis (PCA) is a mathematical technique for finding the most important
aspects of our data (the principal components). It changes the feature representation of our data by
creating a new set of features (“aspects”). Each new feature is a combination of the original features.

You’d want to use PCA if you have a large input dataset

In [0]:
# COMMAND ----------

from pyspark.ml.feature import PCA
pca = PCA().setInputCol("features").setK(2)
pca.fit(scaleDF).transform(scaleDF).show(20, False)

## Interaction
The feature transformer Interaction allows you to create an
interaction between two variables manually. It just multiplies the two features together—something
that a typical linear model would not do for every possible pair of features in your data. This
transformer is currently only available directly in Scala but can be called from any language using the
RFormula. We recommend users just use RFormula instead of manually creating interactions.

### Polynomial Expansion
Polynomial expansion is used to generate interaction variables of all the input columns. With
polynomial expansion, we specify to what degree we would like to see various interactions. For
example, for a degree-2 polynomial, Spark takes every value in our feature vector, multiplies it by
every other value in the feature vector, and then stores the results as features. For instance, if we have
two input features, we’ll get four output features if we use a second degree polynomial (2x2). If we
have three input features, we’ll get nine output features (3x3). If we use a third-degree polynomial,
we’ll get 27 output features (3x3x3) and so on. This transformation is useful when you want to see
interactions between particular features but aren’t necessarily sure about which interactions to
consider.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import PolynomialExpansion
pe = PolynomialExpansion().setInputCol("features").setDegree(2)
pe.transform(scaleDF).show()

## Feature Selection
There are a number of ways to evaluate feature
importance once you’ve trained a model but another option is to do some rough filtering beforehand.
Spark has some simple options for doing that, such as ChiSqSelector.

### ChiSqSelector
ChiSqSelector leverages a statistical test to identify features that are not independent from the label
we are trying to predict, and drop the uncorrelated features. 

It’s often used with categorical data in
order to reduce the number of features you will input into your model, as well as to reduce the
dimensionality of text data (in the form of frequencies or counts). Since this method is based on the
Chi-Square test, there are several different ways we can pick the “best” features. The methods are
numTopFeatures, which is ordered by p-value; percentile, which takes a proportion of the input
features (instead of just the top N features); and fpr, which sets a cut off p-value.

In [0]:
# COMMAND ----------

from pyspark.ml.feature import ChiSqSelector, Tokenizer
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn\
  .transform(sales.select("Description", "CustomerId"))\
  .where("CustomerId IS NOT NULL")
prechi = fittedCV.transform(tokenized)\
  .where("CustomerId IS NOT NULL")
chisq = ChiSqSelector()\
  .setFeaturesCol("countVec")\
  .setLabelCol("CustomerId")\
  .setNumTopFeatures(2)
chisq.fit(prechi).transform(prechi)\
  .drop("customerId", "Description", "DescOut").show()