# Preparing Data for Apache Spark ML Model

## Preparing the environment

### Importing libraries

In [13]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import (StringIndexer, OneHotEncoder, VectorAssembler)

### Spark Connection

In [2]:
spark = SparkSession.builder.getOrCreate()

# eval DataFrame in notebooks
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

# activate gpu
spark.conf.set("spark.driver.resource.gpu.amount","1")

## Scenario 1: VectorAssembler: numerical features

Let's start with a VectorAssembler when only numerical features are available in the data. Below code creates a spark dataframe `customerdata` with three columns: `cust_id`, `monthly_payment` and `tenure_yrs`. Both `monthly_payment` and `tenure_yrs` are numerical features.

In [9]:
customerdata = spark.createDataFrame([(1, 29.99, 5),
                                      (2, 31.99, 3),
                                      (3, 24.99, 1),
                                      (4, 21.99, 3),
                                     ], ['cust_id', 'monthly_payment', 'tenure_yrs'])
customerdata.show()

+-------+---------------+----------+
|cust_id|monthly_payment|tenure_yrs|
+-------+---------------+----------+
|      1|          29.99|         5|
|      2|          31.99|         3|
|      3|          24.99|         1|
|      4|          21.99|         3|
+-------+---------------+----------+



In [10]:
# The VectorAssembler combines all the variables into one column. Let’s call it ‘features’ as shown below.
assembler = VectorAssembler(inputCols=["monthly_payment", "tenure_yrs"],
                            outputCol="features")
outdata = assembler.transform(customerdata)
outdata.show(truncate=False)

+-------+---------------+----------+-----------+
|cust_id|monthly_payment|tenure_yrs|features   |
+-------+---------------+----------+-----------+
|1      |29.99          |5         |[29.99,5.0]|
|2      |31.99          |3         |[31.99,3.0]|
|3      |24.99          |1         |[24.99,1.0]|
|4      |21.99          |3         |[21.99,3.0]|
+-------+---------------+----------+-----------+



## Scenario 2: VectorAssembler: numerical + categorical features

Below, I have created a spark dataframe `customerdata1` consisting of numerical (`monthly_payment` and `tenure_yrs`) and a categorical (`state`) variables. The state variable contains 3 types: `MD`, `VA` and `WA`.

In [11]:
customerdata1 = spark.createDataFrame([(1, 29.99, 5, 'MD'),
                                       (2, 31.99, 3, 'MD'),
                                       (3, 24.99, 1, 'VA'),
                                       (4, 21.99, 3, 'WA'),
                                       (5, 22.00, 3, 'WA'),
                                       (6, 25.00, 7, 'WA'),
                                      ], ['cust_id', 'monthly_payment', 'tenure_yrs', 'state']
)
customerdata1.show(truncate=False)

+-------+---------------+----------+-----+
|cust_id|monthly_payment|tenure_yrs|state|
+-------+---------------+----------+-----+
|1      |29.99          |5         |MD   |
|2      |31.99          |3         |MD   |
|3      |24.99          |1         |VA   |
|4      |21.99          |3         |WA   |
|5      |22.0           |3         |WA   |
|6      |25.0           |7         |WA   |
+-------+---------------+----------+-----+



There is a little bit more work involved when we have categorical variable in the data. First step here, is to change categories to number, which is accomplished by using `StringIndexer` available in `pyspark.ml.feature`. The `StringIndexer` assigns unique values to each of the categories of a variable.

Whe`n StringIndex`er is applied, the most frequent category gets the inde`x` 0, followed b`y` 1 for the next most frequent category and so on. Below, stae `: `WA get`s` 0 index as it is the most frequent category i`n sta`te, followed b`y `D` `:1 an`d `A` `:2.

In [14]:
indexer = StringIndexer(inputCol='state', outputCol='stateNum')
indexd_data=indexer.fit(customerdata1).transform(customerdata1)
indexd_data.show(truncate=False)

+-------+---------------+----------+-----+--------+
|cust_id|monthly_payment|tenure_yrs|state|stateNum|
+-------+---------------+----------+-----+--------+
|1      |29.99          |5         |MD   |1.0     |
|2      |31.99          |3         |MD   |1.0     |
|3      |24.99          |1         |VA   |2.0     |
|4      |21.99          |3         |WA   |0.0     |
|5      |22.0           |3         |WA   |0.0     |
|6      |25.0           |7         |WA   |0.0     |
+-------+---------------+----------+-----+--------+



Next, we use `OneHotEncoder` to encode the indexed variable (`stateNum`) and finally use `VectorAssembler` to assemble all numerical and one hot encoded vectors together. Output from `OneHotEncoder` (`stateVec`) is shown below.

In [16]:
encoder = OneHotEncoder(inputCol='stateNum', outputCol = 'stateVec')
onehotdata = encoder.fit(indexd_data).transform(indexd_data)
onehotdata.show(truncate=False)

+-------+---------------+----------+-----+--------+-------------+
|cust_id|monthly_payment|tenure_yrs|state|stateNum|stateVec     |
+-------+---------------+----------+-----+--------+-------------+
|1      |29.99          |5         |MD   |1.0     |(2,[1],[1.0])|
|2      |31.99          |3         |MD   |1.0     |(2,[1],[1.0])|
|3      |24.99          |1         |VA   |2.0     |(2,[],[])    |
|4      |21.99          |3         |WA   |0.0     |(2,[0],[1.0])|
|5      |22.0           |3         |WA   |0.0     |(2,[0],[1.0])|
|6      |25.0           |7         |WA   |0.0     |(2,[0],[1.0])|
+-------+---------------+----------+-----+--------+-------------+



Now, that we have converted the categorical feature to numerical form, we need to assemble all the input columns including this converted one into a single vector called `features`.

In [18]:
assembler1 = VectorAssembler(
    inputCols=['cust_id', 'monthly_payment', 'tenure_yrs', 'stateVec'],
    outputCol='features')
outdata1 = assembler1.transform(onehotdata)
outdata1.show(truncate=False)

+-------+---------------+----------+-----+--------+-------------+-----------------------+
|cust_id|monthly_payment|tenure_yrs|state|stateNum|stateVec     |features               |
+-------+---------------+----------+-----+--------+-------------+-----------------------+
|1      |29.99          |5         |MD   |1.0     |(2,[1],[1.0])|[1.0,29.99,5.0,0.0,1.0]|
|2      |31.99          |3         |MD   |1.0     |(2,[1],[1.0])|[2.0,31.99,3.0,0.0,1.0]|
|3      |24.99          |1         |VA   |2.0     |(2,[],[])    |[3.0,24.99,1.0,0.0,0.0]|
|4      |21.99          |3         |WA   |0.0     |(2,[0],[1.0])|[4.0,21.99,3.0,1.0,0.0]|
|5      |22.0           |3         |WA   |0.0     |(2,[0],[1.0])|[5.0,22.0,3.0,1.0,0.0] |
|6      |25.0           |7         |WA   |0.0     |(2,[0],[1.0])|[6.0,25.0,7.0,1.0,0.0] |
+-------+---------------+----------+-----+--------+-------------+-----------------------+



[Source](https://medium.com/@statistics.sudip/preparing-data-for-apache-spark-ml-model-4fedcc31a0f4)