# Setting the stage: Preparing features for Machine Learning

This chapter focuses on the most important stage of machine learning regarding your use case: exploring, understanding, preparing, and giving purpose to your data. More specifically, we focus on preparing a data set by cleaning the data, creating new features, which are fields that will serve in training the model and then looking at selecting a curated set of features based on how promising they look.

### Reading, exploring, and preparing our machine learning data set
For our ML model, we chose a data set of 20,057 dish names that contain 680 columns characterizing the ingredient list, the nutritional content, and the category of the dish. The goal here is to predict if this dish is a dessert. 

At their core, data cleanup, exploration, and feature preparation are purpose-driven data transformation. The dataset is available from Kaggle [here](https://www.kaggle.com/hugodarwood/epirecipes). The csv is `epi_r.csv`.

We will allocate 4 gibibytes of RAM to the driver. Then we will read the data frame using the csv specialized `SparkReader` object and print the dimensions of data frame: 20,057 rows and 680 columns.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

spark = (
    SparkSession.builder.appName("Recipes ML model - Are you a dessert?")
    .config("spark.driver.memory", "4g")
    .getOrCreate()
)

food = spark.read.csv(
    "./data/recipes/epi_r.csv", inferSchema=True, header=True
)

print(food.count(), len(food.columns))

20057 680


In [2]:
food.printSchema()

root
 |-- title: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- calories: string (nullable = true)
 |-- protein: double (nullable = true)
 |-- fat: double (nullable = true)
 |-- sodium: double (nullable = true)
 |-- #cakeweek: double (nullable = true)
 |-- #wasteless: double (nullable = true)
 |-- 22-minute meals: double (nullable = true)
 |-- 3-ingredient recipes: double (nullable = true)
 |-- 30 days of groceries: double (nullable = true)
 |-- advance prep required: double (nullable = true)
 |-- alabama: double (nullable = true)
 |-- alaska: double (nullable = true)
 |-- alcoholic: double (nullable = true)
 |-- almond: double (nullable = true)
 |-- amaretto: double (nullable = true)
 |-- anchovy: double (nullable = true)
 |-- anise: double (nullable = true)
 |-- anniversary: double (nullable = true)
 |-- anthony bourdain: double (nullable = true)
 |-- aperitif: double (nullable = true)
 |-- appetizer: double (nullable = true)
 |-- apple: double (nullable = true)


#### Standardizing column names using toDF()

we process all the column names to give them a uniform look and facilitate their subsequent usage. We will remove anything that isn’t a letter or a number, standardize the spaces and other separators to use the underscore (_) character, and replace the ampersand (&) with its English equivalent and. While not mandatory, this will help us in writing a clearer program and improving the consistency of our column names by reducing typos and mistakes.

In [3]:
def sanitize_column_name(name):
    """Drops unwanted characters from the column name.
    We replace spaces, dashes and slashes with underscore,
    and only keep alphanumeric characters."""
    
    answer = name
    
    for i, j in ((" ", "_"), ("-", "_"), ("/", "_"), ("&", "and")):
        answer = answer.replace(i, j)
    return "".join(
        [
            char
            for char in answer
            if char.isalpha() or char.isdigit() or char == "_"
        ]
    )

food = food.toDF(*[sanitize_column_name(name) for name in food.columns])

#### Exploring our data and getting our first feature columns

Looking at the summary of the data contained in each column. Since there are many columns we are not going to do it. But it will look like similar to below code.

```python
 for x in food.columns:
    food.select(x).summary().show()
 
# many tables looking like this one.
+-------+----------------------+
|summary|                 clove|
+-------+----------------------+
| count | 20052                |
| mean  | 0.009624975064831438 |
| stddev| 0.09763611178399834  |
| min   | 0.0                  |
| 25%   | 0.0                  |
| 50%   | 0.0                  |
| 75%   | 0.0                  |
| max   | 1.0                  |
+-------+----------------------+
```
 
In our summary data, we are looking at numerical columns. In machine learning, we
classify numerical features into two categories: `categorical` or `continuous`. A Categorical feature is when your column takes a discrete number, such as the month of the year (1 to 12). A continuous feature is when the column can have infinite possibilities, such as the price of an item. We can subdivide the categorical family into 3 main types:
- `Binary` (or `dichotomous`), when you have only two choices (0/1, true/false)
- `Ordinal`, when the categories have a certain ordering (e.g., the position in a race) that matters.
- `Nominal`, when the categories have no specific ordering (e.g., the color of an item)

<img src="images/types_of_numerical_features.png">




Looking at our summary data, it seems that we have a lot of potentially binary columns. In the case of the clove column, the minimum and three quartile values are all zero. To verify this, we’ll group the entire data frame and collect a set of distinct values. If we have only two values for a given column, binary it is! 

In [4]:
import pandas as pd
import warnings

warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

pd.set_option("display.max_rows", 100)

is_binary = food.agg(
    *[
        (F.size(F.collect_set(x)) == 2).alias(x)
        for x in food.columns
    ]
).toPandas()

is_binary.unstack()

title       0    False
rating      0    False
calories    0    False
protein     0    False
fat         0    False
                 ...  
cookbooks   0     True
leftovers   0     True
snack       0     True
snack_week  0     True
turkey      0     True
Length: 680, dtype: bool

Most columns are binary, but lets investigate deeper about the `cakeweek` and `wasteless` column

#### Addressing data mishaps and building our first feature set

we investigate some seemingly incoherent features and, following our findings, clean our data set. We also identify our first feature set, along with each feature type. This section is an example of forensic data exploration.

In [5]:
food.agg(*[F.collect_set(x) for x in ("cakeweek", "wasteless")]).show(1, False)

food.where("cakeweek > 1.0 or wasteless > 1.0").select(
    "title", "rating", "wasteless", "cakeweek", food.columns[-1]
).show()


+-------------------------------+----------------------+
|collect_set(cakeweek)          |collect_set(wasteless)|
+-------------------------------+----------------------+
|[0.0, 1.0, 1188.0, 24.0, 880.0]|[0.0, 1.0, 1439.0]    |
+-------------------------------+----------------------+

+--------------------+--------------------+---------+--------+------+
|               title|              rating|wasteless|cakeweek|turkey|
+--------------------+--------------------+---------+--------+------+
|"Beet Ravioli wit...| Aged Balsamic Vi...|      0.0|   880.0|   0.0|
|"Seafood ""Catapl...|            Vermouth|   1439.0|    24.0|   0.0|
|"""Pot Roast"" of...| Aunt Gloria-Style "|      0.0|  1188.0|   0.0|
+--------------------+--------------------+---------+--------+------+



For three records, it seems like our data set had a bunch of quotation marks along
with some commas that confused PySpark’s otherwise robust parser. Let's drop them.

In [6]:
food = food.where(
    (
        F.col("cakeweek").isin([0.0, 1.0])
        | F.col("cakeweek").isNull()
    )
    & (
        F.col("wasteless").isin([0.0, 1.0])
        | F.col("wasteless").isNull()
    )
)

print(food.count(), len(food.columns))

20054 680


Now that we have identified two binary-in-hiding feature columns, we can identify our feature set and our target variable. The target (or label) is the column containing the value we want to predict. In our case, the column is aptly named `dessert`. We create all-caps variables containing the four main sets of columns we care about:
- The identifiers, which are the column(s) that contain the information unique to each record
- The targets, which are the column(s) (most often one) that contain the value we wish to predict
- The continuous columns, containing continuous features
- The binary columns, containing binary features

In [7]:
IDENTIFIERS = ["title"]

CONTINUOUS_COLUMNS = [
    "rating",
    "calories",
    "protein",
    "fat",
    "sodium",
]

TARGET_COLUMN = ["dessert"]

BINARY_COLUMNS = [
    x
    for x in food.columns
    if x not in CONTINUOUS_COLUMNS
    and x not in TARGET_COLUMN
    and x not in IDENTIFIERS
]

In this section, we rapidly cleaned our data. In practice, this stage will take more than half your time when building an ML model. Fortunately, data cleaning is principled data manipulation, so you can leverage everything in the PySpark took kit we've build so far.

#### Weeding out useless records and imputing binary features

This section covers the deletion of useless records, those that provide no information to our ML model. In our case, this means removing two types of records:
- Those where all the features are null
- Those where the target is null

Furthermore, we will impute, meaning that we will provide a default value for, our binary features. Since each of them are 0/1, where zero is False and one is True, we equate null to False and fill zero as a default value. Given the context of our model, this is a reasonable assumption. Those operations are common to every ML model. We always face a point where we want to ensure every record will provide some sort of information to our ML model.

In [8]:
food = food.dropna(
    how="all",
    subset=[x for x in food.columns if x not in IDENTIFIERS],
)

food = food.dropna(subset=TARGET_COLUMN)

print(food.count(), len(food.columns))

20049 680


In [9]:
food = food.fillna(0.0, subset=BINARY_COLUMNS)

print(food.where(F.col(BINARY_COLUMNS[0]).isNull()).count())

0


#### Taking care of extreme values: Cleaning continuous columns

we review the distribution of numerical columns to account for extreme or unrealistic values. Many ML models don’t deal well with extreme values. Just like we did with binary columns, taking the time to assess the fit of our numerical columns will pay dividends since we will not feed the wrong information to our ML model.

Before we can start exploring the distribution of our continuous features, we need to make sure they are properly typed. PySpark inferred the type of the `rating` and `calories` column as a string, where they should have been numerical. 

The UDF below looks rather complicated, but if we take it slowly, it’s very simple. We return `True` right off the bat if the value is `null`. If we have a non-`null` value, we try to cast the value as a Python `float`. If it fails, `False` it is!



In [10]:
from typing import Optional

@F.udf(T.BooleanType())
def is_a_number(value: Optional[str]) -> bool:
    if not value:
        return True
    
    try:
        _ = float(value)
    except ValueError:
        return False

    return True

food.where(~is_a_number(F.col("rating"))).select(
    *CONTINUOUS_COLUMNS
).show()

+---------+------------+-------+----+------+
|   rating|    calories|protein| fat|sodium|
+---------+------------+-------+----+------+
| Cucumber| and Lemon "|   3.75|null|  null|
+---------+------------+-------+----+------+



We have a single remaining rogue record that we remove in the next code. Our continuous feature columns are now all numerical

In [11]:
for column in ["rating", "calories"]:
    food = food.where(is_a_number(F.col(column)))
    food = food.withColumn(column, F.col(column).cast(T.DoubleType()))

print(food.count(), len(food.columns))

20048 680


Now we want to look at the actual values to remove any ridiculous values that would break the computation of the average. We repeat the summary table displayed in rapid fire during our initial data exploration. We immediately see that some dishes are over the top! Just like with binary features, we need to use our judgment for the best course of action to address this data quality issue. We could filter the records once more, but this time, we’ll cap the values to the 99th percentile, avoiding extreme (and potentially wrong) values.

```python
food.select(*CONTINUOUS_COLUMNS).summary(
    "mean",
    "stddev",
    "min",
    "1%",
    "5%",
    "50%",
    "95%",
    "99%",
    "max",
).show()

# +-------+------------------+------------------+------------------+
# |summary|            rating|          calories|           protein|
# +-------+------------------+------------------+------------------+
# | mean  | 3.714460295291301|6324.0634571930705|100.17385283565179|
# | stddev|1.3409187660508959|359079.83696340164|3840.6809971287403|
# | min   | 0.0              |               0.0|               0.0|
# | 1%    | 0.0              |              18.0|               0.0|
# | 5%    | 0.0              |              62.0|               0.0|
# | 50%   | 4.375            |             331.0|               8.0|
# | 95%   | 5.0              |            1318.0|              75.0|
# | 99%   | 5.0              |            3203.0|             173.0|
# | max   | 5.0              |       3.0111218E7|          236489.0|
# +-------+------------------+------------------+------------------+
# +-------+-----------------+-------------------+
# |summary|              fat|             sodium|
# +-------+-----------------+-------------------+
# | mean  |346.9398083953107|  6226.927244193346|
# | stddev|20458.04034412409|  333349.5680370268|
# | min   |              0.0|                0.0|
# | 1%    |              0.0|                1.0|
# | 5%    |              0.0|                5.0|
# | 50%   |             17.0|              294.0|
# | 95%   |             85.0|             2050.0|
# | 99%   |            207.0|             5661.0|
# | max   |        1722763.0|         2.767511E7|
# +-------+-----------------+-------------------+
```

In [12]:
maximum = {
    "calories": 3203.0,
    "protein": 173.0,
    "fat": 207.0,
    "sodium": 5661.0,
}

for k, v in maximum.items():
    food = food.withColumn(
        k,
        F.when(F.isnull(F.col(k)), F.col(k)).otherwise(
            F.least(F.col(k), F.lit(v))
        ),
    )

#### Weeding out the rare binary occurrence columns
Rarely occurring features are an annoyance when building a model, as the machine can pick up a signal that is there by chance.  Binary features with only a few zeroes or ones are not helpful in classifying a recipe as a dessert: if every recipe (or no recipe) has a certain feature as true, then that feature does not discriminate properly, meaning that our model has no use for it.

We compute the sum of each binary column; this will give me the number of 1.0 since the sum of the ones is equal to their count. If the count of the ones/sum of a column is below 10 or above the number of records minus 10, I collect the column name to remove it.

In [13]:
inst_sum_of_binary_columns = [
    F.sum(F.col(x)).alias(x) for x in BINARY_COLUMNS
]

sum_of_binary_columns = (
    food.select(*inst_sum_of_binary_columns).head().asDict()
)

num_rows = food.count()
too_rare_features = [
    k
    for k, v in sum_of_binary_columns.items()
    if v < 10 or v > (num_rows - 10)
]

len(too_rare_features) # => 167
print(too_rare_features)
# ['cakeweek', 'wasteless', '30_days_of_groceries',
# [...]
# 'yuca', 'cookbooks', 'leftovers']

BINARY_COLUMNS = list(set(BINARY_COLUMNS) - set(too_rare_features))

['cakeweek', 'wasteless', '30_days_of_groceries', 'alabama', 'alaska', 'anthony_bourdain', 'apple_juice', 'arizona', 'aspen', 'atlanta', 'australia', 'beverly_hills', 'biscuit', 'boston', 'bran', 'brooklyn', 'brownie', 'buffalo', 'bulgaria', 'burrito', 'cambridge', 'camping', 'canada', 'caviar', 'chicago', 'chili', 'cobbler_crumble', 'columbus', 'cook_like_a_diner', 'cookbook_critic', 'costa_mesa', 'cranberry_sauce', 'crêpe', 'crme_de_cacao', 'cuba', 'cupcake', 'custard', 'dallas', 'denver', 'digestif', 'dominican_republic', 'dorie_greenspan', 'eau_de_vie', 'egg_nog', 'egypt', 'emeril_lagasse', 'england', 'entertaining', 'epi__ushg', 'epi_loves_the_microwave', 'flat_bread', 'frankenrecipe', 'freezer_food', 'friendsgiving', 'frittata', 'fritter', 'germany', 'grains', 'grand_marnier', 'granola', 'grappa', 'guam', 'haiti', 'hamburger', 'hawaii', 'healdsburg', 'hollywood', 'house_cocktail', 'houston', 'hummus', 'iced_coffee', 'idaho', 'illinois', 'indiana', 'iowa', 'israel', 'italy', 'jama

we removed 167 features that are either too rare or too frequent. 

### Feature creation and refinement

This section covers two important steps of model building: feature creation (also called feature engineering) and refinement. Feature creation and refinement are where the data scientists can express their judgment and creativity. Our ability to encode meaning and recognize patterns in the data means that our model can more easily pick up on the signal.

we'll look at the following:
- Creating a few custom features using our continuous feature columns
- Measuring correlation over original and generated continuous features

#### Creating custom features

Fundamentally, creating a custom feature in PySpark is nothing more than creating a new column, with a little more thought and notes on the side. Manual feature creation is one of the secret weapons of a data scientist: you can embed business knowledge into highly custom features that can improve your model’s accuracy and interpretability. As an example, we’ll take the protein and fat columns representing the quantity (in grams) of protein and fat in the recipe, respectively. With the information in those two columns, we create two features representing the percentage of calories attributed to each macro nutriment.


In [14]:
food = food.withColumn(
    "protein_ratio", F.col("protein") * 4 / F.col("calories")
).withColumn(
    "fat_ratio", F.col("fat") * 9 / F.col("calories")
)

food = food.fillna(0.0, subset=["protein_ratio", "fat_ratio"])

CONTINUOUS_COLUMNS += ["protein_ratio", "fat_ratio"]

#### Removing highly correlated features

In this section, we take our set of continuous variables and look at the correlation between them in order to improve our model accuracy and explainability. 

Correlation in linear models is not always bad; as a matter of fact, you want your features to be correlated with your target (this provides predictive power). On the other hand, we want to avoid correlation between features for two main reasons:

> **NOTE:**  
> - If two features are highly correlated, it means that they provide almost the same information. In the context of machine learning, this can confuse the fitting algorithm and create model or numerical instability.
> - The more complex your model, the more complex the maintenance. Highly correlated features rarely provide improved accuracy, yet complicate the model. Simple is better

For computing correlation between variables, PySpark provides the `Correlation` object. `Correlation` has a single method, `corr`, that computes the correlation between features in a `Vector`. `Vectors` are like PySpark arrays but with a special representation optimized for ML work. we use the `VectorAssembler` transformer on the food data frame to create a new column, `continuous_features`, that contains a `Vector` of all our continuous features.

In [15]:
from pyspark.ml.feature import VectorAssembler

continuous_features = VectorAssembler(
    inputCols=CONTINUOUS_COLUMNS, outputCol="continuous_features"
)

vector_food = food.select(CONTINUOUS_COLUMNS)

for x in CONTINUOUS_COLUMNS:
    vector_food = vector_food.where(~F.isnull(F.col(x)))

vector_variable = continuous_features.transform(vector_food)
vector_variable.select("continuous_features").show(3, False)

+---------------------------------------------------------------------+
|continuous_features                                                  |
+---------------------------------------------------------------------+
|[2.5,426.0,30.0,7.0,559.0,0.28169014084507044,0.14788732394366197]   |
|[4.375,403.0,18.0,23.0,1439.0,0.17866004962779156,0.5136476426799007]|
|[3.75,165.0,6.0,7.0,165.0,0.14545454545454545,0.38181818181818183]   |
+---------------------------------------------------------------------+
only showing top 3 rows



In [16]:
vector_variable.select("continuous_features").printSchema()

root
 |-- continuous_features: vector (nullable = true)



we apply the `Correlation.corr()` function on our continuous feature vector and export the correlation matrix into an easily interpretable pandas DataFrame. PySpark returns the correlation matrix in a `DenseMatrix` column type, which is like a two-dimensional vector. In order to extract the values in an easy-to-read format, we have to do a little method juggling:
1. We extract a single record as a list of Row using `head()`.
2. A Row is like an ordered dictionary, so we can access the first (and only) field containing our correlation matrix using list slicing. 
3. A DenseMatrix can be converted into a pandas-compatible array by using the `toArray()` method on the matrix.
4. We can directly create a pandas DataFrame from our Numpy array. Inputting our column names as an index (in this case, they’ll play the role of “row names”) makes our correlation matrix very readable.

In [17]:
from pyspark.ml.stat import Correlation

correlation = Correlation.corr(
    vector_variable, "continuous_features"
)

correlation.printSchema()

root
 |-- pearson(continuous_features): matrix (nullable = false)





In [18]:
correlation_array = correlation.head()[0].toArray()

correlation_pd = pd.DataFrame(
    correlation_array,
    index=CONTINUOUS_COLUMNS,
    columns=CONTINUOUS_COLUMNS,
)
print(correlation_pd.iloc[:, :4])

                 rating  calories   protein       fat
rating         1.000000  0.102257  0.113292  0.111536
calories       0.102257  1.000000  0.757837  0.918052
protein        0.113292  0.757837  1.000000  0.664899
fat            0.111536  0.918052  0.664899  1.000000
sodium         0.065225  0.516818  0.585450  0.421920
protein_ratio  0.094429  0.164735  0.600182  0.125572
fat_ratio      0.129946  0.176823  0.109188  0.424986


In [19]:
print(correlation_pd.iloc[:, 4:])

                 sodium  protein_ratio  fat_ratio
rating         0.065225       0.094429   0.129946
calories       0.516818       0.164735   0.176823
protein        0.585450       0.600182   0.109188
fat            0.421920       0.125572   0.424986
sodium         1.000000       0.339067   0.033819
protein_ratio  0.339067       1.000000   0.024854
fat_ratio      0.033819       0.024854   1.000000


The last step from our correlation computation is to assess which variables we want to keep and which we want to drop. There is no absolute threshold for keeping or removing correlated variables (nor is there a protocol for which variable to keep). From the correlation matrix, we see high correlation between sodium, calories, protein, and fat. Surprisingly, we see little correlation between our custom features and the columns that contributed to their creation. 

we could do the following action items:
- Explore the relationship between calorie count and the ratio of macro nutriments (and sodium). Is there a pattern there, or is the calorie count (or size of portions) just all over the place?
- Is the calorie/protein/fat/sodium content related to the “dessert-ness” of the recipes? I can’t imagine a dessert being very salty.
- Run the model with all features, then with calories and protein removed. What is the impact on performance?

### Feature preparation with transformers and estimators

We use transformers and estimators as an abstraction over common operations in machine learning modeling. We explore two relevant examples of transformers and estimators:
- Null imputation, where we provide a value to replace null occurrences in a column (e.g., the mean)
- Scaling features, where we normalize the values of a column, so they are on a more logical scale (e.g., between zero and one)

Let's compare a VectorAssembler to a function `assemble_vector()` that performs the same work, which is to create a `Vector` named after the argument to outputCol, which contains all the values in the columns passed to inputCols. Don’t focus on the actual work here, but more on the mechanism of application.

<img src="images/function_assemble.png">

The transformer object has a two-staged process. First, when instantiating the transformer, we provide the parameters necessary for its application, but not the data frame on which it’ll be applied. Then, we use the instantiated transformer’s `transform()` method on the data frame to get a transformed data frame. This separation of instructions and data is key in creating serializable ML pipelines, which leads to easier ML experiments and model portability.

#### Imputing continuous features using the Imputer estimator

Estimators are the main abstraction used by Spark for any data-dependent transformation, including ML models, so they are pervasive in any ML code using PySpark. At the core, an estimator is a transformer-generating object. We instantiate an estimator object just like a transformer by providing the relevant parameters to its constructor. To apply an estimator, we call the `fit()` method, which returns a Model object, which is, for all purposes, the same as a transformer. Estimators allow for automatically creating transformers that depend on the data.

<img src="images/imputer_estimator.png">


As an example, we want our Imputer to impute the mean value to every record in the calories, protein, fat, and sodium columns when the record is `null`.

In [20]:
from pyspark.ml.feature import Imputer

OLD_COLS = ["calories", "protein", "fat", "sodium"]
NEW_COLS = ["calories_i", "protein_i", "fat_i", "sodium_i"]

imputer = Imputer(
    strategy="mean",
    inputCols=OLD_COLS,
    outputCols=NEW_COLS,
)

imputer_model = imputer.fit(food)

CONTINUOUS_COLUMNS = (
    list(set(CONTINUOUS_COLUMNS) - set(OLD_COLS)) + NEW_COLS
)

We apply the resulting ImputerModel just like with any transformer by using the `transform()` method.

In [21]:
food_imputed = imputer_model.transform(food)

food_imputed.where("calories is null").select("calories", "calories_i").show(5, False)


+--------+-----------------+
|calories|calories_i       |
+--------+-----------------+
|null    |475.5222194325885|
|null    |475.5222194325885|
|null    |475.5222194325885|
|null    |475.5222194325885|
|null    |475.5222194325885|
+--------+-----------------+
only showing top 5 rows



#### Scaling our features using the MinMaxScaler estimator

Scaling variables means performing a mathematical transformation on the variables so that they are all on the same numeric scale. When using a linear model, having scaled features means that your model coefficients (the weight of each feature) are comparable.

Since we have so many binary variables, it is convenient to have every variable be between zero and one. Our protein_ratio and fat_ratio are ratios between zero and one too! PySpark provides the MinMaxScaler for this use case: for each value in the input column, it creates a normalized output between 0.0 and 1.0.

<img src="images/min_max_scale.png">


In [22]:
from pyspark.ml.feature import MinMaxScaler

CONTINUOUS_NB = [x for x in CONTINUOUS_COLUMNS if "ratio" not in x]

continuous_assembler = VectorAssembler(
    inputCols=CONTINUOUS_NB, outputCol="continuous"
)

food_features = continuous_assembler.transform(food_imputed)

continuous_scaler = MinMaxScaler(
    inputCol="continuous",
    outputCol="continuous_scaled",
)

food_features = continuous_scaler.fit(food_features).transform(food_features)
food_features.select("continuous_scaled").show(3, False)

+-----------------------------------------------------------------------------------------+
|continuous_scaled                                                                        |
+-----------------------------------------------------------------------------------------+
|[0.5,0.13300031220730565,0.17341040462427745,0.033816425120772944,0.09874580462815757]   |
|[0.875,0.12581954417733376,0.10404624277456646,0.1111111111111111,0.2541953718424307]    |
|[0.75,0.051514205432407124,0.03468208092485549,0.033816425120772944,0.029146793852676208]|
+-----------------------------------------------------------------------------------------+
only showing top 3 rows



All the variables in our continuous_scaled vector are now between zero and one. Our continuous variables are ready; our binary variables are ready. 