# Setting the stage: Preparing features for Machine Learning

This chapter focuses on the most important stage of machine learning regarding your use case: exploring, understanding, preparing, and giving purpose to your data. More specifically, we focus on preparing a data set by cleaning the data, creating new features, which are fields that will serve in training the model and then looking at selecting a curated set of features based on how promising they look.

### Reading, exploring, and preparing our machine learning data set
For our ML model, we chose a data set of 20,057 dish names that contain 680 columns characterizing the ingredient list, the nutritional content, and the category of the dish. The goal here is to predict if this dish is a dessert. 

At their core, data cleanup, exploration, and feature preparation are purpose-driven data transformation. The dataset is available from Kaggle [here](https://www.kaggle.com/hugodarwood/epirecipes). The csv is `epi_r.csv`.

We will allocate 4 gibibytes of RAM to the driver. Then we will read the data frame using the csv specialized `SparkReader` object and print the dimensions of data frame: 20,057 rows and 680 columns.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

spark = (
    SparkSession.builder.appName("Recipes ML model - Are you a dessert?")
    .config("spark.driver.memory", "4g")
    .getOrCreate()
)

food = spark.read.csv(
    "./data/recipes/epi_r.csv", inferSchema=True, header=True
)

print(food.count(), len(food.columns))

20057 680


In [2]:
food.printSchema()

root
 |-- title: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- calories: string (nullable = true)
 |-- protein: double (nullable = true)
 |-- fat: double (nullable = true)
 |-- sodium: double (nullable = true)
 |-- #cakeweek: double (nullable = true)
 |-- #wasteless: double (nullable = true)
 |-- 22-minute meals: double (nullable = true)
 |-- 3-ingredient recipes: double (nullable = true)
 |-- 30 days of groceries: double (nullable = true)
 |-- advance prep required: double (nullable = true)
 |-- alabama: double (nullable = true)
 |-- alaska: double (nullable = true)
 |-- alcoholic: double (nullable = true)
 |-- almond: double (nullable = true)
 |-- amaretto: double (nullable = true)
 |-- anchovy: double (nullable = true)
 |-- anise: double (nullable = true)
 |-- anniversary: double (nullable = true)
 |-- anthony bourdain: double (nullable = true)
 |-- aperitif: double (nullable = true)
 |-- appetizer: double (nullable = true)
 |-- apple: double (nullable = true)


#### Standardizing column names using toDF()

we process all the column names to give them a uniform look and facilitate their subsequent usage. We will remove anything that isn’t a letter or a number, standardize the spaces and other separators to use the underscore (_) character, and replace the ampersand (&) with its English equivalent and. While not mandatory, this will help us in writing a clearer program and improving the consistency of our column names by reducing typos and mistakes.

In [3]:
def sanitize_column_name(name):
    """Drops unwanted characters from the column name.
    We replace spaces, dashes and slashes with underscore,
    and only keep alphanumeric characters."""
    
    answer = name
    
    for i, j in ((" ", "_"), ("-", "_"), ("/", "_"), ("&", "and")):
        answer = answer.replace(i, j)
    return "".join(
        [
            char
            for char in answer
            if char.isalpha() or char.isdigit() or char == "_"
        ]
    )

food = food.toDF(*[sanitize_column_name(name) for name in food.columns])

#### Exploring our data and getting our first feature columns

Looking at the summary of the data contained in each column. Since there are many columns we are not going to do it. But it will look like similar to below code.

```python
 for x in food.columns:
    food.select(x).summary().show()
 
# many tables looking like this one.
+-------+----------------------+
|summary|                 clove|
+-------+----------------------+
| count | 20052                |
| mean  | 0.009624975064831438 |
| stddev| 0.09763611178399834  |
| min   | 0.0                  |
| 25%   | 0.0                  |
| 50%   | 0.0                  |
| 75%   | 0.0                  |
| max   | 1.0                  |
+-------+----------------------+
```
 
In our summary data, we are looking at numerical columns. In machine learning, we
classify numerical features into two categories: `categorical` or `continuous`. A Categorical feature is when your column takes a discrete number, such as the month of the year (1 to 12). A continuous feature is when the column can have infinite possibilities, such as the price of an item. We can subdivide the categorical family into 3 main types:
- `Binary` (or `dichotomous`), when you have only two choices (0/1, true/false)
- `Ordinal`, when the categories have a certain ordering (e.g., the position in a race) that matters.
- `Nominal`, when the categories have no specific ordering (e.g., the color of an item)

<img src="images/types_of_numerical_features.png">




Looking at our summary data, it seems that we have a lot of potentially binary columns. In the case of the clove column, the minimum and three quartile values are all zero. To verify this, we’ll group the entire data frame and collect a set of distinct values. If we have only two values for a given column, binary it is! 

In [5]:
import pandas as pd
import warnings

warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

pd.set_option("display.max_rows", 100)

is_binary = food.agg(
    *[
        (F.size(F.collect_set(x)) == 2).alias(x)
        for x in food.columns
    ]
).toPandas()

is_binary.unstack()

title       0    False
rating      0    False
calories    0    False
protein     0    False
fat         0    False
                 ...  
cookbooks   0     True
leftovers   0     True
snack       0     True
snack_week  0     True
turkey      0     True
Length: 680, dtype: bool

Most columns are binary, but lets investigate deeper about the `cakeweek` and `wasteless` column

#### Addressing data mishaps and building our first feature set

we investigate some seemingly incoherent features and, following our findings, clean our data set. We also identify our first feature set, along with each feature type. This section is an example of forensic data exploration.

In [6]:
food.agg(*[F.collect_set(x) for x in ("cakeweek", "wasteless")]).show(1, False)

food.where("cakeweek > 1.0 or wasteless > 1.0").select(
    "title", "rating", "wasteless", "cakeweek", food.columns[-1]
).show()


+-------------------------------+----------------------+
|collect_set(cakeweek)          |collect_set(wasteless)|
+-------------------------------+----------------------+
|[0.0, 1.0, 1188.0, 24.0, 880.0]|[0.0, 1.0, 1439.0]    |
+-------------------------------+----------------------+

+--------------------+--------------------+---------+--------+------+
|               title|              rating|wasteless|cakeweek|turkey|
+--------------------+--------------------+---------+--------+------+
|"Beet Ravioli wit...| Aged Balsamic Vi...|      0.0|   880.0|   0.0|
|"Seafood ""Catapl...|            Vermouth|   1439.0|    24.0|   0.0|
|"""Pot Roast"" of...| Aunt Gloria-Style "|      0.0|  1188.0|   0.0|
+--------------------+--------------------+---------+--------+------+



For three records, it seems like our data set had a bunch of quotation marks along
with some commas that confused PySpark’s otherwise robust parser. Let's drop them.

In [7]:
food = food.where(
    (
        F.col("cakeweek").isin([0.0, 1.0])
        | F.col("cakeweek").isNull()
    )
    & (
        F.col("wasteless").isin([0.0, 1.0])
        | F.col("wasteless").isNull()
    )
)

print(food.count(), len(food.columns))

20054 680


Now that we have identified two binary-in-hiding feature columns, we can identify our feature set and our target variable. The target (or label) is the column containing the value we want to predict. In our case, the column is aptly named `dessert`. We create all-caps variables containing the four main sets of columns we care about:
- The identifiers, which are the column(s) that contain the information unique to each record
- The targets, which are the column(s) (most often one) that contain the value we wish to predict
- The continuous columns, containing continuous features
- The binary columns, containing binary features

In [8]:
IDENTIFIERS = ["title"]

CONTINUOUS_COLUMNS = [
    "rating",
    "calories",
    "protein",
    "fat",
    "sodium",
]

TARGET_COLUMN = ["dessert"]

BINARY_COLUMNS = [
    x
    for x in food.columns
    if x not in CONTINUOUS_COLUMNS
    and x not in TARGET_COLUMN
    and x not in IDENTIFIERS
]

In this section, we rapidly cleaned our data. In practice, this stage will take more than half your time when building an ML model. Fortunately, data cleaning is principled data manipulation, so you can leverage everything in the PySpark took kit we've build so far.

#### Weeding out useless records and imputing binary features

This section covers the deletion of useless records, those that provide no information to our ML model. In our case, this means removing two types of records:
- Those where all the features are null
- Those where the target is null