# <span style = "color:rebeccapurple"> Preprocessing</span>

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

**Main Points**
- Standardization
- Normalization
- Categorical Variables

## <span style = "color:darkorchid"> Imports

In [17]:
# :: IMPORTS ::

# Scikit-learn specifics:
from sklearn import datasets
from sklearn import preprocessing

# Helper modules
import pandas as pd
import numpy as np

## <span style = "color:darkorchid">Preprocessing: Transforming data with scikit-learn

Many machine learning algorithms are based on taking distances among data points. Distances will depend on the scale of each dimension (each measured feature). Hence, if the scales are too different from each other (say one feature is measured in the thousands, and another in decimals), we will get suboptimal, if not completely disastrous, results. There are many different ways to transform your data, which one you choose will depend on the type of data and the algorithm you are using. Here are three commonly used transformations:
<ul>
    <li>Standardization</li>
    <li>Normalization</li>
    <li>Encoding categorical features</li>
</ul>

`scikit-learn` uses the module `preprocessing` to perform the operations above.

We have done the necessary imports at the top of this notebook. Now let's review these transformations one by one.

### <span style="color:teal">Standardization</span>

Standardization is a statistics based approach to bring all features to a similar scale. It computes the mean and standard deviation over observations of a feature, and then subtracts the mean from each observation and divides by the standard deviation. This results in mean $0$, standard deviation $1$ statistics.

The name *standardization* comes from computing the standard score, or z-score, of your observations. This is done in the following manner:
$$
z_i = \frac{x_i - \hat{\mu}}{\hat{\sigma}}
$$

Let's try it with one of the penguin columns.

In [2]:
penguins_df = pd.read_csv("data/penguins.csv")

In [3]:
# This is how flipper length looks
penguins_df[["flipper_length_mm"]].head(10)

Unnamed: 0,flipper_length_mm
0,181.0
1,186.0
2,195.0
3,193.0
4,190.0
5,181.0
6,195.0
7,182.0
8,191.0
9,198.0


Did you notice I used double brackets? `scikit-learn` prefers pandas dataframes as input, and not pandas series, so even though we'll use only one column, we'll keep it as a dataframe. Similarly, if your input is a numpy array, it must be $2$-dimensional (like a matrix, even if it is a one-column matrix).

In [4]:
# Create a standard scaler object:
z_scaler = preprocessing.StandardScaler()
z_scaler

Depending on your IDE, you may get a handy-dandy visual representation of your object. If you hover over the information icon, you will see it says *Not fitted*.

In [5]:
# Fit it to the data
z_scaler.fit(penguins_df[["flipper_length_mm"]])

What does fitting do? Well the fitted scaler now has $\hat{\mu}$ and $\hat{\sigma}$. That is, the sample mean and standard deviation.

In [6]:
print(f"Mean: {z_scaler.mean_}\nVariance: {z_scaler.var_}")

Mean: [200.96696697]
Variance: [195.85176167]


We can use the `transform` method to transform our data:

In [7]:
# Transform the data into z-scores
z_flipper_length = z_scaler.transform(penguins_df[["flipper_length_mm"]])

In [8]:
# Let's check our transformed data
z_flipper_length[0:10]

array([[-1.42675157],
       [-1.06947358],
       [-0.42637319],
       [-0.56928439],
       [-0.78365118],
       [-1.42675157],
       [-0.42637319],
       [-1.35529597],
       [-0.71219559],
       [-0.2120064 ]])

Notice that the output is a numpy array. When working with `sklearn` I recommend copious use of `numpy`, but if you are a hardcore Pandas fan, it's possible, we'll look into the method `.set_output()` later.

<b>Review, with two columns</b>

In [9]:
# mock data, let's use two columns this time
x = penguins_df[["flipper_length_mm", "bill_length_mm"]]

# Create a standard scaler object
z_scaler = preprocessing.StandardScaler()

# Fit it to the data
z_scaler.fit(x)

# Transform the data
z = z_scaler.transform(x)

# Let's see the first few rows
z[0:10]

array([[-1.42675157, -0.89604189],
       [-1.06947358, -0.82278787],
       [-0.42637319, -0.67627982],
       [-0.56928439, -1.33556603],
       [-0.78365118, -0.85941488],
       [-1.42675157, -0.9326689 ],
       [-0.42637319, -0.87772838],
       [-1.35529597, -0.52977177],
       [-0.71219559, -0.98760942],
       [-0.2120064 , -1.72014965]])

![penguin-vitrubius](images/vitrubian_penguin.png){width=25%}

Let's compare the standardized vs non-standardized data. Don't worry about the code below, it's not what we care about right now.

#### <span style="color:blue">Example - Building Intuition</span>

Imagine we'll perform an ML algorithm (clustering, for example), and find the situation in which our data comes in completely different scales:

In [10]:
var1 = 'flipper_length_mm'
var2 = 'body_mass_g'
penguins_df[[var1, var2]].head()

Unnamed: 0,flipper_length_mm,body_mass_g
0,181.0,3750.0
1,186.0,3800.0
2,195.0,3250.0
3,193.0,3450.0
4,190.0,3650.0


In [11]:
# Pick sample points to track in our visualization
sample_idx = [1, 50, 100]
sample_colors = ['red', 'blue', 'green']
sample_points = [np.array([penguins_df[var1].iloc[idx], penguins_df[var2].iloc[idx]]) for idx in sample_idx]

# Standardize data
z_scaler = preprocessing.StandardScaler()
z_scaler.fit(penguins_df[[var1, var2]])
z = z_scaler.transform(penguins_df[[var1, var2]])

Let's visualize the data, highlighting three sample points (the code generating the image below can be found in the `support_materials.ipynb` file).

![standardization-comparison](images/standardization_comparison.png){width=90%}

**Question**

Which point is closest to RED?

In [13]:
# If we take distances in original scale:
d_raw_rb = np.linalg.norm(sample_points[0] - sample_points[1]) # dist between red and blue point
d_raw_rg = np.linalg.norm(sample_points[0] - sample_points[2]) # dist between red and green point
print(f"red -> blue (raw): {d_raw_rb:.1f}\nred -> green (raw): {d_raw_rg:.1f}\n")

red -> blue (raw): 250.0
red -> green (raw): 51.7



In [15]:
# If we take distances in standard scale:
z_sample_points = [np.array([z[idx, 0], z[idx, 1]]) for idx in sample_idx]
d_std_rb = np.linalg.norm(z_sample_points[0] - z_sample_points[1])
d_std_rg = np.linalg.norm(z_sample_points[0] - z_sample_points[2])
print(f"red -> blue (standardized): {d_std_rb:.1f}\nred -> green (standardized): {d_std_rg:.1f}\n")

red -> blue (standardized): 0.3
red -> green (standardized): 0.9



#### <span style = "color:red"> EXERCISE

Pick any numerical column(s) from your fish dataframe (except weight) and standardize it/them!

In [None]:
# I'll reload the data for you
fish_df = pd.read_csv("data/fish.csv")

In [None]:
# Create a standard scaler object


# Fit it to the data


# Transform the data


In [None]:
# View the first few elements


### <span style="color:teal">Normalization</span>

While standardization is statistics based, normalization is geometry based. That means there is an important assumption: the data point to be normalized is assumed to be a vector in a vector space. If you don't know what this is, let's not worry about it at this point. For now, this means we cannot use it in categorical data.

Another important point is that normalization happens over the whole ambient space of a data point. That is, it normalizes per row. Notice that, in contrast, standardization was done with column statistics of the whole sample.

If you are interested, normalization is the process of transforming a vector so it has unit norm:
$$
x_{\text{norm}} = \frac{x}{\left||x\right||}
$$

Let's try it with the first few rows from diabetes:

In [18]:
diabetes = datasets.load_diabetes()

In [19]:
# Rows before normalization
diabetes.data[0:3]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187239, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632753, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567042, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286131, -0.02593034]])

In [20]:
# Create normalizer
norm_scaler = preprocessing.Normalizer(norm = "l2")    # <-- "l2" indicates Euclidean norm
norm_scaler

In [21]:
# Fit to data
norm_scaler.fit(diabetes.data)

In [22]:
# Transform data
norm_diab = norm_scaler.transform(diabetes.data)
norm_diab

array([[ 0.32100638,  0.42726865,  0.52014193, ..., -0.02185457,
         0.16783396, -0.14876911],
       [-0.01166161, -0.27661457, -0.31895057, ..., -0.24471426,
        -0.42340521, -0.57132725],
       [ 0.65740682,  0.39059652,  0.34258975, ..., -0.01997881,
         0.02205238, -0.1998476 ],
       ...,
       [ 0.42498708,  0.51640371, -0.16207644, ..., -0.11289447,
        -0.47770833,  0.15784238],
       [-0.4486938 , -0.44049558,  0.38544075, ...,  0.26207365,
         0.43938148, -0.25586427],
       [-0.19283454, -0.18931121, -0.30969865, ..., -0.16747907,
        -0.01790212,  0.0129952 ]], shape=(442, 10))

And double check:

In [23]:
np.linalg.norm(norm_diab[0])

np.float64(1.0)

**Note**

Normalization actually does not require fitting (you are not computing a sample mean and standard deviation as in standardization). Hence, the fit method above is kind of redundant. However, it is kept for consistency among other transformation methods. There is actually a shortcut function that can be used directly:

In [24]:
# Normalization using one function:
norm_diab = preprocessing.normalize(diabetes.data[0:3], norm = "l2")

norm_diab

array([[ 0.32100638,  0.42726865,  0.52014193,  0.18439942, -0.37283485,
        -0.29356325, -0.36589932, -0.02185457,  0.16783396, -0.14876911],
       [-0.01166161, -0.27661457, -0.31895057, -0.1631342 , -0.05235113,
        -0.11874249,  0.46107903, -0.24471426, -0.42340521, -0.57132725],
       [ 0.65740682,  0.39059652,  0.34258975, -0.04370249, -0.35143933,
        -0.26354002, -0.24937026, -0.01997881,  0.02205238, -0.1998476 ]])

<b>Summary</b>

In [25]:
# data to normalize
x = diabetes.data

# Create normalizer
norm_scaler = preprocessing.Normalizer(norm = "l2")

# "Fit" to data
norm_scaler.fit(x)    # <-- doesn't really do much, but keeps syntax/logic consistent

# transform data
x_norm = norm_scaler.transform(x)

In [26]:
# Alternatively, you can use the function shortcut:
x_norm = preprocessing.normalize(x, norm = "l2")

The advantage of creating a Normalizer object is that it can be added to a Pipeline object (we will talk about these soon).

#### <span style = "color:red"> EXERCISE

Normalize the predictive, numerical rows of the fish dataset.

In [None]:
# I'll load the data for you, and drop the columns we don't want
fish_df = pd.read_csv("data/fish.csv")
fish_df = fish_df.drop(columns = ["species", "weight"])

In [None]:
# Create normalizer


# "Fit" to data


# transform data


In [None]:
# Visualize the first 10 elements


In [None]:
# Check the norm of the first row is 1
np.linalg.norm() # <-- Your first row goes inside these parentheses.

### <span style="color:teal">Encoding Categorical variables

Categorical variables don't always have the nice properties of numbers (order, distance, etc.). Therefore, we have to encode them in a way we can deal with them mathematically. There are two main ways of doing this, which depend on the categories having an ordered structure or not.

#### Categorical variables with order

When your categorical variables have an order structure, you will do *ordinal* encoding, which means you will map them to the integers. For example, if you have a list of grades: $A$, $B$, $C$, etc. You know the following facts:
- $A > B$
- $A > C$
- $B > C$

Note that these facts are encoded in the integer numbers ${0, 1, 2, 3, ...}$. So we can map each letter to a number.

To exemplify this, let's make a mock dataframe:

In [27]:
grades_df = pd.DataFrame({"Grades":["A", "A", "D", "B", "C", "A", "C"]})
grades_df

Unnamed: 0,Grades
0,A
1,A
2,D
3,B
4,C
5,A
6,C


In [28]:
# Step 1: Create the OrdinalEncoder
ord_encoder = preprocessing.OrdinalEncoder()
ord_encoder

In [29]:
# Step 2: Fit it to our data
ord_encoder.fit(grades_df)

In [30]:
# Step 3: Transform your data
ord_encoder.transform(grades_df)

array([[0.],
       [0.],
       [3.],
       [1.],
       [2.],
       [0.],
       [2.]])

You can also transform new data:

In [31]:
new_grades_df = pd.DataFrame({"Grades":["D", "D", "B"]})

In [32]:
# Transforming new data
ord_encoder.transform(new_grades_df)

array([[3.],
       [3.],
       [1.]])

Notice that the order was alphabetical: $A$ got mapped to $0$. However, you may want the opposite: for $D$ to be $0$.

To my knowledge, there is no nice option to encode in reverse alphabetical order. However, we can provide the categories to encode as an explicit list (a list of lists to be precise, where the $i^{th}$ list corresponds to the $i^{th}$ column in your dataframe). In this case, the order in which we provide these categories will indicate the order of encoding:

In [33]:
# Indicating explicitly the categories:
ord_encoder = preprocessing.OrdinalEncoder(categories = [["D", "C", "B", "A"]])

ord_encoder.fit(grades_df)

ord_encoder.transform(grades_df)

array([[3.],
       [3.],
       [0.],
       [2.],
       [1.],
       [3.],
       [1.]])

Now $A$ maps to the higher number.

#### Categorical variables with no order:

If your variables don't have any order whatsoever, the recommended approach is one-hot encoding. This basically maps each value to a vector whose $i^{th}$ element is $1$ if it belongs to category $i$, and $0$ otherwise.

For example, if we have efrén's favorite animals: "Aardvark", "Babirusa", and "Capybara", for which there is no natural order, the encoding could go like this:
- "Aardvark" $\rightarrow [1,0,0]$
- "Babirusa" $\rightarrow [0,1,0]$
- "Capybara" $\rightarrow [0,0,1]$

![aardvark-and-friends](images/aardvark_babirusa_capybara.png?v1){width = 50%}


Let's see this in action with the penguins island feature:

In [34]:
# This is how the original data looks
penguins_df[["island"]]

Unnamed: 0,island
0,Torgersen
1,Torgersen
2,Torgersen
3,Torgersen
4,Torgersen
...,...
328,Dream
329,Dream
330,Dream
331,Dream


In [35]:
# Step 1: Create a OneHotEncoder
oh_encoder = preprocessing.OneHotEncoder()
oh_encoder

In [36]:
# Step 2: Fit it to the data
oh_encoder.fit(penguins_df[["island"]])

In [37]:
# Step 3: Transform the data
oh_islands = oh_encoder.transform(penguins_df[["island"]])

Note: since the result will be a matrix with a lot of zeros, scikit-learn actually returns in "compressed sparse row" format. Like his:

In [38]:
oh_islands

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 333 stored elements and shape (333, 3)>

We don't have time to go over this in detail, but, you have two options:
1. You can specify you don't want a sparse output with `sparse_output` argument, or
2. you can easily "decompress" it by calling the `toarray()` method:

In [39]:
# convert to dense array
oh_islands_arr = oh_islands.toarray()
oh_islands_arr[0:10]

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

In [40]:
# create OneHotEncoder with sparse_output as False:
preprocessing.OneHotEncoder(sparse_output = False).fit(penguins_df[["island"]]).transform(penguins_df[["island"]])[0:10]

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

How do we know which vector element belongs to which category? Use the `categories_` attribute, the order they appear on will be the order of the data:

In [41]:
oh_encoder.categories_

[array(['Biscoe', 'Dream', 'Torgersen'], dtype=object)]

#### **(Optional)**

Some extra details on ategorical encoding.

How do we put the data back into our dataframe?

In [42]:
penguins_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


First, we could just create a new dataframe from the array:

In [43]:
oh_islands_df = pd.DataFrame(oh_islands_arr, columns = oh_encoder.categories_)
oh_islands_df.head()

Unnamed: 0,Biscoe,Dream,Torgersen
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


Now we just join the dataframes:

In [44]:
penguins_df.join(oh_islands_df).head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,"(Biscoe,)","(Dream,)","(Torgersen,)"
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,0.0,0.0,1.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,0.0,0.0,1.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,0.0,0.0,1.0
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,0.0,0.0,1.0
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007,0.0,0.0,1.0


Alternatively, and presumably easier, we can change the encoder's output format to a pandas dataframe using the `.set_output()` method, BUT, if that is the case, we must set `sparse_output` as `False`.

In [45]:
# Make new encoder with updated sparse_output argument:
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False)

# Changing the output format of encoder:
oh_encoder.set_output(transform = "pandas")

oh_encoder.fit(penguins_df[["island"]])

oh_islands_df = oh_encoder.transform(penguins_df[["island"]])

In [46]:
oh_islands_df.head()

Unnamed: 0,island_Biscoe,island_Dream,island_Torgersen
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


In [47]:
penguins_df.join(oh_islands_df).head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,island_Biscoe,island_Dream,island_Torgersen
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,0.0,0.0,1.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,0.0,0.0,1.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,0.0,0.0,1.0
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,0.0,0.0,1.0
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007,0.0,0.0,1.0


#### <span style = "color:red"> EXERCISE

The first predictive feature of the fish dataset (the species) is categorical.
1. Create a one hot encoder and fit it to this data.
2. Transform the data.
3. Check the first few rows of the transformed data.

In [None]:
# I'll load the data for you, and select the column we want
fish_df = pd.read_csv("data/fish.csv")
fish_df = fish_df[["species"]]

In [None]:
# Make new encoder with updated sparse_output argument:


# Changing the output format of encoder:


# Fit it to the data


# Transform the data


In [None]:
# Check the transformed data


### <span style="color:teal">How to choose the right scaler?

There are a variety of factors to consider. We don't have much time to go over them here, but here's a preview:
- What is the data type?
- How is it distributed?
- What models will be used, and what are the model assumptions?
- Do we want to keep the data within a given range?
- etc.

### <span style="color:teal">Summary

We learn a few classes from the `preprocessing` module:
- `StandardScaler`
- `Normalizer`
- `OrdinalEncoder`
- `OneHotEncoder`

There are many more you will learn in your scikit-learn adventures, but we can't deal with those here. Remember the main recipe for preprocessors:
1. Create and instance of the object you need. Usually like this: `preprocessor = SomeClass()`</li>
2. Fit the preprocessor, usually like this: `preprocessor.fit(X)`</li>
3. Transform your data, which may or may not be the same as the one used for fitting. Usually like this: `preprocessor.transform(X_2)`

Keep this recipe in mind, because the models we are about to use follow a similar logic!