# 3.Handling Numerical Data

## 3.0 Introduction

Quantitative data is the measurement of something—whether class size, monthly
sales, or student scores. The natural way to represent these quantities is numerically
(e.g., 29 students, $529,392 in sales). In this chapter, we will cover numerous strategies
for transforming raw numerical data into features purpose-built for machine
learning algorithms.

## 3.1Rescaling a Feature

Use scikit-learn’s MinMaxScaler to rescale a feature array:

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load libraries
from sklearn import preprocessing
# Create feature
feature = np.array([[-500.5],
                    [-100.1],
                    [0],
                    [100.1],
                    [900.9]])
# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
# Scale feature
scaled_feature = minmax_scale.fit_transform(feature)


In [None]:
scaled_feature

Rescaling is a common preprocessing task in machine learning. Many of the algorithms will assume all features are on the same scale, typically
0 to 1 or –1 to 1. There are a number of rescaling techniques, but one of the
simplest is called min-max scaling. Min-max scaling uses the minimum and maximum
values of a feature to rescale values to within a range. Specifically, min-max calculates
    **formula**
  where x is the feature vector, x’i is an individual element of feature x, and x’i is the
rescaled element. In our example, we can see from the outputted array that the feature
has been successfully rescaled to between 0 and 1
scikit-learn’s MinMaxScaler offers two options to rescale a feature. One option is to
use fit to calculate the minimum and maximum values of the feature, then use trans
form to rescale the feature. The second option is to use fit_transform to do both
operations at once. There is no mathematical difference between the two options, but
there is sometimes a practical benefit to keeping the operations separate because it
allows us to apply the same transformation to different sets of the data.

## 3.2 Standardizing a Feature

You want to transform a feature to have a mean of 0 and a standard deviation of 1.
scikit-learn’s **StandardScaler** performs both transformations

In [None]:
# Load libraries
import numpy as np
from sklearn import preprocessing
# Create feature
x = np.array([[-1000.1],
        [-200.2],
        [500.5],
        [600.6],
        [9000.9]])
# Create scaler
scaler = preprocessing.StandardScaler()
# Transform the feature
standardized = scaler.fit_transform(x)

In [None]:
standardized

## Explanation

A common alternative to min-max scaling discussed in Recipe 4.1 is rescaling of features
to be approximately standard normally distributed. To achieve this, we use
standardization to transform the data such that it has a mean, x̄, of 0 and a standard
deviation, σ, of 1. Specifically, each element in the feature is transformed so that
**forumula**
where x’i is our standardized form of xi. The transformed feature represents the number
of standard deviations the original value is away from the feature’s mean value
(also called a z-score in statistics).
Standardization is a common go-to scaling method for machine learning preprocessing
and in my experience is used more than min-max scaling. However, it depends on
the learning algorithm. For example, principal component analysis often works better
using standardization, while min-max scaling is often recommended for neural networks
(both algorithms are discussed later in this book). As a general rule, I’d recommend
defaulting to standardization unless you have a specific reason to use an alternative.
We can see the effect of standardization by looking at the mean and standard deviation
of our solution’s output:

In [None]:
# Print mean and standard deviation
print("Mean:", round(standardized.mean()))
print("Standard deviation:", standardized.std())

If our data has significant outliers, it can negatively impact our standardization by
affecting the feature’s mean and variance. In this scenario, it is often helpful to instead
rescale the feature using the median and quartile range. In scikit-learn, we do this
using the **RobustScaler** method

In [None]:
# Create scaler
robust_scaler = preprocessing.RobustScaler()
# Transform feature
robust_scaler.fit_transform(x)

## 3.3 Normalizing Observation

You want to rescale the feature values of observations to have unit norm (a total
length of 1)

Use **Normalizer** with a norm argument:

In [None]:
# Load libraries
from sklearn.preprocessing import Normalizer
# Create feature matrix
features = np.array([[0.5, 0.5],
                    [1.1, 3.4],
                    [1.5, 20.2],
                    [1.63, 34.4],
                    [10.9, 3.3]])
# Create normalizer
normalizer = Normalizer(norm="l2")
# Transform feature matrix
normalizer.transform(features)

## Explanation 

Many rescaling methods (e.g., min-max scaling and standardization) operate on features;
however, we can also rescale across individual observations. **Normalizer**
rescales the values on individual observations to have unit norm (the sum of their
lengths is 1). This type of rescaling is often used when we have many equivalent features
(e.g., text classification when every word or n-word group is a feature).
Normalizer provides three norm options with Euclidean norm (often called L2)
being the default argument:
**formula**
where x is an individual observation and xn is that observation’s value for the nth feature

In [None]:
# Transform feature matrix
features_l2_norm = Normalizer(norm="l2").transform(features)
# Show feature matrix
features_l2_norm

Alternatively, we can specify **Manhattan** norm (L1):

**formula**

In [None]:
# Transform feature matrix
features_l1_norm = Normalizer(norm="l1").transform(features)
# Show feature matrix
features_l1_norm

Intuitively, L2 norm can be thought of as the distance between two points in New
York for a bird (i.e., a straight line), while L1 can be thought of as the distance for a
human walking on the street (walk north one block, east one block, north one block,
east one block, etc.), which is why it is called **“Manhattan norm”** or **“Taxicab norm.”**
Practically, notice that norm='l1' rescales an observation’s values so they sum to 1,
which can sometimes be a desirable quality:

In [None]:
print("Sum of the first observation\'s values:",
features_l1_norm[0, 0] + features_l1_norm[0, 1])

## 3.4 Generating Polynomial and Interaction Features

Even though some choose to create polynomial and interaction features manually,
scikit-learn offers a built-in method:

In [25]:
# Load libraries
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

In [26]:
features = np.array([[2,3],
                      [2,3],
                      [2,3]])

In [27]:
# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False)
# Create polynomial features
polynomial_interaction.fit_transform(features)

array([[2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.]])

The degree parameter determines the maximum degree of the polynomial. For
example, degree=2 will create new features raised to the second power:
x1, x2, x1 2 , x2 2
while degree=3 will create new features raised to the second and third power:
x1, x2, x1 2 , x2 2 , x1 3 , x2 3
Furthermore, by default **PolynomialFeatures** includes interaction features:
x1x2

We can restrict the features created to only interaction features by setting **interaction_only** to True

In [28]:
interaction = PolynomialFeatures(degree=2,
interaction_only=True, include_bias=False)
interaction.fit_transform(features)

array([[2., 3., 6.],
       [2., 3., 6.],
       [2., 3., 6.]])

## Explanation

Polynomial features are often created when we want to include the notion that there
exists a nonlinear relationship between the features and the target. For example, we
might suspect that the effect of age on the probability of having a major medical condition
is not constant over time but increases as age increases. We can encode that
nonconstant effect in a feature, x, by generating that feature’s higher-order forms (x2,
x3, etc.).
Additionally, often we run into situations where the effect of one feature is dependent
on another feature. A simple example would be if we were trying to predict whether
or not our coffee was sweet and we had two features: 1) whether or not the coffee was
stirred and 2) if we added sugar. Individually, each feature does not predict coffee
sweetness, but the combination of their effects does. That is, a coffee would only be
sweet if the coffee had sugar and was stirred. The effects of each feature on the target
(sweetness) are dependent on each other. We can encode that relationship by including
an interaction feature that is the product of the individual features.

## 3.5 Transforming Features

In [29]:
# Load libraries
import numpy as np
from sklearn.preprocessing import FunctionTransformer
# Create feature matrix
features = np.array([[2, 3],
                    [2, 3],
                    [2, 3]])
# Define a simple function
def add_ten(x):
    return x + 10
# Create transformer
ten_transformer = FunctionTransformer(add_ten)
# Transform feature matrix
ten_transformer.transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

We can create the same transformation in pandas using apply

In [30]:
# Load library
import pandas as pd

In [31]:
# Create DataFrame
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Apply function
df.apply(add_ten)

Unnamed: 0,feature_1,feature_2
0,12,13
1,12,13
2,12,13


## 3.6 Detecting Outliers

Detecting outliers is unfortunately more of an art than a science. However, a common
method is to assume the data is normally distributed and based on that assumption
“draw” an ellipse around the data, classifying any observation inside the ellipse as an
inlier (labeled as 1) and any observation outside the ellipse as an outlier (labeled as
-1):

In [32]:
# Load libraries
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
# Create simulated data
features, _ = make_blobs(n_samples = 10,
            n_features = 2,
                centers = 1,
random_state = 1)
# Replace the first observation's values with extreme values
features[0,0] = 10000
features[0,1] = 10000
outlier_detector = EllipticEnvelope(contamination=.1)
# Fit detector
outlier_detector.fit(features)
# Predict outliers
outlier_detector.predict(features)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

In [33]:
features

array([[ 1.00000000e+04,  1.00000000e+04],
       [-2.76017908e+00,  5.55121358e+00],
       [-1.61734616e+00,  4.98930508e+00],
       [-5.25790464e-01,  3.30659860e+00],
       [ 8.52518583e-02,  3.64528297e+00],
       [-7.94152277e-01,  2.10495117e+00],
       [-1.34052081e+00,  4.15711949e+00],
       [-1.98197711e+00,  4.02243551e+00],
       [-2.18773166e+00,  3.33352125e+00],
       [-1.97451969e-01,  2.34634916e+00]])

A major limitation of this approach is the need to specify a contamination parameter,
which is the proportion of observations that are outliers—a value that we don’t
know. Think of contamination as our estimate of the cleanliness of our data. If we
expect our data to have few outliers, we can set contamination to something small.
However, if we believe that the data is very likely to have outliers, we can set it to a
higher value.Instead of looking at observations as a whole, we can instead look at individual features
and identify extreme values in those features using interquartile range (IQR):

In [34]:
# Create one feature

# Create a function to return index of outliers
def indicies_of_outliers(x):
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    return np.where((x > upper_bound) | (x < lower_bound))
# Run function
indicies_of_outliers(feature)

(array([0], dtype=int64),)

In [None]:
feature = features[:,0]

In [None]:
feature

IQR is the difference between the first and third quartile of a set of data. You can
think of IQR as the spread of the bulk of the data, with outliers being observations far
from the main concentration of data. Outliers are commonly defined as any value 1.5
IQRs less than the first quartile or 1.5 IQRs greater than the third quartile.

## 3.7 Handling Outliers

Typically we have three strategies we can use to handle outliers

First, we can drop them

In [35]:
# Load library
import pandas as pd
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
# Filter observations
houses[houses['Bathrooms'] < 20]

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


Second, we can mark them as outliers and include it as a feature

In [36]:
# Load library
import numpy as np
# Create feature based on boolean condition
houses["Outlier"] = np.where(houses["Bathrooms"] < 20, 0, 1)
# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


Finally, we can transform the feature to dampen the effect of the outlier

In [37]:
# Log feature
houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]]
# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,Log_Of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


## 3.8 Discretizating Features

Depending on how we want to break up the data, there are two techniques we can
use. First, we can binarize the feature according to some threshold

In [38]:
# Load libraries
import numpy as np
from sklearn.preprocessing import Binarizer
# Create feature
age = np.array([[6],
                [12],
                [20],
                [36],
                [65]])
# Create binarizer
binarizer = Binarizer(18)
# Transform feature
binarizer.fit_transform(age)

array([[0],
       [0],
       [1],
       [1],
       [1]])

Second, we can break up numerical features according to multiple thresholds

In [39]:
# Bin feature
np.digitize(age, bins=[20,30,64])


array([[0],
       [0],
       [1],
       [2],
       [3]], dtype=int64)

In [40]:
# Bin feature
np.digitize(age, bins=[20,30,64], right=True)


array([[0],
       [0],
       [0],
       [2],
       [3]], dtype=int64)

## 4.9 Grouping Observations Using Clustering

In [47]:
# Load libraries
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Make simulated feature matrix
features, _ = make_blobs(n_samples = 50,
n_features = 2,
centers = 3,
random_state = 1)
# Create DataFrame
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Make k-means clusterer
clusterer = KMeans(3, random_state=0)
# Fit clusterer
clusterer.fit(features)
# Predict values
dataframe["group"] = clusterer.predict(features)
# View first few observations
dataframe.head(5)

Unnamed: 0,feature_1,feature_2,group
0,-9.877554,-3.336145,2
1,-7.28721,-8.353986,0
2,-6.943061,-7.023744,0
3,-7.440167,-8.791959,0
4,-6.641388,-8.075888,0


## 4.10 Deleting Observations with Missing Values

You need to delete observations containing missing values

In [48]:
# Load library
import numpy as np
# Create feature matrix
features = np.array([[1.1, 11.1],
                    [2.2, 22.2],
                    [3.3, 33.3],
                    [4.4, 44.4],
                    [np.nan, 55]])
# Keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

Alternatively, we can drop missing observations using pandas

In [49]:
# Load library
import pandas as pd
# Load data
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Remove observations with missing values
dataframe.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4
