# CH 4 -  Handling Numerical Data

## Introduction

 In this chapter, we will cover numerous strate‐gies for transforming raw numerical data into features purpose-built for machine learning algorithms

## Rescaling a Feature

This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points, like support vector machines, or SVM or k-nearest neighbors, or KNN.

Rescaling is a common preprocessing task in machine learning. Many of the algo‐
rithms described later in this book will assume all features are on the same scale, typi‐
cally 0 to 1 or –1 to 1. There are a number of rescaling techniques, but one of the
simplest is called min-max scaling. Min-max scaling uses the minimum and maxi‐
mum values of a feature to rescale values to within a range. 

### Problem
You need to rescale the values of a numerical feature to be between two values.

<h3 style ="color : green">Solution ? <h3>

In [3]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np 

In [11]:
min_max_scaler = MinMaxScaler(feature_range=(0,1))

In [12]:
feature = np.array([[-500.5],
                 [-100.1],
                 [0],
                 [100.1],
                 [900.9]])

In [13]:
feature

array([[-500.5],
       [-100.1],
       [   0. ],
       [ 100.1],
       [ 900.9]])

In [14]:
min_max_scaler.fit_transform(feature)

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

One option is to
use fit to calculate the minimum and maximum values of the feature, then use trans
form to rescale the feature. The second option is to use fit_transform to do both
operations at once. 

## Standardizing a Feature


Standardization is a common go-to scaling method for machine learning preprocess‐
ing and in my experience is used more than min-max scaling. However, it depends on
the learning algorithm. For example, principal component analysis often works better
using standardization, while min-max scaling is often recommended for neural net‐
works (both algorithms are discussed later in this book). As a general rule, I’d recom‐
mend defaulting to standardization unless you have a specific reason to use an alternativ

### Problem
You want to transform a feature to have a mean of 0 and a standard deviation of 1.

<h3 style ="color : green">Solution ? <h3>

In [17]:
from sklearn.preprocessing import StandardScaler
import numpy as np 

In [18]:
# Create feature matrix
features = np.array([[0.5, 0.5],
                     [1.1, 3.4],
                     [1.5, 20.2],
                     [1.63, 34.4],
                     [10.9, 3.3]])

In [19]:
features

array([[ 0.5 ,  0.5 ],
       [ 1.1 ,  3.4 ],
       [ 1.5 , 20.2 ],
       [ 1.63, 34.4 ],
       [10.9 ,  3.3 ]])

In [20]:
std = StandardScaler()

In [22]:
std.fit(features)

StandardScaler()

In [23]:
std.transform(features)

array([[-0.67215216, -0.90948567],
       [-0.51857589, -0.68709879],
       [-0.4161917 ,  0.60121144],
       [-0.38291684,  1.69014032],
       [ 1.9898366 , -0.6947673 ]])

## Transforming Features

## Problem
You want to make a custom transformation to one or more features.

It is common to want to make some custom transformations to one or more features.
For example, we might want to create a feature that is the natural log of the values of
the different feature. We can do this by creating a function and then mapping it to
features using either scikit-learn’s FunctionTransformer or pandas’ apply. In the sol‐
ution we created a very simple function, add_ten, which added 10 to each input, but
there is no reason we could not define a much more complex function

<h3 style ="color : green">Solution ? <h3>

In [1]:
import numpy as np 
from sklearn.preprocessing import FunctionTransformer

In [3]:
# Create feature matrix
features = np.array([[2, 3],
                     [2, 3],
                     [2, 3]])

In [5]:
def add_10(arr):
    return arr+10

In [9]:
fr = FunctionTransformer(add_10)

In [10]:
fr.transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

## Detecting Outliers


### Problem
You want to identify extreme observations.

<h3 style ="color : green">Solution ? <h3>

A major limitation of this approach is the need to specify a contamination parame‐
ter, which is the proportion of observations that are outliers—a value that we don’t
know. Think of contamination as our estimate of the cleanliness of our data. If we
expect our data to have few outliers, we can set contamination to something small.
However, if we believe that the data is very likely to have outliers, we can set it to a
higher value.

##  Handling Outliers


### Problem
You have outliers

<h3 style ="color : green">Solution ? <h3>

In [11]:
import pandas as pd

In [12]:
# Create DataFrame
houses = pd.DataFrame()

In [13]:
houses

In [14]:
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

In [15]:
houses

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500
3,4322032,116.0,48000


In [16]:
houses[houses['Bathrooms'] < 10]

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


In [17]:
import numpy as np 

In [18]:
houses['out'] = np.where(houses['Bathrooms'] < 10 , 0 , 1)

In [19]:
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,out
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


## Discretizating Features


### Problem
You have a numerical feature and want to break it up into discrete bins.

<h3 style ="color : green">Solution ? <h3>

## Grouping Observations Using Clustering


### Problem
You want to cluster observations so that similar observations are grouped together.

<h3 style ="color : green">Solution ? <h3>

##  Deleting Observations with Missing Values


### Problem
You need to delete observations containing missing values

<h3 style ="color : green">Solution ? <h3>

## Imputing Missing Values


### Problem
You have missing values in your data and want to fill in or predict their values

<h3 style ="color : green">Solution ? <h3>