# Introduction:
> - Quantitative data is the measurement of something—whether class size, monthly
sales, or student scores.
> - We will cover numerous strate‐
gies for transforming raw numerical data into features purpose-built for machine
learning algorithms.

# Rescaling a Feature
> - Rescaling is a common preprocessing task in machine learning.
> - Many of the algorithms will assume all features are on the same scale,
typically 0 to 1 or –1 to 1.
> - There are a number of rescaling techniques, but one of
the simplest is called `min-max scaling`.
>> - Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range:<br>
`xi′ = (xi − min x)/(max x − min x)`
>>> where x is the feature vector, xi is an individual element of feature x, and xi′ is
the rescaled element.

In [7]:
# To rescale the values of a numerical feature to be between two values, 
# use scikit-learn’s MinMaxScaler to rescale a feature array:
# Load libraries
import numpy as np
from sklearn import preprocessing
# Create feature
feature = np.array([[-500.5],
 [-100.1],
 [0],
 [100.1],
 [900.9]])
# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
minmax_scale.fit_transform(feature)
# In our example, we can see from the outputted array that the feature has been successfully rescaled to between 0 and 1

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

**Note:**
> 1. One option is to use `fit` to calculate the minimum and maximum values of the feature, and then
use `transform` to rescale the feature.
> 2. The second option is to use `fit_transform` to
do both operations at once.
- There is no mathematical difference between the two
options, but there is sometimes a practical benefit to keeping the operations separate
because it allows us to apply the same transformation to different sets of the data.

# Standardizing a Feature
> - A common alternative to the `min-max scaling`  is rescaling of
features to be approximately standard normally distributed.
>> - To achieve this, we use standardization to transform the data such that it has a mean, x, of 0 and a standard
deviation, σ, of 1:<br>
`xi′ = (xi − x)/σ`
>>> where xi′ is our standardized form of xi. The transformed feature represents the
number of standard deviations of the original value from the feature’s mean value
(also called a z-score in statistics).
>______________________________________
> - Standardization is used more often than min-max scaling. However, it
depends on the learning algorithm. 
>> Example:<br>
>>> 1. principal component analysis often
works better using standardization
>>> 2. min-max scaling is often recommended for
neural networks 
>______________________________________________
> **Note:** As a general rule, it is recommend defaulting to standardization unless you have a specific reason to use
an alternative.

In [8]:
# To transform a feature to have a mean of 0 and a standard deviation of 1, 
# use scikit-learn’s StandardScaler to perform both transformations:
# Load libraries
import numpy as np
from sklearn import preprocessing
# Create feature
x = np.array([[-1000.1],
 [-200.2],
 [500.5],
 [600.6],
 [9000.9]])
# Create scaler
scaler = preprocessing.StandardScaler()
# Transform the feature
standardized = scaler.fit_transform(x)
# Show feature
standardized

array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])

In [9]:
# We can see the effect of standardization by looking at the mean and standard devia‐
# tion of our solution’s output:
# Print mean and standard deviation
print("Mean:", round(standardized.mean()))
print("Standard deviation:", standardized.std())

Mean: 0
Standard deviation: 1.0


**Note:**
- If our data has significant outliers, it can negatively impact our standardization by
affecting the feature’s mean and variance. In this scenario, it is often helpful to instead
rescale the feature using the median and quartile range. In scikit-learn, we do this
using the RobustScaler method:

In [10]:
# Create scaler
robust_scaler = preprocessing.RobustScaler()
# Transform feature
robust_scaler.fit_transform(x)

array([[-1.87387612],
       [-0.875     ],
       [ 0.        ],
       [ 0.125     ],
       [10.61488511]])

# Normalizing Observations
> - Many rescaling methods (e.g., min-max scaling and standardization) operate on
features; however, we can also rescale across individual observations.
> - Normalizer rescales the values on individual observations to have unit norm (the sum of their
lengths is 1).
>> - This type of rescaling is often used when we have many equivalent
features (e.g., text classification when every word or n-word group is a feature).
>>> Normalizer provides three norm options
>>> 1. Euclidean norm (often called L2) being the default argument:<br>
`∥ x ∥2 = sqrt((x1)^2 + (x2)^2 + ⋯ + (xn)^2)`
>>>>where x is an individual observation and xn is that observation’s value for the nth
feature.
>>> 2. Manhattan norm (L1):<br>
`∥ x ∥1 = ∑(i = 1 till n)xi`
>>>>Intuitively, L2 norm can be thought of as the distance between two points in New
York for a bird (i.e., a straight line), while L1 can be thought of as the distance for a
human walking on the street (walk north one block, east one block, north one block,
east one block, etc.), which is why it is called `“Manhattan norm”` or `“Taxicab norm.”`

In [13]:
# To rescale the feature values of observations to have unit norm (a totallength of 1),
# use Normalizer with a norm argument:
# Load libraries
import numpy as np
from sklearn.preprocessing import Normalizer
# Create feature matrix
features = np.array([[0.5, 0.5],
 [1.1, 3.4],
 [1.5, 20.2],
 [1.63, 34.4],
 [10.9, 3.3]])
# Create normalizer
normalizer = Normalizer(norm="l2")
# Transform feature matrix
print(normalizer.transform(features))
# array([[ 0.70710678, 0.70710678],
#  [ 0.30782029, 0.95144452],
#  [ 0.07405353, 0.99725427],
#  [ 0.04733062, 0.99887928],
#  [ 0.95709822, 0.28976368]])
# ----- or -----------------
# Transform feature matrix
features_l2_norm = Normalizer(norm="l2").transform(features)
# Show feature matrix
features_l2_norm

[[0.70710678 0.70710678]
 [0.30782029 0.95144452]
 [0.07405353 0.99725427]
 [0.04733062 0.99887928]
 [0.95709822 0.28976368]]


array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

In [14]:
# Transform feature matrix
features_l1_norm = Normalizer(norm="l1").transform(features)
# Show feature matrix
features_l1_norm

array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

- Practically, notice that `norm="l1"` rescales an observation’s values so they sum to 1,
which can sometimes be a desirable quality:

In [15]:
# Print sum
print("Sum of the first observation\'s values:",
 features_l1_norm[0, 0] + features_l1_norm[0, 1])

# Note: Sum of the first observation's values: 1.0

Sum of the first observation's values: 1.0


# Generating Polynomial and Interaction Features
> - Polynomial features are often created when we want to include the notion that there
exists a nonlinear relationship between the features and the target. For example, we
might suspect that the effect of age on the probability of having a major medical
condition is not constant over time but increases as age increases. We can encode that
nonconstant effect in a feature, x, by generating that feature’s higher-order forms (x^2, x^3, etc.).

> **Note:**
> > Often we run into situations where the effect of one feature is dependent
on another feature.
> > > A simple example would be if we were trying to predict whether
or not our coffee was sweet, and we had two features: (1) whether or not the coffee
was stirred, and (2) whether or not we added sugar.<br><br>
> > > Individually, each feature does
not predict coffee sweetness, but the combination of their effects does. That is, a
coffee would only be sweet if the coffee had sugar and was stirred.
> > > - The effects of each
feature on the target (sweetness) are dependent on each other.
>>> -  We can encode that
relationship by including an interaction feature that is the product of the individual
features.

In [16]:
# To create polynomial and interaction features.
# use scikit-learn built-in method:
# Load libraries
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
# Create feature matrix
features = np.array([[2, 3],
 [2, 3],
 [2, 3]])
# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False)
# Create polynomial features
polynomial_interaction.fit_transform(features)

array([[2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.]])

- The degree parameter determines the maximum degree of the polynomial. For
example, degree=2 will create new features raised to the second power:<br>
`x1, x2, (x1)^2, (x1)^2, (x2)^2`
<br>
- while degree=3 will create new features raised to the second and third power:<br>
`x1, x2, (x1)^2, (x2)^2,(x1)^3, (x2)^3,(x1)^2, (x1)^3, (x2)^3`

In [17]:
# Furthermore, by default PolynomialFeatures includes interaction features,
# We can restrict the features created to only interaction features by setting
# interaction_only to True:
interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction.fit_transform(features)

array([[2., 3., 6.],
       [2., 3., 6.],
       [2., 3., 6.]])

# Transforming Features

In [20]:
# To make a custom transformation to one or more features.
# use FunctionTransformer in scikit-learn to apply a function to a set of features:
# Load libraries
import numpy as np
from sklearn.preprocessing import FunctionTransformer
# Create feature matrix
features = np.array([[2, 3],
 [2, 3],
 [2, 3]])
# Define a simple function
def add_ten(x: int) -> int:
    return x + 10
# Create transformer
ten_transformer = FunctionTransformer(add_ten)
# Transform feature matrix
ten_transformer.transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

In [21]:
# We can create the same transformation in pandas using apply:
# Load library
import pandas as pd
# Create DataFrame
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Apply function
df.apply(add_ten)

Unnamed: 0,feature_1,feature_2
0,12,13
1,12,13
2,12,13


# Detecting Outliers
> - There is no single best technique for detecting outliers.
> - Instead, we have a collection
of techniques all with their own advantages and disadvantages.
>>  Our best strategy
is often trying multiple techniques 

In [1]:
# To identify extreme observations, you have to detect outliers and a common
# method is to assume the data is normally distributed and, based on that assumption,
# “draw” an ellipse around the data, classifying any observation inside the ellipse as
# an inlier (labeled as 1) and any observation outside the ellipse as an outlier (labeled as -1):
# Load libraries
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
# Create simulated data
features, _ = make_blobs(n_samples = 10,
 n_features = 2,
 centers = 1,
 random_state = 1)
# Replace the first observation's values with extreme values
features[0,0] = 10000
features[0,1] = 10000
# Create detector
outlier_detector = EllipticEnvelope(contamination=.1)
# Fit detector
outlier_detector.fit(features)
# Predict outliers
outlier_detector.predict(features)
# array([-1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

> - In these arrays, values of -1 refer to outliers whereas values of 1 refer to inliers.
> - A major limitation of this approach is the need to specify a contamination parameter,
which is the proportion of observations that are outliers—a value that we don’t
know.
>> - Think of contamination as our estimate of the cleanliness of our data. If we
expect our data to have few outliers, we can set contamination to something small.
However, if we believe that the data is likely to have outliers, we can set it to a higher

In [2]:
# Instead of looking at observations as a whole, we can instead look at individual
# features and identify extreme values in those features using interquartile range (IQR):
# Create one feature
feature = features[:,0]
# Create a function to return index of outliers
def indicies_of_outliers(x: int) -> np.array(int):
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    return np.where((x > upper_bound) | (x < lower_bound))
# Run function
indicies_of_outliers(feature)
# (array([0]),)

(array([0], dtype=int64),)

> - IQR is the difference between the first and third quartile of a set of data. You can
think of IQR as the spread of the bulk of the data, with outliers being observations far
from the main concentration of data.
>> - Outliers are commonly defined as any value 1.5
IQRs less than the first quartile, or 1.5 IQRs greater than the third quartile value.

# Handling Outliers
> - If we believe they are errors in the data, such as from a broken
sensor or a miscoded value, then we might drop the observation or replace outlier
values with NaN since we can’t trust those values. 
> - Second, how we handle outliers should be based on our goal for machine learning.
>> For example, if we want to predict house prices based on features of the house, we
might reasonably assume the price for mansions with over 100 bathrooms is driven
by a different dynamic than regular family homes.
>_____________________________________________________________

> what should we do if we have outliers?
> > Think about why they are outliers, have an
end goal in mind for the data, and, most importantly, remember that not making a
decision to address outliers is itself a decision with implications.
>>> - **Note:** if you do have outliers, standardization might not be appropriate
because the mean and variance might be highly influenced by the outliers. In this
case, use a rescaling method more robust against outliers, like `RobustScaler`.

In [3]:
# If you have outliers in your data that you want to identify and then reduce their impacton the data distribution.
# Typically we can use three strategies to handle outliers. 
# First, we can drop them:
# Load library
import pandas as pd
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
# Filter observations
houses[houses['Bathrooms'] < 20]

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


In [5]:
# Second, we can mark them as outliers and include “Outlier” as a feature:
# Load library
import numpy as np
# Create feature based on boolean condition
houses["Outlier"] = np.where(houses["Bathrooms"] < 20, 0, 1)
# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


In [7]:
# Finally, we can transform the feature to dampen the effect of the outlier:
# Log feature
houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]]
# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,Log_Of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


# Discretizating Features
> - Discretization can be a fruitful strategy when we have reason to believe that a numerical feature should behave more like a categorical feature.
> - **Note:** we can use clustering as a preprocessing step. 

In [8]:
# If you have a numerical feature and want to break it up into discrete bins.
# Depending on how we want to break up the data, there are two techniques we can use, 
# First, we can binarize the feature according to some threshold:
# Load libraries
import numpy as np
from sklearn.preprocessing import Binarizer
# Create feature
age = np.array([[6],
 [12],
 [20],
 [36],
 [65]])
# Create binarizer
binarizer = Binarizer(threshold=18)
# Transform feature
binarizer.fit_transform(age)

array([[0],
       [0],
       [1],
       [1],
       [1]])

In [9]:
# Second, we can break up numerical features according to multiple thresholds:
# Bin feature
np.digitize(age, bins=[20,30,64])

array([[0],
       [0],
       [1],
       [2],
       [3]], dtype=int64)

# Deleting Observations with Missing Values
> - Most machine learning algorithms cannot handle any missing values in the target and
feature arrays. The simplest solution is to delete every observation that contains one or more
missing values.
>> That said, we should be very reluctant to delete observations with missing values.
Deleting them is the nuclear option

> There are three types of missing data:

> 1. Missing completely at random (MCAR):
>> The probability that a value is missing is independent of everything.<br><br>
>> For example,
a survey respondent rolls a die before answering a question: if she rolls a six, she
skips that question.
>_________________________________________
> 2. Missing at random (MAR):
>> The probability that a value is missing is not completely random but depends
on the information captured in other features.<br><br>
>> For example, a survey asks about
gender identity and annual salary, and women are more likely to skip the salary
question; however, their nonresponse depends only on information we have
captured in our gender identity feature.
>____________________________________
> 3. Missing not at random (MNAR):
>> The probability that a value is missing is not random and depends on informa‐
tion not captured in our features.<br><br>
>> For example, a survey asks about annual salary,
and women are more likely to skip the salary question, and we do not have a
gender identity feature in our data.<br>

> - **Note:** It is sometimes acceptable to delete observations if they are MCAR or MAR.
However, if the value is MNAR, the fact that a value is missing is itself information.
Deleting MNAR observations can inject bias into our data because we are removing
observations produced by some unobserved systematic effect.

In [10]:
# If you need to delete observations containing missing values, is easy with a clever line of NumPy:
# Load library
import numpy as np
# Create feature matrix
features = np.array([[1.1, 11.1],
 [2.2, 22.2],
 [3.3, 33.3],
 [4.4, 44.4],
 [np.nan, 55]])
# Keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

In [12]:
# Alternatively, we can drop missing observations using pandas:
# Load library
import pandas as pd
# Load data
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Remove observations with missing values
dataframe.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4


# Imputing Missing Values
> There are two main strategies for replacing missing data with substitute :<br>
>> **First:** we can use machine learning to
predict the values of the missing data. To do this we treat the feature with missing
values as a target vector and use the remaining subset of features to predict missing
values.<br>
>>> A popular choice is KNN, the
short explanation is that the algorithm uses the k nearest observations (according to
some distance metric) to predict the missing value. In the below code we predicted the
missing value using the five closest observations.<br>
>>> - **Note:** The downside to KNN is that in order to know which observations are the closest to
the missing value, it needs to calculate the distance between the missing value and
every single observation. This is reasonable in smaller datasets but quickly becomes
problematic if a dataset has millions of observations. In such cases, approximate
nearest neighbors (ANN) is a more feasible approach.<br>
>>______________________________________________
>> **Second:** we can use scikit-learn’s SimpleImputer class from the imputer module
to fill in missing values with the feature’s mean, median, or most frequent value.
>>> - We will typically get worse results than with KNN ;(
<br><br>
> - **Note:** If we use imputation, it is a good idea to create a binary feature indicating whether
the observation contains an imputed value.

In [13]:
# If you have missing values in your data and want to impute them via a generic methodor prediction.
# You can impute missing values using k-nearest neighbors (KNN) or the scikit-learn SimpleImputer class. 
# If you have a small amount of data, predict and impute the missing values using k-nearest neighbors:
# Load libraries
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# Make a simulated feature matrix
features, _ = make_blobs(n_samples = 1000,
 n_features = 2,
 random_state = 1)
# Standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)

# Replace the first feature's first value with a missing value
true_value = standardized_features[0,0]
standardized_features[0,0] = np.nan
# Predict the missing values in the feature matrix
knn_imputer = KNNImputer(n_neighbors=5)
features_knn_imputed = knn_imputer.fit_transform(standardized_features)
# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_knn_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: 1.0959262913919632


In [14]:
# Alternatively, we can use scikit-learn’s SimpleImputer class from the imputer module
# to fill in missing values with the feature’s mean, median, or most frequent value.
# However, we will typically get worse results than with KNN:
# Load libraries
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# Make a simulated feature matrix
features, _ = make_blobs(n_samples = 1000,
 n_features = 2,
 random_state = 1)
# Standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)
# Replace the first feature's first value with a missing value
true_value = standardized_features[0,0]
standardized_features[0,0] = np.nan
# Create imputer using the "mean" strategy
mean_imputer = SimpleImputer(strategy="mean")
# Impute values
features_mean_imputed = mean_imputer.fit_transform(features)
# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_mean_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


# END of Chapter 04 ---> Handling Numeric Data