## CHAPTER 4 
---
# HANDLING NUMERICAL  DATA

---
### 2.1 Rescaling a Feature

- Load NumPy library
- Load 'preprocessing' from 'sklearn'

In [1]:
import numpy as np
from sklearn import preprocessing

- Create a NumPy array (feature named 'array_1') containing: -500.5, -100.1, 0, 100.1, 900.9
- Show 'array_1'

In [2]:
array_1 = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]])
array_1

array([[-500.5],
       [-100.1],
       [   0. ],
       [ 100.1],
       [ 900.9]])

- Create a 'MinMaxScaler' (named 'minmax_scaler') with a range of 0-1
- Show 'minmax_scaler'

In [3]:
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
minmax_scaler

MinMaxScaler()

- Use 'minmax_scaler' to scale 'array_1' and name it 'array_1_scaled'
- Show 'array_1_scaled'

In [4]:
array_1_scaled = minmax_scaler.fit_transform(array_1)
array_1_scaled

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

- Create a 'MinMaxScaler' (named 'minmax_scaler_2') with a range of 0-5
- Use 'minmax_scaler_2' to scale 'array_1' and name it 'array_1_scaled_2'
- Show 'array_1_scaled_2'

In [5]:
minmax_scaler_2 = preprocessing.MinMaxScaler(feature_range=(0,5))
array_1_scaled_2 = minmax_scaler_2.fit_transform(array_1)
array_1_scaled_2

array([[0.        ],
       [1.42857143],
       [1.78571429],
       [2.14285714],
       [5.        ]])

- Create a 'MinMaxScaler' (named 'minmax_scaler_3') with a range of -5 to 5
- Use 'minmax_scaler_3' to scale 'array_1' and name it 'array_1_scaled_3'
- Show 'array_1_scaled_3'

In [6]:
minmax_scaler_3 = preprocessing.MinMaxScaler(feature_range=(-5,5))
array_1_scaled_3 = minmax_scaler_3.fit_transform(array_1)
array_1_scaled_3

array([[-5.        ],
       [-2.14285714],
       [-1.42857143],
       [-0.71428571],
       [ 5.        ]])

**MinMaxScaler Formula:**
$$
x_i^` = \frac{x_i - min(x)}{max(x) - min(x)}
$$

### 2.2 Standardizing a Feature

- Create a NumPy array (feature named 'array_2') containing: -1000.1, -200.2, 500.5, 600.6, 9000.9
- Show 'array_2'

In [7]:
array_2 = np.array([[-1000.1], [-200.2], [500.5], [600.6], [9000.9]])
array_2

array([[-1000.1],
       [ -200.2],
       [  500.5],
       [  600.6],
       [ 9000.9]])

- Create a 'StandardScaler' named 'std_scaler'
- Show 'minmax_scaler'

In [8]:
std_scaler = preprocessing.StandardScaler()
std_scaler

StandardScaler()

- Standardize 'array_2' and name it 'array_2_stdized'
- Show 'array_2_stdized'

In [9]:
array_2_stdized = std_scaler.fit_transform(array_2)
array_2_stdized

array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])

- Print the rounded mean of 'array_2'
- Print the rounded mean of 'array_2_stdized'

In [10]:
print(round(array_2.mean()))
print(round(array_2_stdized.mean()))

1780
0


- Print the rounded standard deviation of 'array_2'
- Print the rounded standard deviation of 'array_2_stdized'

In [11]:
print(round(array_2.std()))
print(round(array_2_stdized.std()))

3656
1


**StandardScaler Formula:**
$$
x_i^` = \frac{x_i - \bar x}{\sigma}
$$

### 2.3 Normalizing Observations

- Import 'Normalizer' from sklearn preprocessing
- Create a 2d array ('array_3') containing: 0.5, 0.5; 1.1, 3.4; 1.5, 20.2; 1.63, 34.4; and 10.9, 3.3
- Create a normalizer ('norm_l2') with 'norm=l2'
- Normalize 'array_3'

In [12]:
from sklearn.preprocessing import Normalizer

array_3 = np.array([
                    [0.5, 0.5],
                    [1.1, 3.4],
                    [1.5, 20.2],
                    [1.63, 34.4],
                    [10.9, 3.3]
                ])
norm_l2 = Normalizer(norm="l2")
norm_l2.transform(array_3)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

- Create a normalizer with 'norm=l2' and normalize 'array_3' in one step

In [13]:
Normalizer(norm="l2").transform(array_3)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

- Create a normalizer with 'norm=l1' and normalize 'array_3' in one step

In [14]:
Normalizer(norm="l1").transform(array_3)

array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

**Note:** Intuitively, L2 norm can be thought of as the distance between two points in New York for a bird (i.e., a straight line), while L1 can be thought of as the distance for a human walking on the street (walk north one block, east one block, north one block, east one block, etc.), which is why it is called “Manhattan norm” or “Taxicab norm.” Practically, notice that norm='l1' rescales an observation’s values so they sum to 1, which can sometimes be a desirable quality.

### 2.4 Transforming Features

- Import 'FunctionTransformer' from 'sklearn preprocessing'
- Create a 2d array ('array_4') containing: 2, 3; 2, 3; and 2, 3
- Define a simple function ('add_10') that takes in one argument ('x') and returns 'x+10'
- Create a transformer, 'ten_transformer'
- Transform 'array_4'

In [15]:
from sklearn.preprocessing import FunctionTransformer

array_4 = np.array([[2, 3], [2, 3], [2, 3]])

def add_10(x):
    return x + 10

ten_transformer = FunctionTransformer(add_10)
ten_transformer.transform(array_4)

array([[12, 13],
       [12, 13],
       [12, 13]])

- Import Pandas
- Create a dataframe ('array_4_df') from 'array_4' with 'col1' and 'col2' as columns 
- Apply 'add_10' function to the dataframe

In [16]:
import pandas as pd

array_4_df = pd.DataFrame(array_4, columns=['col1', 'col2'])
array_4_df.apply(add_10)

Unnamed: 0,col1,col2
0,12,13
1,12,13
2,12,13


### 2.5 Deleting Observations with Missing Values

- Create a 2d array ('array_5') containing: 1.1, 11.1; 2.2, 22.2; 3.3, 33.3; 4.4, 44.4; np.nan, 55
- Keep only observations that are not missing

In [17]:
array_5 = np.array([[1.1, 11.1], [2.2, 22.2], [3.3, 33.3], [4.4, 44.4], [np.nan, 55]])
array_5[~np.isnan(array_5).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

- Create a dataframe ('array_5_df') from 'array_5' with 'col1' and 'col2' as columns
- Remove observations with missing values

In [18]:
array_5_df = pd.DataFrame(array_5, columns=['col1', 'col2'])
array_5_df.dropna()

Unnamed: 0,col1,col2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4


### 2.6 Imputing Missing Values

- Import 'StandardScaler' from 'sklearn.preprocessing'
- Import 'make_blobs' from 'sklearn.datasets'
- Import 'SimpleImputer' from 'sklearn.impute'
- Make fake data ('fake, blobs') from 'make_blobs' with 1000 samples, 2 features, and a random state of 1
- Show the first 5 rows of 'fake,blobs'

In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.impute import SimpleImputer

fake,blobs = make_blobs(n_samples = 1000,
                        n_features = 2,
                        random_state = 1)
fake[:5],blobs[:5]

(array([[-3.05837272,  4.48825769],
        [-8.60973869, -3.72714879],
        [ 1.37129721,  5.23107449],
        [-9.33917563, -2.9544469 ],
        [-8.63895561, -8.05263469]]),
 array([0, 1, 0, 1, 2]))

- Standardize the data and name it 'standardized_blobs'
- Assign the first value of 'standardized_blobs' to a variable named 'true_value'
- Show the first 3 rows of 'standardized_blobs'

In [20]:
standardized_blobs = StandardScaler().fit_transform(fake,blobs)
true_value = standardized_blobs[0, 0]
standardized_blobs[:3]

array([[ 0.87301861,  1.31426523],
       [-0.67073178, -0.22369263],
       [ 2.1048424 ,  1.45332359]])

- Assign the first value of 'standardized_blobs' to 'np.nan'
- Show the first 3 rows of 'standardized_blobs'

In [21]:
standardized_blobs[0,0] = np.nan
standardized_blobs[:3]

array([[        nan,  1.31426523],
       [-0.67073178, -0.22369263],
       [ 2.1048424 ,  1.45332359]])

- Create a mean imputer named 'mean_imputer'
- Impute 'fake,blobs' and call it 'blob_imputed'
- Using format function, print "True Value" and 'true_value'
- Using format function, print "Imputed Value" and 'blob_imputed' at 0,0

In [22]:
mean_imputer = SimpleImputer(strategy="mean")
blob_imputed = mean_imputer.fit_transform(fake,blobs)

print("True Value: {}".format(true_value))
print("Imputed Value: {}".format(blob_imputed[0,0]))

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996
