## CHAPTER 4 & 5
---
# HANDLING NUMERICAL  DATA

---
### 2.1 Rescaling a Feature

- Load NumPy library
- Load 'preprocessing' from 'sklearn'

In [1]:
import numpy as np
from sklearn import preprocessing

- Create a NumPy array (feature named 'array_1') containing: -500.5, -100.1, 0, 100.1, 900.9
- Show 'array_1'

In [2]:
array_1 = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]])
array_1

array([[-500.5],
       [-100.1],
       [   0. ],
       [ 100.1],
       [ 900.9]])

- Create a 'MinMaxScaler' (named 'minmax_scaler') with a range of 0-1
- Show 'minmax_scaler'

In [3]:
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
minmax_scaler

MinMaxScaler(copy=True, feature_range=(0, 1))

- Use 'minmax_scaler' to scale 'array_1' and name it 'array_1_scaled'
- Show 'array_1_scaled'

In [4]:
array_1_scaled = minmax_scaler.fit_transform(array_1)
array_1_scaled

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

- Create a 'MinMaxScaler' (named 'minmax_scaler_2') with a range of 0-5
- Use 'minmax_scaler_2' to scale 'array_1' and name it 'array_1_scaled_2'
- Show 'array_1_scaled_2'

In [5]:
minmax_scaler_2 = preprocessing.MinMaxScaler(feature_range=(0,5))
array_1_scaled_2 = minmax_scaler_2.fit_transform(array_1)
array_1_scaled_2

array([[0.        ],
       [1.42857143],
       [1.78571429],
       [2.14285714],
       [5.        ]])

- Create a 'MinMaxScaler' (named 'minmax_scaler_3') with a range of -5 to 5
- Use 'minmax_scaler_3' to scale 'array_1' and name it 'array_1_scaled_3'
- Show 'array_1_scaled_3'

In [6]:
minmax_scaler_3 = preprocessing.MinMaxScaler(feature_range=(-5,5))
array_1_scaled_3 = minmax_scaler_3.fit_transform(array_1)
array_1_scaled_3

array([[-5.        ],
       [-2.14285714],
       [-1.42857143],
       [-0.71428571],
       [ 5.        ]])

**MinMaxScaler Formula:**
$$
x_i^` = \frac{x_i - min(x)}{max(x) - min(x)}
$$

### 2.2 Standardizing a Feature

- Create a NumPy array (feature named 'array_2') containing: -1000.1, -200.2, 500.5, 600.6, 9000.9
- Show 'array_2'

In [7]:
array_2 = np.array([[-1000.1], [-200.2], [500.5], [600.6], [9000.9]])
array_2

array([[-1000.1],
       [ -200.2],
       [  500.5],
       [  600.6],
       [ 9000.9]])

- Create a 'StandardScaler' named 'std_scaler'
- Show 'minmax_scaler'

In [8]:
std_scaler = preprocessing.StandardScaler()
std_scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

- Standardize 'array_2' and name it 'array_2_stdized'
- Show 'array_2_stdized'

In [9]:
array_2_stdized = std_scaler.fit_transform(array_2)
array_2_stdized

array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])

- Print the rounded mean of 'array_2'
- Print the rounded mean of 'array_2_stdized'

In [10]:
print(round(array_2.mean()))
print(round(array_2_stdized.mean()))

1780.0
0.0


- Print the rounded standard deviation of 'array_2'
- Print the rounded standard deviation of 'array_2_stdized'

In [11]:
print(round(array_2.std()))
print(round(array_2_stdized.std()))

3656.0
1.0


**StandardScaler Formula:**
$$
x_i^` = \frac{x_i - \bar x}{\sigma}
$$

### 2.3 Normalizing Observations

- Import 'Normalizer' from sklearn preprocessing
- Create a 2d array ('array_3') containing: 0.5, 0.5; 1.1, 3.4; 1.5, 20.2; 1.63, 34.4; and 10.9, 3.3
- Create a normalizer ('norm_l2') with 'norm=l2'
- Normalize 'array_3'

In [12]:
from sklearn.preprocessing import Normalizer

array_3 = np.array([
                    [0.5, 0.5],
                    [1.1, 3.4],
                    [1.5, 20.2],
                    [1.63, 34.4],
                    [10.9, 3.3]
                ])
norm_l2 = Normalizer(norm="l2")
norm_l2.transform(array_3)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

- Create a normalizer with 'norm=l2' and normalize 'array_3' in one step

In [13]:
Normalizer(norm="l2").transform(array_3)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

- Create a normalizer with 'norm=l1' and normalize 'array_3' in one step

In [14]:
Normalizer(norm="l1").transform(array_3)

array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

**Note:** Intuitively, L2 norm can be thought of as the distance between two points in New York for a bird (i.e., a straight line), while L1 can be thought of as the distance for a human walking on the street (walk north one block, east one block, north one block, east one block, etc.), which is why it is called “Manhattan norm” or “Taxicab norm.” Practically, notice that norm='l1' rescales an observation’s values so they sum to 1, which can sometimes be a desirable quality.

### 2.4 Transforming Features

- Import 'FunctionTransformer' from 'sklearn preprocessing'
- Create a 2d array ('array_4') containing: 2, 3; 2, 3; and 2, 3
- Define a simple function ('add_10') that takes in one argument ('x') and returns 'x+10'
- Create a transformer, 'ten_transformer'
- Transform 'array_4'

In [15]:
from sklearn.preprocessing import FunctionTransformer

array_4 = np.array([[2, 3], [2, 3], [2, 3]])

def add_10(x):
    return x + 10

ten_transformer = FunctionTransformer(add_10)
ten_transformer.transform(array_4)

array([[12, 13],
       [12, 13],
       [12, 13]])

- Import Pandas
- Create a dataframe ('array_4_df') from 'array_4' with 'col1' and 'col2' as columns 
- Apply 'add_10' function to the dataframe

In [16]:
import pandas as pd

array_4_df = pd.DataFrame(array_4, columns=['col1', 'col2'])
array_4_df.apply(add_10)

Unnamed: 0,col1,col2
0,12,13
1,12,13
2,12,13


### 2.5 Deleting Observations with Missing Values

- Create a 2d array ('array_5') containing: 1.1, 11.1; 2.2, 22.2; 3.3, 33.3; 4.4, 44.4; np.nan, 55
- Keep only observations that are not missing

In [17]:
array_5 = np.array([[1.1, 11.1], [2.2, 22.2], [3.3, 33.3], [4.4, 44.4], [np.nan, 55]])
array_5[~np.isnan(array_5).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

- Create a dataframe ('array_5_df') from 'array_5' with 'col1' and 'col2' as columns
- Remove observations with missing values

In [18]:
array_5_df = pd.DataFrame(array_5, columns=['col1', 'col2'])
array_5_df.dropna()

Unnamed: 0,col1,col2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4


### 2.6 Imputing Missing Values

- Import 'StandardScaler' from 'sklearn.preprocessing'
- Import 'make_blobs' from 'sklearn.datasets'
- Import 'SimpleImputer' from 'sklearn.impute'
- Make fake data ('fake, blobs') from 'make_blobs' with 1000 samples, 2 features, and a random state of 1
- Show the first 5 rows of 'fake,blobs'

In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.impute import SimpleImputer

fake,blobs = make_blobs(n_samples = 1000,
                        n_features = 2,
                        random_state = 1)
fake[:5],blobs[:5]

(array([[-3.05837272,  4.48825769],
        [-8.60973869, -3.72714879],
        [ 1.37129721,  5.23107449],
        [-9.33917563, -2.9544469 ],
        [-8.63895561, -8.05263469]]),
 array([0, 1, 0, 1, 2]))

- Standardize the data and name it 'standardized_blobs'
- Assign the first value of 'standardized_blobs' to a variable named 'true_value'
- Show the first 3 rows of 'standardized_blobs'

In [20]:
standardized_blobs = StandardScaler().fit_transform(fake,blobs)
true_value = standardized_blobs[0, 0]
standardized_blobs[:3]

array([[ 0.87301861,  1.31426523],
       [-0.67073178, -0.22369263],
       [ 2.1048424 ,  1.45332359]])

- Assign the first value of 'standardized_blobs' to 'np.nan'
- Show the first 3 rows of 'standardized_blobs'

In [21]:
standardized_blobs[0,0] = np.nan
standardized_blobs[:3]

array([[        nan,  1.31426523],
       [-0.67073178, -0.22369263],
       [ 2.1048424 ,  1.45332359]])

- Create a mean imputer named 'mean_imputer'
- Impute 'fake,blobs' and call it 'blob_imputed'
- Using format function, print "True Value" and 'true_value'
- Using format function, print "Imputed Value" and 'blob_imputed' at 0,0

In [22]:
mean_imputer = SimpleImputer(strategy="mean")
blob_imputed = mean_imputer.fit_transform(fake,blobs)

print("True Value: {}".format(true_value))
print("Imputed Value: {}".format(blob_imputed[0,0]))

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


---
#### Part II. Handling Categorical Data


### 2.7 Encoding Nominal Categorical Features

In [23]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer

feature = np.array([
                    ["Texas"],
                    ["California"],
                    ["Texas"],
                    ["Delaware"],
                    ["Texas"]
                    ])

# create one-hot encoder
one_hot = LabelBinarizer()

# one-hot encode feature
one_hot.fit_transform(feature)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

In [24]:
# view feature classes
one_hot.classes_

array(['California', 'Delaware', 'Texas'], dtype='<U10')

In [25]:
# reverse one-hot encoding
one_hot.inverse_transform(one_hot.transform(feature))

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

In [26]:
# We can even use Pandas to one-hot-encode the feature
import pandas as pd

# Create dummy variables from feature
pd.get_dummies(feature[:,0])

Unnamed: 0,California,Delaware,Texas
0,0,0,1
1,1,0,0
2,0,0,1
3,0,1,0
4,0,0,1


In [27]:
# create multiclass feature
multiclass_feature = [
                    ("Texas", "Florida"),
                    ("California", "Alabama"),
                    ("Texas", "Florida"),
                    ("Delaware", "Florida"),
                    ("Texas", "Alabama")
                    ]

# create multiclass one-hot encoder
one_hot_multiclass = MultiLabelBinarizer()

# one-hot encode multiclass feature
one_hot_multiclass.fit_transform(multiclass_feature)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

In [28]:
# view classes
one_hot_multiclass.classes_

array(['Alabama', 'California', 'Delaware', 'Florida', 'Texas'],
      dtype=object)

In [29]:
# Create dummy variables from multiclass feature
pd.get_dummies(multiclass_feature)

Unnamed: 0,"(California, Alabama)","(Delaware, Florida)","(Texas, Alabama)","(Texas, Florida)"
0,0,0,0,1
1,1,0,0,0
2,0,0,0,1
3,0,1,0,0
4,0,0,1,0


### 2.8 Encoding Ordinal Categorical Features

In [30]:
import pandas as pd

# create features
df = pd.DataFrame({"Score": 
                   ["Low", "Low", "Medium", "Medium", "High"]
                  })
# create mapper
scale_mapper = {"Low": 1,
                "Medium": 2,
                "High": 3
                }

# replace feature values with scale
df["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    3
Name: Score, dtype: int64

In [31]:
df

Unnamed: 0,Score
0,Low
1,Low
2,Medium
3,Medium
4,High


In [32]:
# Be conscious about the numerical values mapped to classes
df = pd.DataFrame({"Score": 
                   ["Low", "Low", "Medium", "Medium", "High", "Barely More Than Medium"]
                  })
scale_mapper = {"Low": 1,
                "Medium": 2,
                "Barely More Than Medium": 2.1,
                "High": 3
               }
df["Score"].replace(scale_mapper)

0    1.0
1    1.0
2    2.0
3    2.0
4    3.0
5    2.1
Name: Score, dtype: float64

### 2.9  Encoding Dictionaries of Features

In [33]:
from sklearn.feature_extraction import DictVectorizer

data_dict = [{"Red": 2, "Blue": 4},
             {"Red": 4, "Blue": 3},
             {"Red": 1, "Yellow": 2},
             {"Red": 2, "Yellow": 2}
            ]
# create dictionary vectorizer
dictvectorizer = DictVectorizer(sparse=False)

# convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)

features

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

By default DictVectorizer outputs a sparse matrix that only stores elements with a value other than 0. This can be very helpful when we have massive matrices (often encountered in natural language processing) and want to minimize the memory requirements. We can force DictVectorizer to output a dense matrix using sparse=False.

In [34]:
# get feature names
feature_names = dictvectorizer.get_feature_names()
feature_names

['Blue', 'Red', 'Yellow']

In [35]:
#While not necessary, for the sake of illustration we can create a pandas DataFrame to view the output better

# Import library
import pandas as pd

# Create dataframe from features
pd.DataFrame(features, columns=feature_names)

Unnamed: 0,Blue,Red,Yellow
0,4.0,2.0,0.0
1,3.0,4.0,0.0
2,0.0,1.0,2.0
3,0.0,2.0,2.0


### 2.10 Imputing Missing Class Values

In [36]:
# load libraries
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

X = np.array([[0, 2.10, 1.45],
             [1, 1.18, 1.33],
             [0, 1.22, 1.27],
             [1, -0.21, -1.19]
             ])

X_with_nan = np.array([[np.nan, 0.87, 1.31],
                      [np.nan, -0.67, -0.22]])

# train KNN learner
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:, 0])

# predict missing values' class
imputed_values = trained_model.predict(X_with_nan[:, 1:])

# join column of predicted class with their other features
X_with_imputed = np.hstack((imputed_values.reshape(-1, 1), X_with_nan[:, 1:]))

# join two feature matricies
np.vstack((X_with_imputed, X))

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

In [37]:
from sklearn.impute import SimpleImputer

# join the two feature matricies
X_complete = np.vstack((X_with_nan, X))

imputer = SimpleImputer(strategy='most_frequent')

imputer.fit_transform(X_complete)

array([[ 0.  ,  0.87,  1.31],
       [ 0.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

### 2.11 Handling Imbalanced Classes

In [38]:
# Load libraries
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load iris data
iris = load_iris()

# Create feature matrix
features = iris.data

# Create target vector
target = iris.target

# Remove first 40 observations
features = features[40:, :]
target = target[40:]

# Create binary target vector indicating if class 0 and look at the imbalanced target vector
target = np.where((target == 0), 0, 1)
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [39]:
# Create weights
weights = {0: .9, 1: 0.1}

# Create random forest classifier with weights
RandomForestClassifier(class_weight=weights)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight={0: 0.9, 1: 0.1}, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       max_samples=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)

In [40]:
# Train a random forest with balanced class weights
RandomForestClassifier(class_weight="balanced")

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [41]:
# Indicies of each class' observations
i_class0 = np.where(target == 0)[0]
i_class1 = np.where(target == 1)[0]

# Number of observations in each class
n_class0 = len(i_class0)
n_class1 = len(i_class1)

# For every observation of class 0, randomly sample
# from class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1, size=n_class0, replace=False)

# Join together class 0's target vector with the
# downsampled class 1's target vector
np.hstack((target[i_class0], target[i_class1_downsampled]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [42]:
# Join together class 0's feature matrix with the
# downsampled class 1's feature matrix
np.vstack((features[i_class0,:], features[i_class1_downsampled, :]))[0:5]

array([[5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4]])

In [43]:
# For every observation in class 1, randomly sample from class 0 with replacement
i_class0_upsampled = np.random.choice(i_class0, size=n_class1, replace=True)


# Join together class 0's upsampled target vector with class 1's target vector
np.concatenate((target[i_class0_upsampled], target[i_class1]))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1])

In [44]:
# Join together class 0's upsampled feature matrix with class 1's feature matrix
np.vstack((features[i_class0_upsampled,:], features[i_class1,:]))[0:5]

array([[5. , 3.5, 1.6, 0.6],
       [4.8, 3. , 1.4, 0.3],
       [4.8, 3. , 1.4, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6]])