<a href="https://colab.research.google.com/github/kursataker/cng562-machine-learning-spring-19/blob/master/Basic_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
## Disclaimer: This notebook closely follows Burkov's book Chapter 5.

import numpy as np

## Feature Engineering
 Converting raw data into a dataset is called **feature engineering**.
 
 *Informative features* have high *predictive power*.
 
 In credit default problem, we expect *income* or *credit card payment history* to be informative features.
 
 
 
 

## Categorical Data to Numerical Data
 
 Among all the algorithms, we have seen the only which accepts categorical data (e.g. color = 'blue') is the decision tree algoritm.
 
 Linear, Logistic Regression, kNN or SVM **do no accept** categorical data.
 
*One-Hot encoding* transforms categorical features into several binary features:

\begin{align}
red &= [1,0,0]\\
yellow &= [0,1,0]\\
green &=[0,0,1].
\end{align}



In [0]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)

In [0]:
enc.categories_

[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

In [0]:
enc.transform([['Female', 1], ['Male', 4]]).toarray()

array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [0]:
enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])


array([['Male', 1],
       [None, 2]], dtype=object)

In [0]:
enc.get_feature_names()


array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)

## Numerical Data to Categorical Data
*Binning* (*Bucketing* or *Discretization*) is the process of converting a continuous feature into multiple binary features called *bins* or *buckets*, typically based on value range.

In [0]:
>>> X = np.array([[ -3., 5., 15 ],
...               [  0., 6., 14 ],
...               [  6., 3., 11 ]])

In [0]:
from sklearn import preprocessing

In [0]:
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)

In [0]:
est.bin_edges_

array([array([-3., -1.,  2.,  6.]), array([3., 5., 6.]),
       array([11., 14., 15.])], dtype=object)

In [0]:
est.n_bins_

array([3, 2, 2])

In [0]:
est.n_bins

[3, 2, 2]

In [0]:
est.transform(X)

array([[0., 1., 1.],
       [1., 1., 1.],
       [2., 0., 0.]])

## Preparing numerical features


### Standardization = Mean removal and Variance Scaling

*Standardization* (or *z-score normalization*, or *mean removal and variance scaling*) is the procedure so that mean of the resulting distribution is **zero** and its standard deviation is **one**:


$$ x \to \frac{x-\mu}{\sigma}.$$



Many of scikit-learn estimators requires numerical features to be *standardized*, such as linear regression.



In [0]:
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])

In [0]:
X_scaled = preprocessing.scale(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [0]:
X_scaled.mean(axis=0)

array([0., 0., 0.])

In [0]:
X_scaled.std(axis=0)

array([1., 1., 1.])

In [0]:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [0]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [0]:
scaler.transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [0]:
X_test = [[-1., 1., 0]]
scaler.transform(X_test)

array([[-2.44948974,  1.22474487, -0.26726124]])

### Normalization / Scaling features to a range
Sometimes we need to scale a numeric feature to a fixed interval, such as [0,1] (MinMaxScaler).

$$ x \to \frac{x-{min}}{{max}-{min}}$$

1. Some learning algorithms need features to be scaled. Recall that kNN algorithm needs features to be relatively of the same size.
2. Optimization algorithms, such as gradient descent, may need features to be relatively of the same size. Suppose feature $x$ is in the range [0, 10000] and feature $y$ is in the range [0, 0.0001], then updates in gradient descent will dominated by the larger feature.






In [0]:
X_train

array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])

In [0]:
min_max_scaler = preprocessing.MinMaxScaler()

In [0]:
X_train_minmax = min_max_scaler.fit_transform(X_train)

In [0]:
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [0]:
X_test = [[-3., -1., 4]]
min_max_scaler.transform(X_test)

array([[-1.5       ,  0.        ,  1.66666667]])

In [0]:
min_max_scaler.scale_

array([0.5       , 0.5       , 0.33333333])

In [0]:
min_max_scaler.min_

array([0.        , 0.5       , 0.33333333])

### Rules of thumb: Standardization vs Scaling

Standardization is preferred 
* if the feature is close to being normally distributed (bell-curve)
* if the feature has extremely high or extremely low values (outliers)
* for unsupervised learning algorithms.

## Dealing with Missing Features

It is very typical that some examples in the dataset do not contain information about some features.

Typical approaches to dealing with missing vaues for a feature include

* remove the examples with missing features
* use a learning algorithm that can deal with missing feature values
* **impute** the data.

### Imputation Techniques

* Replace a missing feature with the average of feature values (SimpleImputer).
* Set the missing value outside the range (=[0,1])).
* Set the missing value as the middle of the range (=[-1,1]).
* Set up a regression problem to estimate the values of the missing features.
* Add a binary identifier to signal a missing feature (MissingIndicator).

**Must use the same imputation technique** at prediction time.

For more, see
https://scikit-learn.org/stable/modules/impute.html#impute

In [0]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train = np.array([[1, 2], [np.nan, 3], [7, 6]])
imp.fit(X_train)  
X = np.array([[np.nan, 2], [6, np.nan], [7, 6]])
print(imp.transform(X))           

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


In [0]:
print(X_train)

[[ 1.  2.]
 [nan  3.]
 [ 7.  6.]]


In [0]:
print(X)

[[nan  2.]
 [ 6. nan]
 [ 7.  6.]]


In [0]:
print(imp.transform(X))

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


## Model Performance Assessment


### Regression

For a given output, **mean model** estimates the output as the average of training labels.

We typically compare *Mean Squared Error* (MSE), or *Root Mean Squared Error* (RMSE).

## Classification

Widely used metrics are
* Confusion matrix
* Accuracy
* Cost-sensitive accuracy
* Precision/recall
* Area under the ROC curve (AUC)

#### Confusion matrix

In [0]:
import matplotlib.pyplot as plt
from sklearn.utils.multiclass import unique_labels

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

In [0]:
from sklearn.metrics import confusion_matrix


y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)


array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

| |spam (predicted)|not_spam (predicted)|
|-|---------|----------------|
|spam (actual)| 23 (TP)| 1 (FN)|
|not_spam (actual)| 12 (FP)| 556 (TN)|


### Precision/Recall

\begin{align}
\text{precision} &= \frac{\text{TP}}{\text{TP+FP}} \\
\\
\\
\text{recall} &= \frac{\text{TP}}{\text{TP+FN}}
\end{align}



[https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg](https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg)

Precision = Ratio of correct positive predictions to all positive predictions

Recall = Ratio of correct positive predictions to all positive examples

In DB query,

Precision = ratio of relevant documents among all returned documents

Recall = ratio of relevant documents returned to all relevant documents.


### Accuracy

\begin{align}
\text{accuracy} &= \frac{\text{TP+TN}}{\text{TP+TN+FP+FN}} \\
\end{align}

Accuracy is useful when errors in predicting all classes are equally important.

In spam, we tolerate false negatives (a spam message not marked as spam) more than false positives (a friend's email marked as spam).

### Cost-Sensitive Accuracy

Assign a number to FP and FN. Compute Accuracy this way.

### Area under the ROC curve (AUC)

\begin{align}
\text{TPR} &= \frac{\text{TP}}{\text{TP+FN}} \\
\\
\\
\text{FPR} &= \frac{\text{FP}}{\text{FP+TN}}
\end{align}

ROC curves can be only used with confidence scores.

If the range is [0,1], discretize the interval, such as [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1].

For each T in [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
1. Set a threshold T. 
2. Compute TPR and FPR. 
3. Mark the point (FPR, TPR) on the graph .



See https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py