# Chapter 3: A Tour of Machine Learning Classifieres Using Scikit-learn

**Key Steps in Training a ML Algorithm**
1. Selection of features.
2. Choosing a performance metric.
3. Choosing a classifier and optimization algorithm.
4. Evaluating the performance of the model.
5. Tuning the algorithm.

<script type="text/javascript" async
  src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>

In [None]:
from sklearn import datasets
import numpy as np

In [None]:
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()
# Estimate the sample mean and standard deviation on the training data.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [None]:
from sklearn.linear_model import Perceptron
ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)
ppn.fit(X_train_std, y_train)

y_pred = ppn.predict(X_test_std)
print('Misclassified samples: %d' % (y_test != y_pred).sum())

In [None]:
from sklearn.metrics import accuracy_score
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

In [None]:
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
    # Set up marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    
    # Plot the decision surface
    x1_min, x1_max = X[:,0].min() - 1, X[:,0].max() + 1
    x2_min, x2_max = X[:,1].min() - 1, X[:,1].max() + 1
    # Generate a grid over the field R2 - with the given resolution.
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
    # Note that np.array([xx1.ravel(), xx2.ravel()]).T gives permutations of coordinates
    # over the grid.
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    
    # Plot the class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                   alpha=0.8, c=cmap(idx),
                   marker=markers[idx], label=cl)
    
    # Highlight test samples
    if test_idx:
        X_test, y_test = X[test_idx, :], y[test_idx]
        plt.scatter(X_test[:,0], X_test[:, 1], c='',
                   alpha=1.0, linewidths=1, marker='o',
                   s=55, label='test set')

In [None]:
X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X=X_combined_std,
                      y=y_combined,
                      classifier=ppn,
                      test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.show()

In [None]:
help(Perceptron)

## Logistic Regression

We note that the perceptron will never converge on datasets that are not linearly separable!

Logistic regression is a simple yet poweful algorithm for linear and binary classification problems (somewhat of a misnomer, as it is not actually used for regression.)

**odds ratio**: odds in favor of a particular event, $\frac{p}{1-p}$
where $p$ is the probability of the positive event (the one we want to predict)

**logit** function: $$logit(p) = \log \frac{p}{1 - p}$$

We use the logit function to express a linear relationship between feature values and the log-odds.

$$logit\big(p\big(y=1 \mid \mathbf{x}\big)\big) = w_0 x_0 + w_1 x_1 + \cdots + w_m x_m = \sum_{i=0}^m w_i x_i = \mathbf{w}^T \mathbf{x}$$

What we really want is the probability that a certain sample belongs to a particular class, i.e. the inverse form of the logit function, the **logistic** function:
$$\phi(z) = \frac{1}{1+ e^{-z}}$$

where $z$ is the net input (linear comb. of weights/features), $z = \mathbf{w}^T \mathbf{x}$.

So why use the logistic function? Note that the logit function maps values from the range $[0,1]$ to the reals. Then, the logistic function as the inverse maps the reals to the range $[0,1]$, which may be interpreted as probability. 

Additionally, note that $\phi(z) = 0.5 \implies z = 0$.

Logistic regression entails replacing the activation function from before with the sigmoid function.

The output of the sigmoid function can then be interpreted as the probability of the particular sample belonging to class $1$ $\phi(z) = P(y = 1\mid \mathbf{x};\mathbf{w})$.

For binary classification, the complement gives the probability of the sample belonging to class $0$.

A quantizer function (e.g. some step function) can then be used to convert the probabilities into a binary outcome.

It *is* often helpful to be able to estimate class-membership probability in addition to simply outputting a single yes/no result.