# Index

* [About classification algorithms](#About-classification-algorithms)
* [Train and test subsamples](#Train-and-test-subsamples)
* [K nearest neighbors](#KNN)
* [Decision Trees](#Decision-Trees)
    * [Splitting the dataset](#Splitting-the-dataset)
    * [Avoiding overfitting](#Avoiding-overfitting)
* [Logistic regression](#Logistic-Regression) 
    * [Example #1: random samples](#Example-#1,-logistic-regresion-with-random-samples)
    * [Example #2: iris dataset](#Example-#2,-logistic-regresion-with-iris-dataset)
* [Discriminant analysis](#Discriminant-analysis)
* [Naive Bayes](#Naive-Bayes)
* [Support vector machines (SVM)](#Support-Vector-Machines)
* [PCA for classification](#PCA-for-classification)

## About classification algorithms
Some classification algorithms can only distinguish between two classes, how can we use them in multi class problems? There are two approaches to this:
    
* **One vs one:** is the approach where we evaluate the classes in pairs. Say we have three classes, A, B and C. The OVO ensemble will be composed of 3 (= 3 * (3 - 1) / 2) binary classifiers. The first will discriminante A from B, the second A from C, and the third B from C. At prediction stage, the class that got the highest number of "+1" predictions is our winner. Notice that this is a $O(n^2)$ problem
      
* **One vs rest:** (aka one-vs-all)is the strategy that involves training one classifier (estimator) for class and then taking the one which gives the highest confidence.

[wiki](https://en.wikipedia.org/wiki/Multiclass_classification#Transformation_to_binary)

## Train and test subsamples
In general we should split the data given in two parts: one for training and the other for testing. Usually the testing slice is 1/3 of the dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import (
    ListedColormap, LinearSegmentedColormap, Normalize, )
from matplotlib.patches import Ellipse

# Some estimators
from sklearn import neighbors, datasets, linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.discriminant_analysis import (
    LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis, )
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Import metrics
from sklearn.metrics import confusion_matrix, accuracy_score

# Import some utils
from sklearn.utils import shuffle


# import some data to play with
IRIS = datasets.load_iris()

# Pandas version
IRIS_PD = pd.DataFrame(data= np.c_[IRIS['data'], IRIS['target']],columns= IRIS['feature_names'] + ['target'])
IRIS_PD['target'] = IRIS_PD['target'].astype(int)

%matplotlib inline

In [None]:
# Some useful functions

def mesh(X, h=.01):
    """Create a meshgrid object with the input space dimensions."""
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return (xx, yy)

def quick_scatter(X, y):
    """Create a quick scatter plot."""
    plt.figure()
    plt.scatter(X, y)
    plt.show()

## KNN
K nearest neighbors is a *lazy* algorithm which does not learn and makes computations in classification time, that is, find a predefined number of training samples (k) closest in distance to the new point, and predict the label from these.

Notice that KNN takes by default the k **closest samples regardless how far they are**, to mitigate this effect a weight parameter can be added.

KNN can also be applied to time series but they're pretty much regression problems we'll see them in due time. 

**Key features of KNN:**
* Easy to understand and implement.
* Computationally efficient in general (with small datasets).
* Defining similarities.
* The first thing that should be tried when approaching a ML problem.
* They suffer especially the [curse of dimensionality](./../Glossary.ipynb/#C).

In [None]:
### PART #1, Load an preprocess the data #####

# we only take the first two features in the dataset
X = IRIS.data[:, :2]
cols = IRIS['feature_names'][:2]
y = IRIS.target


##### PART 2, create the model #####

# Number of neighbors and weight
k, w = 30, 'distance'

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors=k, weights=w)
clf.fit(X, y)


##### PART #3, plot the outcome ####

# We are about to create a mesh of points that will represent a bunch of predictions 
xx, yy = mesh(X)
    
# Once created the mesh, drop all the points into the model and predict the values for them
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Plot the prediction areas (background)
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
Z = Z.reshape(xx.shape)  # reshape to match the grid, same as yy.shape
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points (real points)
cmap_bold =  ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
plt.xlabel(cols[0])
plt.ylabel(cols[1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)"% (k))

plt.show()

## Decision Trees
[Nice visualization](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) 

Decision trees divide the space into high dimensional rectangles. They are simple to understand and interpret (white box model), but they tend to overfit the data. However, they are useful in other ML techniques like bagging or random forests.

In [None]:
# We take the features in pairs (Uncomment to see other pairs)
pair = [0,1]
#pair = [1,2] 
#pair = [2,3] 
X = IRIS.data[:, pair]
y = IRIS.target

# Train
clf = DecisionTreeClassifier().fit(X, y)

# Display the score (in the same training set, notice)
print('score was: {}'.format(clf.score(X, y)))

# Again, create a mesh of points that will represent a bunch of predictions
xx, yy = mesh(X)

# Once created the mesh, drop all the points into the model and predict the values for them
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

plt.xlabel(IRIS.feature_names[pair[0]])
plt.ylabel(IRIS.feature_names[pair[1]])

# Plot the training points
plot_colors, n_classes = 'ryb', 3
for i, color in zip(range(n_classes), plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1], c=color, label=IRIS.target_names[i], edgecolor='black', s=15)
plt.show()

### Splitting the dataset
Let's split the dataset into training and testing subsamples so we can check how effective is our training

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)

# Train
clf = DecisionTreeClassifier().fit(X_train, y_train)

# Display the score
print('score was: {}'.format(clf.score(X_test, y_test)))

# Plot the decision boundary
xx, yy = mesh(X)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

plt.xlabel(IRIS.feature_names[pair[0]])
plt.ylabel(IRIS.feature_names[pair[1]])

# Plot the training points
for i, color in zip(range(n_classes), plot_colors):
    idx = np.where(y_test == i)
    plt.scatter(X_test[idx, 0], X_test[idx, 1], c=color, label=IRIS.target_names[i], edgecolor='black', s=15)
plt.show()


### Avoiding overfitting
As we can see, decision trees tend to overfit to the trining data, there are two ways to mitigate this:
* Bagging **(B**ootstrap **agg**regat**ing**) [[wiki]](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
* Random Forests [[wiki]](https://en.wikipedia.org/wiki/Random_forest)

**A: Bagging:** take several random subsets of the data, train them independently and finally aggregate the resutls and vote the best one.

**B: Random Forests:** like above but instead of taking random subsets of the data, we take random subsets of the features. That prevents errors produced by correlations in the features.

In [None]:
# Create an artificial dataset
X, y = datasets.make_classification(n_samples=10000, n_features=6,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

# Now split the train and the test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)

# Train the model
clf = RandomForestClassifier(
    n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)  

# Display the score
print('score was: {}'.format(clf.score(X_test, y_test)))


print(clf.feature_importances_)

print(clf.predict([[0, 0, 0, 0, 0, 0]]))

## Logistic Regression

[Regression](https://en.wikipedia.org/wiki/Regression_analysis) is a wide topic in maths and machine learning that tries to estimate an outcome (target) given several independent variables called predictors or features so we can forecast a future output.

The basic idea of regression is the following:


$$\hat{y}(\mathbf{w},\mathbf{x})=w_0+\mathbf{w_1 x_1}+...+\mathbf{w_p y_p}$$


We'll try to predict a $\hat{y}$ by assigning a coeficient ($\mathbf{w}$, weight) to each component (feature) of the vector $\mathbf{X}$ we input and an intercept point (constant term) $w_0$.

That is: we assume that **every target in the data can be approximated by a linear combination of its features.**

In the case of logistic regression, we can plug this line as an argument of the logistic function to get a probability for a certain sample $X$ to be classified as 0 or 1. Odds under 2:2 will be classified as $0$ and $1$ otherwise.   

**Consider:** 
* When the classes are well separated can be unstable: if there's a feature that separates classes perfectly the coefficients go up to infinity.

* If the sample is small, discriminant analysis is more accurate.


#### **Example #1, logistic regresion with random samples**
An adaptation from a [[scikit-learn]](https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html#sphx-glr-auto-examples-linear-model-plot-logistic-py) example. 

* **Black dots:** a random sample where values over 0 yield $y=1$ (mostly but not always) and $y=0$ otherwise.

* **Blue line:** a logistic model that predict a certain value. Every prediction for $X>0.222$ ($-w_0/w_1$) will be classified as $1$ otherwise $0$.

* **Red curve:** the probabilty that the prediction will be 1. Notice that the point where probability is $0.5$ is precisely $-w_0/w_1$

There is also a [Desmos graph](https://www.desmos.com/calculator/binjtdtjry) to get a feeling of how those coefficients affect to the final logistic curve.


In [None]:
# Create an array of 300 random samples cetered at 0 
n_samples = 300
np.random.seed(0)
X = np.random.normal(size=n_samples)

# Now, add value 1 to samples over 0 and 0 to samples under 0
y = (X > 0).astype(np.float).ravel()

# Add some noise
X[X > 0] *= 4  # strectch out values over 0

# Be sure to find samples under 0 with y=1 and vice versa
X += .3 * np.random.normal(size=n_samples)  

# Finally, make it a col vector
X = X[:, np.newaxis]

# Plot the point distribution
plt.figure(figsize=(17, 5))
plt.scatter(X, y, color='black')

# Instantiate the classifier
clf = linear_model.LogisticRegression(C=1e7, solver='lbfgs')
clf.fit(X, y)
w0, w1 = clf.intercept_, clf.coef_

def log(x):
    """Get the probability using a logistic function."""
    return 1 / (1 + np.exp(-x))

# Create a bunch of test samples and predict the values and the odds for them.
X_test = np.linspace(-1, 2, n_samples)
y_hat = clf.predict(X_test[:, np.newaxis])
odds = log(w1 * X_test + w0).ravel()

# Plot the outcomes
plt.plot(X_test, odds, color='red', linewidth=3, label='Probability')
plt.plot(X_test, y_hat, color='blue', linewidth=3, label='Prediction (y_hat)')

# Finally, add some more details to the plot
plt.axhline(.5, color='.5') # 2:2 odds
plt.ylabel('y')
plt.xlabel('X')
plt.xlim(-1, 2)
plt.legend( loc="lower right")
plt.show()

#### **Example #2, logistic regresion with iris dataset**
An adaptation from a [[scikit-learn]](https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py) example. 

In this example, we'll tray to set a linear boundary between the points so blue ones are in blue areas, yellows in yellow and browns in brown.

We can imagine this boundaries as different heights (Z). We can see that blue ones are quite accurate classified whereas the main problem is addressed at brown ones.

The second graph shows how some pair of features are more suitable to predict.

In [None]:
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, 2:4]  # we only take the first two features.
Y = iris.target

# Create an instance of Logistic Regression Classifier and fit the data.
logreg = linear_model.LogisticRegression(
    C=1, solver='lbfgs', multi_class='multinomial')
logreg.fit(X, Y)

# Plot the decision boundary.
xx, yy = mesh(X)

# Drop all the points into the model and predict the values for them
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(7,7))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

In [None]:
# Multi-comparison between pairs of features.

# Let's take features by couples to compare them
X = (IRIS.data[:, :2], IRIS.data[:, 1:3], IRIS.data[:, 2:4])
Y = IRIS.target

# Create an instance of Logistic Regression Classifier and fit the data.
logreg = linear_model.LogisticRegression(
    C=1e5, solver='lbfgs', multi_class='multinomial')

# Instantiate the plot
_, axs = plt.subplots(1, 3, figsize=(24, 8))

for n, pair in enumerate(X):
    # Fit the model
    clf = logreg.fit(pair, Y)
    
    # Get the boundaries
    xx, yy = mesh(pair)
    
    # Make a bunch of predictions
    y_hat = (clf.predict(np.c_[xx.ravel(), yy.ravel()]))
    Z = (y_hat.reshape(xx.shape))
    
    # Plot the classification boundaries
    axs[n].pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
    
    # And the training points
    axs[n].scatter(pair[:, 0], pair[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
    
    # Now set the limits, labels and remove the ticks
    axs[n].set_xlim(xx.min(), xx.max())
    axs[n].set_ylim(yy.min(), yy.max())
    axs[n].set_xlabel(IRIS.feature_names[n])
    axs[n].set_ylabel(IRIS.feature_names[n + 1])
    axs[n].set_xticks(()), axs[n].set_yticks(())
    
plt.show() 

## Discriminant analysis

The linear and quadratic discriminant analysis is an algorithm that can be used for classification and also for dimensionality reduction (especially when dealing with multiclass datasets)

It's based on Bayesian inference:

$$P(y=k|\mathbf{X})=\frac{P(\mathbf{X}|y=k)\cdot P(y=k)}{P(\mathbf{X})}$$

Where:

* $P(y=k|\mathbf{X})$, **estimation:** probability that the known vector $\mathbf{X}$ (test dataset vector) belongs to certain class $k$

* $P(\mathbf{X}|y=k)$, **likelihood:** take all the vectors $\mathbf{X}$ in the training dataset that output certain class $k$ and assume that each component (feature) is normal distributed.

* $P(y=k)$, **priors:** ratio of class $k$ in the training dataset.

* $P(\mathbf{X})$: an element we can get rid of since the dataset is fixed and therefore its probability is 1.

With all this above we get:

> **Estimation(k) = likelihood(k) · priors(k)** and then,
>
> `max[estimation(k) for k in classes]`

That is, we'll calculate all the estimations for each possible class for a given vector in the test dataset and choose the $k$ that has max value (maximum likelihood

LDA assumes that each class has the same covariance matrix (that is distributions have similar eigenvectors but different means) and that implies that the classification boundary is a straight line. Intuitivelly, there's a line that shows the equilibrium between probabilities, everyone is pulling vector **X** towards itself with the same strenght.

Alternativelly, QDA makes no assumptions and that leads to quadratic boundaries, we are squaring the distance from vector **X** to the mean of each class.

[[Scikit-learn]](https://scikit-learn.org/stable/modules/lda_qda.html#mathematical-formulation-of-the-lda-and-qda-classifiers)


In [None]:
# First, generate datasets
def dataset_fixed_cov(n=300):
    """Generate 2 Gaussian samples with the same covariance matrix"""
    dim = 2
    np.random.seed(0)
    
    # generate two random samples 
    s1, s2 = np.random.randn(n, dim), np.random.randn(n, dim)
    
    # define a linear transformation
    T = np.array([[0., -0.23], [0.83, .23]])
    
    # Apply transformations and ensure s2 is far enough
    s1, s2 = np.dot(s1, T), np.dot(s2, T) + np.array([1, 1])
    
    # Finally join them together
    X = np.r_[s1, s2]
    
    # Assign classes
    y = np.hstack((np.zeros(n), np.ones(n)))
    
    return X, y


def dataset_cov(n=300):
    """Generate 2 Gaussian samples with different covariance matrices."""
    dim = 2
    np.random.seed(0)
    
    # generate two random samples 
    s1, s2 = np.random.randn(n, dim), np.random.randn(n, dim)
    
    # define a linear transformation
    T = np.array([[0., -1.], [2.5, .7]]) * 2
    
    # Apply transformations and ensure s2 is far enough
    s1, s2 = np.dot(s1, T), np.dot(s2, T.T) + np.array([1, 4])
    
    # Finally join them together
    X = np.r_[s1, s2]
    
    # Create the classes
    y = np.hstack((np.zeros(n), np.ones(n)))
    
    return X, y

# #############################################################################
# Second, generate a colormap.
cmap = LinearSegmentedColormap(
    'red_blue_classes',
    {'red': [(0, 1, 1), (1, 0.7, 0.7)],
     'green': [(0, 0.7, 0.7), (1, 0.7, 0.7)],
     'blue': [(0, 0.7, 0.7), (1, 1, 1)]})
plt.cm.register_cmap(cmap=cmap)

# #############################################################################
# Third, plot functions
def plot_data(lda, X_train, X_test, y_test, y_hat, fig_index):
    """Plot the data for each of the subplots."""
    # Instatiate subplot to assign it the properties 
    splot = plt.subplot(2, 2, fig_index)
    
    # Add some titles
    if fig_index == 1:
        plt.title('Linear Discriminant Analysis')
        plt.ylabel('Data with\n fixed covariance')
    elif fig_index == 2:
        plt.title('Quadratic Discriminant Analysis')
    elif fig_index == 3:
        plt.ylabel('Data with\n varying covariances')

    ### # Generate a confusion matrix # ###
    tp = (y_test == y_hat)  # Define what means true positive
    tp0, tp1 = tp[y_test == 0], tp[y_test == 1]  # Group truths by class
    X0, X1 = X_test[y_test == 0], X_test[y_test == 1]  # Group samples by class also
    X0_tp, X0_fp = X0[tp0], X0[~tp0]  # Select tp & fp in class 0
    X1_tp, X1_fp = X1[tp1], X1[~tp1]  # Select tp & fp in class 1

    # plot points 
    # dots are tp and crosses are fp, red are class 0 and blue are class 1
    plt.scatter(X0_tp[:, 0], X0_tp[:, 1], marker='.', color='red')
    plt.scatter(X0_fp[:, 0], X0_fp[:, 1], marker='x', s=20, color='#990000')
    plt.scatter(X1_tp[:, 0], X1_tp[:, 1], marker='.', color='blue')
    plt.scatter(X1_fp[:, 0], X1_fp[:, 1], marker='x', s=20, color='#000099')  # dark blue

    # Get a bunch of points inside the values of X
    xx, yy = mesh(X_train, h=.1)
    
    # Drop'em all into the model and predict the values for them
    Z = lda.predict_proba(np.c_[xx.ravel(), yy.ravel()])
    
    # Plot the outcome into a color plot
    Z = Z[:, 1].reshape(xx.shape)
    plt.pcolormesh(xx, yy, Z, cmap='red_blue_classes',
                   norm=Normalize(0., 1.), zorder=0)
    
    # Also paint the border line
    plt.contour(xx, yy, Z, [0.5], linewidths=2., colors='white')
    
    # Add some limits to the graph
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    # Finally draw a star where the means are located
    plt.plot(lda.means_[0][0], lda.means_[0][1],
             '*', color='yellow', markersize=15, markeredgecolor='grey')
    plt.plot(lda.means_[1][0], lda.means_[1][1],
             '*', color='yellow', markersize=15, markeredgecolor='grey')

    return splot


def plot_ellipse(splot, mean, cov, color):
    """Draw an ellipse to show standard deviation."""
    v, w = np.linalg.eigh(cov)  # Get eigenvalues & eigenvectors
    u = w[0] / np.linalg.norm(w[0])  # Get the unit eigenvector
    angle = np.arctan(u[1] / u[0])  # Get the angle, in radians
    angle = 180 * angle / np.pi  # convert to degrees
    
    # filled Gaussian at 2 standard deviation
    ell = Ellipse(mean, 2 * v[0] ** 0.5, 2 * v[1] ** 0.5,
                              180 + angle, facecolor=color,
                              edgecolor='black', linewidth=2)
    
    # Add some extra+ art
    ell.set_clip_box(splot.bbox)
    ell.set_alpha(0.2)
    splot.add_artist(ell)
    splot.set_xticks(()), splot.set_yticks(())


def plot_lda_cov(lda, splot):
    """Invoke the ellipses."""
    plot_ellipse(splot, lda.means_[0], lda.covariance_, 'red')
    plot_ellipse(splot, lda.means_[1], lda.covariance_, 'blue')


def plot_qda_cov(qda, splot):
    """Invoke the ellipses."""
    plot_ellipse(splot, qda.means_[0], qda.covariance_[0], 'red')
    plot_ellipse(splot, qda.means_[1], qda.covariance_[1], 'blue')


plt.figure(figsize=(17, 12), facecolor='white')
plt.suptitle('Linear Discriminant Analysis vs Quadratic Discriminant Analysis',
             y=0.98, fontsize=15)

# #############################################################################
# Fourth, generate the plots

# Get confusion matrices
cml, cmq = list(), list()

for i, (X, y) in enumerate([dataset_fixed_cov(n=600), dataset_cov(n=600)]):
    # Split the datasets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)
    
    # Instatiate the estimator
    lda = LinearDiscriminantAnalysis(solver="lsqr", store_covariance=True)
    
    # Make some predictions on the test set
    y_hat = lda.fit(X_train, y_train).predict(X_test)
    
    # Get the confusion matrix for them
    cml.append(confusion_matrix(y_test, y_hat))
    
    # Finally plot the outcome
    splot = plot_data(lda, X_train, X_test, y_test, y_hat, fig_index=2 * i + 1)
    plot_lda_cov(lda, splot)
    plt.axis('tight')

    # Quadratic Discriminant Analysis
    # Instantiate the estimator
    qda = QuadraticDiscriminantAnalysis(store_covariance=True)
    
    # Make some predictions using same set
    y_hat = qda.fit(X_train, y_train).predict(X_test)
    
    # Get the confusion matrix for them
    cmq.append(confusion_matrix(y_test, y_hat))
    
    # Finally plot the outcome
    splot = plot_data(qda, X_train, X_test, y_test, y_hat, fig_index=2 * i + 2)
    plot_qda_cov(qda, splot)
    plt.axis('tight')

plt.tight_layout()
plt.subplots_adjust(top=0.92)
plt.show()

# Print the outcomes of confusion matrix
data = {
    'Value': ['TP', 'FP', 'FN', 'TN', ],
    'LDA fixed': cml[0].ravel(),
    'LDA var': cml[1].ravel(),
    'QDA fixed': cmq[0].ravel(),
    'QDA var': cmq[1].ravel(),
}
pd.DataFrame(data)

### Naive Bayes

**Bayes theorem:**  
Bayes theorem let us update our beliefs (priors, $y$ classes) when we observe new evidence ($\mathbf{X}$ training).

**Naive Bayes:**
> The probability of a future class ($\hat{y}$) given certain feature ($\mathbf{X}_i$) is equal to the probability --distribution-- of that feature in the training times the ratio of that class in the  whole training set.
>
> Then we can choose the $\hat{y}$ that gives the highest value (most likely)

Naive since it assumes that  features are independent one each other and therefore the covariance matrix is diagonal (all the values are 0 except each component with himself)

**Algorithms**  
There are 4 algorithms to work out naive Bayes:
* Gaussian: the likelihood of the feature is assumed to be Gaussian
* Multinomial: the likelihood of each feature is assumed to be Multinomial (n-binomial)
* Complement: the above multinomial for unbalanced classes.
* Bernoulli: the likelihood of the feature is assumed to be a multivariate Bernoulli dostribution.

One of the most common uses of Naive Bayes is for text classification (like spam filtering for email) where one can assume that the words in the message are independent events. This condition is not generally satisfied (for example, in natural languages like English the probability of finding an adjective is affected by the probability of having a noun), but it is a useful idealization, especially since the statistical correlations between individual words are usually not known.


In [None]:
X, y = IRIS.data, IRIS.target

# Instantiate the classifier
gnb = GaussianNB()
clf = gnb.fit(X, y)

# Make some predictions
y_hat = clf.predict(X)

print("Number of mislabeled points out of a total %d points : %d"
      % (X.shape[0],(y != y_hat).sum()))

## Support Vector Machines
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

The closest points to the plane are the so called *support vectors* since they support the plane

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane

SVC is a **linear classifier** that can be converted to non-linear through the so-called *kernel trick*

In [None]:
# remove one of the classes
svc_df = IRIS_PD.drop((IRIS_PD[IRIS_PD['target'] == 2]).index)
x, y= 'sepal length (cm)', 'petal length (cm)'
svc_df.plot.scatter(x=x, y=y)

# Replace target 0 by -1 so sign can change with the product afterwards
t1 = svc_df[svc_df['target'] == 0]
svc_df.loc[t1.index, 'target'] = -1

# Get vectors and targets
X = svc_df[['sepal length (cm)', 'petal length (cm)']].values 
Y = svc_df['target'].values 

# Shuffle a bit and split into train and test
X, Y = shuffle(X, Y)
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.9)

# Make targets col vectors
y_train, y_test = y_train.reshape(90, 1), y_test.reshape(10, 1)

# Also for features
train_f1 = x_train[:, 0].reshape(90, 1)
train_f2 = x_train[:, 1].reshape(90, 1)

# Initialize the coefficients
w1, w2 = 0, 0

# Training cycles & learning rate
epochs, alpha = 1000, 1e-4

# record the evolution of w
w1_tape, w2_tape = np.zeros(0), np.zeros(0)

# Now, start training
for e in range(1, epochs):
    
    """Let's calculate the dot product between the 90 instances of x (with two features) 
    and 90 instances of w (with also two features)"""
    y_hat = w1 * train_f1 + w2 * train_f2
    
    # And change the sign of those instances that are in one category
    prod = y_hat * y_train
    
    # regularization parameter (reduces the impact of grad w over training)
    # Used to produce stable solutions
    lamda = 1 / (epochs)
    
    
    for n, val in enumerate(prod):
        grad_w1, grad_w2 = 2 * lamda * w1, 2 * lamda * w2
        if val >= 1:
            w1 -= alpha * grad_w1
            w2 -= alpha * grad_w2
        else:
            loss_f1 = train_f1[n] * y_train[n]
            loss_f2 = train_f2[n] * y_train[n]
            w1 += alpha * (loss_f1 -  grad_w1)
            w2 += alpha * (loss_f2 -  grad_w2)

    w1_tape = np.append(w1_tape, w1[0])
    w2_tape = np.append(w2_tape, w2[0])
    
    print('Epoch: {}\r'.format(e), end='')
    
### Plot ###
x = np.arange(1, epochs)
fig, ax = plt.subplots()

p1 = ax.plot(x, w1_tape)
p2 = ax.plot(x, w2_tape)

plt.show()

# Get test features
test_f1 = x_test[:, 0].reshape(10, 1)
test_f2 = x_test[:, 1].reshape(10, 1)

# Predict'em
y_hat = w1 * test_f1 + w2 * test_f2

pred = [1 if val>1 else -1 for val in y_hat]
print('precission:', accuracy_score(y_test, pred))

### Create a maximum margin separating hyperplane
[sklearn origin](https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#sphx-glr-auto-examples-svm-plot-separating-hyperplane-py)

In [None]:
# Create 40 separable points
X, y = datasets.make_blobs(n_samples=200, centers=2, random_state=6)

# Fit the model
clf = SVC(kernel='linear', C=1000)
clf.fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# Plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)

# plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])

# plot support vectors
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,
           linewidth=1, facecolors='none', edgecolors='k')
plt.show()



# PCA for classification

[[PCA visualized]](https://notsquirrel.com/pca/)  
[[SVD insight]](https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/)

Principal component analysis (PCA), is a techique that let's us to get a reduced & efficient version of a given matrix (covariance matrix, *sometimes, often, always?*), that can be used in turn for predictions. Efficient since provides eigenvectors (unique information) and reduced since only a few of these eigenvectors are enough to give close approximations to the original matrix. 

**Process breakdown (PCA visualized):**

1. Flatten every image to an array of $1\times 64$
2. Get the expected value of every pixel across all flattened images
3. Get the expected value over the N images of the distance from each pixel to the pixel mean
4. Build the covariance matrix, that is a $64\times 64$ matrix of distances and relationships.  
    If some pixel is:
    * Large positive, then if $i>\mu$,  $j>\mu$
    * Large negative, then if $i>\mu$, $j<\mu$
    * Close to zero, then $i$ doest not provide much information about $j$
5. The goal of this process is to get an efficient version of above created matrix so it can be used to predict unknown images. This means that we should find the eigenvectors of this matrix, and this is achieved by descompositing it with [SVD](../Glossary.ipynb/#S) (Singular value descomposition). These eigenvectors we are looking for are the columns in $\mathbf{U}$
6. Apply the reduced matrix over an unknown image, if the outcome is the image ehanced then it agrees with the training otherwise returns noise.

In [None]:
# Dimensionality reduction with SVD
# Forked from: http://cs231n.github.io/neural-networks-2/

# Create a sample matrix
X = np.random.normal(size=(100, 10))

# Get the covariance matrix
X -= X.mean(axis=0)  # Center at zero
cov = np.dot(X.T, X) / X.shape[0]

# Decompose it using SVD
U, s, V_T = np.linalg.svd(cov)

# Decorrelate the data
Xrot = np.dot(X, U)

# Reduce dimensions
Xrot_reduced = np.dot(X, U[:, :4])

Xrot_reduced.shape  # 100x4