# Classification
Using classification, samples get labeled depending on a decision boundary test that separates your data into a range of space.

## Logistic Regression

In [15]:
def logistic_mod(df, logProb = 1.0):
    from sklearn import linear_model

    ## Prepare data for model
    nrow = df.shape[0]
    X = df[['x', 'y']].as_matrix().reshape(nrow,2)
    Y = df.z.as_matrix().ravel() #reshape(nrow,1)
    ## Compute the logistic regression model
    lg = linear_model.LogisticRegression()
    logr = lg.fit(X, Y)
    ## Compute the y values
    temp = logr.predict_log_proba(X)  
    df['predicted']  = [1 if (logProb > p[1]/p[0]) else 0 for p in temp]
    return df

def eval_logistic(df):
    import matplotlib.pyplot as plt
    import pandas as pd

    truePos = df[((df['predicted'] == 1) & (df['z'] == df['predicted']))]  
    falsePos = df[((df['predicted'] == 1) & (df['z'] != df['predicted']))] 
    trueNeg = df[((df['predicted'] == 0) & (df['z'] == df['predicted']))]  
    falseNeg = df[((df['predicted'] == 0) & (df['z'] != df['predicted']))]

    fig = plt.figure(figsize=(5, 5))
    fig.clf()
    ax = fig.gca()
    truePos.plot(kind = 'scatter', x = 'x', y = 'y', ax = ax, 
                       alpha = 1.0, color = 'DarkBlue', marker = '+', s = 80) 
    falsePos.plot(kind = 'scatter', x = 'x', y = 'y', ax = ax, 
                       alpha = 1.0, color = 'Red', marker = 'o', s = 40)  
    trueNeg.plot(kind = 'scatter', x = 'x', y = 'y', ax = ax, 
                       alpha = 1.0, color = 'DarkBlue', marker = 'o', s = 40)  
    falseNeg.plot(kind = 'scatter', x = 'x', y = 'y', ax = ax, 
                       alpha = 1.0, color = 'Red', marker = '+', s = 80) 
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_title('Classes vs X and Y')
    
    TP = truePos.shape[0]
    FP = falsePos.shape[0]
    TN = trueNeg.shape[0]
    FN = falseNeg.shape[0]
       
    confusion = pd.DataFrame({'Positive': [FP, TP],
                              'Negative': [TN, FN]},
                               index = ['TrueNeg', 'TruePos'])
    accuracy = float(TP + TN)/float(TP + TN + FP + FN)      
    precision = float(TP)/float(TP + FP)     
    recall =  float(TP)/float(TP + FN)      
    
    print(confusion)
    print('accracy = ' + str(accuracy))
    print('precision = ' + str(precision))
    print('recall = ' + str(recall))
    
    return 'Done'

## K Nearest Neighbors Classification
SKlearn prepartions your data using kd-trees. Prediction based on majority vote; probably best to use an odd number of k-neighbors with binary classification.

Choose best fitting distance measure. Be carefull wiht **imbalanced labeled dataset** when choosing you distance measure: in this case using uniform or user-defined weighting may be better.

In [16]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import euclidean_distances

In [17]:
data   = [[0],[1],[2],[3],[4], [5],[6],[7],[8],[9]]  # input dataframe samples
labels = [0,0,0,0,0, 1,1,1,1,1]  # the function we're training is " >4 "
data_train, data_test, label_train, label_test = train_test_split(data, labels, test_size=0.5, random_state=7)

In [18]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(data_train, label_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [19]:
predictions = model.predict(data_test)
predictions

array([1, 0, 0, 0, 0])

In [20]:
model.predict_proba(data_test)

array([[ 0.        ,  1.        ],
       [ 0.66666667,  0.33333333],
       [ 0.66666667,  0.33333333],
       [ 0.66666667,  0.33333333],
       [ 0.66666667,  0.33333333]])

In [21]:
#Model evaluation
model.score(data_test, label_test)

0.80000000000000004

## Support Vector Machines (SVC)
In a nutshell, SVC solves the classification problem by finding the equation of the hyperplane (linear surface) which results in the most separation between two classes of samples. This allows you to confidently label your samples in a very fast and efficient way. The kernel is essentially a similarity function.

SVC is a classifier, so in short, you could use it on any classification problem. Other algorithms like K-Neighbors are instantaneous with training, but require you traverse a complicated tree structure for each sample you want to classify. That can become a bottle-neck in realtime applications like self driving cars that need to rapidly be able to tell the difference between a plastic bag and a large rock. One of the advantages of SVC is that once you've done the hard work of finding the hyperplane and its supporting vectors, the real job of classifying your samples is as simple as answering what side of the line is the point on? This makes SVC a classifier of choice for problems where classification speed is more critical than training speed.

SVC is extremely effective, even in high dimensional spaces. Just as you saw in the billiards explanation earlier, even after the instructor added in the rest of the balls onto the table, the accuracy of the original pool-stick classification was still pretty good. With SVC, most of your dataset actually doesn't even matter. The only important samples are those closest to the decision boundary, called the support vectors. Those samples determine the position of the separating hyperplane and the size of its margin. If you have a very large dataset consisting of many samples and want to speed it up simply by throwing away samples, a way to do so without sacrificing your classification accuracy too much would be by using SVC.

There may be cases where number your dataset has more features than the number of samples. Not all machine learning algorithms will be able to work with that, however such datasets aren't an issue for SVC. In fact, at least conceptually, if you use the kernel trick then at some point your data will almost assuredly be at a higher dimensionality than the number of features, depending on which kernel you use. And with the ability to use different kernel functions or even define your own, SVC will prove to be a very versatile classifier for you to have down in your machine learning arsenal.

Lastly, SVC is non-probabilistic. That means the resulting classification is calculated based off of the geometry of your dataset, as opposed to probabilities of occurrences. Once you get to decision trees, you'll see an example of a classifier that works using probabilities and not the geometric nature of your dataset.

The most important parameter you need to look at while using Support Vector Classifier is the kernel parameter,
which specifies the kernel function type. The higher the C parameter is, the more it costs the algorithm to incorrectly classify a sample. So as a result, it'll put all its effort into ascertaining that it wriggles through your training data. You have to kind of have a balance between this seesaw of C values in order to get the best possible fit.

Moving right along, gamma is a parameter whose value is actually inversely proportional to the intensity of the effect or to the region of influence of a single training sample. So large gamma values would results in the training samples
in your data set having localized effects only. The only effect a very close region. But if you have a smaller gamma value, that'll result in each training sample affecting a much more large and much more broad region. In essence, the gamma values will dictate how pronounced your decision boundary is, by veering the influence of your support vector samples.

With scikit-learn's SVC, this class_weight parameter allows you the ability to make certain samples more important by class.

support vector samples, which could be useful both for examination as well as for visualization purposes.
And you can also pick up your intercepts.Now, if you're wondering what an intercept looks like, for example, through a radio basis function kernel, right? Just remember that once the table's been flipped, your samples are now
in a higher dimensionality, linearly separable space. And all you're really trying to do is just calculate that linear
decision boundary using a straight line equation.

SVC class, the most important three in order are:
* **kernel** Defines the type of kernel used with your classifier. The default is the radial basis function rbf, the most popular kernel used with support vector machines generally. SciKit-Learn also supports linear, poly, sigmoid, and precomputed kernels. You can also specify a user defined function to pre-compute the kernel matrix from your sample's feature space, which should be shaped $[n_samples, n_samples]$.
* **C** This is the penalty parameter for the error term. Do you want your SVC to never miss a single classification? Or is having a more generalized solution important to you? The lower your C value, the smoother and more generalized your decision boundary is going to be. But if you have a large C value, the classifier will attempt to do whatever is in its power to squiggle and wiggle between each sample to correctly classify it.
* **gamma** This parameter's value is inversely proportional to the extent a single training sample's influence extends. Large gamma values result in each training sample having localized effects only. Smaller values result in each sample affecting a larger area. In essence, the gamma values dictate how pronounced your decision boundary is by varying the influence of your support vector samples.
* **random_state** SVC and support vector machines are theoretically a deterministic algorithm, meaning if you re-run it against the same input, it should produce identical output each time. However SKLearn's SVC via libsvc implementation randomly shuffles your data during its probability estimation step. So to truly get deterministic execution, set a state seed.

```python
>>> from sklearn.svm import SVC
>>> model = SVC(kernel='linear')
>>> model.fit(X, y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
```

In addition to the regular .fit(), .predict(), and .score() methods, SVC also allows you to calculate the distance of a set of samples to the decision boundary in high-dimensional space using .decision_function(X), where X is a series of samples of the form $[n_samples, n_features]$.

In terms of attributes, a few goodies are exposed here to:
* **support_** Contains an array of the indices belonging to the selected support vectors
* **support_vectors_** The actual samples chosen as the support vectors
* **intercept_** The constants of the decision function
* **dual_coef_** Each support vector's contribution to the decision function, on a per classification basis. This has similarities to the weights of linear regression

One of SVC's strong points is that since the core of the algorithm is based on a small subset of your dataset's samples, namely the support vectors, even if you have fewer samples than dimensions, so long as the samples you do have are close to the decision boundary, there's a good chance your support vector classifier will do just swell.

Intuitively, those samples that are further away from the decision boundary are more clearly identifiable as belonging to their respective classes. The samples closer to the decision boundary are more vague, and could easily be mistaken as belonging to the wrong class. If you wanted to train a child how to recognize cats from dogs, good training samples would include the most "catly" cat you could find, and the most "dogly" dog. By showing them samples from the two classes that are far away from the decision boundary, they are less likely to look for characteristics and features that might accidentally be misconstrued. Support vector machines behave counter intuitively; they don't care about the samples that are clearly cats, or that are clearly dogs. Rather, they focus on those samples, or support vectors, closest to the decision boundary, so they can compute precisely the smallest change of features that differentiate between the two, perchance it's able to properly classify all of them.

Since SVC is one of SciKit-Learn's highly configurable predictors, it's easy to start overfitting your models if you're not careful. Furthermore, unlike KNeighbors that does all its processing at the point of predicting, SVC does the majority of its heavy lifting at the point of training, so large training sets can result in sluggish training. If the ability to do realtime training and updating of your model is of great concern to you, you might have to consider another algorithm, depending on the size of your dataset. That said, there are a few mechanisms to speed it up.

In [22]:
def drawPlots(model, X_train, X_test, y_train, y_test, wintitle='Figure 1'):
    # You can use this to break any higher-dimensional space down,
    # And view cross sections of it.

    # If this line throws an error, use plt.style.use('ggplot') instead
    mpl.style.use('ggplot') # Look Pretty

    padding = 3
    resolution = 0.5
    max_2d_score = 0

    y_colors = ['#ff0000', '#00ff00', '#0000ff']
    my_cmap  = mpl.colors.ListedColormap(['#ffaaaa', '#aaffaa', '#aaaaff'])
    colors   = [y_colors[i] for i in y_train]
    num_columns = len(X_train.columns)

    fig = plt.figure()
    fig.canvas.set_window_title(wintitle)
    fig.set_tight_layout(True)
    
    cnt = 0
    for col in range(num_columns):
        for row in range(num_columns):
            
            # Easy out
            if FAST_DRAW and col > row:
                cnt += 1
                continue

            ax = plt.subplot(num_columns, num_columns, cnt + 1)
            plt.xticks(())
            plt.yticks(())

            # Intersection:
            if col == row:
                plt.text(0.5, 0.5, X_train.columns[row], verticalalignment='center', horizontalalignment='center', fontsize=12)
                cnt += 1
                continue


            # Only select two features to display, then train the model
            X_train_bag = X_train.ix[:, [row,col]]
            X_test_bag = X_test.ix[:, [row,col]]
            model.fit(X_train_bag, y_train)

            # Create a mesh to plot in
            x_min, x_max = X_train_bag.ix[:, 0].min() - padding, X_train_bag.ix[:, 0].max() + padding
            y_min, y_max = X_train_bag.ix[:, 1].min() - padding, X_train_bag.ix[:, 1].max() + padding
            xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
                                 np.arange(y_min, y_max, resolution))

            # Plot Boundaries
            plt.xlim(xx.min(), xx.max())
            plt.ylim(yy.min(), yy.max())

            # Prepare the contour
            Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            plt.contourf(xx, yy, Z, cmap=my_cmap, alpha=0.8)
            plt.scatter(X_train_bag.ix[:, 0], X_train_bag.ix[:, 1], c=colors, alpha=0.5)


            score = round(model.score(X_test_bag, y_test) * 100, 3)
            plt.text(0.5, 0, "Score: {0}".format(score), transform = ax.transAxes, horizontalalignment='center', fontsize=8)
            max_2d_score = score if score > max_2d_score else max_2d_score

            cnt += 1

    print("Max 2D Score: ", max_2d_score)

## Decision Trees
And since it's only looking at one value at a time, that is since every branch is only based off of
a single feature, decision trees are the first machine learning method that we're looking at together that are resilience to things like skewed feature skills and distributions within your datasets. The leaves of a decision tree are the classification results. The other unique aspect about decision trees is that they use a clever arrangement of linear decision boundaries in order to smake and in order to do nonlinear decision-making.

Decision trees, similar to SVC, make use of a clever trick to allow you to do non-linear decision-making by use of a linear decision surface. With SVC it was the table-flipping, kernel trick. With decision trees, you can divide up your feature set into sections and boxes, which otherwise would not have been possible using a single, linear classifier. Due to this, a few other nifty qualities are born to decision trees, such as them being indifferent to feature scaling, unlike KNeighbors or PCA.

```python
>>> from sklearn import tree
>>> model = tree.DecisionTreeClassifier(max_depth=9, criterion="entropy")
>>> model.fit(X,y)
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=9,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	
# .DOT files can be rendered to .PNGs, if you've already `brew install graphviz`.
>>> tree.export_graphviz(model.tree_, out_file='tree.dot', feature_names=X.columns)
```

SciKit-Learn's trees are quite configurable:

* **criterion** By default, SciKit-Learn uses Gini, which is an impurity rating. Alternatively, you could also make use of information gain, or entropy instead.
* **splitter** Lets you control of the algorithm chooses the best split or not. We'll discuss why that's importance once you move to random forest classifier.
* **max_features** One of the possible splitter options for splitter above is called 'best'. SciKit-Learn runs a bunch of tests on your features to figure out which mechanism should be used when searching for the best split. This parameter limits the number of features to consider while doing this.
After you've gone ahead and trained your tree, you can of course get back all the end-node, leaf classifications that the tree has reached, as well as the entire tree object if you like. For those leaf nodes that aren't 100% pure due to having samples belonging to multiple classes within them, then the end-result class the leaf takes is a weighted mode vote, based on the number of each class label inside of it.

You can also get back a **feature_importances** vector that stores, in order of importance, the features that used to make the labeling decisions of your tree. You can use this as a dimensionality reduction tool.

Decision Trees have some high points which make them desirable for machine learning. They're easy to interpret, by linearizing them into IF .. AND .. AND .. THEN .. blocks. Both training and testing speed are fast. They work with either categorical or continuous features, with or without encoding, and are invariant to feature of scaling. Moreover, if configured properly, they can pretty decently handle irrelevant and noisy features. In a nutshell, decision trees help you decide the worst, best and expected classification labels given various scenarios.

While using them, keep in mind that your data still must be multivariate linearly separable for classification to work. The decision surfaces of the tree are still flat, even though they can intersect at angles. Also considering how malleable and how many parameters they have, it's easy to get carried away and overfit your tree unless, particularly if you don't fully test against an independent testing set. A good sign of an overfit tree is its very complex structure, and it having branches that reach out erroneously just to correctly label single records, due to their sensitivity to small, local fluctuations in your training data.

### Visualizing Trees using Graphviz
```python
>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>>
>>> clf = tree.DecisionTreeClassifier()
>>> iris = load_iris()
>>>
>>> clf = clf.fit(iris.data, iris.target)
>>> tree.export_graphviz(clf,
...     out_file='tree.dot') 
```

Use the CLI to produce PNG or PS files:
```shell
$ dot -Tps tree.dot -o tree.ps      (PostScript format)
$ dot -Tpng tree.dot -o tree.png    (PNG format)
```

### Random Forests or Forests of Randomized Decision Trees
And you're also slightly more able to control overfitting, which is a real problem that singly trained decision trees
are extremely prone to. In fact, that's probably the main reason why random forests were even created in the first place.

A common belief is that, the complexity of a machine learning classifier can only grow to a certain level of accuracy before it starts to get hurt by overfitting. But even if you add more decision trees to your ensemble in order to create a larger forest, as long as you have sufficient training data it seems to only boost the classifiers accuracy.

Since random forests train a multitude of trees, it's almost effortlessly straight forward to paralyze them across multiple processing units. And they're pretty fast at training to boot.

A single decision tree, tasked to learn a dataset might not be able to perform well due to the outliers, and the breadth and depth complexity of the data. So instead of relying on a single tree, random forests rely on a forest of cleverly grown decision trees. Each tree within the forest is allowed to become highly specialized in a specific area, but still retains some general knowledge about most areas. When a random forest classifier, it is actually each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.

**However, random forests, you can train many trees on independent parts of the training set, and by doing that, you actually reduced variance without increasing bias**. Sklearn uses bootstrapping!

Random forests make use of two techniques when training, one occurs at the forest level, and the other at the individual tree level. First, like any supervised classifier, you'll pass in a training set of samples along with "truth" labels when you create an instance of the class. Instead of sharing the entire dataset with each decision tree, the forest performs an operation which is essential a train / test split of the training data. Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.

Random forests also use one more trick. In addition to the tree bagging of training samples at the forest level, each individual decision tree further 'feature bags' at each node-branch split. This is helpful because some datasets contain a feature that is very correlated to the target (the 'y'-label). By selecting a random sampling of features every split, if such a feature were to exist, it wouldn't show up on as many branches of the tree and there would be more diversity of the features examined.

Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself. It does this by scoring each tree's predictions against that tree's out-of-bag samples. A tree's out of bag samples are those forest training samples that were withheld from a specific tree during training. There's nothing unique about splitting your data between training and testing sets except that you have an independent set of unseen samples to validate the accuracy of your training. Part of the random forest algorithm is the creation of independent sets for the training of each tree, so an overall out-of-bag error metric can be calculated for the forest ensemble. This error value is defined as the mean prediction error for each training samples using only those trees that didn't have the sample in their bootstrap.


One of the advantages of using the out of bag error is it eliminates the need for you to split your data into a training / testing before feeding it into the forest model, since that's part of the forest algorithm. However using the out-of-bag error metric often underestimates the actual performance improvement, and the optimal number of training iterations. Due to this and a few other reason's we'll discuss in the next module, SciKit-Learn recommends you maintain separate training and testing sets.

Random forest is an ensemble or meta-estimator: one that combines many versions of another estimator together. Perhaps you'll be able to get to a higher level of accuracy. So the estimator that's being combined, of course, is the decision tree estimator. **Bootstrapping is a process which is also known as bagging. And it basically means that every tree within the forest will be grown independently by drawing on a bootstrap, or a subset of the input data**. So of your training data, the samples that a specific tree sees will be considered in-bag, and the samples that a specific tree doesn't see are considered out-of-bag for that one tree.

```python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=30, max_depth=10, oob_score=True, random_state=0)
```