# Other classification methods


## Review

Regression and classification are two classes of problems in machine learning.

We know the similarity and difference between regression and classification.

In terms of similarity, target (y) is known.

In terms of difference, for regression y is continuous; for classification y is discrete.


We learned about linear regression (regression) and decision tree (classification).

We learned about validation (R2 for regression and precision/recall for classification).

We learned about cross-validation, the importance of training and testing on different data.  We learned about validating (training/testing) across multiple different datasets.



## Other classification methods

We will learn about logistics regression and support vector machine.  We'll try to understand the fundamental concepts behind these methods; and how to use them.


### Logistics Regression

![logistics regression vs linear regression](./linear_vs_logistic_regression.png)


In linear regression, we find $y = \alpha \cdot x + \beta$ to fit the data.

In logistics regression, we find $y = 1 / (1 + e^{- x \cdot \alpha})$ to fit the data.



## Support Vector Machine

![SVM](./svm.png)

We try to find a linear model/equation that best separates the data into two classes.  The linear function is y = f(x) = wx + b.  The right linear function results in f(x) being +1 or -1.



In [1]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

Compare the performance of these three methods on prediction of (A) iris species, and (B) admission to UCLA grad school.

To rigorously compare the performance, you should cross validate.  Use stratified KFold (k=10), to cross validate.  Performance measures should be precision and recall.  

In [2]:
# 1.  Get the data
import pandas
iris = pandas.read_csv('~/Dropbox/datasets/iris.csv')
admit = pandas.read_csv('~/Dropbox/datasets/admission.csv')



In [3]:

# 2.  Select features
X1 = iris.drop('Species', axis=1)
y1 = iris.Species

X2 = admit.drop('admit', axis=1)
y2 = admit.admit


In [4]:

# 3.  Create the models

models = [ DecisionTreeClassifier(), SVC(), LogisticRegression() ]


In [5]:

# 4.  Cross validate
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import StratifiedKFold

def cross_validate(model, X, y, k=10):
    cv = StratifiedKFold(n_splits=k)
    ps, rs = [], []
    for train_idx, test_idx in cv.split(X, y):
        X_train = X.loc[train_idx]
        X_test = X.loc[test_idx]
        y_train = y.loc[train_idx]
        y_test = y.loc[test_idx]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        ps.append( precision_score(y_test, y_pred, average='weighted') )
        rs.append( recall_score(y_test, y_pred, average='weighted') )
    return sum(ps)/len(ps), sum(rs)/len(rs)


In [7]:
for model in models:
    print(model)
    print('Precision, recall: ', cross_validate(model, X2, y2, 10))

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
Precision, recall:  (0.6349062392247146, 0.6348624140087555)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Precision, recall:  (0.6657640441457836, 0.6874421513445903)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Precision, recall:  (0.669807542722384