# Classification

Classification problem is another supervised learning problem. The difference between regression and classification is the labels $y$. If $y$ can take any values, then it is a regression problem. If $y$ only takes discrete values, then it is a classification problem.


# Classification example

We can use the `make_classification` command to generate synthetic data samples.

    sklearn.datasets.make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

In [6]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

## K-nearest neighbors

#### Algorithm explanation:

An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

#### Regression:
K-nearest neighbors algorithm can also be used for regression problem. In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors. If k = 1, then the output is simply assigned to the value of that single nearest neighbor.

#### Documentation:
Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [10]:
from sklearn.neighbors import KNeighborsClassifier


Test accuracy is 0.74
Test accuracy is 0.74
[[0.625 0.125 0.125 0.125]
 [0.75  0.25  0.    0.   ]
 [0.    0.    1.    0.   ]
 ...
 [0.125 0.5   0.25  0.125]
 [0.625 0.    0.    0.375]
 [0.125 0.625 0.    0.25 ]]


You can use cross-validation to select the best number of neighbors.

## Logistic Regression

See wikipedia for model explanation: https://en.wikipedia.org/wiki/Logistic_regression

**Logistic regression is designed for classification problems only. In other words, you can not use this model for regression problems.**

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [23]:
# train the model
from sklearn.linear_model import LogisticRegression


Test accuracy is 0.59
Test accuracy is 0.59
[[0.56281303 0.04135957 0.07570028 0.32012711]
 [0.20904043 0.51212522 0.15466231 0.12417203]
 [0.67553186 0.12558172 0.16790161 0.03098482]
 ...
 [0.22122758 0.59547218 0.11085617 0.07244407]
 [0.79096746 0.02564928 0.00510108 0.17828218]
 [0.07873112 0.75625907 0.1178367  0.04717311]]


## Tree based model

A decision tree looks like this: ![](https://scikit-learn.org/stable/_images/sphx_glr_plot_iris_dtc_002.png)


A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [32]:
from sklearn.tree import DecisionTreeClassifier


Test accuracy is 0.76
Test accuracy is 0.76
[[0.65454545 0.         0.23636364 0.10909091]
 [0.75       0.         0.         0.25      ]
 [0.         0.         1.         0.        ]
 ...
 [0.03688525 0.91803279 0.03278689 0.01229508]
 [0.06976744 0.06976744 0.         0.86046512]
 [0.03688525 0.91803279 0.03278689 0.01229508]]


## Ensemble Learning

To better understand random forest model, we should know **Ensemble Learning** first, which is an important technique in Machine Learning.

#### Motivation:

Suppose that you have a complex question of thousands of random people, then aggregate their answers. In many cases you will find yjay yhis aggregated answer is better than an expert's answer. This is called the *wisdom of the crowd*. Similarly, if you aggregate the predictions of a groups of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus this technique is called Ensemble Learning.

#### When do we use ensemble model?

Usually, ensemble model works better than single model, but there is no guarantee. In my opinion, since we must try different single models due to No Free Luch Theorem, it does not hurt to ensemble all single models you have tried, and look at the performance on test dataset.

#### Example:

Let's ensemble KNNeighbor, Logistic Regression, and Tree model together.

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

In [60]:
from sklearn.ensemble import VotingClassifier


KNeighborsClassifier 0.7235
LogisticRegression 0.588
DecisionTreeClassifier 0.71
VotingClassifier 0.741


## Random Forest

In short, random forest is an ensemble of decision trees. 

A group of Decision Tree classifiers are trained on a different random subset of the training set. Then, you use max vote (ensemble technique) to obtain the prediction. 


Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [63]:
from sklearn.ensemble import RandomForestClassifier


Test accuracy is 0.8245
