### Before you choose a classifier, please ask your self the following questions:

- Size of training dataset
- Dimension size of the features
- The problem is linearly separable?
- Are features independent?

Then follow the Occam's Razor principle: use the least complicated algorithm tha can address your problem.

# 1. Logistic Regression

### 1.1. Pros and cons

Advantages:

- A pretty well-behaved classification algorithm that can be trained as long as you expected your features to be roughly linear  and the problem to be linearly separable. You can do some feature engineering to turn most non-linear features into linear pretty easily. 
- robut to noise and you can aviod overfitting and even do feature selection by using l2 or l1 regularization
- easily update your model, pretty efficient and can be used for large dataset
- Output can be interpreted as a proability
- lots of ways to regularize your model and you don't have to worry about your features being correlated

Disadvantages:

- it can hardly handle categorical (binary) features.

# 2. Support Vector Machine

## 1.1. Pros and cons
Advantages:

- Effective in high dimensional spaces, especially popular in text classification problems where very high-dimensional spaces are the norm
- Robust to noise because they maximize margins
- Versatile: can model complex, nonlinear relationship different kernel functions are available

Disadvantages:

- If the number of features is much greater than the number of samples, the methods might perform poorly
- SVMs do not directly provide probability estimates, hard to interpret
- Memory-intensive
- hard to run and tune

### 2.2. Classes

SVC and NuSVC: 
  - are similar methods but accept slightly different sets of parameters. 
  - implement the "one-against-one" approach for multi-class classification. n_class * (n_class - 1) / 2 classifiers are constructed 
  
LinearSVC: 
  - Another implemetation of support vector classification for the case of a linear kernel, does not accept keyword kernel.
  - implements "one-vs-the-rest" approach. trains n_class models

### 2.3. Scores and probabilities

- the probabilities are calibrated using Platt scalling which is a expensive operation for large datasets
- it is advisable to set probability=False and use decision_function istead of predict_proba

### 2.4.Parameters

- In problems where it is desired to give more importance to certain classes or certain individual samples keywords class_weight and sample_weight can be used.

- C and gamma. The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

### 2.5. Complexity

The core of an SVM is a quadratic programmin problem (QP) so the compute and storage requirements increase rapidly with the number of training vectors. 

Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.

### 2.6 Tips on practical use

- Kernel cache size: for SVC and NuSVC,  the kernel cache has a strong impact on run times for larger problems. If you have enough RAM, it is recommended to set cache_size to a larger value
- Setting C: C is 1 by default, If you have a lot of noisy observations, you should decrease it. 
- In SVC, if data for classification are unbalanced (e.g. many positive and few negative), set class_weight='balanced' and/or try different penalty parameters C.
- Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. 
- Use sklearn.model_selection.GridSearchCV with C and gamma spaced exponentially far apart to choose good values.

### 2.7. Mathematical formulation

A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

# 3. Nearest Neighbors

### 3.1. Pros and cons

Advantages:

- No training involved ("lazy")
- Naturally handles multiclass classification and regression


Disadvantages:

- For high-dimensional parameter spaces, this method becomes less effective due to the so-called “curse of dimensionality”.
- Expensive and slow to predict new instances
- Must define a meaningful distance function

### 3.2. Classes

- KNeighborsClassifier: implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user.
- RadiusNeighborsClsffifier: implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. In cases where the data is not uniformly sampled, radius-based neighbors classification can be a better choice

### 3.3. Parameters

- Algorithms:
  - Brute Force:  computation of distances between all pairs of points in the dataset: for N samples in D dimensions, this approach scales as O[D N^2]. As the number of samples N grows, the brute-force approach quickly becomes infeasible.
  - K-D tree: the computational cost of a nearest neighbors search reduce to O[D N log(N)] or better. Though the KD tree approach is very fast for low-dimensional (D < 20) neighbors searches, it becomes inefficient as D grows very large: this is one manifestation of the so-called “curse of dimensionality”.
  - Ball tree: 


### 3.4. Mathematical formulation

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point

# 4. Decision tree classification

### 4.1. Pros and cons:
Advantages:

- Simple to understand and to interpret
- Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
- The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points

Disadvantages:

- Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

### 4.2. Parameters

- criterion:
    - gini: default
    - entropy: information gain

- max_depth:

- min_samples_splt: default is 2

# 5. Stochastic Gradient Descent

### 5.1. Pros and cons

Advantages:
- Efficiency, if X is a matrix of size(n,p) training has  a cost of O(knp_hat), where k is the number of iterations and p_hat is the average number of non-zero attributes per sample. It is particularly useful when the number of smples is very large.

### 5.2. Tips on practical use

- Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. If your attributes have an intrinsic scale (e.g. word frequencies or indicator features) scaling is not needed. 


### 5.3. Parameters:

- loss: 
    - hinge: default
    - log: for logistic regression, a probabilistic classifier, for large dataset
    - modified_huber: a smooth loss that brings tolerance to outliers
    - squared_hinge: like hinge but is a quandratically penalized
    - perceptron: a linear loss used by the perceptron algotithm
    
- penalty:
    - l2: default, the standard regularizer for linear SVM models
    - l1: bring sparsity to the model not achieveable with l2
    - elasticnet: combination of both l1 and l2, might bring sparsity to the model not achieveable with l2.

### 5.4

SGD is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. 

# 6. Gaussian Naive Bayes (GaussianNB)

### 6.1. Pros and cons

Advantages:

- Super simple
- If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data

Disadvantages:

-  it can’t learn interactions between features 

# 7. Ensemble methods

### 7.1 Two families of ensemble methods:

- **Average methods:** build several estimators independently and then average theri predictions.
    - Bagging
    - Forests of randomized trees
    
- **Boosting methods:** base estimators are built sequentially and one tries to reduce the biases of the combined estimator
    - Adaboost
    - Gradient tree boosting

### 7.2 Ensemble Methods in Machine Learning - Thomas G. Dietterich
[Thomas G. Dietterich](http://www.cs.orst.edu/~tgd)

In low noise cases. AdaBoost gives good performance because it is able to optimize the ensemble without overfitting However, in high noise cases, AdaBoost puts a large amount of weight on the mislabeled examples and this leads it to overfit very badly.  Bagging and Randomization do well in both the noisy and noise free cases because they are focusing on the statistical problem  and noise increases this statistical problem. 

In very large datasets, Randomization can be expected to do better than Bagging because bootstrap replicates of a large training set are very similar to the training set itself and hence the learned decision tree will not be very diverse Randomization creates diversity under all conditions but at the risk of generating low quality decision trees