All Materials are summarized from:
[1. sklearn-learn.org](http://scikit-learn.org/)
[2. Machine learning in action]

### Before you choose a classifier, please ask yourself the following questions:

- Size of training dataset
- Dimension size of the features
- The problem is linearly separable?
- Are features independent?

Then follow the Occam's Razor principle: use the least complicated algorithm that can address your problem.

# 1. Logistic Regression - optimization algorithm

### 1.1. Pros and cons

Advantages:

- A pretty well-behaved classification algorithm that can be trained as long as you expected your features to be roughly linear  and the problem to be linearly separable. You can do some feature engineering to turn most non-linear features into linear pretty easily. 
- robut to noise and you can avoid overfitting and even do feature selection by using l2 or l1 regularization
- Computationally inexpensive, easy update your model, pretty efficient and can be used for large dataset
- Output can be interpreted as a proability
- lots of ways to regularize your model and you don't have to worry about your features being correlated

Disadvantages:

- It can hardly handle categorical (binary) features.
- Prone to underfitting, may have low accuracy

### 1.2. Step function - sigmoid

$$\sigma(z) = 1 /(1 + e ^{-z})$$


### 1.3. Optimiz the best weights (regression coefficients) - gradient descendent

z = w0 * x0 + w1 * x1 + w2 * x2 + ... + wn * xn      
In vector notation $z = (W^T*X) $


Pseudocode for the gradient descendent:

    Start with the weights all set to 1
    Repeat R times:
        Calculate the gradient of the entire dataset
        update the weights vector by alpha * gradient
        return the weight vector
        
Gradient descendent can be simplified with stochastic gradient descendent        

# 2. Support Vector Machine

## 1.1. Pros and cons
Advantages:

- Effective in high dimensional spaces, especially popular in text classification problems where very high-dimensional spaces are the norm
- Robust to noise because they maximize margins
- make good decisions for data points that are outside the training set
- Versatile: can model complex, nonlinear relationship different kernel functions are available

Disadvantages:

- If the number of features is much greater than the number of samples, the methods might perform poorly
- SVMs do not directly provide probability estimates, hard to interpret
- Sensitive to tuning parameters and kernel choice, hard to run and tune

### 2.2. Classes

SVC and NuSVC: 
  - are similar methods but accept slightly different sets of parameters. 
  - implement the "one-against-one" approach for multi-class classification. n_class * (n_class - 1) / 2 classifiers are constructed 
  
LinearSVC: 
  - Another implemetation of support vector classification for the case of a linear kernel, does not accept keyword kernel.
  - implements "one-vs-the-rest" approach. trains n_class models

### 2.3. Scores and probabilities

- the probabilities are calibrated using Platt scalling which is a expensive operation for large datasets
- it is advisable to set probability=False and use decision_function istead of predict_proba

### 2.4.Parameters

- In problems where it is desired to give more importance to certain classes or certain individual samples keywords class_weight and sample_weight can be used.

- C and gamma. The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

### 2.5. Complexity

The core of an SVM is a quadratic programmin problem (QP) so the compute and storage requirements increase rapidly with the number of training vectors. 

Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.

### 2.6. kernels - mapping data to higher dimensions

One great thing about the SVM optimization is that all operations can be written in terms of inner products. We can replace the inner products with kernel functions without making simplifications. Replacing the inner product with a kernel is known as the kernel trick ot kernel substation

Radial bias function (rbf, Gaussian version) kernel: $k(x,y) = exp(-||x - y||^2 / (2 * sigma^2))$

There is an optimum numbert of support vectors. The beauty of SVMs is that they classify things efficiently. If you have too few support vectors, you may have a poor decision boundary. If you have too many support vectors, you're using the whole dataset every time you classify something - that's called k-Nearest Neighbors

The k-Nearest Neighbors algorithm works well but you have to carry around all the training examples. With support vector machines, you can carry around far fewer example (only your support vectors) and achieve comparable performance

[Multiclass SVM](https://www.csie.ntu.edu.tw/~cjlin/papers/multisvm.pdf)

### 2.7. Tips on practical use

- Kernel cache size: for SVC and NuSVC,  the kernel cache has a strong impact on run times for larger problems. If you have enough RAM, it is recommended to set cache_size to a larger value
- Setting C: C is 1 by default, If you have a lot of noisy observations, you should decrease it. 
- In SVC, if data for classification are unbalanced (e.g. many positive and few negative), set class_weight='balanced' and/or try different penalty parameters C.
- Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. 
- Use sklearn.model_selection.GridSearchCV with C and gamma spaced exponentially far apart to choose good values.

### 2.8. Mathematical formulation

A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

# 3. Nearest Neighbors

### 3.1. Pros and cons

Advantages:

- No training involved ("lazy")
- Naturally handles multiclass classification and regression


Disadvantages:

- For high-dimensional parameter spaces, this method becomes less effective due to the so-called “curse of dimensionality”.
- Expensive and slow to predict new instances
- Must define a meaningful distance function

### 3.2. Classes

- KNeighborsClassifier: implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user.
- RadiusNeighborsClsffifier: implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. In cases where the data is not uniformly sampled, radius-based neighbors classification can be a better choice

### 3.3. Parameters

- Algorithms:
  - Brute Force:  computation of distances between all pairs of points in the dataset: for N samples in D dimensions, this approach scales as O[D N^2]. As the number of samples N grows, the brute-force approach quickly becomes infeasible.
  - K-D tree: the computational cost of a nearest neighbors search reduce to O[D N log(N)] or better. Though the KD tree approach is very fast for low-dimensional (D < 20) neighbors searches, it becomes inefficient as D grows very large: this is one manifestation of the so-called “curse of dimensionality”.
  - Ball tree: 


### 3.4. Mathematical formulation

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point

# 4. Decision tree classification

### 4.1. Pros and cons:
Advantages:

- Simple to understand and to interpret
- Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
- The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points, computationally cheap to use

Disadvantages:

- Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

### 4.2. Pseudo_code:

Check if every item in the dataset is in the same class:

    If so return the class label
    Else
        find the best feature to split the data
        split the dataset
        create a branch node
            for each split 
                call breateBranch and add the result to the branch node
            return branch node

### 4.3. Parameters

- criterion:
    - gini: default
    - entropy: information gain

- max_depth:

- min_samples_splt: default is 2

### 4.4 Algorithm

- ID3: can split norminal-valued datasets
- C4.5
- CART

# 5. Stochastic Gradient Descent

### 5.1. Pros and cons

Advantages:
- Efficiency, if X is a matrix of size(n,p) training has  a cost of O(knp_hat), where k is the number of iterations and p_hat is the average number of non-zero attributes per sample. It is particularly useful when the number of smples is very large.

### 5.2. Tips on practical use

- Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. If your attributes have an intrinsic scale (e.g. word frequencies or indicator features) scaling is not needed. 


### 5.3. Parameters:

- loss: 
    - hinge: default
    - log: for logistic regression, a probabilistic classifier, for large dataset
    - modified_huber: a smooth loss that brings tolerance to outliers
    - squared_hinge: like hinge but is a quandratically penalized
    - perceptron: a linear loss used by the perceptron algotithm
    
- penalty:
    - l2: default, the standard regularizer for linear SVM models
    - l1: bring sparsity to the model not achieveable with l2
    - elasticnet: combination of both l1 and l2, might bring sparsity to the model not achieveable with l2.

### 5.4 Pseudo-code:

SGD is a simple yet very efficient approach. This method updates the weights using only one instance at a time to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. 

    Start with the weights all set to 1
    For each pieces of data in the dataset
        calculate the gradient of one pieces of data
        update the weights vector by alpha * gradient
        Return the weights vector

# 6. Gaussian Naive Bayes (GaussianNB)

### 6.1. Pros and cons

Advantages:

- Super simple
- If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data
- handles multiple classes

Disadvantages:

-  it can’t learn interactions between features 
- Sensitive to how the input data is prepared
- Works with: norminal values

### 6.2. Bayes' rule:

p(c|x,y) = p(x,y|c)p(c)/p(x,y)

We basically compare p(c1|x,y) and p(c2|x,y):

- If p(c1|x,y) > p(c2|x,y), the class is c1
- If p(c1|x,y) < p(c2|x,y), the class is c2

In naive bayes:

- assume features are independent
- assume every feature is equally important

Token: is any combination of characters. You can think of tokens as words, but we may use things that aren't words such as URLs, IP address

### 6.3. Practical considerations 

- **zero probability**: We multiply a lot of probabilities together to the get the probabilities that a question belongs to a given class. If any probability is zero, then the multiplied probability will be 0. To lessen the impact of this, we  will initialize all of occurance counts to 1 and the denominators to 2.

- **underflow**: We multiply a lot of probabilities together and many of these numbers are very small, we will get underflow. One solution to this is to take the natural logarithm of this product. ln(a * b) = ln(a) + ln(b)

### 6.4. Summary

Using probabilities can sometimes be more effective than using hard rules for classification. Bayesian probability and Bayes' rule gives us a way to estimate unknown probabilities from known values

# 7. Ensemble methods

### 7.1 Two families of ensemble methods:

- **Average methods:** build several estimators independently and then average their predictions.
    - Bagging: also known as bootstrap aggregating. The data is taken from the original datasets S times to make S new dataset. The datasets are the same size as the original. each dataset is built by randomly selecting an example from the original with replacement. After the s datasets are built, a learning algorithm is applied to each dataset. When you classify a new pieces of data, you apply our S classifiers to the new piece of data and take a majority vote
    - Forests of randomized trees
    
- **Boosting methods:** base estimators are built sequentially and one tries to reduce the biases of the combined estimator
    - Adaboost： is short for adaptive boosting. Boosting applies a weight to every sample in the training data. Initially, these weights are all equal (**a weight vector D**). A weak classifier is first trained on the training data. The errors from the weak classifier are calculated, and the weak classifier is trained a second time with the same dataset. This second time the weak classifier is trained, the weights of the training set are adjusted so the examples properly classified the first time are weighted less and the examples incorrectly classified in the first iteration are weighted more. To get one answer from all of these weak classifiers, adaboost assigns alpha values to each of the classifiers. The alpha values are based on the error of each weak classifier.
    - Gradient tree boosting
    
    To make this approach work, there are two fundamental questions that must be answered: first, how should each distribution be chosen on each round, and second, how should the weak rules be combined into a single rule? Regarding the choice of distribution, the technique that we advocate is to place the most weight on the examples most often misclassified by the preceding weak rules; this has the effect of forcing the base learner to focus its attention on the “hardest” examples. As for combining the weak rules, simply taking a (weighted) majority vote of their predictions is natural and effective.  
    

### 7.2 Ensemble Methods in Machine Learning - Thomas G. Dietterich
[Thomas G. Dietterich](http://www.cs.orst.edu/~tgd)

__In low noise cases. AdaBoost gives good performance because it is able to optimize the ensemble without overfitting. However, in high noise cases, AdaBoost puts a large amount of weight on the mislabeled examples and this leads it to overfit very badly__.  Bagging and Randomization do well in both the noisy and noise free cases because they are focusing on the statistical problem  and noise increases this statistical problem. 

In very large datasets, Randomization can be expected to do better than Bagging because bootstrap replicates of a large training set are very similar to the training set itself and hence the learned decision tree will not be very diverse. Randomization creates diversity under all conditions but at the risk of generating low quality decision trees

### 7.3 Boosting

error: $$\epsilon = \frac{number of incorrectly classified examples}{ total number of examples}$$

alpha: $$\alpha = \frac{1}{2} (ln {\frac{1 - \epsilon}{\epsilon}})$$

If correctly predicted: $$D_i^{(t + 1)} =  \frac{D_i^{(t)}e^{-\alpha}}{Sum(D)}$$ 
If incorrectly predicted: $$D_i^{(t + 1)} =  \frac{D_i^{(t)}e^{\alpha}}{Sum(D)}$$
[read here](https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/boosting-survey.pdf)

### 7.4. Overfitting

It has been claimed in literature that for well=behaved datasets the test error for Adaboost reaches a plateau and won't increase with more classifiers. 

For dataset isn't "well behaved" , you may see that the test error reaches a minimum and then starts to increase. The dataset did start off with 30% missing values and were replaced with zeros which works well for logistic regression but they may not work for a decision tree. You can try to replace with average for a given class

# 8. Options for handling missing data

- Use the feature's mean value from all the available data
- Fill in the unknown with a special value like -1
- Ignore the instance
- Use a mean value from  similar items
- Use another machine learning algorithm to predict the value

e.g. set value as 0 work out well for logistic regression for two reasons:

- weights = weights + alpha * error * dataMatrix[randindex] , if dataMatrix is 0 fro any feature, then the weight  for that feature will simply be weights
- the error term will not be impacted because sigmoid(0) = 0.5 which is neutral for the predicting the class

# 9. Classification imbalance

### 9.1. Alternative performance metrics: precision, recall and ROC

- Confusion matrix
- Precision: the fraction of records that were positive from the group that the classifier predicted to be positive
$$\frac{TP}{(TP + FP)}$$
- Recall: the fraction of positive examples the classifier got right
$$\frac{TP}{(TP + FN)}$$

You can easily construct a classifier that achieves a high measure of recall or precission but not both. Creating a classifier that maximize both precision and recall is a challenge

- ROC curve: x-axis False positive rate; y-axis True positive rate

### 9.2. Manipulating the classifier's decision with a cost function

original calculate the total cost: TP * 0 + FN * 1 + FP * 1 + TN * 0

now consider total cost: TP * -5 + FN * 1 + FP * 50 + TN * 0

### 9.3. Data sampling for dealing with classification imbalance


Oversample: duplicate examples

Undersample: delete examples

For example, you are trying identify credit card fraud. there is a rare case. You want to preserve as much information as posible about the rare case. So you should keep all of the examples from the positive class and undersample or discard examples from the negative class. One draback of this approach is deciding which negative examples to toss out. The examples you choose to toss out could carry valuable information that isn't contained in the remaining examples. One solution for this is to pick samples to discard that aren't near the decision boundary.