Supervised learning
https://en.wikipedia.org/wiki/Supervised_learning

The most widely used learning algorithms are Support Vector Machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm, and Neural Networks (Multilayer perceptron).

#### There are four major issues to consider in supervised learning:

#### Bias-variance tradeoff[edit]
Main article: Bias-variance dilemma
A first issue is the tradeoff between bias and variance.[2] Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input {\displaystyle x} x if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for {\displaystyle x} x. A learning algorithm has high variance for a particular input {\displaystyle x} x if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.[3] Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).

Function complexity and amount of training data[edit]
The second issue is the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a "flexible" learning algorithm with low bias and high variance.

Dimensionality of the input space[edit]
A third issue is the dimensionality of the input space. If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function. In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.

Noise in the output values[edit]
A fourth issue is the degree of noise in the desired output values (the supervisory target variables). If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Attempting to fit the data too carefully leads to overfitting. You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation that part of the target function that cannot be modeled "corrupts" your training data - this phenomenon has been called deterministic noise. When either type of noise is present, it is better to go with a higher bias, lower variance estimator.

In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.[4][5]

https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf

#### Logarithmic transformation

For highly-skewed feature distributions, it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of 0 is undefined, so we must translate the values by a small amount above 0 to apply the the logarithm successfully.

https://en.wikipedia.org/wiki/Data_transformation_(statistics)

#### Normalizing Numerical Features
In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning,

**F-beta score** as a metric that considers both precision and recall:

$$ F_{\beta} = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\left( \beta^2 \cdot precision \right) + recall} $$

In particular, when $\beta = 0.5$, more emphasis is placed on precision. This is called the **F$_{0.5}$ score** (or F-score for simplicity).

Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance.

Suppose a computer program for recognizing dogs in scenes from a video containing 12 dogs and some cats. Of the 8 dogs identified, 5 are actually dogs (true positives), while the rest are cats (false positive). The program's precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is "how useful the search results are", and recall is "how complete the results are".

### Regularization:

For example let's say you're training your guard robot to identify which neighborhood pets to scare off because they bother your prize cockateel. You give the robot the following examples:

Bo, snake, small, FRIENDLY

Miles, dog, small, FRIENDLY

Fifi, cat, small, ENEMY

Muffy, cat, small, FRIENDLY

Rufus, dog, large, FRIENDLY

Jebediah, snail, small, FRIENDLY

Aloysius, dog, large, ENEMY

Tom, cat, large, ENEMY

Your robot might happily learn the rule pets with up to four-letter names are enemies, as are large dogs with names beginning with 'A', except that small snakes are not enemies since it perfectly fits this data, but it feels clunky. 

Intuitively we'd suppose there's a simpler rule that fits as well. The reason we feel complexity is bad per se is that it seems unlikely to generalize well to new pets: are small snakes really not enemies? or is it just that they weren't in this input?

In contrast, the rule - large dogs and cats are enemies - does not fit the data perfectly, but fits reasonably well. It's simpler, and maybe we therefore suppose it will be more correct for the thousands of other pets not in this sample.

Loosely, regularization is the kind of thing that discourages complexity, even if it means picking a less-accurate rule according to the training data, in the math that might evaluate these rules.


- Regularization is a technique used in an attempt to solve the overfitting[1] problem in statistical models.*

 ## Choosing a Machine Learning Classifier
 
by Edwin Chen on Wed 27 April 2011


http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice

http://ml.posthaven.com/machine-learning-done-wrong


How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models.

You can also think of this as a generative model vs. discriminative model distinction.

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

But…

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.

### Applications of supervised machine learning algorithms:
#### Logistic Regression: 
One real world application of Logistic regression is prediction of bankruptcy.Since the outcome is either going to be bankrupt or not bankrupt,logistic regression would be an ideal model for predicting the future trends in bankruptcy. It gives the probabilitic estimate of the output based on techniques like Maximum Likelihood[Richard P.H and David B(2011).Predicting bankruptcy with Robust Logistic Regression.Journal of Data Science.] which can then be later used for ranking purposes. Logistic regression is a very good tool for classification problems given in terms of probability.They are not much prone to outliers[Richard P.H and David B(2011).Predicting bankruptcy with Robust Logistic Regression.Journal of Data Science.].Regularization can be easily implemented to generalize the model on unseen data.It cannot predict continuos outcomes.It performs really bad if multi-collinearity exists.It does not perform well with data having very large number of features.Logistic regression is good at classification problems where there are not many features and works better only if the data are linearly separable.


#### Support Vector Machines:

It is major machine learning algorithm implemented in the field of health and medicine.One such application is in the detection of Diabetes and Pre-Diabetes in a given population.Unlike Logistic Regression which predicts its output based on a determined model,SVM generates a hyperplane to separate the outcome after transforming the input variables that maximises the "margin or the distance between the points from this hyperplane.SVM does not calculate the probabilistic value of the ouput unlike what logistic regression does.[Wei.Y,Tiebin.L,Rodolfo.V,Marta.G and Muin.J.K(2010).Application of Support Vector Machine modelling for prediction of common diseases:the case of diabetes and pre-diabetes.BMC Medical Informatics and Decision making.] One of the best advantage of SVM over others is that, it can convert input data that is linearly non-separable and transform them to linearly separable through the use of kernels.It also accomodates regularization parameter that avoids overfitting the data.SVM works very well when there are many feature variables.The disadvantages of SVM is that it requires more time for training the data when comapared to other algorithms.Also when the features become too complex,SVM fails to recognize the hyperplane.SVM works better for classification problems and from the data given,I think this model would make a good candidate.

#### Random Forest Classifier: 

Random Forest Classifier is been used in the field of medical diagonositcs like multiple organ segmentation and detection of Parkinson's lesions within the brain[Olivier P.Random Forests for Medical Applications[2010].Technische Universitat Munchen]. This algorithm goes by the rule of divide and conquer.It randomly selects a subset of data and forms a tree.This then repeats again and again until they form a forest with many different trees.Then a vote decides the output from the averaged predictions across all the trees. Random forests have the ability to deal with a huge amount of data contaning many different variables.It depicts what predictors are important for the classification.They also have good visualizations and graphics.It comes with an internal estimate of the error as the tree progresses.It can also be run parallel and is resistant to outliers. Some drawbacks includes overfitting of data and is more time consuming when working with large number of predictors as the trees and forests grow.The random forests have greedy approaches to grow their trees by finding the best tree for that particular subset and not the whole set of data. Random forests are known to work well at times especially for classification problems and gives more accuracy and is fast.Hence I believe this is a good model for the data.

### Random Forests
are fast, flexible, and represent a robust approach to mining high-dimensional data. They are an extension of classification and regression trees (CART). They perform well even in the presence of a large number of features and a small number of observations. In analogy to CART, random forests can deal with continuous outcome, categorical outcome, and time-to-event outcome with censoring. The tree-building process of random forests implicitly allows for interaction between features and high correlation between features. Approaches are available to measuring variable importance and reducing the number of features. Although random forests perform well in many applications, their theoretical properties are not fully understood. Recently, several articles have provided a better understanding of random forests, and we summarize these findings. We survey different versions of random forests, including random forests for classification, random forests for probability estimation, and random forests for estimating survival data. We discuss the consequences of (1) no selection, (2) random selection, and (3) a combination of deterministic and random selection of features for random forests. Finally, we review a backward elimination and a forward procedure, the determination of trees representing a forest, and the identification of important variables in a random forest. Finally, we provide a brief overview of different areas of application of random forests. WIREs Data Mining Knowl Discov 2014, 4:55–63. doi: 10.1002/widm.1114

For further resources related to this article, please visit the WIREs website.

 .

 .

 .

 .

 .

 .