Skip to content

Latest commit

 

History

History
121 lines (86 loc) · 3.78 KB

model_comparison.md

File metadata and controls

121 lines (86 loc) · 3.78 KB

Comparison of Machine Learning Models

K-nearest neighbors (KNN)

Advantages:

  • Simple to understand and explain
  • Model training is fast
  • Can be used for classification and regression

Disadvantages:

  • Must store all of the training data
  • Prediction phase can be slow when n is large
  • Sensitive to irrelevant features
  • Sensitive to the scale of the data
  • Accuracy is (generally) not competitive with the best supervised learning methods

Linear Regression

Advantages:

  • Simple to explain
  • Highly interpretable
  • Model training and prediction are fast
  • No tuning is required (excluding regularization)
  • Features don't need scaling
  • Can perform well with a small number of observations
  • Well-understood

Disadvantages:

  • Presumes a linear relationship between the features and the response
  • Performance is (generally) not competitive with the best supervised learning methods due to high bias
  • Can't automatically learn feature interactions

Logistic Regression

Advantages:

  • Highly interpretable (if you remember how)
  • Model training and prediction are fast
  • No tuning is required (excluding regularization)
  • Features don't need scaling
  • Can perform well with a small number of observations
  • Outputs well-calibrated predicted probabilities

Disadvantages:

  • Presumes a linear relationship between the features and the log-odds of the response
  • Performance is (generally) not competitive with the best supervised learning methods
  • Can't automatically learn feature interactions

Naive Bayes

Advantages:

  • Model training and prediction are very fast
  • Somewhat interpretable
  • No tuning is required
  • Features don't need scaling
  • Insensitive to irrelevant features (with enough observations)
  • Performs better than logistic regression when the training set is very small

Disadvantages:

  • Predicted probabilities are not well-calibrated
  • Correlated features can be problematic (due to the independence assumption)
  • Can't handle negative features (with Multinomial Naive Bayes)
  • Has a higher "asymptotic error" than logistic regression

Decision Trees

Advantages:

  • Can be used for regression or classification
  • Can be displayed graphically
  • Highly interpretable
  • Can be specified as a series of rules, and more closely approximate human decision-making than other models
  • Prediction is fast
  • Features don't need scaling
  • Automatically learns feature interactions
  • Tends to ignore irrelevant features
  • Non-parametric (will outperform linear models if relationship between features and response is highly non-linear)

Disadvantages:

  • Performance is (generally) not competitive with the best supervised learning methods
  • Can easily overfit the training data (tuning is required)
  • Small variations in the data can result in a completely different tree (high variance)
  • Recursive binary splitting makes "locally optimal" decisions that may not result in a globally optimal tree
  • Doesn't tend to work well if the classes are highly unbalanced
  • Doesn't tend to work well with very small datasets

Random Forests

Advantages (compared to decision trees):

  • Performance is competitive with the best supervised learning methods
  • Provides a more reliable estimate of feature importance
  • Allows you to estimate out-of-sample error without using train/test split or cross-validation

Disadvantages (compared to decision trees):

  • Less interpretable
  • Slower to train
  • Slower to predict

Regularized Linear Models

Advantages (compared to unregularized linear models):

  • Better performance
  • L1 regularization performs automatic feature selection
  • Useful for high-dimensional problems (p > n)

Disadvantages (compared to unregularized linear models):

  • Tuning is required
  • Feature scaling is recommended
  • Less interpretable (due to feature scaling)