# Top 10 algorithms in data mining

The top 10 algorithms are: 

1. C4.5
2. k-Means
3. SVM
4. Apriori
5. EM
6. PageRank 
7. AdaBoost
8. kNN
9. Naive Bayes
10.  CART. 

The 10 algorithms covered: 
* classification
* clustering
* statistical learning
* association analysis
*  link mining
which are all among the most important topics in data mining research and development. 

 
The objective is simply to highlight what was considered by the community as the state-of-the-art data mining tools. 


## 1. k-means

This is one of the workhorse unsupervised algorithms. 
* The goal of k-means is simply to cluster by proximity to a set of k points. 
  * By updating the locations of the `k` points according to the mean of the points closest to them, the algorithm iterates to the k-means. 
  
[Example](https://github.com/Egade/notes/blob/main/070_Clustering.ipynb)

## 2. EM (mixture models)

Mixture models are the another workhorse algorithm for unsupervised learning. 
* The assumption underlying the mixture models is that the observed data is produced by a mixture of different probability distribution functions whose weightings are unknown. 
  * Moreover, the parameters must be estimated, thus requiring the Expectation-Maximization (EM) algorithm. 
  
Eg.


## 3. Support vector machine (SVM)

One of the most powerful and flexible supervised learning algorithms used for most of the 90s and 2000s, the SVM is an exceptional off-the-shelf method for classification and regression. 

The main idea: 
* project the data into higher dimensions and split the data with hyperplanes. 

Critical to making this work in practice was the kernel trick for efficiently evaluating inner products of functions in higher-dimensional space. 

Eg.


## 4. CART (Classification And Regression Tree)

One of the most powerful technique of supervised learning. 

The main idea:
* split the data in a principled and informed way so as to produce an interpretable clustering of the data. 
  * The data splitting occurs along a single variable at a time to produce branches of the tree structure. 

Eg.


## 5. k-Nearest Neighbors (kNN)

The simplest supervised algorithm to understand. 
It is highly interpretable and easy to execute. 

The main idea: 
* given a new data point $x_k$ which does not have a label, simply find the $k$ nearest neighbors $x_j$ with labels $y_j$. 
* The label of the new point $x_k$ is determined by a majority vote of the kNN. 

Eg.

## 6. The Naive Bayes algorithm 

The Naive Bayes algorithm provides an intuitive framework for supervised learning. 
It is simple to construct and does not require any complicated parameter estimation, similar to SVM and/or classification trees. 
It further gives highly interpretable results that are remarkably good in practice. 

The main idea: 

* the method is based upon `Bayes’s theorem` and the computation of conditional probabilities. 
  * one can estimate the label of a new data point based on the prior probability distributions of the labeled data.

Eg.

## 7. AdaBoost (ensemble learning and boosting)

AdaBoost is an example of an ensemble learning algorithm. 

* AdaBoost is a form of random forest, which takes into account an ensemble of decision tree models. 

The way all boosting algorithms work is to first consider an equal weighting for all training data $x_j$. 
* Boosting re-weights the importance of the data according to how difficult they are to classify. 
* Thus the algorithm focuses on data that is harder to classify. 
  * Thus a family of weak learners can be trained to yield a strong learner by boosting the importance of hard to classify data. 

The concept and its usefulness are based upon a seminal theoretical contribution by Kearns and Valiant. 
* robust boosting & gradient boosting are the most powerful techniques.

Eg.



## 8. C4.5 (Ensemble learning of decision trees)

This algorithm is another variant of decision tree learning developed by J. R. Quinlan. 

The main idea: 

* the algorithm splits the data according to an information entropy score. 

* it supports boosting as well as many other well known functionalities to improve performance. 

* broadly, we can think of this as a strong performing version of CART. 

Eg.

## 9. Apriori algorithm

Use to find frequent itemsets from data. 

The main idea:

  * Although, finding frequent itemsets from data may sound trivial, it is not since data sets tend to be very large and can easily produce NP-hard computations because of the combinatorial nature of the algorithms.
* The Apriori algorithm provides an efficient algorithm for finding frequent itemsets using a candidate generation architecture. 
* This algorithm can then be used for fast learning of associate rules in the data.

[Example](https://github.com/Egade/notes/blob/main/091_Apriori_algorithm.ipynb)

## 10. PageRank

The founding of Google by Sergey Brin and Larry Page revolved around the PageRank algorithm. 

* PageRank produces a static ranking of variables, such as web pages, by computing an off-line value for each variable that does not depend on search queries. 

* The PageRank is associated with graph theory as it originally interpreted a hyperlink from one page to another as a vote. 
  * From this, and various modifications of the original algorithm, one can then compute an importance score for each variable and provide an ordered rank list. T
  * The number of enhancements for this algorithm is quite large. 

Eg.

<!--NAVIGATION-->
< [previous](prev) | [Contents](toc.ipynb) | [next](next.ipynb) >