Skip to content

machine learning algorithms overview

Khelil Sator edited this page Jun 25, 2019 · 3 revisions

LinearSVC

LinearSVC is a python class from Scikit Learn library

>>> from sklearn.svm import LinearSVC

LinearSVC performs classification.
LinearSVC finds a linear separator. A line separating classes.
There are many linear separators: It will choose the optimal one, i.e the one that maximizes our confidence, i.e the one that maximizes the geometrical margin, i.e the one that maximizes the distance between itself and the closest/nearest data point point

Support vectors are the data points, which are closest to the line

SVM.png

Support vector classifier

Support vector machines (svm) is a set of supervised learning methods in the Scikit Learn library.
Support vector classifier (SVC) is a python class capable of performing classification on a dataset.
The class SVC is in the module svm of the Scikit Learn library

>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear')

SVC with parameter kernel='linear' is similar to LinearSVC

SVC with parameter kernel='linear' finds the linear separator that maximizes the distance between itself and the closest/nearest data point

k-nearest neighbors

k-NN classification is used with a supervised learning set.
K is an integer.
To classify a new data point, this algorithm calculates the distance between the new data point and the other data points.
The distance can be Euclidean, Manhattan, ....
Once the algorithm knows the K closest neighbors of the new data point, it takes the most common class of these K closest neighbors, and assign that most common class to the new data point.
So the new data point is assigned to the most common class of its k nearest neighbors.
So the new data point is assigned to the class to which the majority of its K nearest neighbors belong to.

KNN.png

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

It is an unsupervised machine learning algorithm.
It is a density-based clustering algorithm.
It groups datapoints that are in regions with many nearby neighbors.
It groups datapoints in such a way that datapoints in the same cluster are more similar to each other than those in other clusters.
Clusters are dense groups of points.
Clusters are dense regions in the data space, separated by regions of lower density
If a point belongs to a cluster, it should be near to lots of other points in that cluster.
It marks datapoints in lower density regions as outliers.
It works like this: First, we choose two parameters, a number epsilon (distance) and a number minPoints (minimum cluster size).
epsilon is a letter of the Greek alphabet.
We then begin by picking an arbitrary point in our dataset.
If there are at least minPoints datapoints within a distance of epsilon from this datapoint, this is a high density region and a cluster is formed. i.e if there are more than minPoints points within a distance of epsilon from that point (including the original point itself), we consider all of them to be part of a "cluster".
We then expand that cluster by checking all of the new points and seeing if they too have more than minPoints points within a distance of epsilon, growing the cluster recursively if so.
Eventually, we run out of points to add to the cluster.
We then pick a new arbitrary point and repeat the process.
Now, it's entirely possible that a point we pick has fewer than minPoints points in its epsilon range, and is also not a part of any other cluster: in that case, it's considered a "noise point" (outlier) not belonging to any cluster.
epsilon and minPoints remain the same while the algorithm is running.

DBSCAN.png

k-means clustering

k-means clustering splits N data points into K groups (called clusters).
k ≤ n.

kmeans.png

A cluster is a group of data points.
Each cluster has a center, called the centroid.
A cluster centroid is the mean of a cluster (average across all the data points in the cluster).
The radius of a cluster is the maximum distance between all the points and the centroid.

radius.png

Distance between clusters = distance between centroids.
k-means clustering uses a basic iterative process.
k-means clustering splits N data points into K clusters.
Each data point will belong to a cluster.
This is based on the nearest centroid.
The objective is to find the most compact partitioning of the data set into k partitions.
k-means makes compacts clusters.
It minimizes the radius of clusters.
The objective is to minimize the variance within each cluster.
Clusters are well separated from each other.
It maximizes the average inter-cluster distance.

The file kmeans.xlsx computes one iteration of k-means with k=2

k-means clusters tend to be of the same size. size refers to the area. size doesnt refer to the number od elements. Two clusters of the same area do not have to have the same number of elements (except if your data set has the same density)
kmeans_mouse.png
The tendency of k-means to produce equal-sized clusters leads to bad results here