# Introduction to Machine Learning

## What is machine learning?

Machine learning is the term for using machines (or computers) to “learn” its own algorithm to solve problems that cannot be solved efficiently by human programmers. In the context of data science, computer programs can make predictions and decisions purely based on the data, without being explicitly programed to do so. 

 There are three general approaches to machine learning: supervised learning, unsupervised learning, and reinforcement learning. 
- **Supervised learning**: How do we map inputs to outputs?
- **Unsupervised learning**: What are the patterns in the data?
- **Reinforcement learning**: How do we identify rules to make the best decision for any situation?

In this module, we will mainly focus on supervised learning and unsupervised learning.

Note: Reinforcement learning is vastly different from supervised and unsupervised learning. In reinforcement learning, the intelligence is derived only from mathematical computation. In supervised and unsupervised learning, the intelligence is derived only from the data.


## Categorizing machine learning algorithms

![ML-1.jpg](attachment:ML-1.jpg)

The type of machine learning approaches we can take depends on the data we are working with. 

Supervised learning algorithms uses labeled data (where inputs are paired to their outputs) for their training data to build a mathematical model that maps inputs to outputs. There are two classical types of supervised learning: regression and classification. Recall from the data science module that the goal of **regression modeling** is to make predictions using the existing data. **Classification** uses the prediction generated by a model (which can be a regression model) to classify data into separate groupings. 

Unsupervised learning algorithms uses unlabeled data (where inputs do not have a corresponding output) to uncover any underlying structure within the data. There are two classical types of unsupervised learning: dimensionality reduction and clustering. In **dimensionality reduction**, the goal is to reduce the dataset to only its fundamental features. **Clustering** assigns groupings (or clusters) to sets of data that are similar to one another. 

We will focus on classification and clustering in this module.

## Classification

![ML-2.jpg](attachment:ML-2.jpg)

In the above example, we are depicting the relationship of serum sodium level and red blood cell count for both CKD and non-CKD patients. Using this data, how would we determine whether a new patient have CKD if we are only given their serum sodium level and red blood cell count?

### K-Nearest Neighbors (k-NN)

One approach to address this question is through the **k-nearest neighbors algorithm**. The idea is that we can classify the any new data points based on the label of nearby data. It is important to note that **Euclidean distance** is commonly used to determine distance between data points and **majority voting** is used for classification purposes. 

![ML-3.png](attachment:ML-3.png)

In this example, the green circle is the (new) sample data. If we have k = 3 (majority voting by three neighbors, solid line), then the data will be classified to red triangle. If we have k = 5 (majority voting by five neighbors, dotted line), then the data will be classified to blue square.

### Decision Tree

Another approach is through decision trees. As you may have expected, a **decision tree** involves the creation of a series of condition that help us classify any new data points. Below is an example of a decision tree that help us create decision boundaries for the given plot. As you may have noticed, the decision tree algorithm is **not deterministic** (there can be many valid decision trees for the same dataset).

![ML-4.png](attachment:ML-4.png)

### Random Forest

The **random forest** algorithm builds upon the decision tree classifier by constructing multiple decision trees when training the data. This is done using the **bootstrapping method**, in which data (repeats allowed) and features are randomly selected to create multiple new training datasets, each producing a decision tree. Similar to the k-nearest neighbors algorithm, majority voting between all the decision trees is used for classification purposes.

### Support Vector Machine (SVM)

One common approach to the classification problem is through the **support vector machine algorithm**. This involves the creation of a decision boundary that separates the data into their perspective groupings while maximizing the distance from the decision boundary. Similar to the k-nearest neighbors algorithm, Euclidean distance is used to determine distance.

![ML-5.png](attachment:ML-5.png)

In the above example, the blue and red lines correctly separate the data to its appropriate groupings. However, only the red line maximizes the distance between the groupings to the decision boundary. 

### Algorithm determines the decision boundary

Because these classification algorithms approach the same classification problem differently, the resulting **decision boundary** can either be similar or vastly different, as shown below.

![ML-6.png](attachment:ML-6.png)

You may have noticed that the RBF support vector machine decision boundary is not linear. How is that even possible? The RBF (Radial Basis Function) support vector machine uses a much more complex algorithm that transform the data in higher dimensional spaces where it would be linearly separable. When that decision boundary is mapped back into the two-dimensional space, it results in a non-linear separation of the data points. 

As noted in the earlier section, decision trees are not deterministic. The below figure shows the two different possible decision boundaries generated from the same data, model, and algorithm. Notice that the random forest decision boundary is encompases elements of both decision tree decision boundaries. This is due to the majority voting aspect of the random forest algorithm. 

![ML-7.png](attachment:ML-7.png)

## Clustering

![ML-8.jpg](attachment:ML-8.jpg)

Now we are looking at the same data and relationship (of serum sodium level and red blood cell count), except this time, we do not know whether the patients have CKD? Using this data, how would we determine which groups of patients are similar?

## K-Means Clustering

One approach to address this question is through the **k-means clustering** algorithm. The idea is to first assign the data into k clusters and continuously update the clusters until convergence is reached. During each cycle, the data is reassigned to the cluster with the nearest **mean (centroid)** based on Euclidean distance and the mean is recalculated with the newly reassigned data.

Note that k-means clustering algorithm is not the same as the k-nearest neighbors algorithm.

![ML-9.png](attachment:ML-9.png)

It is important to remember that the k-means cluster is determined purely by the centroid of each cluster. These clusters may not match the original label of the data, as shown below (centroids are represented by the larger stars).

![ML-10.jpg](attachment:ML-10.jpg)

## Core machine learning libraries

Scikit-learn and SciPy are standard libraries used for machine learning and optimization functions respectively. 