### Classification Definition

* Classification in machine learning and statistics is a supervised learning approach in which the computer program learns from the data given to it and make new observations or classifications.

* The computer is presented with both example inputs and their respective outputs. The algorithm learns a general rule to map the inputs to the outputs.

**Two Stages**

* **Training** - We show the model a set of inputs along with the respective outputs. The task of the model is to learn by mapping the inputs and outputs.

<img src="https://raw.githubusercontent.com/msameeruddin/Data-Analysis-Python/main/7_DA_Machine_Learning/training.png">

* **Testing/Evaluation** - We show the model a set of new inputs without the respective outputs. The aim of the model is to predict based on the learning it had undergone.

<img src="https://raw.githubusercontent.com/msameeruddin/Data-Analysis-Python/main/7_DA_Machine_Learning/testing.png">

The model predicts the category based on the previous training or learning.

### Mathematical Notation

* In any dataset $x_i$ represents the input and $y_i$ represents the class label. Let's say we have two class labels (+ve & -ve), we can have labels as (0, 1).

* It is important to convert textual data into vectors for mathematically mapping inputs with respect to class labels.

$$D_n = \big\{(x_i, y_i)_{i = 1}^n \ | \ x_i \in \rm I\!R^d \ \& \ y_i \in (0, 1)\big\}$$

### Classification VS Regression

* For classification, we shall have a finite set of class labels from which we predict the class of a new data.
    - Blinary Classification - {0, 1}
    - Multi Classification (MNIST) - {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
* For regression, the output values are generally continuous values (completely different from finite set of classes}
    - Predicting the height (inches) based on the weight of the person

### Geometric Intuition

* Let us have data set that is plotted (visualized).

![data_set-knn](https://user-images.githubusercontent.com/63333753/115667178-2acb7a00-a363-11eb-94cd-5770228fd3b6.png)

* Let's introduce a new query point ($x_q$) plotted along with the data (magenta color).

![qp-introduced](https://user-images.githubusercontent.com/63333753/115667817-e55b7c80-a363-11eb-995d-86857e3c31ca.png)

* How do we determine which category $x_q$ belongs to?

    - Find $k$ (=3) nearest points to $x_q$ in the data set.
    - Apply voting system and take majority value of the $k$ points that they belong.
        * Since $k = 3$, we shall have points as $\{x_1, x_2, x_3 \}$.
        * Find the respective classes of points. Let's have $\{y_1, y_2, y_3 \}$ as classes and class values as $\{\text{positive}, \text{positive}, \text{negative} \}$.
        * The majority is $\text{positive}$, hence $x_q$ belongs to positive categeory (blue).
    
    - **Note** - It is always recommended to consider $k$ to be an odd number.

![geometric-intuit-knn](https://user-images.githubusercontent.com/63333753/115669067-8139b800-a365-11eb-93f8-f655e9b4d5df.png)

This is the geometric intuition of KNN algorithm.

### Distance Measures

Distances are measured between two points whereas norms are measured between two vectors in $d$ dimensional space.

* **Euclidean Distance** between two points
    - $x_1 - (x_{11}, x_{12}) \ \text{and} \ x_2 - (x_{21}, x_{22}) \ \text{is} \ \rightarrow ||x_1 - x_2|| = \sqrt{(x_{21} - x_{11})^2 + (x_{22} - x_{12})^2}$
    - We can also represent it in the form of $||x_1 - x_2||_2$
    - To simplify, $$||x_1 - x_2||_2 \rightarrow \bigg[\sum_{i = 1}^d \big(x_{1i} - x_{2i}\big)^2 \bigg]^{\frac{1}{2}} \implies L_2 \ \text{Norm}$$

* **Manhattan Distance** between two points
    - $x_1 - (x_{11}, x_{12}, \dots, x_{1d}) \ \text{and} \ x_2 - (x_{21}, x_{22}, \dots, x_{2d}) \ \text{is} \ \rightarrow ||x_1 - x_2|| = \sum_{i = i}^d \big| x_{1i} - x_{2i} \big|$
    - We can also represent it in the form of $||x_1 - x_2||_1 \implies L_1 \ \text{Norm}$

* **Minkowski Distance** between two points
    - $x_1 - (x_{11}, x_{12}, \dots, x_{1d}) \ \text{and} \ x_2 - (x_{21}, x_{22}, \dots, x_{2d}) \ \text{is} \ \rightarrow ||x_1 - x_2|| = \bigg[\sum_{i = 1}^d \big|x_{1i} - x_{2i}\big|^p \bigg]^{\frac{1}{p}}$
    - We can also represent it in the form of $||x_1 - x_2||_p \implies L_p \ \text{Norm}$
    - If $(p = 1) \rightarrow$ Manhattan Distance
    - If $(p = 2) \rightarrow$ Euclidean Distance
    - $p$ should be always $> 0$

* **Hamming Distance** is measured between two vectors
    - Let 
    - $x_1$ be $[0, 1, 1, 0, 1, 0, 0]$ and 
    - $x_2$ be $[1, 0, 1, 0, 1, 0, 1]$
    - Hamming distance is the number of locations or dimensions where binary vectors differ.
    - $\text{Ham}(x_1, x_2) = 3$

In [1]:
import numpy as np

class DistanceMeasures():
    def __init__(self, point1, point2):
        self.point1 = np.array(point1)
        self.point2 = np.array(point2)
        self.flag = True if (len(self.point1) == len(self.point2)) else False
    
    def euclidean_measure(self):
        if self.flag:
            dist = np.linalg.norm(self.point1 - self.point2)
            return round(dist, 4)
        return None
    
    def manhattan_measure(self):
        if self.flag:
            dist = np.abs(self.point1 - self.point2).sum()
            return round(dist, 4)
        return None
    
    def minkowski_measure(self, p):
        if not self.flag and (p <= 0):
            return None
        
        if (p == 1):
            return self.manhattan_measure()
        elif (p == 2):
            return self.euclidean_measure()
        
        dist = np.sum(np.abs(self.point1 - self.point2) ** p) ** (1 / p)
        return round(dist, 4)
    
    def hamming_measure(self):
        if self.flag:
            return np.sum(self.point1 != self.point2)
        return None

In [2]:
p1 = [1, 2, 3, 4, 5]
p2 = [4, 3, 1, 4, 5]

dist = DistanceMeasures(point1=p1, point2=p2)

print("Euclidean \t →", dist.euclidean_measure())
print("Manhattan \t →", dist.manhattan_measure())
print("Minkowski \t →", dist.minkowski_measure(p=3))
print("Hamming \t →", dist.hamming_measure())

Euclidean 	 → 3.7417
Manhattan 	 → 6
Minkowski 	 → 3.3019
Hamming 	 → 3


### Relationship b/w Cosine Similarity and Cosine Distance

Study - https://cmry.github.io/notes/euclidean-v-cosine

* The relationship between these two are inverse. If distance increases then similarity decreases and vice-versa.

**General Notation**

$$1 - \text{cosine-sim}(x_1, x_2) = \text{cosine-dis}(x_1, x_2)$$

where

* $x_1 = [x_{11}, x_{12}, \dots, x_{1d}]$
* $x_1 = [x_{21}, x_{22}, \dots, x_{2d}]$

$$\text{cosine-sim}(x_1, x_2) = \frac{x_1.x_2}{||x_1||_2 ||x_2||_2} = \frac{x_1.x_2}{\sqrt{x_1.x_1} \sqrt{x_2.x_2}}$$

<br>

* **Training error** is the error that you get when you run the trained model back on the training data.
    - Remember that this data has already been used to train the model and this necessarily doesn't mean that the model once trained will accurately perform when applied back on the training data itself.

* **Test error** is the error when you get when you run the trained model on a set of data that it has previously never been exposed to.
    - This data is often used to measure the accuracy of the model before it is shipped to production.

### Test / Evaluation - Time & Space Complexity

Study - https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/

The below is a simple `pseudo` code to determine time and space complexity for KNN.

$x_q \rightarrow y_q$

**Inputs** - $D_{Train}$, $k$, $x_q \in \rm I\!R^d$

**Ouput** - $y_q$

**Algo**

```python
KNNPts = []
for each_xi in D_Train: # O(n)
    di = compute_distance(each_xi, xq) # O(d)
    # keep the smallest distances ==> (each_xi, yi, di)
    KNNPts.append(s_each_xi, s_yi, s_di) # O(1)
    
    # Total time complexity → O(nd)

cnt_pos = 0; cnt_neg = 0

# O(k)
for each_pair in KNNPts:
    # O(1)
    if each_pair[1] > 0:
        cnt_pos += 1
    else:
        cnt_neg += 1

# O(1)
if cnt_pos > cnt_neg:
    return yq = 1 # +ve
return yq = 0 # -ve

# Overall time complexity is O(nd) + O(1) + O(1) = O(nd)
```

* Time Complexity → $O(nd)$
* Space Complexity → $O(nd)$

**Can we reduce the time and space complexities?**

* We shall use the state of the art algorithms such as -

    * KD-Tree (data structure)
    * LSH - Locality Sensitive Hashing

### Decision surface for KNN as `K` changes

* A (hyper) surface in a multidimensional state space that partitions the space into different regions.
* Data lying on one side of a decision surface are defined as belonging to a different class from those lying on the other.

![decision-surface](https://user-images.githubusercontent.com/63333753/116030754-ee0cc500-a679-11eb-972f-4c87e6e97062.png)

* Decision surfaces help us understand what happens as `k` increases.
* As `K` increases, the smoothness of the curve increases.
* If `K` is exactly equal to `n`, all the new query points become the majority class no matter how deep inside the region the point is.

### Important terminology

* **Overfitting** - is an error that occurs in data modeling as a result of a particular function aligning too closely to a minimal set of data points.
* **Regular fitting** - Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained. A model that is well-fitted produces more accurate outcomes.
* **Underfitting** - A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. Underfitting destroys the accuracy of our machine learning model.

### Cross Validation

* Study
    * http://ethen8181.github.io/machine-learning/model_selection/model_selection.html
    * https://www.ritchieng.com/machine-learning-cross-validation/

<br>

* Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
* The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.

**Generalization** - If an algorithm performs well for an unseen data point, it is known as generalization.

<img src="http://ethen8181.github.io/machine-learning/model_selection/img/kfolds.png">

**Credits** - Image from Internet

We divide the whole data set in to 3 parts -

* Training - it is used to determine `NN`.
* Validation - it is used to determine `K`.
* Testing (unseen) - it is used to predict the class label for an unseen data.

The time complexity for k-fold cross validation is $O(n*k*d)$.

### Visualizing the data - (Train, CV, Test)

* $D_{\text{Train}}$, $D_{\text{CV}}$, and $D_{\text{Test}}$ do not overlap perfectly (because we split the data randomly).

* If there are many $\text{+ve}$ or $\text{-ve}$ points from $D_{\text{Train}}$ in a region then it is highly likely to find many $\text{+ve}$ or $\text{-ve}$ points from $D_{\text{CV}}$ in that region.

* On the other hand, if there are very few $\text{+ve}$ or $\text{-ve}$ points in a region from $D_{\text{Train}}$ then it is very unlikely to find $\text{+ve}$ or $\text{-ve}$ points from $D_{\text{CV}}$ in that region. This represents that the data contain noisy values.

![intuitive-viz](https://user-images.githubusercontent.com/63333753/116349082-dade2e80-a80c-11eb-8bed-8ba86991f631.png)

### Train error, Cross Validation error

* **Training Error:** We get the by calculating the classification error of a model on the same data the model was trained on.

* **Cross Validation Error:** We get by calculating the classification error of a model on the cross validation data using the trained data for training.

* **Test Error:** We usually get the test error after getting the accuracy i.e., $(1 - \text{accuracy})$. We generally report this in order to see how best the model performs.

![ttcv-errors](https://user-images.githubusercontent.com/63333753/116353894-314f6b00-a815-11eb-8bb0-0f053f7311a3.png)

**Note:** When Training Error and CV Error is more then it is known as **Underfitting**. Similarly, when Training Error is less and CV Error is more, then it is known as **Overfitting**.

### Time based Splitting (TBS)

If there is a time/date feature in the data set then it is better to use this method.

* First sort the data based on time feature.
* Select first 60% data for training data.
* Next 20% data for cross validation.
* Further 20% data for test data.

With time, many things change (reviews, products) so when time column is available, it is good to choose TBS.

### KNN for Regression

Some classification algorithms can be extended to do regression analysis. A simple notation on how we can do is below.

* Given $x_q$, find k-nearest neighbors
    - $(x_1, y_1), (x_2, y_2), (x_3, y_3), \dots, (x_k, y_k)$
* **Classification** → workds based on majority votes.
* **Regression** → It is sure that -
    - $y_q \in \{y_1, y_2, y_3, \dots, y_k\} \rightarrow \rm I\!R$
    - $y_q \rightarrow \text{Mean}({y_i})_{i=1}^k$
    - $y_q \rightarrow \text{Median}({y_i})_{i=1}^k$
    - We shall not take mode into consideration as it may lead to the same technique such as majority vote.

### Weighted KNN

* This is an extension to a regular KNN implementation.
* A weight is added to each distance that is computed.
* More weight if less distance and vice-versa.
* Sum up all the weights per class label and choose the one which is the largest of all.

**One way** to determine how much weight should be given is $w_i = \frac{1}{d_i}$

### Kd Tree

Imagine the following image is your data visualized

![kd-tree-points](https://user-images.githubusercontent.com/63333753/116850618-94267500-ac0e-11eb-9f29-e1f02c40d7b6.png)

To construct a `Kd` Tree, we must follow the below steps

* Split the data with the **median** by projecting the points on `X` axis.

![kd-tree-step1](https://user-images.githubusercontent.com/63333753/116852540-1ebca380-ac12-11eb-9e99-8b84934d1470.png)

* Alternate between axes and repeat the step 1.

![kd-tree-step2](https://user-images.githubusercontent.com/63333753/116858391-e9b54e80-ac1b-11eb-96ec-c771154851e4.png)

* We go through each axis one after the other as done in the above diagram in order to construct a tree till we reach leaf nodes.


With `Kd` Tree, we are breaking up the space using axis parallel lines or planes into rectanlges/cuboids/hypercuboids.

### Limitations of `Kd` Tree

* When `d` is small - {2, 3, 4, 5} → `Kd` tree works efficiently.
* When `d` is `>=` 10 → the time complexity increases dramatically.
* `Kd` tree was developed first to solve some problems related to computer graphics. It has no proper applications in Machine Learning where we deal with so many dimensions of data.

### Locality Sensitive Hashing

* Uses the concept of hashing and hashtable to find the nearest neighbors.
* Stores the data in a `key` and `value` pair (similar to dictionary).
* $\text{hash}(x_q) \rightarrow$ used to find the nearest neighbor for a new query point. The time complexity for the same is $O(1)$ (less time).

![lsh-concept](https://user-images.githubusercontent.com/63333753/116969053-40ca2a80-acd3-11eb-9139-346eed4fe5b5.png)

* This technique works well even the dimension ($d$) is large.

### LSH  for Cosine Similarity

Steps for locality sensitive hashing:

1. First make ‘m’ hyperplanes to split into regions and create slices such that cluster of points lie in a particular slice and be called their neighbourhood. Typically $m = \log(n)$

2. Next for each point create a vector (also called hash function) by $W_1^T \ \text{point}$. If it is greater than 0 , it lies on the same side of that hyperplane else other side. Based on that create a vector of m size. 
    - For eg the vector can be [1,-1,-1] denoting point $x$ lies on same side of normal to hyperplane 1, opposite side to normal of hyperplane 2 and 3. Now this vector serves as key to the hash table and all the points with the same key or vector representation will go in the same bucket as they have similar vector representation denoting they lie in the neighbourhood of each other.

![lsh-cosine](https://user-images.githubusercontent.com/63333753/116975496-5d6b6000-acdd-11eb-9d3a-8c16129d88cc.png)

| Key | local points |
| --- | --- |
| [1 1 1] | {x1, x2, x3, x4, x5 } |
| [1 -1 -1] | {x6, x7, x8, x9 } |

3. Now it may happen that two points which are very close fall on different slice due to placing of hyperplane and hence not considered as nearest neighbour. To resolve this problem, create `l` hash tables (`l` is typically small). In other words repeat **step 2** `l` times thus creating l hash tables and m random planes `l` times. So when a query point comes, compute the hash function each of the `l` times and get its neighbours from each of bucket. Union them and find the nearest neighbours from the list. So basically in each `l` iterations create m hyperplanes and hence region splitting will be different thus vector representation or hash function of the same query point will be different in each of the representations. Thus the hash table will be different as points which lied on the same region in previous iteration might lie in a different region in this iteration and vice versa due to different placement of hyperplanes.

**Time and Space Complexity**

* Time complexity is $O(m*d*l)$ for each query point. And for creating the hash table $O(m*d*l*n)$ which is one time only
* Space complexity is $O(n)$

<br>

> **LSH** is extensively used in Computer Vision areas.

### Probabilistic Class Labels

In probabilistic approach instead of giving deterministic value, we want to establish certainty by giving probabilities that is helpful sometimes like in medical applications, where we also want to state how confident you are about our decision of one class or another wherein the probabilistic class-labels are very useful. You can combine probabilistic approach with weighted knn as follows:

1. Let $d_1$, $d_2$, and $d_3$ be the distances to the `3 NNs`. Here, let $d_1$ be the distance to the `+ve` labeled points and $d_2$ and $d_3$ be the distances to `-ve` labeled points. Let $\frac{1}{d_1} = w_1$, $\frac{1}{d_2} = w_2$, $\frac{1}{d_3} = w_3$ where $w_i$ is the weights.

2. $P(x_q = \text{+ve}) = \frac{w_1}{(w_1+w_2+w_3)}$ and $P(x_q = \text{-ve}) = \frac{(w_2+w_3)}{(w_1+w_2+w_3)}$. Now use these ratios to obtain the probability values to decide the class label.

In a nutshell, instead of using the count of 1 to compute the probabilistic scores, we are using the weights.

### Interview Questions

* https://www.analyticsvidhya.com/blog/2017/09/30-questions-test-k-nearest-neighbors-algorithm/