# Anomaly Detection using k-Nearest Neighbor

KNN is a supervised algorithm, but it is not limited to classification and regression tasks. It is also used as an anomaly detection algorithm.

## Dataset
### Context
The dataset is orginally from the National Institute of Diabetes and Digestive and Kidney Diseases.<br>
Several constraints were placed on the selection of these instances from a larger database.<br>
In particular, all patients here are females at least 21 years old of Pima Indian heritage.<br>
The age and BMI (body mass index) from this dataset are analyzed to identify any anomalies.
### Content
<li>age: Age (in years)</li>
<li>BMI: Body mass index (weight in kg/(height in m)^2)</li>

## What is k-Nearest Neighbor ?

k-Nearest Neighbor (k-NN) is a supervised machine learning algorithm used for classification and regression tasks. It is a non-parametric algorithm, meaning it does not make any assumptions about the underlying data distribution.
<br><br>
In the k-NN algorithm, the "k" refers to the number of nearest neighbors considered for making predictions. When given a new data point, the algorithm identifies the "k" closest data points in the training set based on a chosen distance metric (such as Euclidean distance). The predicted class or value for the new data point is then determined by the majority vote or averaging of the labels or values of its k nearest neighbors.
<br><br>
For classification tasks, the k-NN algorithm assigns the most common class label among the neighbors to the new data point. In regression tasks, it calculates the average of the values from the k nearest neighbors to predict a continuous value.
<br><br>
One important aspect of the k-NN algorithm is the choice of the value for "k." A smaller value of k can make the algorithm sensitive to local variations and noise, while a larger value of k can smooth out the decision boundary and may not capture local patterns well.
<br><br>
k-NN is a simple and intuitive algorithm, but its performance can be affected by the curse of dimensionality, where the presence of a large number of features can lead to decreased accuracy. Additionally, the algorithm can be computationally expensive when dealing with large datasets, as it requires calculating distances between the new data point and all existing data points.
<br><br>
In anomaly detection using k-NN, the algorithm calculates the distances between each data point and its k nearest neighbors. If a data point has significantly larger distances compared to its neighbors, it is considered an anomaly.

# Libraries

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

# ML Data Engineering

In [2]:
# Read the data.
df = pd.read_csv("diabetes.csv", sep=",", usecols=["BMI", "Age"])

In [3]:
# Visualize the age and BMI.
fig = px.scatter(df, x="Age", y="BMI")
fig.show()

In [4]:
# Scale the features.
scaler = StandardScaler()
scale = scaler.fit_transform(df)

## ML Model Engineering

In [5]:
# Create arrays.
X = scale

# Create the model.
knn = NearestNeighbors(n_neighbors=3)

# Fit the model.
knn.fit(X)

## ML Model Evaluation

In [6]:
# The average distances.
distances, indexes = knn.kneighbors(X)

# Visualize the average distances.
fig = px.line(distances.mean(axis=1), title="Average Distance Plot")
fig.update_xaxes(title_text='index')
fig.update_yaxes(title_text='average distance')
fig.update_layout(showlegend=False)
fig.show()

The data points having relatively high average distance are likely anomalies. Possible candidates are at indexes [9, 177, 459, 684]. Let's set the threshold at 0.6.

In [7]:
# The outlier indexes.
outlier_index = np.where(distances.mean(axis = 1) > 0.6)
outlier_index

(array([  9, 177, 459, 684], dtype=int64),)

In [8]:
# Filter the  outlier values in the data.
outlier_values = df.iloc[outlier_index]
outlier_values

Unnamed: 0,BMI,Age
9,0.0,54
177,67.1,26
459,25.9,81
684,0.0,69


In [21]:
# Scatter plot with main data
fig = px.scatter(df, x="Age", y="BMI", color_discrete_sequence=['black'])

# Scatter plot with outlier data
outlier_fig = px.scatter(outlier_values, x="Age", y="BMI", color_discrete_sequence=['red'])

# Add outlier data to the main figure
fig.add_trace(outlier_fig.data[0])

# Show the combined graph
fig.show()

## References

https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn<br>
https://www.enjoyalgorithms.com/blog/introduction-to-anomaly-detection<br>
https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database<br>
https://towardsdatascience.com/k-nearest-neighbors-knn-for-anomaly-detection-fdf8ee160d13<br>