> Q1. What is anomaly detection and what is its purpose?

Anomaly detection, also known as outlier detection, is a data analysis technique aimed at identifying patterns in data that do not conform to expected behavior. Its purpose is to identify rare, unusual, or anomalous data points or events within a dataset. Anomaly detection is used in various domains, including fraud detection, network security, quality control, and healthcare, to uncover deviations from normal behavior.

> Q2. What are the key challenges in anomaly detection?

Key challenges in anomaly detection include:

Lack of labeled data: Anomalies are often rare, making it challenging to have sufficient labeled examples for supervised learning.

Data imbalance: Anomalies are typically a minority class, leading to imbalanced datasets.

Evolving data: Data distributions may change over time, requiring models to adapt.

High-dimensional data: Anomalies are harder to detect in high-dimensional spaces.

Interpretability: Understanding why a data point is flagged as an anomaly can be difficult.

Scalability: Handling large datasets efficiently can be a challenge.

> Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?

Unsupervised anomaly detection does not require labeled data and aims to identify anomalies based solely on the characteristics of the data itself. It seeks to discover patterns that deviate from the norm in an unsupervised manner. In contrast, supervised anomaly detection relies on labeled examples of both normal and anomalous data to train a model, making it a supervised classification problem.

> Q4. What are the main categories of anomaly detection algorithms?

Anomaly detection algorithms can be categorized into several types:

Statistical methods: Based on statistical properties of the data, such as the Gaussian distribution.

Machine learning algorithms: Including Isolation Forest, One-Class SVM, and k-Nearest Neighbors (KNN).

Clustering methods: Anomalies are points far from cluster centers.

Density estimation techniques: Such as kernel density estimation and Gaussian Mixture Models.

Time series methods: Specialized techniques for detecting anomalies in time series data.

> Q5. What are the main assumptions made by distance-based anomaly detection methods?

Distance-based anomaly detection methods assume that anomalies are far from the majority of data points in feature space. The key assumptions include:

Anomalies are sparse and isolated.

Anomalies have a higher distance (dissimilarity) from other data points.

Normal data points are concentrated in dense regions.

> Q6. How does the LOF algorithm compute anomaly scores?

The LOF (Local Outlier Factor) algorithm computes anomaly scores for data points based on their local density compared to the density of their neighbors. A lower LOF score indicates a more anomalous data point, and a higher score indicates a more typical data point. LOF calculates the ratio of the local reachability density of a data point to the average reachability density of its k-nearest neighbors.


> Q7. What are the key parameters of the Isolation Forest algorithm?

The key parameters of the Isolation Forest algorithm include:

Number of trees (n_estimators): The number of isolation trees in the forest.

Maximum tree depth (max_depth): The maximum depth allowed for each isolation tree.

Sample size for tree construction (max_samples): The number of data points to use when constructing each tree.

Contamination: The expected proportion of anomalies in the dataset, used to set a threshold for anomaly classification.

In [27]:
'''Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score
using KNN with K=10?'''
import numpy as np

# Calculate the distance to the two neighbors
distance_to_neighbor_1 = np.linalg.norm(data_point - neighbor_1)
distance_to_neighbor_2 = np.linalg.norm(data_point - neighbor_2)

# Calculate the anomaly score
anomaly_score = (distance_to_neighbor_1 + distance_to_neighbor_2) / 2

# Print the anomaly score
print(anomaly_score)

NameError: name 'neighbor_1' is not defined

In [18]:
'''Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the
anomaly score for a data point that has an average path length of 5.0 compared to the average path
length of the trees?'''
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=3000, n_features=2, centers=2, random_state=42)
clf=IsolationForest(n_estimators=100, max_samples=3000, contamination=0.1, random_state=42)
clf.fit(X)
average_path_length=5.0
anomaly_score = clf.decision_function([[average_path_length,average_path_length]])[0]
print("Anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees is: ",anomaly_score)

Anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees is:  -0.11952103158645955
