## Ans : 1

Anomaly detection, also known as outlier detection, is a data mining technique used to identify patterns or data points that deviate significantly from the norm in a dataset. These deviant patterns or data points are considered anomalies because they do not conform to the expected behavior or distribution of the majority of the data. The purpose of anomaly detection is to flag and investigate these unusual instances, as they may represent errors, fraud, unusual events, or potential security breaches. Anomaly detection is applicable in various domains, such as cybersecurity, finance, healthcare, manufacturing, and more, where detecting rare and abnormal occurrences is essential for maintaining system integrity and safety.


## Ans : 2

Anomaly detection poses several challenges, including:

Lack of labeled data: In many real-world scenarios, anomalies are rare, making it challenging to collect enough labeled data to train supervised models effectively.

Imbalanced datasets: Since anomalies are usually a minority class, imbalanced datasets can lead to biased models that perform poorly in detecting anomalies.

Novelty detection: Anomaly detection models should be able to identify novel and previously unseen anomalies that might differ significantly from known anomalies.

Feature engineering: Choosing relevant features and representation of data is crucial for the effectiveness of anomaly detection algorithms.

High-dimensional data: High-dimensional feature spaces can lead to the "curse of dimensionality," making it harder to distinguish normal and anomalous data points.

False positives and false negatives: Striking the right balance between detecting anomalies accurately while minimizing false alarms is a challenge.

## Ans : 3

 The main difference between unsupervised and supervised anomaly detection lies in the availability of labeled training data:

Unsupervised Anomaly Detection: In unsupervised anomaly detection, the algorithm works with unlabeled data, meaning it doesn't have access to examples of anomalies during training. The algorithm aims to learn the normal patterns in the data and identify deviations as anomalies without explicit knowledge of what anomalies look like.

Supervised Anomaly Detection: In supervised anomaly detection, the algorithm is trained on a dataset that contains both normal and anomalous instances. The algorithm uses this labeled data to learn the patterns of both normal and anomalous behavior. During testing, it can then classify new instances as either normal or anomalous based on what it learned during training.

## Ans : 4

 Anomaly detection algorithms can be broadly categorized into the following types:

Statistical Methods: These methods assume that the data follows a certain statistical distribution (e.g., Gaussian) and identify anomalies based on deviations from this assumed distribution.

Machine Learning Methods: These methods use various machine learning techniques, such as clustering, classification, or density estimation, to identify anomalies in the data.

Distance-based Methods: These methods measure the distance or dissimilarity between data points and identify instances that are farthest from the rest as anomalies.

Density-based Methods: These methods focus on identifying regions of low data density and label data points in these regions as anomalies.

Model-based Methods: These methods create models of normal behavior and identify deviations from these models as anomalies.

Time Series Anomaly Detection: These methods specifically deal with time series data and identify anomalies based on unusual patterns or trends over time.

## Ans : 5

 Distance-based anomaly detection methods make the following assumptions:

Distance Metric: They assume the existence of a meaningful distance or similarity metric to measure the dissimilarity between data points.

Normal Data Distribution: The majority of the data is assumed to follow a certain distribution, such as a normal distribution or any other well-defined distribution.

Anomaly Separability: Anomalies are assumed to be more distant from the normal data instances in the feature space. In other words, anomalies are expected to have significantly different characteristics compared to normal data points.

Global vs. Local Anomalies: These methods may make assumptions about whether anomalies are global (rare throughout the entire dataset) or local (rare within specific regions or clusters of the data).

## Ans : 6

The Local Outlier Factor (LOF) algorithm computes anomaly scores based on the local density of data points. Here's a high-level overview of the LOF algorithm's computation for a given data point:

Calculate Local Reachability Density (LRD): For each data point, the LRD is calculated based on the average local density of its k-nearest neighbors (where k is a user-defined parameter). It measures how isolated the data point is compared to its neighbors.

Calculate Local Outlier Factor (LOF): The LOF for each data point is then computed as the ratio of its LRD to the LRDs of its k-nearest neighbors. The LOF value for a data point quantifies how much more or less dense it is compared to its neighbors. An LOF value significantly greater than 1 indicates the data point is in a sparser region (potential anomaly), while a value much less than 1 suggests the point is in a denser region (typical data point).

Anomaly Ranking: The data points are ranked based on their LOF scores, with higher LOF scores indicating higher likelihood of being anomalies.

## Ans : 7

The Isolation Forest algorithm has two main parameters:

Number of Trees (n_estimators): This parameter determines the number of isolation trees to build. More trees generally lead to better performance but also increase computational overhead.

Subsample Size (max_samples): This parameter controls the number of samples drawn from the dataset to construct each isolation tree. It typically defaults to "auto," which means using a sub-sample size of the dataset size for each tree.

## Ans : 8 

To calculate the anomaly score of a data point using K-Nearest Neighbors (KNN), you can use the concept of the local outlier factor (LOF). The LOF is calculated based on the local density of a data point relative to its k-nearest neighbors. In this case, the data point has 2 neighbors (k=2) within a radius of 0.5.

The formula for the LOF is as follows:
LOF = (Average LRD of k-nearest neighbors) / LRD of the data point

However, the Local Reachability Density (LRD) depends on the k-distance of the data point, which is the distance to its k-th nearest neighbor. Without additional information about the exact distances to the 2 neighbors and their classes, it's not possible to provide a specific LOF or anomaly score.

## Ans : 9

 In the Isolation Forest algorithm, the anomaly score for a data point is computed based on its average path length in the isolation trees. The intuition is that anomalies are easier to isolate and will have shorter average path lengths, while normal data points will require more splits and have longer average path lengths.

The average path length of a data point in isolation trees is denoted by E(h(x)), where h(x) is the path length of the data point x in a single tree. The anomaly score s(x) for the data point x is calculated as follows:

s(x) = 2^(-E(h(x))/c(n))

where c(n) is a normalization factor estimated as 2 * (log(n-1) + 0.5772156649), and n is the number of data points in the dataset.

Given that you have 100 trees and a dataset of 3000 data points, let's assume the average path length for the specific data point is 5.0. We can plug these values into the formula to calculate the anomaly score:

n = 3000
E(h(x)) = 5.0
c(n) = 2 * (log(3000-1) + 0.5772156649) ≈ 9.019

anomaly score = 2^(-5.0/9.019) ≈ 0.482

So, the anomaly score for the data point with an average path length of 5.0 is approximately 0.482. Higher anomaly scores indicate a higher likelihood of being an anomaly.