# Q1. What is anomaly detection and what is its purpose?

Anomaly detection is a technique used to identify patterns or instances that deviate significantly from the norm or expected behavior within a dataset. Its purpose is to identify unusual or rare events, outliers, or anomalies that may indicate potential fraud, errors, defects, or suspicious activities. Anomaly detection helps in identifying data points or instances that require further investigation or may indicate a potential threat or anomaly in the system.


# Q2. What are the key challenges in anomaly detection?

There are several key challenges in anomaly detection:

- Lack of labeled data: Anomaly detection often operates in an unsupervised or semi-supervised manner, which means there may be a lack of labeled instances of anomalies for training and evaluation.

- Imbalanced datasets: Anomalies are typically rare compared to normal instances, resulting in imbalanced datasets, which can affect the performance of anomaly detection algorithms.

- Feature selection and extraction: Choosing relevant features or finding appropriate representations from the data can be challenging, as anomalies may exhibit different characteristics and distributions compared to normal instances.

- Dynamic and evolving patterns: Anomalies can change over time, requiring algorithms that can adapt and detect anomalies in evolving data streams.

- Interpretability: Understanding and explaining the detected anomalies can be difficult, as anomaly detection algorithms may work as black boxes without providing clear insights into the reasons behind the detection.


# Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?

Unsupervised anomaly detection algorithms aim to identify anomalies in a dataset without the use of labeled instances. These algorithms focus on detecting patterns or instances that deviate significantly from the normal behavior, often relying on statistical, clustering, or density-based methods.

On the other hand, supervised anomaly detection involves training a model using labeled instances of both normal and anomalous behavior. The model learns to differentiate between normal and anomalous instances based on the provided labels. During the detection phase, the model predicts whether a new instance is normal or an anomaly based on the learned patterns from the training data.


# Q4. What are the main categories of anomaly detection algorithms?

The main categories of anomaly detection algorithms are:

- Statistical-based methods: These methods use statistical techniques to model the normal behavior of the data and identify instances that significantly deviate from the expected statistical properties.

- Machine learning-based methods: These methods involve training models on labeled or unlabeled data to identify anomalies based on deviations from learned patterns or features.

- Density-based methods: These methods focus on identifying regions of low density in the data, where anomalies are more likely to reside.

- Proximity-based methods: These methods measure the distance or similarity between instances and identify instances that are far from or dissimilar to the majority of the data.

- Information theory-based methods: These methods leverage information theory principles to quantify the unexpectedness or surprise of instances and detect anomalies based on high information content.


# Q5. What are the main assumptions made by distance-based anomaly detection methods?

Distance-based anomaly detection methods typically make the following assumptions:

- Normal instances are densely packed and have similar patterns or characteristics, leading to small distances between them.
- Anomalous instances deviate significantly from the norm and are located in regions of lower density, resulting in larger distances to their nearest neighbors or to the majority of the data.
- The choice of distance metric and its suitability to the data distribution can impact the performance of distance-based anomaly detection methods.


# Q6. How does the LOF algorithm compute anomaly scores?

The Local Outlier Factor (LOF) algorithm computes anomaly scores based on the concept of local density. The algorithm calculates the local reachability density for each data point by comparing its average reachability distance to the average reachability distances of its k nearest neighbors. A lower local reachability density indicates that the data point is located in a region of lower density, suggesting it may be an anomaly. The anomaly score for each data point is then determined based on the local reachability densities of its neighbors.

# Q7. What are the key parameters of the Isolation Forest algorithm?


The key parameters of the Isolation Forest algorithm are:

- n_estimators: The number of isolation trees to be constructed. Increasing the number of trees can improve the performance but also increases the computational cost.

- max_samples: The number of samples to be used for building each isolation tree. It controls the randomness and the trade-off between model accuracy and efficiency.

- contamination: The assumed proportion of anomalies in the dataset. It helps in setting the threshold for identifying anomalies based on the anomaly scores.

- max_features: The number of features to be considered when splitting a node in the isolation tree. It affects the randomness and diversity of the trees.

# Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?

The anomaly score using KNN with K=10 for a data point that has only 2 neighbors of the same class within a radius of 0.5 would depend on the specific implementation and calculation method. However, based on the given information, the data point would have a relatively low anomaly score since it has a small number of neighbors of the same class within a small radius.

# Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?

The anomaly score for a data point in the Isolation Forest algorithm is determined based on the average path length of the data point across all trees. In the given scenario with 100 trees and a dataset of 3000 data points, the average path length of 5.0 for the data point compared to the average path length of the trees would indicate a relatively low anomaly score. Anomalies typically have shorter average path lengths as they are isolated more quickly in the forest.
