In [None]:
# Q1. What is anomaly detection and what is its purpose?
"""
Anomaly detection is the process of identifying patterns or data points that deviate significantly from the norm or expected
 behavior. The goal of anomaly detection is to identify unusual, rare, or suspicious events or behaviors that may be indicative
  of a problem or potential threat.

In other words, anomaly detection is the process of finding data points that are outliers in a given dataset, and it can be used
 for a variety of applications, such as fraud detection, intrusion detection, system health monitoring, and predictive maintenance.

The purpose of anomaly detection is to identify and flag potentially problematic data points that may require further 
investigation or action. By identifying anomalies early, organizations can take proactive steps to address issues and 
prevent more serious problems from occurring. Additionally, anomaly detection can help organizations optimize their 
operations and improve overall efficiency by identifying areas where processes can be improved or streamlined."""

In [None]:
# Q2. What are the key challenges in anomaly detection?
"""

Anomaly detection is the process of identifying patterns or instances that deviate from the norm or expected behavior.
 While it can be a powerful tool for detecting unusual or fraudulent activity, there are several hurdles that can make 
 it challenging to implement effectively. 

Lack of labeled data--- Anomaly detection algorithms often require labeled data to train the model. However, obtaining labeled
 data can be difficult or expensive, especially when dealing with rare events.

Imbalanced data--- Anomaly detection datasets are often highly imbalanced, with a large number of normal instances and
 only a few anomalous instances. This can make it difficult for the algorithm to accurately detect anomalies.

Data preprocessing--- Anomaly detection algorithms often require preprocessing steps to transform the data into a format 
suitable for analysis. This can be time-consuming and require expertise in data manipulation.

False positives--- Anomaly detection algorithms can sometimes generate false positives, flagging normal instances as anomalies.
 This can be problematic, especially if it leads to unnecessary investigations or actions.

Real-time detection--- Real-time anomaly detection can be challenging, especially when dealing with large volumes of data.
 The algorithm must be able to process data quickly and accurately, without compromising on the quality of the results.

Interpretability--- Some anomaly detection algorithms can be difficult to interpret, making it hard to understand why a
 particular instance was flagged as an anomaly. This can make it challenging to take appropriate action based on the results.



In [None]:
# """Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?"""

"""Unsupervised anomaly detection and supervised anomaly detection are two different approaches used to identify anomalies in data. 

Unsupervised anomaly detection is used when the data is not labeled, and there are no prior examples of anomalies to learn from.
 This means that the algorithm must identify patterns in the data that deviate from the norm without any prior knowledge of what
  constitutes an anomaly. Unsupervised anomaly detection methods include clustering-based methods, density-based methods, and 
  statistical methods.

Supervised anomaly detection, on the other hand, uses labeled data to train a model to recognize anomalies. The model is trained
 on a dataset that contains both normal and anomalous instances. The goal is to learn the patterns that distinguish normal 
 instances from anomalies. Once the model is trained, it can be used to identify anomalies in new, unseen data. 
 Supervised anomaly detection methods include decision tree-based methods, rule-based methods, and deep learning-based methods.

One key difference between the two approaches is that supervised anomaly detection requires labeled data,
 which can be time-consuming and expensive to obtain. In contrast, unsupervised anomaly detection can be used when
  labeled data is not available, making it a more practical option in some situations.

Another difference is that supervised anomaly detection is often more accurate than unsupervised anomaly detection
 because it is trained on labeled data. However, supervised anomaly detection is limited by the fact that it can 
 only identify anomalies that are similar to the anomalies seen during training. In contrast, unsupervised anomaly 
 detection can identify novel anomalies that have not been seen before, making it more flexible and adaptable to 
 changing data."""

In [None]:
# Q4. What are the main categories of anomaly detection algorithms?
"""

Anomaly detection algorithms can be broadly categorized into the following types:

Statistical Methods--- These algorithms use statistical techniques such as probability density estimation, clustering, and
 regression analysis to identify anomalies. They assume that normal data points will follow a certain statistical pattern, 
 and any data point that deviates significantly from that pattern is considered an anomaly.

Machine Learning Methods--- These algorithms use supervised, unsupervised, or semi-supervised machine learning techniques to 
detect anomalies. Supervised learning algorithms are trained on labeled data to predict anomalies in new data, whereas 
unsupervised learning algorithms use clustering or density estimation techniques to identify unusual patterns in data.
 Semi-supervised learning algorithms combine elements of both supervised and unsupervised learning to detect anomalies.

Deep Learning Methods--- These algorithms use deep neural networks to identify complex patterns in data. They are particularly
 useful for detecting anomalies in large and complex datasets, such as image, speech, or text data. Deep learning algorithms
  can be supervised or unsupervised, depending on the availability of labeled data.

Rule-Based Methods--- These algorithms use predefined rules to identify anomalies in data. Rules can be based on expert 
knowledge, domain-specific heuristics, or data-driven patterns. These methods are particularly useful in situations where 
the underlying statistical or machine learning models may not be able to capture all possible anomalies.



In [None]:
# Q5. What are the main assumptions made by distance-based anomaly detection methods?
"""Density assumption--- normal instances are more concentrated in the data space than anomalous instances.

Distance assumption--- anomalous instances are far away from the normal instances in the data space.

Local structure assumption--- normal instances exhibit a consistent local structure in the data space, whereas anomalous 
instances do not follow this pattern.

Independence assumption--- each instance is independent of the others, and its attributes do not depend on the attributes
 of other instances."""

In [None]:
# Q6. How does the LOF algorithm compute anomaly scores?
"""The LOF score for a data point is computed as the ratio of the average local density of its k-nearest neighbors to its own
 local density. If the ratio is close to 1, then the point is similar in density to its neighbors and is not anomalous. 
 If the ratio is greater than 1, then the point is less dense than its neighbors and is considered anomalous. The larger
  the ratio, the more anomalous the point is considered to be.
  The LOF algorithm computes the local density of a data point as the inverse of the average distance between the point and
   its k-nearest neighbors. The k-nearest neighbors are the k points in the data set that are closest to the given data point.
    The local density of a point is a measure of how close it is to its neighbors and how tightly the neighbors are clustered 
    together."""

In [None]:
# Q7. What are the key parameters of the Isolation Forest algorithm?
"""The Isolation Forest algorithm is a tree-based anomaly detection method that is based on the concept of isolating anomalies
 from the majority of the data. 

Number of trees or n_estimators--- This parameter controls the number of trees in the forest. A larger number of trees will
 increase the accuracy of the model but may also increase the training time and memory requirements.

Sample size or max_samples--- This parameter controls the number of instances to be selected randomly from the dataset to construct
 each tree. A smaller sample size will result in more random splits and potentially faster training, but may also decrease
  the accuracy of the model.

Maximum depth or max_depth)--- This parameter controls the maximum depth of each tree in the forest. A deeper tree can capture
 more complex relationships in the data, but may also increase the risk of overfitting.

Contamination---- This parameter represents the proportion of anomalous data points in the dataset. It is used to set the
 threshold for anomaly detection.

In [None]:
# Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score
# using KNN with K=10?
""" Its anomaly score would be 1 - (2/10) = 0.8. This is because the anomaly score is calculated
  as 1 minus the ratio of the number of neighbors of the same class to the total number of neighbors.
  
  An anomaly score of 0.8 signifies that the data point is relatively more likely to be an anomaly compared to other
   points in the dataset"""



In [1]:
# Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the
# anomaly score for a data point that has an average path length of 5.0 compared to the average path
# length of the trees?
from sklearn.ensemble import IsolationForest
import numpy as np


X = np.random.randn(3000, 10)
iso_forest = IsolationForest(n_estimators=100, contamination='auto')

iso_forest.fit(X)
new_point = np.random.randn(1, 10)
avg_path_length = np.mean(iso_forest.decision_function(new_point))
n = X.shape[0]
norm_factor = 2 * (np.log(n - 1) + 0.5772156649)


anomaly_score = 2 ** (-avg_path_length / norm_factor)

print(f"The anomaly score for the data point is: {anomaly_score}")


The anomaly score for the data point is: 1.0015050194857185
